Survey by the Horizon 2020 project AllScale

Is HPC software ready for the ExaScale computing era ?

As we approach the ExaScale computing era, the failure of compute nodes will become commonplace. This is a survey to determine how software developers, in the field of high performance computing, view fault tolerance today, if at all, and how they plan to incorporate fault tolerance in their software in the ExaScale era.

This survey is being undertaken by the AllScale project funded by the European Commission under the Horizon 2020 Programme. Please respond by 16th March 2018.

Your response will be anonymous. A report summarising the results of this survey will be made available on the project web site http://www.allscale.eu in late April 2018

Question Title

* 1. Briefly describe the problem domain in which your application software is applied. For example computational chemistry, fluid dynamics, economics .....

Question Title

* 2. In which programming language(s) is/are your application(s) written

Fortran

C++

Java

Assembler

Scripting languages - e.g. python, ruby

Other languages not listed

I do not wish to answer this question

Question Title

* 3. Which of the following software technologies which enable parallel and distributed computing, do you use in your application software today, if any.

Choose all that apply.

HPX

Task based runtime (not HPX)

MPI

OpenMP

Posix threads

Direct communication over IP sockets

Message queueing system (e.g. IBM MQ)

Other message passing harness not listed above

Another technology not listed above

None of the above

I do not wish to answer this question

Question Title

* 4. Which of the following, if any, describe the principal mathematical solution kernels that dominate your application. Choose all that apply

Linear algebra

Finite element methods

Numerical integration

Solving partial differential equations

Solving ordinary differential equations

Combinatorial techniques

Supervised machine learning

Unsupervised machine learning

Filtering data

None of the above

I do not wish to answer this question

Question Title

* 5. Choose the option which best describes the numerical representation(s) that dominate your solution kernels

Single precision IEEE 754 floating point

Double precision IEEE 754 floating point

Software emulated arbitrary precision

Arbitrary precision implemented in hardware (FPGA)

Integer

Mixture of single precision IEEE 754 floating point and integer

Mixture of double precision IEEE 754 floating point and integer

None of the above

I do not wish to answer this question

Question Title

* 6. How many CPU cores do you use in a typical job execution today. Enter an integer value

Question Title

* 7. Consider the following statement and chose the most relevant answer.

The architects and developers of our application software are actively considering the challenges that using ExaScale computing, in order to gain maximum performance, will bring to our code base.

Yes

We have considered the challenges but have concluded that changing the code to address these is either too risky or would carry an economic cost that is too great for our organization

I do not wish to answer this question

Question Title

* 8. Consider the following definition of HARD and SOFT faults. Have you considered the impact of any of these faults in your code base for example by implementing checks on numerical accuracy of the computations. The questions which follow will ask for more details of any resiliency techniques that you have implemented.

Please choose the most relevant answer from the options below.

HARD faults, that is faults due to hardware failure or data corruption, cause a running application to crash and therefore means that the application fails to return a valid result. This could be caused for example by failure of a single node within a multi-node distributed computation.

SOFT faults do not cause a program to crash and may or may not be detectable by examining the computed result. For example a bit-flip which changes the value of an integer or floating point number can lead to an incorrect result being computed and it may or may not be obvious that an incorrect value has been computed. Sometimes called silent data corruption (SDC).

SOFT faults can be TRANSIENT or PERSISTENT. An example of a TRANSISENT fault would be modification of the data copy held in a cache, leaving the value in RAM unmodified. Correspondingly an example of a PERSISTENT fault would be modification of the data held in RAM leaving the cache value unmodified.

My application software caters for neither HARD faults nor SOFT faults

My application software only attempts to detect failures nodes (HARD faults)

My application software attempts to detect both PERSISTENT and TRANSIENT SOFT faults

My application software attempts to detect only PERSISTENT SOFT faults

My application software attempts to detect only TRANSIENT SOFT faults

My application attempts to verify its final computed results but not explicitly consider HARD and SOFT faults.

I do not wish to answer this question

Question Title

* 9. Resiliency against HARD faults can be provided by using a generic checkpoint/restart mechanism.
Does your application software employ checkpoint/restart mechanisms. If so please describe briefly

Question Title

* 10. Techniques exist to protect against some kinds of soft faults. Does your application use any of the following ? Choose all that apply.

Computation of checksums on critical data items

Performing the same computation two or more times, perhaps on different nodes, and comparing results (often called redundant computation)

Checking numerical results against the mathematical/physical constraints of the problem (perhaps as simply as just thresholding) and rolling back using a previous check point.

Apply numerical analysis techniques, if possible, to make the algorithm resilient to soft faults

I do not wish to answer this question

Finite state machines

Other techniques not listed above

Question Title

* 1. Briefly describe the problem domain in which your application software is applied. For example computational chemistry, fluid dynamics, economics .....

Question Title

* 2. In which programming language(s) is/are your application(s) written

Question Title

* 3. Which of the following software technologies which enable parallel and distributed computing, do you use in your application software today, if any. Choose all that apply.

Question Title

* 4. Which of the following, if any, describe the principal mathematical solution kernels that dominate your application. Choose all that apply

Question Title

* 5. Choose the option which best describes the numerical representation(s) that dominate your solution kernels

Question Title

* 6. How many CPU cores do you use in a typical job execution today. Enter an integer value

Question Title

* 7. Consider the following statement and chose the most relevant answer. The architects and developers of our application software are actively considering the challenges that using ExaScale computing, in order to gain maximum performance, will bring to our code base.

Question Title

Question Title

* 9. Resiliency against HARD faults can be provided by using a generic checkpoint/restart mechanism. Does your application software employ checkpoint/restart mechanisms. If so please describe briefly

Question Title

* 10. Techniques exist to protect against some kinds of soft faults. Does your application use any of the following ? Choose all that apply.

* 3. Which of the following software technologies which enable parallel and distributed computing, do you use in your application software today, if any.

Choose all that apply.

* 7. Consider the following statement and chose the most relevant answer.

The architects and developers of our application software are actively considering the challenges that using ExaScale computing, in order to gain maximum performance, will bring to our code base.

* 9. Resiliency against HARD faults can be provided by using a generic checkpoint/restart mechanism.
Does your application software employ checkpoint/restart mechanisms. If so please describe briefly