Is HPC software ready for the ExaScale computing era ?

As we approach the ExaScale computing era, the failure of compute nodes will become commonplace. This is a survey to determine how software developers, in the field of high performance computing, view fault tolerance today, if at all, and how they plan to incorporate fault tolerance in their software in the ExaScale era.
 
This survey is being undertaken by the AllScale project  funded by the European Commission under the Horizon 2020 Programme.  Please respond by 16th March 2018.
 
Your response will be anonymous. A report summarising the results of this survey will be made available on the project web site http://www.allscale.eu in late April 2018
 

Question Title

* 1. Briefly describe the problem domain in which your application software is applied. For example computational chemistry, fluid dynamics, economics .....

Question Title

* 2. In which programming language(s) is/are your application(s) written

Question Title

* 3. Which of the following software technologies which enable parallel and distributed computing, do you use in your application software today, if any.

Choose all that apply.

Question Title

* 4. Which of the following, if any, describe the principal mathematical solution kernels that dominate your application. Choose all that apply

Question Title

* 5. Choose the option which best describes the numerical representation(s) that dominate your solution kernels

Question Title

* 6. How many CPU cores do you use in a typical job execution today. Enter an integer value

Question Title

* 7. Consider the following statement and chose the most relevant answer.

The architects and developers of our application software are actively considering the challenges that using ExaScale computing, in order to gain maximum performance, will bring to our code base.

Question Title

* 8. Consider the following definition of HARD and SOFT faults. Have you considered the impact of any of these faults in your code base for example by implementing checks on numerical accuracy of the computations. The questions which follow will ask for more details of any resiliency techniques that you have implemented.

Please choose the most relevant answer from the options below.

HARD faults, that is faults due to hardware failure or data corruption, cause a running application to crash and therefore means that the application fails to return a valid result. This could be caused for example by failure of a single node within a multi-node distributed computation. 

SOFT faults do not cause a program to crash and may or may not be detectable by examining the computed result. For example a bit-flip which changes the value of an integer or floating point number can lead to an incorrect result being computed and it may or may not be obvious that an incorrect value has been computed.  Sometimes called silent data corruption (SDC).

SOFT faults can be TRANSIENT or PERSISTENT. An example of a TRANSISENT fault would be modification of the data copy held in a cache, leaving the value in RAM unmodified. Correspondingly an example of a PERSISTENT fault would be modification of the data held in RAM leaving the cache value unmodified.

Question Title

* 9. Resiliency against HARD faults can be provided by using a generic checkpoint/restart mechanism.
Does your application software employ checkpoint/restart mechanisms. If so please describe briefly

Question Title

* 10. Techniques exist to protect against some kinds of soft faults. Does your application use any of the following ? Choose all that apply.

T