Jack Dongarra described the use of synchronization-reducing algortihms and communication-reducing algorithms as well as mixed precision methods. The developers need to work with fault resilient algorithms and implement algorithms that can recover from failures.
The strategies used are checkpointing and restart, diskless checkpointing, and algorithm based fault tolerance (ABFT). Before the factorization starts, a checksum is taken and algorithm based fault tolerance is used, explained Jack Dongarra.
For the dense facorization, the C matrix contains a checksum. The overhead is for 1 failure, for multiple failures you have to multiple by the number of failures you want to protect against.
The Fault Tolerant Linear algebra software package extends the ScaLAPACK codebase. It will be released next September and will support the checkpoint on failure.
Jack Dongarra thinks developers should be implementing ABFT for dense factorization with minimal middleware support. They also have to enable ABFT recovery on existing MPI implementations. CoF is a hybrid of rollback recovery and ABFT.
MPI requirements consist in returning control after failure and termination after checkpoint.
ABFT with checkpoint on failure runs on today's unmodified MPI systems. There are three executions runs.
The tolerance is about 3%.
There is a new generation of DLA software. The software algorithms follow hardware evolution in time, according to Jack Dongarra. We had LINPACK in the seventies, LAPACK in the eighties, and ScaLAPACK in the nineties. Today, there are new algorighms that are many-core friendly, known as PLASMA and hybrid algorithms, known as MAGMA.
Jack Dongarra introduced the parallel linear algebra software for multicore/hybrid architectures to the audience. The parallel runtime scheduler and execution control is PaRSEC. It executes a dataflow representation of a programme. The scheduler provides automatic load-balance between cores. It harnesses the power.
Another tool is runtime DAG scheduling. Every process has the symbolic DAG representation with backgroud remote data transfer.
The task affinity in PaRSEC is the following: within each node, task scheduling on hardware resources is decided dynamically.
Jack Dongarra also talked about hybrid clusters of accelerators. The Keeneland system has three GPU acclereators per node but has a severe computation/bandwidth provision imbalance with 75% of ideal scaling and 60% of GEMM peak.
The energy used depends on the number of cores. There is up to 62% more of energy efficiency while using a high performance tuned scheduling, explained Jack Dongarra.