ExaFLOW's memory-conservative approach to resilience in CFD tools leads to fault Tolerance at Exascale


ULFM enables a spare node to be enrolled into the simulation to replace a failed node without needing to restart the simulation.
22 May 2017 London - Chris Cantwell explains on the ExaFLOW blog how a memory-conservative approach to resilience in CFD tools leads to fault tolerance at Exascale supercomputing. He says that algorithms and software for exascale need to be developed with resilience in mind and designed to be tolerant of failures when they occur. The ExaFLOW project has been examining how this might be achieved with computational fluid dynamics, without adversely affecting the performance or scalability of the code. One particular concern at exascale is the size of memory per processor, which is currently on a downward trend. ExaFLOW is therefore been seeking solutions which provide resilience in a memory-conservative manner.
Ad Emmen