Chris Cantwell explains on the ExaFLOW blog how a memory-conservative approach to resilience in CFD tools leads to fault tolerance at Exascale supercomputing. He says that algorithms and software for exascale need to be developed with resilience in mind and designed to be tolerant of failures when they occur. The ExaFLOW project has been examining how this might be achieved with computational fluid dynamics, without adversely affecting the performance or scalability of the code. One particular concern at exascale is the size of memory per processor, which is currently on a downward trend. ExaFLOW is therefore been seeking solutions which provide resilience in a memory-conservative manner.
Read further...