At ORNL, they created the CAAR centre for Accelerated Application Readiness as part of the Titan project to help prepare applications for accelerated architectures. The goals are to work with code teams to develop and implement strategies for exposing herarchical parallelism for users applications, to maintain code portability across modern architectures, and to learn from and share the results.
Buddy Bland's colleagues have selected six applications, including material science, molecular dynamics, climate change, astrophysics, combustion, and nuclear energy.
CAAR is built up of a comprehensive team assigned to each application. A single early-science problem has been targeted for each application. The developers have set up a particular plan-of-attack, different for each application, Multiple acceleration methods have been explored during the process, as Buddy Bland explained.
For the climate change application, CAM-SE, a pervasive and widespread custom acceleration was required. A new vertical remapping algorithm for GPU efficiency has been coded. The tracer transport has been ported to GPUs, threaded over bases, elements, vertical levels, and tracers.
The code description for the material science application, Wang-Landau LSMS, combines classical statistical mechanics (WL) for atomic magnetic moment distributions with first-principles calculations (LSMS) of the associated energies, as Buddy Bland explained. The main computational effort is dense linear algebra for complex numbers. C++ with F77 is used for some computational routines.
The performance strategy is to leverage accelerated linear algebra cuBLAS and specialized Cuda for parallelization over WL Monte-Carlo walkers, over atoms through the MPI process, and the use of OpenMP on CPU sections.
Denovo is the application for nuclear reactors modellling which uses a 3D sweep algorithm. The parallelisation strategy has been deployed as follows, according to Buddy Bland. Denovo has been restructured to provide a new axis of parallelism for better cross-node scaling and GPU threading. The 3D sweep algorithm has been rewritten for GPU using CUDA, and exploiting multiple problem dimensions for threading, the speaker told the audience. The CAAR team has restructured the loops to optimize for memory locality.
The LAMMPS application for simulation of materials, soft matter, and biomolecules in molecular dynamics has been accelerated for neighbour-list builds, short-range force calculation, and longe-range electrostatics, explained Buddy Bland. Concurrent calculations have been executed on the CPU and GPU to overlap data-transfer and independent calculations. MPI tasks shared the GPU to allow for an efficient hybrid code.
The S3D all MPI combustion code was first refactored into a hybrid application using OpenMP on the node and MPI between nodes, Buddy Bland showed. Once the hybrid code was running faster than the all MPI
version, OpenACC was employed to move the major computation to the accelerator. Optimizations were performed to overlap the accelerator computation with the MPI message passing and host computation.
Buddy Bland asked himself how effective GPUs are on scalable applications. The performance ratio of the Cray XK7 versus the Cray XE6 is 7,4 for LAMMPS, 2,2 for S3D, 3,8 for Denovo, and 3,8 for WL-LSMS, Buddy Bland showed the audience, but the performance depends strongly on the specific problem size chosen.
All codes will need rework to scale, Buddy Bland warned. It will take up to 1 or 2 person-years to port each code from Jaguar to Titan. This takes work, but it is an unavoidable step if the developers want to reach exascale, regardless of the type of processors. It comes from the required level of parallelism on the node.
Buddy Bland also explained that it pays off for other system. The ported codes often run significantly faster with CPU only.
The CAAR team estimates that possibly 70% to 80% of the developer time is spent in code restructuring, regardless of whether OpenMP is used, or CUDA, OpenCL or OpenACC. Each code team must make its own choice of using OpenMP versus CUDA, or versus OpenCL, or versus OpenACC, based on the specific case, as Buddy Bland explained. It may be a different conclusion for each code. The users and their sponsors must plan for this expense, Buddy Bland warned.
At the extreme scale all codes need error recovery, Buddy Bland stated. For the simple checkpoint, restart is a minimum. At the scale of Titan, the developers are seeing several nodes fail per day. Jobs running on the full system for several hours should expect to have a node fail during job execution and be prepared to recover, Buddy Bland explained.
More advanced error detection and recovery techniques will be required as parallelism increases, the speaker expected. FT-MPI algorithms that can ignore faults, and other research techniques for error containment and recovery will be mandatory for larger systems.
Buddy Bland said that the developers have to rethink their algorithms. Heterogeneous architectures can make previously infeasible or inefficient models and implementations viable. Alternative methods for electrostatics that perform slower on traditional x86 can be significantly faster on GPUs, Buddy Bland told the audience.
Three-body coarse-grain simulations of water with greater concurrency can allow more than 100 times of simulation rates when compared to the fastest atomistic models, even though both are run on the GPUs, Buddy Bland explained.
The developers must adopt a richer programming environment. Tools are critical to success. With complex hierarchical parallelism and heterogeneous processors, the days of debugging with print statement are over, stated Buddy Bland. Now, developers should invest in good tools: debuggers, performance analysis, memory analysis, and the training on how to use them. The programmers and user services team need to become experts on these, Buddy Bland insisted.
Buddy Bland concluded that science codes are under active development. Porting to GPU can be pursuing a "moving target", he warned, and challenging to manage. More available FLOPS on the node should lead researchers to think of new science opportunities enabled, for example, more degrees of freedom per grid cell. Programmers may need to look in unconventional places to get another 30x thread parallelism that may be needed for exascale, Buddy Bland ended.