Back to Table of contents

Primeur weekly 2013-07-08

Special

Exploiting parallelism on Titan to reach exascale performance, eventually ...

The Cloud

A virtual chef-nutritionist for old people ...

BRABUS draws on IBM to deliver stylish upgrades for automobiles, planes and yachts through the Cloud ...

New hardware design protects data in the cloud ...

Oracle enhances Cloud management with new third generation release of Oracle Enterprise Manager 12c ...

Oracle announces general availability of Oracle Database 12c, the first database designed for the Cloud ...

New Oracle Database 12c innovations engineered with Oracle Storage Systems deliver next level of database-to-storage performance and efficiency ...

Rackspace and CERN openlab collaborate to deliver "Big Bang" with Hybrid Cloud ...

EuroFlash

Megware supercomputer at the University of Bayreuth secures TOP500 place ...

High Performance Computing facility to receive Government funding of 8 million euro ...

RSC supercomputers continue to lead in energy efficiency among Russian HPC systems in Green500 ...

Pittsburgh Supercomputing Center and Numascale AS to collaborate on improved memory systems for research ...

Bull launches the 2013 Bull-Joseph Fourier Prize, aimed at boosting the development of computer simulation ...

Kees Neggers honoured as an Internet Pioneer in the Internet Hall of Fame ...

World record in silicon integrated nanophotonics - More energy efficiency in the data communication ...

SURFnet and Russian Skoltech embark on joint e-Infrastructure project ...

USFlash

Parallella: an open source hardware project ...

Pleiades supercomputer to be augmented with next-generation SGI ICE-X systems ...

2014 Pennsylvania State Budget includes $500,000 for Pittsburgh Supercomputing Center ...

Senator Durbin leads dedication of new Mira supercomputer at Argonne National Laboratory ...

Microscopy technique could help computer industry develop 3D components ...

Cyclica Inc. is awarded access to IBM's Blue Gene/Q supercomputer to repurpose FDA approved drugs ...

Texas Advanced Computing Center deploys 20PB Big Data hub using DataDirect Networks High-Performance Storage System ...

More of the world's TOP500 supercomputers trust DataDirect Networks for best analytics and simulation performance, scale and lowest TCO ...

Indiana University to take lead in Defense Department effort securing software-defined networks ...

Graphene-based system could lead to improved information processing ...

NSF and Mozilla announce breakthrough applications on a faster, smarter internet of the future ...

Titan completes acceptance testing ...

Exploiting parallelism on Titan to reach exascale performance, eventually


20 Jun 2013 Leipzig - In the Exascale session at the ISC'13 event in Leipzig, Buddy Bland, Project Director at the Oak Ridge Leadership Computing Facility, explained how the Titan supercomputer can help in teaching HPC developers to reach exascale performance. Titan, hosted at Oak Ridge National Laboratory (ORNL), is a hybrid system with 27,1 PetaFlops of peak performance. The Cray XK7 system with AMD Opteron is an upgrade of the previous no. 1 Jaguar system and has 18,688 compute nodes. For 28 years at ORNL, systems have scaled from 64 cores to hundreds of thousands of cores and millions of simultaneous threads of execution. The last 28 years of application development have been about finding ways to exploit that parallelism, Buddy Bland told the audience.

At ORNL, they created the CAAR centre for Accelerated Application Readiness as part of the Titan project to help prepare applications for accelerated architectures. The goals are to work with code teams to develop and implement strategies for exposing herarchical parallelism for users applications, to maintain code portability across modern architectures, and to learn from and share the results.

Buddy Bland's colleagues have selected six applications, including material science, molecular dynamics, climate change, astrophysics, combustion, and nuclear energy.

CAAR is built up of a comprehensive team assigned to each application. A single early-science problem has been targeted for each application. The developers have set up a particular plan-of-attack, different for each application, Multiple acceleration methods have been explored during the process, as Buddy Bland explained.

For the climate change application, CAM-SE, a pervasive and widespread custom acceleration was required. A new vertical remapping algorithm for GPU efficiency has been coded. The tracer transport has been ported to GPUs, threaded over bases, elements, vertical levels, and tracers.

The code description for the material science application, Wang-Landau LSMS, combines classical statistical mechanics (WL) for atomic magnetic moment distributions with first-principles calculations (LSMS) of the associated energies, as Buddy Bland explained. The main computational effort is dense linear algebra for complex numbers. C++ with F77 is used for some computational routines.

The performance strategy is to leverage accelerated linear algebra cuBLAS and specialized Cuda for parallelization over WL Monte-Carlo walkers, over atoms through the MPI process, and the use of OpenMP on CPU sections.

Denovo is the application for nuclear reactors modellling which uses a 3D sweep algorithm. The parallelisation strategy has been deployed as follows, according to Buddy Bland. Denovo has been restructured to provide a new axis of parallelism for better cross-node scaling and GPU threading. The 3D sweep algorithm has been rewritten for GPU using CUDA, and exploiting multiple problem dimensions for threading, the speaker told the audience. The CAAR team has restructured the loops to optimize for memory locality.

The LAMMPS application for simulation of materials, soft matter, and biomolecules in molecular dynamics has been accelerated for neighbour-list builds, short-range force calculation, and longe-range electrostatics, explained Buddy Bland. Concurrent calculations have been executed on the CPU and GPU to overlap data-transfer and independent calculations. MPI tasks shared the GPU to allow for an efficient hybrid code.

The S3D all MPI combustion code was first refactored into a hybrid application using OpenMP on the node and MPI between nodes, Buddy Bland showed. Once the hybrid code was running faster than the all MPI

version, OpenACC was employed to move the major computation to the accelerator. Optimizations were performed to overlap the accelerator computation with the MPI message passing and host computation.

Buddy Bland asked himself how effective GPUs are on scalable applications. The performance ratio of the Cray XK7 versus the Cray XE6 is 7,4 for LAMMPS, 2,2 for S3D, 3,8 for Denovo, and 3,8 for WL-LSMS, Buddy Bland showed the audience, but the performance depends strongly on the specific problem size chosen.

All codes will need rework to scale, Buddy Bland warned. It will take up to 1 or 2 person-years to port each code from Jaguar to Titan. This takes work, but it is an unavoidable step if the developers want to reach exascale, regardless of the type of processors. It comes from the required level of parallelism on the node.

Buddy Bland also explained that it pays off for other system. The ported codes often run significantly faster with CPU only.

The CAAR team estimates that possibly 70% to 80% of the developer time is spent in code restructuring, regardless of whether OpenMP is used, or CUDA, OpenCL or OpenACC. Each code team must make its own choice of using OpenMP versus CUDA, or versus OpenCL, or versus OpenACC, based on the specific case, as Buddy Bland explained. It may be a different conclusion for each code. The users and their sponsors must plan for this expense, Buddy Bland warned.

At the extreme scale all codes need error recovery, Buddy Bland stated. For the simple checkpoint, restart is a minimum. At the scale of Titan, the developers are seeing several nodes fail per day. Jobs running on the full system for several hours should expect to have a node fail during job execution and be prepared to recover, Buddy Bland explained.

More advanced error detection and recovery techniques will be required as parallelism increases, the speaker expected. FT-MPI algorithms that can ignore faults, and other research techniques for error containment and recovery will be mandatory for larger systems.

Buddy Bland said that the developers have to rethink their algorithms. Heterogeneous architectures can make previously infeasible or inefficient models and implementations viable. Alternative methods for electrostatics that perform slower on traditional x86 can be significantly faster on GPUs, Buddy Bland told the audience.

Three-body coarse-grain simulations of water with greater concurrency can allow more than 100 times of simulation rates when compared to the fastest atomistic models, even though both are run on the GPUs, Buddy Bland explained.

The developers must adopt a richer programming environment. Tools are critical to success. With complex hierarchical parallelism and heterogeneous processors, the days of debugging with print statement are over, stated Buddy Bland. Now, developers should invest in good tools: debuggers, performance analysis, memory analysis, and the training on how to use them. The programmers and user services team need to become experts on these, Buddy Bland insisted.

Buddy Bland concluded that science codes are under active development. Porting to GPU can be pursuing a "moving target", he warned, and challenging to manage. More available FLOPS on the node should lead researchers to think of new science opportunities enabled, for example, more degrees of freedom per grid cell. Programmers may need to look in unconventional places to get another 30x thread parallelism that may be needed for exascale, Buddy Bland ended.

Leslie Versweyveld

Back to Table of contents

Primeur weekly 2013-07-08

Special

Exploiting parallelism on Titan to reach exascale performance, eventually ...

The Cloud

A virtual chef-nutritionist for old people ...

BRABUS draws on IBM to deliver stylish upgrades for automobiles, planes and yachts through the Cloud ...

New hardware design protects data in the cloud ...

Oracle enhances Cloud management with new third generation release of Oracle Enterprise Manager 12c ...

Oracle announces general availability of Oracle Database 12c, the first database designed for the Cloud ...

New Oracle Database 12c innovations engineered with Oracle Storage Systems deliver next level of database-to-storage performance and efficiency ...

Rackspace and CERN openlab collaborate to deliver "Big Bang" with Hybrid Cloud ...

EuroFlash

Megware supercomputer at the University of Bayreuth secures TOP500 place ...

High Performance Computing facility to receive Government funding of 8 million euro ...

RSC supercomputers continue to lead in energy efficiency among Russian HPC systems in Green500 ...

Pittsburgh Supercomputing Center and Numascale AS to collaborate on improved memory systems for research ...

Bull launches the 2013 Bull-Joseph Fourier Prize, aimed at boosting the development of computer simulation ...

Kees Neggers honoured as an Internet Pioneer in the Internet Hall of Fame ...

World record in silicon integrated nanophotonics - More energy efficiency in the data communication ...

SURFnet and Russian Skoltech embark on joint e-Infrastructure project ...

USFlash

Parallella: an open source hardware project ...

Pleiades supercomputer to be augmented with next-generation SGI ICE-X systems ...

2014 Pennsylvania State Budget includes $500,000 for Pittsburgh Supercomputing Center ...

Senator Durbin leads dedication of new Mira supercomputer at Argonne National Laboratory ...

Microscopy technique could help computer industry develop 3D components ...

Cyclica Inc. is awarded access to IBM's Blue Gene/Q supercomputer to repurpose FDA approved drugs ...

Texas Advanced Computing Center deploys 20PB Big Data hub using DataDirect Networks High-Performance Storage System ...

More of the world's TOP500 supercomputers trust DataDirect Networks for best analytics and simulation performance, scale and lowest TCO ...

Indiana University to take lead in Defense Department effort securing software-defined networks ...

Graphene-based system could lead to improved information processing ...

NSF and Mozilla announce breakthrough applications on a faster, smarter internet of the future ...

Titan completes acceptance testing ...