Back to Table of contents

Primeur weekly 2016-08-22

Special

ExaCT team shows how Legion S3D code is a tribute to co-design on the way to exascale supercomputing ...

Focus

Sunway TaihuLight's strengths and weaknesses highlighted by Jack Dongarra ...

Exascale supercomputing

Big PanDA tackles Big Data for physics and other future extreme scale scientific applications ...

Computer programming made easier ...

Quantum computing

Cryptographers from the Netherlands win 2016 Internet Defense Prize ...

Focus on Europe

STFC Daresbury Laboratory to host 2016 Hands-on Tutorial on CFD using open-source software Code_Saturne ...

Middleware

Germany joins ELIXIR ...

Columbus Collaboratory announces CognizeR, an Open Source R extension that accelerates data scientists' access to IBM Watson ...

Cycle Computing optimizes NASA tree count and climate impact research ...

GPU-accelerated computing made better with NVIDIA DCGM and PBS Professional ...

Hardware

Mellanox demonstrates accelerated NVMe over Fabrics at Intel Developers Forum ...

Nor-Tech has developed the first affordable supercomputers designed to be used in an office, rather than a data centre ...

NVIDIA CEO delivers world's first AI supercomputer in a box to OpenAI ...

AMD demonstrates breakthrough performance of next-generation Zen processor core ...

CAST and PLDA Group demonstrate x86-compliant high compression ratio GZIP acceleration on FPGA, accessible to non-FPGA experts using the QuickPlay software defined FPGA development tool ...

IBM Research - Almaden celebrates 30 years of innovation in Silicon Valley ...

Wiring reconfiguration saves millions for Trinity supercomputer ...

Cavium completes acquisition of QLogic ...

Applications

Soybean science blooms with supercomputers ...

NOAA launches America's first national water forecast model ...

Computers trounce pathologists in predicting lung cancer type, severity, researchers find ...

Star and planetary scientists get millions of hours on EU supercomputers ...

Bill Gropp named acting director of NCSA ...

Latest NERSC/Intel/Cray dungeon session yields impressive code speed-ups ...

User-friendly language for programming efficient simulations ...

New book presents how deep learning neural networks are designed ...

Liquid light switch could enable more powerful electronics ...

Energy Department to invest $16 million in computer design of materials ...

Pitt engineers receive grant to develop fast computational modelling for 3D printing ...

Environmental datasets help researchers double the number of microbial phyla known to be infected by viruses ...

Teaching machines to direct traffic through deep reinforcement learning ...

Simulations by PPPL physicists suggest that magnetic fields can calm plasma instabilities ...

New material discovery allows study of elusive Weyl fermion ...

New maths to predict dangerous hospital epidemics ...

Kx financial analytics technology tackles Big Data crop research at biotech leader Earlham Institute ...

The Cloud

New hacking technique imperceptibly changes memory virtual servers ...

Latest NERSC/Intel/Cray dungeon session yields impressive code speed-ups


19 Aug 2016 Berkeley - Six application development teams participating in NESAP, NERSC's next-generation code optimization effort, gathered at Intel in early August for a marathon "dungeon" session designed to help tweak their codes for the next-generation Intel Xeon Phi Knight's Landing manycore architecture and NERSC's new Cori supercomputer.

Approximately 20 members of the NESAP teams made the journey, where they spent three days working closely with an additional 20 Intel engineers, tapping into their expertise in compilers, vectorization, KNL architecture and more. The teams represented six specific codes that were chosen last year for the NESAP programme: Quantum ESPRESSO, a materials modeling code; M3D-C1, a plasma simulation code; ACME, CESM and MPAS, all climate modelling codes; and Chombo, an adaptive mesh refinement code used in flow simulations.

This was the eighth dungeon session so far, which give the NESAP teams unprecedented access to Intel and Cray engineers and their expertise. Among those attending from NERSC were Taylor Barnes, Brandon Cook, Jack Deslippe, Helen He, Thorsten Kurth, Tareq Malas, Andre Ovsyannikov and Woo-Sun Yang. There were also representatives from several other facilities, including Princeton Plasma Physics Laboratory, Argonne National Laboratory, RPI and NCAR.

"Each team came in with a set of goals, parts of their applications that they wanted to investigate at a very deep level", Jack Deslippe stated. "We try to prepare ahead of time to bring the types of problems that can only be solved with the experts at Intel and Cray present - deep questions about the architecture and how applications use the Xeon Phi processor. It's all geared toward optimizing the codes to run on the new manycore architecture and on Cori."

Jack Deslippe, Taylor Barnes and Thorsten Kurth joined engineers from Cray and Intel to work on the Quantum ESPRESSO application. The team was able to achieve speed-ups of approximately 2x in the benchmark time, primarily by improving thread scaling and vector and streaming instruction generation with OpenMP - an application program interface - pragmas and employing cache blocking/tiling techniques where appropriate. They also investigated the explicit management of data-structures within the KNL high-bandwidth memory and are able to achieve speed-ups beyond the KNL performance in "cache mode", where the high-bandwidth memory is instead configured as a cache.

Using OpenMP also led to performance improvements with the M3D-C1 code, according to Woo-Sun Yang. Bringing two source kernels to the dungeon session, the M3D-C1 team spent their time focusing on two key aspects of the M3D-C1 code: optimizing the matrix assembly stage and testing particle-in-cell (PIC) codes within M3D-C1. In the first instance, they used the Intel Math library to streamline one of the most time-consuming processes in matrix assembly; and also restructured some functions to eliminate overhead and bad speculations, which led to an overall 2.8x speed-up. They also parallelized the code using OpenMP and saw a good parallel scaling for up to 68 cores on a KNL node. Adding OpenMP to the code was done in "a very optimal way of optimizing this code", Woo-Sun Yang stated, allowing the team to achieve almost perfect parallel scaling.

They achieved good code optimization in the PIC code, too. "By looking at the performance profiling, where the code spends most of its time, we affected one major function and restructured the code to optimize that part", he explained. "This kernel has about 200,000 particles, and initially it was running around 200 seconds, but after doing a series of optimizations, we cut the runtime to 50 seconds - a 3.9x speed-up."

The CESM and ACME teams came away from the latest dungeon session with code improvements as well, according to Helen He. One of their goals was to understand the process and thread affinity, which is the basis for getting optimal performance on KNL and for guiding further performance optimizations for all NESAP teams.

During a mini-dungeon session at NERSC in July, Helen He discovered that using the OpenMP4 standard affinity settings was slower than Intel compiler-specific KMP affinity settings for the CESM code, which was not the case for other applications the team has tested. So she ran seven test scenarios of the CESM code, which in theory should have all performed the same, but she found that some performed two times slower. At this most recent dungeon session, Helen He worked closely with Intel engineers - with special thanks to Karthik Raman and Larry Meadows. In the end, a glitch in the code was identified that was causing two extra threads to be spawned in the nested OpenMP level. The glitch was fixed and the seven cases then all ran at the same speed.

This understanding is also directly applied to getting best process and thread affinity for nested OpenMP, which is used by both the CESM and ACME teams to achieve better performance as compared to single level OpenMP. The ACME team also looked at optimizing its code with vectorization and threading, and achieved 35% speedup for one of the key functions.

"The ACME and CESM teams appreciated the expert knowledge of the Intel people who were at the dungeon session and the guidance and tools they provided", Helen He stated. "As one team member said, 'now I am weaponized'."

The dungeon sessions are also proving beneficial for Intel, according to Mike Greenfield, director of technical computing engineering in Intel's Developer Relations Division. While the dungeon process is not new to the company's product development practices, inviting a customer to participate in the sessions is, he added, emphasizing that the collaboration with NERSC and Cray has been "a very positive experience".

"Working with NERSC is a unique opportunity for us because most of the applications NERSC has targeted are of interest to the worldwide HPC community", Mike Greenfield stated. "It's not just getting something to run on KNL; it's about getting in-depth developer feedback, accelerating product readiness and advancing the scalability of applications that are of great interest to the HPC community. Through this collaboration, we want to help NERSC be even better positioned to get more value from its huge commitment to this next-generation system."

The NESAP teams are now busy updating their production codes with the improvements they achieved during the latest dungeon session and looking forward to the next dungeon session, which will likely take place in the fall.

"Most everyone who has attended these sessions has come out with not just faster code but a much deeper knowledge of what it takes to optimize code and how the architecture and the processors work, which I think is invaluable", Jack Deslippe stated.

Source: National Energy Research Scientific Computing Center - NERSC

Back to Table of contents

Primeur weekly 2016-08-22

Special

ExaCT team shows how Legion S3D code is a tribute to co-design on the way to exascale supercomputing ...

Focus

Sunway TaihuLight's strengths and weaknesses highlighted by Jack Dongarra ...

Exascale supercomputing

Big PanDA tackles Big Data for physics and other future extreme scale scientific applications ...

Computer programming made easier ...

Quantum computing

Cryptographers from the Netherlands win 2016 Internet Defense Prize ...

Focus on Europe

STFC Daresbury Laboratory to host 2016 Hands-on Tutorial on CFD using open-source software Code_Saturne ...

Middleware

Germany joins ELIXIR ...

Columbus Collaboratory announces CognizeR, an Open Source R extension that accelerates data scientists' access to IBM Watson ...

Cycle Computing optimizes NASA tree count and climate impact research ...

GPU-accelerated computing made better with NVIDIA DCGM and PBS Professional ...

Hardware

Mellanox demonstrates accelerated NVMe over Fabrics at Intel Developers Forum ...

Nor-Tech has developed the first affordable supercomputers designed to be used in an office, rather than a data centre ...

NVIDIA CEO delivers world's first AI supercomputer in a box to OpenAI ...

AMD demonstrates breakthrough performance of next-generation Zen processor core ...

CAST and PLDA Group demonstrate x86-compliant high compression ratio GZIP acceleration on FPGA, accessible to non-FPGA experts using the QuickPlay software defined FPGA development tool ...

IBM Research - Almaden celebrates 30 years of innovation in Silicon Valley ...

Wiring reconfiguration saves millions for Trinity supercomputer ...

Cavium completes acquisition of QLogic ...

Applications

Soybean science blooms with supercomputers ...

NOAA launches America's first national water forecast model ...

Computers trounce pathologists in predicting lung cancer type, severity, researchers find ...

Star and planetary scientists get millions of hours on EU supercomputers ...

Bill Gropp named acting director of NCSA ...

Latest NERSC/Intel/Cray dungeon session yields impressive code speed-ups ...

User-friendly language for programming efficient simulations ...

New book presents how deep learning neural networks are designed ...

Liquid light switch could enable more powerful electronics ...

Energy Department to invest $16 million in computer design of materials ...

Pitt engineers receive grant to develop fast computational modelling for 3D printing ...

Environmental datasets help researchers double the number of microbial phyla known to be infected by viruses ...

Teaching machines to direct traffic through deep reinforcement learning ...

Simulations by PPPL physicists suggest that magnetic fields can calm plasma instabilities ...

New material discovery allows study of elusive Weyl fermion ...

New maths to predict dangerous hospital epidemics ...

Kx financial analytics technology tackles Big Data crop research at biotech leader Earlham Institute ...

The Cloud

New hacking technique imperceptibly changes memory virtual servers ...