Back to Table of contents

Primeur weekly 2013-09-02

Focus

Using one Cloud for Cloud computing today? Perhaps two Clouds? In the future there will be MultiClouds. An interview with Dana Petcu of the West University of Timisoara ...

The Cloud

VMware delivers vCloud Hybrid Service ...

HP and SAP collaborate to deliver HP As-a-Service solution for SAP HANA ...

Dell delivers innovative Cloud and virtualization solutions blending VMware technology with Dell end-to-end offerings ...

New FICO software enables data scientists to transform Big Data into predictive models ...

Smart is the new green in Latin America ...

VMware management solutions demonstrated on stage at VMworld ...

HP and Yorkshire Building Society boost agility for financial services providers ...

IBM closes acquisition of CSL International ...

University of California deploys Huawei's Cloud storage system to support computational astrophysics research ...

Desktop Grids

Neutron stars in the computer Cloud ...

EuroFlash

Video Design brings Dataton WATCHOUT and WATCHPAX to InfoComm India 2013 ...

New UK/USA agreement will use high performance computing to boost economic competitiveness ...

Immanuel Bloch is to receive the Körber Prize 2013 ...

Light on twenty-year-old mystery ...

University of Tübingen physicists create interface between atoms and superconductors ...

USFlash

Allinea tools help Canada close innovation gap ...

Dell furthers commitment to Asia with transformative technology solutions from the data centre to the end device ...

Control scheme dynamically maintains unstable quantum system ...

Hadoop release 2.1.0-beta available ...

Mapping the Earth's magnetosphere to predict and prepare for space-weather events ...

EMC and VMware join forces around software-defined storage and VMware virtual SAN ...

HP and VMware enable customers to unify data centre networks ...

OSC hosts first meeting of the MVAPICH Users Group ...

Vampir takes a bite out of inefficiency as codes run on bigger supercomputers ...

HP helps HoneyBaked Ham of Georgia manage seasonal demand via tenfold performance boost ...

Vampir takes a bite out of inefficiency as codes run on bigger supercomputers


22 Aug 2013 Oak Ridge - At the Oak Ridge Leadership Computing Facility (OLCF), scientists and engineers run programmes, or application codes, on America's fastest supercomputer. Called Titan, the supercomputer is a marvel of engineering with 18,688 graphic processing units and 299,008 central processing unit (CPU) cores working in concert at a peak speed of 27 quadrillion calculations per second. The codes task the processors to model hurricanes and earthquakes that put people and property at risk, simulate combustion instabilities in power plant turbines and vehicle engines, predict properties of advanced materials, and spur other discoveries and innovations. Titan accelerates breakthroughs in fields in which major advances would not be probable or possible without massively parallel computing.

That said, researchers usually don't begin their computations on Titan. They often start with scaled-down problems run on computing clusters at their home institutions or on partitions of supercomputers that provide access to hundreds of processors at most. They add detail to their models to achieve greater realism, and eventually they need to run the simulation on as many processors as possible to achieve the requisite resolution and complexity. They come to a leadership-class computing facility to do so. At the OLCF they may use a supercomputer with 100 times as many processors as they’ve used before, and they're expecting their code to run 100 times as fast. But sometimes they find the code runs just three times as fast. What went wrong?

Computer scientists have developed tools to diagnose the millions of things that can go wrong when a programme runs through its routines. In May 2012 a team from Oak Ridge National Laboratory (ORNL), Argonne National Laboratory, and the Technische Universität Dresden used Vampir, a software development toolset of the University of Dresden, to analyze a running application in detail as it executed on all 220,000 CPU processors of Titan's predecessor, Jaguar.

"Understanding code behaviour at this new scale with Vampir is huge", stated ORNL computer scientist Terry Jones, who leads system software efforts and works with tools that optimize codes to run on leadership-class supercomputers. Vampir traces events and translates the files into diagrams, charts, time lines, and statistics that help experts locate performance bottlenecks in codes. "For people that are trying to build up a fast leadership-class programme, we've given them a very powerful new tool to trace events at full size because things happen at larger scale that just don't happen at smaller scale", Terry Jones stated.

The research team first presented its findings in June 2012 at the High-Performance Parallel and Distributed Computing conference in Delft, Netherlands. The work was published on-line June 20, 2013, inCluster Computing. Before this research effort, Vampir could perform this kind of detailed analysis on, at most, 86,400 cores. Large supercomputers use a version of Message Passing Interface (MPI) to communicate between processors. Vampir started as a tool for Visualization and Analysis of MPI Resources (hence the name) but today supports parallelization paradigms beyond MPI, such as OpenMP, Pthreads, and Cuda.

Performance analysis tools explore what failed where and when. Vampir can troubleshoot Titan-sized machines by setting aside part of the memory of each core so software developers can trace events. A researcher indicates which part of a code is a potential problem area containing performance or logic bugs, and Vampir "turns on" when it enters that part of the code, records details about events, and "turns off" when it leaves that part of the code. The researcher can zoom in on events and identify problems in detail.

"We made it so that when somebody gets access to Jaguar or Titan, if they need to collect information at the scale of the full machine, with Vampir that’s now possible", Terry Jones stated. "They don't have to collect information on one-eighth of the machine and look at that and scratch their heads wondering how things will change as the code scales to run on increasing numbers of processors. They get to, as part of their code development, do the testing at the full size of the machine."

Jaguar was no. 1 on the TOP500 list of world’s fastest supercomputers from November 2009 to November 2010 and in 2012 was transformed through a series of upgrades to become Titan, which headed the November 2012 list. About 1,100 scientists and engineers use the OLCF's leadership-class machines each year. These researchers, whose backgrounds are often in such fields as physics, biology, chemistry, and Earth science, may need to call on OLCF computer scientists if their codes have trouble scaling up.

One example of an issue that can go unnoticed at small scales but cause trouble at large scales is overuse of barriers, which block any portion of the work being done in parallel from proceeding past a given point until all work has reached that same point. In an analogy in which chefs are mass-producing cakes in a shared kitchen, Terry Jones said, a barrier is equivalent to the chief baker saying, "I can say with certainty that all chefs have their flour, salt, and other dry goods in their bowls at this point." Added Terry Jones, "For parallel applications running on hundreds of thousands of cores, barrier operations can be very computationally expensive, but you don't notice them if they’re running on only 20 or 30 cores. It's easy and tempting to sprinkle unnecessary barriers into the code while you're parallel programming - I'll throw in a barrier here, I'll throw in a barrier there, and there - and people have too many barriers in their code." The programmer would need to determine if that barrier is really needed and, if so, if anything can be done to speed up operations in that part of the code.

Vampir dramatically increased the ability to capture what's going on inside a running programme, allowing Titan's user community to expose such problems. Further, these latest tool advances, which are described in the two recent publications, drove a stake through the heart of an input/output, or I/O, problem of getting massive amounts of data in and out of processors. This game-changing contribution allows Vampir to suck more information from data arteries than it previously could.

But Vampir had its own challenges when it initially touched the massive parallelism provided at ORNL. When Vampir is diagnosing a program, it first captures events into memory. But if the memory portion set aside for events is exhausted - either by a long-running application or by dialing Vampir to capture very fine detail about what's happening, the programme must pause while the huge amounts of event data are written to the file system. This transfer of data between memory and file system is a challenge of its own when performed simultaneously with more than 200,000 instances. The team made enhancements that allow that procedure to happen quickly and trouble-free. "You'd like for that data stream to be going over a bunch of fire hoses as opposed to a few straws", Terry Jones stated.

The project's team members were Jason Cope, Kamil Iskra, Dries Kimpe, and Robert Ross of Argonne National Laboratory; Thomas Ilsche, Andreas Knüpfer, Wolfgang E. Nagel, and Joseph Schuchart of Technische Universität Dresden; and Terry Jones and Stephen Poole of ORNL. They used a Director's Discretion allocation of 7.3 million processor hours in 2011 and 2012 on Jaguar. The I/O Forwarding Scalability Layer project, the aspect that increased Vampir's I/O data stream, was funded by the Advanced Scientific Computing Research program in the Department of Energy's Office of Science. The work to make Vampir function on Jaguar and Titan was funded by the OLCF through a contract with the university at Dresden.
Source: Oak Ridge National Laboratory

Back to Table of contents

Primeur weekly 2013-09-02

Focus

Using one Cloud for Cloud computing today? Perhaps two Clouds? In the future there will be MultiClouds. An interview with Dana Petcu of the West University of Timisoara ...

The Cloud

VMware delivers vCloud Hybrid Service ...

HP and SAP collaborate to deliver HP As-a-Service solution for SAP HANA ...

Dell delivers innovative Cloud and virtualization solutions blending VMware technology with Dell end-to-end offerings ...

New FICO software enables data scientists to transform Big Data into predictive models ...

Smart is the new green in Latin America ...

VMware management solutions demonstrated on stage at VMworld ...

HP and Yorkshire Building Society boost agility for financial services providers ...

IBM closes acquisition of CSL International ...

University of California deploys Huawei's Cloud storage system to support computational astrophysics research ...

Desktop Grids

Neutron stars in the computer Cloud ...

EuroFlash

Video Design brings Dataton WATCHOUT and WATCHPAX to InfoComm India 2013 ...

New UK/USA agreement will use high performance computing to boost economic competitiveness ...

Immanuel Bloch is to receive the Körber Prize 2013 ...

Light on twenty-year-old mystery ...

University of Tübingen physicists create interface between atoms and superconductors ...

USFlash

Allinea tools help Canada close innovation gap ...

Dell furthers commitment to Asia with transformative technology solutions from the data centre to the end device ...

Control scheme dynamically maintains unstable quantum system ...

Hadoop release 2.1.0-beta available ...

Mapping the Earth's magnetosphere to predict and prepare for space-weather events ...

EMC and VMware join forces around software-defined storage and VMware virtual SAN ...

HP and VMware enable customers to unify data centre networks ...

OSC hosts first meeting of the MVAPICH Users Group ...

Vampir takes a bite out of inefficiency as codes run on bigger supercomputers ...

HP helps HoneyBaked Ham of Georgia manage seasonal demand via tenfold performance boost ...