Leveraging a long history of collaboration, the three organisations worked in parallel to define the debugging interfaces and port the TotalView debugger, simultaneous to the development of the Blue Gene/Q hardware. TotalView engineers at Rogue Wave were able to begin validation and finalize the hardware-specific development shortly after the first Sequoia system was powered on. As a result of this successful collaboration, Rogue Wave Software has delivered to LLNL a pre-release version of TotalView, its massively parallel, interactive and automated debugging tool.
When it is deployed later this year at LLNL, Sequoia is expected to deliver 20 petaflops peak, double the speed of the fastest system currently on the TOP500 list. LLNL plans to use Sequoia's impressive computational capability to advance understanding of fundamental physics and engineering questions that arise in the National Nuclear Security Administration's (NNSA) programme to ensure the safety, security and reliability of the United States' nuclear deterrent without testing. Sequoia will also support NNSA/DOE programmes at LLNL that focus on non-proliferation, counterterrorism, energy, security, health and climate change. IBM's historic role in developing supercomputers that provide the power behind critical applications across a wide array of industrial and laboratory clients has uniquely positioned them to be able to provide Sequoia for the vital functions that are LLNL's responsibility.
TotalView is a comprehensive parallel source code debugging and memory error detection tool that dramatically enhances developer productivity by simplifying the process of debugging parallel, data-intensive, multi-process, multi-threaded or network-distributed applications. "Our software development teams create some of the most sophisticated computational models of physical systems anywhere. These models represent many physical effects and span large scales in both time and space", stated Jim Rathkopf, Associate Programme Director of Computational Physics at LLNL. "Understanding what's happening in these large multi-physics codes running on thousands of processors is really hard, especially when things aren't working correctly. We rely on tools provided by folks like Rogue Wave to help us develop our codes and verify that they are working correctly, but also to find problems that come up both in development and production."
As LLNL takes delivery of the Sequoia system and works to move it into production, they will be migrating applications that have been running on earlier systems (Blue Gene/L and Dawn, or Blue Gene/P) to the newer architecture, Blue Gene/Q. This is a period of intense activity for the software teams as they gain experience with the new hardware and software environment. "Having an early-access version of TotalView available is vital to the installation and acceptance process for Sequoia. It is critical that our development teams have their familiar parallel debugging environment available as they iron out the inevitable issues that come up with running on a new system", stated Kim Cupps, Division Leader of Livermore Computing at LLNL.
The timely availability of an early-access version of TotalView was made possible through close collaboration between LLNL, IBM, and Rogue Wave Software. These organisations have worked together on several generations of HPC environments important to LLNL. Providing debugging capability on the Blue Gene/Q architecture involved collaboratively designing an interface called CDTI (Code Development and Tools Interface), which was then implemented by IBM and used by Rogue Wave in porting TotalView. CDTI also provides key functionality for other tools such as LLNL's STAT and SCR.
"The partnership with LLNL and IBM is key to our success in supporting the Sequoia project. The continuous, timely, clear communication among the three organisations makes it possible to meet the goals we've all agreed on", explained Sean FitzGerald, Senior Vice President of Engineering and CTO of Rogue Wave Software. "We anticipate delivering full Blue Gene/Q platform support and scalability improvements in a series of releases in 2012 and 2013."