Those engaged in computational research often reach the point where they outgrow their personal desktops or department computers - programmes run slowly, run out of memory, or researchers don't have all the sophisticated software needed to run simulations, and they turn to advanced computing resources like those at the Texas Advanced Computing Center (TACC). But those new to supercomputers must learn to make the sometimes difficult transition from desktop to supercomputer and begin sharing resources with hundreds or even thousands of other researchers.
Doug James, HPC research associate at TACC, compares this transition to his experience as a cyclist.
"When I started riding, I bought a Sears-Roebuck bike and it met my needs for a while, but I eventually got to a point where I knew I was outgrowing the bike. My skills, my needs, my fitness level, had reached the point where it was pretty obvious I needed a more sophisticated bike", he stated.
"The problem we're trying to solve is in this transition from desktop to supercomputer. Those who support computational research are working hard to put mechanisms in place to make it possible for thousands and thousands of people to share these resources and get their work done", Doug James continued.
To ease this transition, TACC's HPC research team act as detectives to understand the issues that users face and help them gain control of their computing environment. By developing software tools, they make the process of using supercomputing more efficient for users, consultants and administrators.
In 2009, Robert McLay, a TACC research associate and manager of HPC software tools, noticed a recurring issue that always landed on his desk. For researchers to successfully run calculations on supercomputers, they must use the appropriate package of applications, file libraries, and compilers, which is challenging without deep knowledge of these concepts. So Robert McLay developed Lmod, a software tool that takes into account the unique environment that each researcher requires. It helps researchers select the correct combination among thousands of possible options and prevents them from loading incompatible software.
As one of TACC's more established tools, Lmod use grew steadily over the years, and is used by supercomputing centers around the world. Lmod also spurred a new generation of tools to better address the particular needs of every user. The XALT tool is a collaborative effort between Mark Fahey, an HPC researcher at the University of Chicago and Robert McLay. XALT helps consultants by generating detailed information on the software that researchers use and how successful they are in using that software.
"XALT is designed to be a census taker, so we know what programs get used and at what rate. We also know all the other kinds of applications we do and what libraries get used so we can better manage our system", Robert McLay stated.
XALT is complementary to TACC Stats, a tool to help consultants respond to user questions about jobs, or the process of running executable codes or programmes on supercomputers. "We look at individual jobs. I can see things like who ran it, when they ran it, what queues they ran it in, if it completed successfully, how many nodes, and I can see what hosts or what nodes it ran on", Todd Evans, a research associate on the HPC team stated.
TACC Stats generates performance metrics for individual jobs and also helps the HPC team address user needs by analyzing general trends in jobs. The tool generates performance snapshots by taking measurements at the beginning, every 10 minutes, and at the end of a job on system statistics and hardware performance computer data.
One tool that empowers users to better understand why a job might fail is the Sanitytool, developed by Robert McLay and Si Liu, research associate on the HPC team. Error messages can be ambiguous and confusing. The Sanitytool is written in Python and by typing a simple command, a user can invoke the tool which runs a series of customized tests to determine exactly what went wrong.
Si Liu cited one example where a TACC consultant worked with a user and spent several hours attempting to figure out why a job did not run correctly. However, after using the Sanitytool, it only took a few minutes to diagnose the issue and resolve it.
Designed by James Browne, Professor Emeritus of computer science at the University of Texas at Austin, the PerfExpert tool makes performance optimization at the compute node level as simple as possible. The tool is designed to help researchers easily detect the main issues in their codes and to improve the performance of their programs without requiring computing expertise. PerfExpert automates instrumentation and profiling, analysis of bottlenecks, and recommendation of optimizations.
Remora is TACC's newest tool, which stemmed from a common request from users to understand how much memory a job is using since jobs can crash if they run out of memory. The tool also allows consultants to visualize how memory evolves during execution of a job through graphs.
"One user ran a job that was creating 20,000 requests per second in the file system so the application ran very slowly and could potentially create issues on the file system", Antonio Gomez, another research associate in the HPC group, stated. "Using Remora, we showed him the issue and he changed his programme so that he was only accessing files 600 times a second. The performance of the code improved by 25% to 30% thanks to the insights provided by this tool."
TACC's HPC team is dedicated to educating researchers and HPC professionals on the importance of software tools to broaden the scope of the community and lower the barrier to HPC access. Each of these performance monitoring tools are open-source and available on TACC's github page. Recently, the team released their paper, "Tales from the Trenches: Can User Support Tools Make a Difference", and presented findings at the second annual workshop on HPC user support tools.
"Software tools benefit the humanities researchers, they benefit the people trying to use supercomputers who are coming from colleges that may not have done much computational research in the past", Doug James stated. "There's a sense in which our passion is helping the nontraditional and underrepresented disciplines, demographics, universities and colleges make this transition."