To reduce fragmentation of large scientific workloads on America's fastest supercomputer, staff at the Oak Ridge Leadership Computing Facility (OLCF), a US Department of Energy (DOE) Office of Science User Facility, is fine-tuning how jobs are scheduled on Titan. By differentiating between large, long-lived jobs and small, short-lived jobs, new tests indicate Titan is able to do more work in less time.
At any given moment, Titan runs between 30 and 60 jobs of various sizes - from 1,000-plus nodes to 1 or 2 nodes, such as debugging tasks. To regulate the flow of incoming jobs, the Cray XK7 supercomputer relies on the Moab workload management system. Traditionally, Moab keeps a list of all incoming jobs scheduled to run on Titan irrespective of size. When space opens up on the machine, Moab scans from the top of the list to the bottom for a job to fill the vacancy.
"If you think of Titan as a big 3D grid, you can have various portions of your job spread throughout the machine", stated Chris Zimmer, an HPC systems engineer in the Technology Integration Group at the OLCF, located at DOE's Oak Ridge National Laboratory. "This is exacerbated by the fact that we have very small, short-lived jobs popping in and out of that list."
To reduce fragmentation, OLCF staff programmed Moab to schedule workloads differently. Instead of scanning a list from top to bottom, Moab now moves small, short-lived jobs (less than 2 hours) to the bottom of the list, increasing the probability that a large, intensive workload will receive a more favourable allocation.
"It essentially creates a situation where your short-lived, lower-priority jobs run on one end of the machine and your bigger, longer-lived jobs are on the other side", Chris Zimmer stated. "The main benefit of this effort is that large jobs, which are affected more by fragmentation, run faster, and the small ones come and go quickly. Ultimately, the goal is to reduce variability from job-to-job runs."
During the summer, OLCF staff experimented to find the optimum demarcation point for scheduling small and large workloads. A 6-week trial relegating jobs of 16 nodes or less to the bottom of Moab's list resulted in improved performance for most jobs greater than 128 nodes, whereas small jobs maintained past performance levels. This improvement is measured in the reduction of "hop counts", or data transfers between nodes, that are needed to complete the job. For example, during the 6-week trial, the average 4,096-node job moved 40 to 50 percent closer to the minimum number of hop counts theoretically needed to finish the task.
A second trial is under way to determine the effect of raising the demarcation point to 125 nodes. When complete, staff will compare the results of both trials to determine which will have the greatest impact moving forward.