Back to Table of contents

Primeur weekly 2015-10-19

Special

Blue Yonder to automate enterprise decisions with machine learning ...

Videos of presentations from the Docker Workshop at ISC Cloud & Big Data conference published ...

Exascale supercomputing

Research and Markets to investigate impact of exascale computing trends in key sectors ...

Focus on Europe

Cray launches EMEA Research Lab expanding its presence in the European HPC ecosystem ...

GCS sponsors students from Technische Universität München for the Student Cluster Challenge at SC15 ...

Middleware

Moab scheduling tweak tightens Titan's workload ...

Adaptive Computing deploys Converged HPC, Cloud & Big Data for The Hospital for Sick Children & University Health Network's Princess Margaret Cancer Centre ...

Hardware

Rugged 10/40 Gigabit Ethernet switch that brings high speed networking to 3U VPX systems introduced by Curtiss-Wright ...

Asetek receives order for largest server installation to date ...

Retrofitting Rhea cluster at OLCF ...

British Antarctic Survey navigates surge of scientific research and complex climate models with DDN hybrid flash storage appliance ...

United Nations Technology Bank recognises importance of GÉANT and research and education networks ...

KBC Group delivers customers powerful banking experience driven by IBM z13 mainframe ...

Nanoelectronics researchers employ Titan for an electrifying simulation speed-up ...

Gigabyte presents its new C230 series server & workstation motherboards ...

Applications

Berkeley Lab's Yelick lauded for advances in programmability of high-performance computing systems ...

Supercoiled DNA is far more dynamic than the 'Watson-Crick' double helix ...

Ace Computers to exhibit custom HPCs and supercomputers for the Oil and Gas industry at SEG Expo ...

27 million euro for Dutch expertise centre in data-intensive scientific research ...

The Cloud

EGI opens new training infrastructure ...

VMware advances industry leading hybrid Cloud management platform and accelerates unified services delivery across public and private Clouds ...

Moab scheduling tweak tightens Titan's workload


A side-by-side visualization depicting a highly fragmented computing job (left) and a job with reduced fragmentation (right). To reduce fragmentation of large jobs on the Titan supercomputer, OLCF staff is experimenting with how the Moab workload management system schedules jobs. The changes have allowed Moab to allocate large jobs more efficiently. Improvement is measured in the reduction of "hop counts", or data transfers between nodes, that are needed to complete the job. For example, during a 6-week trial the average 4,096-node job on Titan moved 40 to 50 percent closer to the minimum number of hop counts theoretically needed to finish the task.
13 Oct 2015 Oak Ridge - Parallel computing makes big computational problems feasible by breaking them up into smaller parts. On a supercomputer, these parts share information with one another to find a solution. Ideally, a high-performance computing (HPC) application would be packed across a supercomputer's nodes so that it only crossed communication paths with collaborating nodes. But on a leadership class machine, that's rarely the case. New jobs fill vacant nodes amidst active jobs as soon as the space becomes available. Gaps between communicating nodes commonly arise, slowing down communication and affecting performance.

To reduce fragmentation of large scientific workloads on America's fastest supercomputer, staff at the Oak Ridge Leadership Computing Facility (OLCF), a US Department of Energy (DOE) Office of Science User Facility, is fine-tuning how jobs are scheduled on Titan. By differentiating between large, long-lived jobs and small, short-lived jobs, new tests indicate Titan is able to do more work in less time.

At any given moment, Titan runs between 30 and 60 jobs of various sizes - from 1,000-plus nodes to 1 or 2 nodes, such as debugging tasks. To regulate the flow of incoming jobs, the Cray XK7 supercomputer relies on the Moab workload management system. Traditionally, Moab keeps a list of all incoming jobs scheduled to run on Titan irrespective of size. When space opens up on the machine, Moab scans from the top of the list to the bottom for a job to fill the vacancy.

"If you think of Titan as a big 3D grid, you can have various portions of your job spread throughout the machine", stated Chris Zimmer, an HPC systems engineer in the Technology Integration Group at the OLCF, located at DOE's Oak Ridge National Laboratory. "This is exacerbated by the fact that we have very small, short-lived jobs popping in and out of that list."

To reduce fragmentation, OLCF staff programmed Moab to schedule workloads differently. Instead of scanning a list from top to bottom, Moab now moves small, short-lived jobs (less than 2 hours) to the bottom of the list, increasing the probability that a large, intensive workload will receive a more favourable allocation.

"It essentially creates a situation where your short-lived, lower-priority jobs run on one end of the machine and your bigger, longer-lived jobs are on the other side", Chris Zimmer stated. "The main benefit of this effort is that large jobs, which are affected more by fragmentation, run faster, and the small ones come and go quickly. Ultimately, the goal is to reduce variability from job-to-job runs."

During the summer, OLCF staff experimented to find the optimum demarcation point for scheduling small and large workloads. A 6-week trial relegating jobs of 16 nodes or less to the bottom of Moab's list resulted in improved performance for most jobs greater than 128 nodes, whereas small jobs maintained past performance levels. This improvement is measured in the reduction of "hop counts", or data transfers between nodes, that are needed to complete the job. For example, during the 6-week trial, the average 4,096-node job moved 40 to 50 percent closer to the minimum number of hop counts theoretically needed to finish the task.

A second trial is under way to determine the effect of raising the demarcation point to 125 nodes. When complete, staff will compare the results of both trials to determine which will have the greatest impact moving forward.
Source: Oak Ridge Leadership Computing Facility - OLCF

Back to Table of contents

Primeur weekly 2015-10-19

Special

Blue Yonder to automate enterprise decisions with machine learning ...

Videos of presentations from the Docker Workshop at ISC Cloud & Big Data conference published ...

Exascale supercomputing

Research and Markets to investigate impact of exascale computing trends in key sectors ...

Focus on Europe

Cray launches EMEA Research Lab expanding its presence in the European HPC ecosystem ...

GCS sponsors students from Technische Universität München for the Student Cluster Challenge at SC15 ...

Middleware

Moab scheduling tweak tightens Titan's workload ...

Adaptive Computing deploys Converged HPC, Cloud & Big Data for The Hospital for Sick Children & University Health Network's Princess Margaret Cancer Centre ...

Hardware

Rugged 10/40 Gigabit Ethernet switch that brings high speed networking to 3U VPX systems introduced by Curtiss-Wright ...

Asetek receives order for largest server installation to date ...

Retrofitting Rhea cluster at OLCF ...

British Antarctic Survey navigates surge of scientific research and complex climate models with DDN hybrid flash storage appliance ...

United Nations Technology Bank recognises importance of GÉANT and research and education networks ...

KBC Group delivers customers powerful banking experience driven by IBM z13 mainframe ...

Nanoelectronics researchers employ Titan for an electrifying simulation speed-up ...

Gigabyte presents its new C230 series server & workstation motherboards ...

Applications

Berkeley Lab's Yelick lauded for advances in programmability of high-performance computing systems ...

Supercoiled DNA is far more dynamic than the 'Watson-Crick' double helix ...

Ace Computers to exhibit custom HPCs and supercomputers for the Oil and Gas industry at SEG Expo ...

27 million euro for Dutch expertise centre in data-intensive scientific research ...

The Cloud

EGI opens new training infrastructure ...

VMware advances industry leading hybrid Cloud management platform and accelerates unified services delivery across public and private Clouds ...