The group was formed in response to workload changes following acquisition of the 200-petaflop IBM AC922 Summit supercomputer, located at the OLCF. This month, OLCF users began running allocation projects on Summit under the Innovative and Novel Computational Impact on Theory and Experiment, or INCITE, programme. The OLCF is a US Department of Energy (DOE) Office of Science User Facility at DOE's Oak Ridge National Laboratory.
"As a supercomputing centre grows, there comes a point where there are too many services and too many things going on at the same time in one group", stated Ryan Adamson, HPC cybersecurity engineer and interim HPC Core Ops group leader. "This was a strategic change that allows us to scale successfully and work more efficiently."
HPC Core Ops staff members were reassigned from the OLCF's High Performance Compute and Data Operations (HPC and Data Ops) Group, formerly known as HPC Ops. The split of HPC and Data Ops into two separate and distinct groups marks a noteworthy change in NCCS organisational structure and will afford both groups opportunities to focus on a specific subset of the centre's operations.
HPC Core Ops houses three teams: a networking team, which handles the Ethernet network for all of NCCS' systems; a cybersecurity team, which monitors and secures the supercomputing centre; and a core infrastructure team, which provides necessary external services to the centre's HPC resources. HPC and Data Ops, on the other hand, focuses on the user-facing aspects of operating the centre, which include administering the HPC and cluster resources and monitoring the storage and file systems.
"The capabilities provided by HPC Core Ops impact every NCCS customer, whereas HPC and Data Ops capabilities are specifically tailored to various customer needs", Ryan Adamson stated.
The new structure will allow Kevin Thach, the HPC and Data Ops group leader, to focus his group's attention on the OLCF's supercomputing and storage resources and on individual projects such as the Collaboration of Oak Ridge, Argonne, and Livermore (CORAL) - a joint HPC procurement activity among these three national laboratories.
Both groups will continue to work with the OLCF's User Assistance and Outreach Group to solve user issues and work on user tickets. They will also continue to share many of the same procedures such as change management, code review, and disaster recovery.
"We must continue to work closely with HPC and Data Ops", Ryan Adamson stated. "We will not be successful in this evolving supercomputing environment unless we do."