Back to Table of contents

Primeur live 2017-11-14

Exascale

Mellanox deployment collaboration with Lenovo will power Canada's largest supercomputer centre with leading performance, scalability for High Performance Computing applications ...

Middleware

Scalable clusters make HPC R&D easy as Raspberry Pi ...

NVIDIA chosen by every major computer maker and every major Cloud ...

WekaIO announces native support for Mellanox InfiniBand and Ethernet intelligent interconnect solutions ...

Bright Computing announces new product to help get enterprise data scientists up and running quickly with Deep Learning ...

OpenMP Architecture Review Board releases Technical Report that addresses top user requests ...

Diagnosing supercomputer problems ...

Hardware

Oak Ridge National Laboratory acquires Atos Quantum Learning Machine to support US Department of Energy research ...

Cavium and partners to showcase ThunderX2 Arm-based server platforms and FastLinQ Ethernet adapters for High Performance Computing at SC17 ...

HPE helps businesses capitalize on High Performance Computing and Artificial Intelligence applications with new high-density compute and storage ...

SciNet relies on Excelero for high-performance, peta-scale storage at new supercomputing facility ...

CoolIT Systems announces liquid cooled Intel Buchanan Pass server ...

Applications

Supercomputing speeds up Deep Learning training ...

INCITE grants of 5.95 billion hours awarded to 55 computational research projects ...

The Cloud

Penguin Computing announces Intel Xeon Scalable processor availability for Penguin Computing On-Demand HPC Cloud ...

Company news

Nallatech showcases next generation FPGA accelerators at Supercomputing 2017 ...

Cray supercomputer to assist Samsung's research on Artificial Intelligence and Deep Learning ...

Lenovo accelerates Artificial Intelligence initiatives to solve humanity's greatest challenges ...

DDN strengthens its HPC storage leadership with new solutions and next generation monitoring tools ...

New Dell EMC solutions bring machine and deep learning to mainstream enterprises ...

Diagnosing supercomputer problems

Sandia National Laboratories computer scientist Vitus Leung and a team of computer scientists and engineers from Sandia and Boston University won the Gauss Award at the International Supercomputing conference for their paper about using machine learning to automatically diagnose problems in supercomputers. Photo by Randy Montoya.13 Nov 2017 Albuquerque - A team of computer scientists and engineers from Sandia National Laboratories and Boston University recently received a prestigious award at the International Supercomputing conference for their paper on automatically diagnosing problems in supercomputers.

The research, which is in the early stages, could lead to real-time diagnoses that would inform supercomputer operators of any problems and could even autonomously fix the issues, said Jim Brandt, a Sandia computer scientist and author on the paper .

Supercomputers are used for everything from forecasting the weather and cancer research to ensuring U.S. nuclear weapons are safe and reliable without underground testing. As supercomputers get more complex, more interconnected parts and processes can go wrong, said Jim Brandt.

Physical parts can break, previous programs could leave "zombie processes" running that gum up the works, network traffic can cause a bottleneck or a computer code revision could cause issues. These kinds of problems can lead to programs not running to completion and ultimately wasted supercomputer time, Jim Brandt added.

Jim Brandt and Vitus Leung, another Sandia computer scientist and paper author, came up with a suite of issues they have encountered in their years of supercomputing experience. Together with researchers from Boston University, they wrote code to re-create the problems or anomalies. Then they ran a variety of programmes with and without the anomaly codes on two supercomputers - one at Sandia and a public Cloud system that Boston University helps operate.

While the programmes were running, the researchers collected lots of data on the process. They monitored how much energy, processor power and memory was being used by each node. Monitoring more than 700 criteria each second with Sandia's high-performance monitoring system uses less than 0.005 percent of the processing power of Sandia's supercomputer. The Cloud system monitored fewer criteria less frequently but still generated lots of data.

With the vast amounts of monitoring data that can be collected from current supercomputers, it's hard for a person to look at it and pinpoint the warning signs of a particular issue. However, this is exactly where machine learning excels, said Vitus Leung.

Machine learning is a broad collection of computer algorithms that can find patterns without being explicitly programmed on the important features. The team trained several machine learning algorithms to detect anomalies by comparing data from normal programme runs and those with anomalies.

Then they tested the trained algorithms to determine which technique was best at diagnosing the anomalies. One technique, called Random Forest, was particularly adept at analyzing vast quantities of monitoring data, deciding which metrics were important, then determining if the supercomputer was being affected by an anomaly.

To speed up the analysis process, the team calculated various statistics for each metric. Statistical values, such as the average, fifth percentile and 95th percentile, as well as more complex measures of noisiness, trends over time and symmetry, help suggest abnormal behavior and thus potential warning signs. Calculating these values doesn't take much computer power and they helped streamline the rest of the analysis.

Once the machine learning algorithm is trained, it uses less than 1 percent of the system's processing power to analyze the data and detect issues.

"I am not an expert in machine learning, I'm just using it as a tool. I'm more interested in figuring out how to take monitoring data to detect problems with the machine. I hope to collaborate with some machine learning experts here at Sandia as we continue to work on this problem", stated Vitus Leung.

Vitus Leung said the team is continuing this work with more artificial anomalies and more useful programmes. Other future work includes validating the diagnostic techniques on real anomalies discovered during normal runs, said Jim Brandt.

Due to the low computational cost of running the machine learning algorithm these diagnostics could be used in real time, which also will need to be tested. Brandt hopes that someday these diagnostics could inform users and system operation staff of anomalies as they occur or even autonomously take action to fix or work around the issue.

This work was funded by National Nuclear Security Administration's Advanced Simulation and Computing and Department of Energy's Scientific Discovery through Advanced Computing programmes.
Source: Doe/Sandia National Laboratories

Back to Table of contents

Primeur live 2017-11-14

Exascale

Mellanox deployment collaboration with Lenovo will power Canada's largest supercomputer centre with leading performance, scalability for High Performance Computing applications ...

Middleware

Scalable clusters make HPC R&D easy as Raspberry Pi ...

NVIDIA chosen by every major computer maker and every major Cloud ...

WekaIO announces native support for Mellanox InfiniBand and Ethernet intelligent interconnect solutions ...

Bright Computing announces new product to help get enterprise data scientists up and running quickly with Deep Learning ...

OpenMP Architecture Review Board releases Technical Report that addresses top user requests ...

Diagnosing supercomputer problems ...

Hardware

Oak Ridge National Laboratory acquires Atos Quantum Learning Machine to support US Department of Energy research ...

Cavium and partners to showcase ThunderX2 Arm-based server platforms and FastLinQ Ethernet adapters for High Performance Computing at SC17 ...

HPE helps businesses capitalize on High Performance Computing and Artificial Intelligence applications with new high-density compute and storage ...

SciNet relies on Excelero for high-performance, peta-scale storage at new supercomputing facility ...

CoolIT Systems announces liquid cooled Intel Buchanan Pass server ...

Applications

Supercomputing speeds up Deep Learning training ...

INCITE grants of 5.95 billion hours awarded to 55 computational research projects ...

The Cloud

Penguin Computing announces Intel Xeon Scalable processor availability for Penguin Computing On-Demand HPC Cloud ...

Company news

Nallatech showcases next generation FPGA accelerators at Supercomputing 2017 ...

Cray supercomputer to assist Samsung's research on Artificial Intelligence and Deep Learning ...

Lenovo accelerates Artificial Intelligence initiatives to solve humanity's greatest challenges ...

DDN strengthens its HPC storage leadership with new solutions and next generation monitoring tools ...

New Dell EMC solutions bring machine and deep learning to mainstream enterprises ...