Nikoli Dryden moderated the session. The SPCL Lab is headed by Torsten Hoefler. This was the first in a series of seminars that SPCL is organizing. Satoshi Matsuoka very nicely explained how the co-design approach in the Fugaku system works. The R-CCS Lab is one of the twelve research centres of the RIKEN National Research Lab in Japan. The R-CCS is specifically focusing on science for computing, science by computing and science of computing.
A few years ago there was another big Japanese supercomputing that took the nr. 1 position called K. As Satoshi Matsuoka explained there is a dramatic change from K to Fugaku. The K computer was mainly aimed at basic science and simulations. The Fugaku system is centered to Society 5.0. This means that it should have a large number of applications instead of mere basic science. Fugaku is another name for Mount Fuji. This mountain has a very broad base while it is very high. This symbolically resembles the application focus of Fugaku. It should have a very broad base of applications while at the same time it should be very fast for specific applications. This is comparable to what is called in Europe the 'pyramid of computing': a broad base of clusters and a high peak for the real petascale supercomputers.
Satoshi Matsuoka gave an overview of the history. This started already fifteen years ago. Since then, there was a very thorough design cycle to get to the TOP500 first position. This actually happened accidentically and earlier than expected in order to help the COVID-19 related research. In fact, the official system is not released yet because the development team still finds some issues in the system. It is something different to have 80.000 cores working, as Satoshi Matsuoka explained, because when you scale up to 160.000 cores, you will find errors. So, they are still working on it.
The Fugaku design was driven by applications. The team selected a number of applications in several areas. They worked together with Fujitsu on the hardware design and put the two together. The idea was to have a speed-up by factor 100 for a selected number of important applications, compared to the K computer. That was the goal instead of having a top 10 or a top 1 machine. The team worked on different items to optimize the system design. In 2011/2012, they made some performance projections to estimate what would be available in 6 to 8 years' time and asked themselves whether this would fit into an architecture that could run the applications.
The team focused on several application areas and looked how they could map different architectures on them. They identified the gaps for which they needed to develop new hardware and make new architecture decisions.
In 2012/2013 a feasibility study came out. Basically, one had several architecture design teams working on this, as well as several application teams since you need to make changes in the applications in order to make them run on a very large supercomputer. The architecture teams made several designs in parallel to see which ideas could work better. The feasibility study of 600 pages in Japanese is still available but there is also an English summary.
If you go from applications to architecture in co-design, you have to make decisions about SIMD, vectors basically, how to use them, and about vector length. From all these architecture parameters you make choices. Then you look at the design that fits the architecture parameters with the applications.
Very early on, the team made the choice to design the most important computation part which is called the SVE vector unit themselves. They made the decision to base it on Arm and use that because Arm allows to easily extend a specific architecture with your own vector unit and your other designs. They also evaluated this to performance and to power but, as Satoshi Matsuoka explained, the design was the extensibility of Arm, not so much the power.
The team came up with a design that was produced by Fujitsu, the company that was responsible for the hardware. The chip was called A64FX. It is manufactured using the TSMC 7 nanometer chip producing technology. Seven nanometers is top of the bill today. The A64FX processor which is the core, consists of 48 compute cores and a number of assistant cores. It is based on Arm version 8. It can make use of the whole 64-bit Arm ecosystem. It has an SVE accelerator built in. The SVE is part of the A64FX processor.
If you compare the A64FX processor to the NVIDIA GPU, it is fast, according to Satoshi Matsuoka. If you look at the figures, it looks like the NVIDIA V100 is a little bit faster but that is actually for the whole system while the A64FX speed is per chip. All in all the A64FX is comparable in performance and power to the NIVIDIA V100 GPU, which is very nicely designed though, as Satoshi Matsuoka admitted.
The rest is history. In June 2020, the Fugaku system made it to the first rank in the TOP500, followed by the HPCG, HPL-AI, and Graph500 benchmarks. Fugaku is the fastest system on earth and for AI applications, it even reaches exaflop speeds.