Pre-exascale EuroHPC systems explained at EWPC Workshop: Leonardo at 248 Petaflop/s LINPACK


19 Oct 2020 Almere - The European HPC Infrastructure Workshop EWHPC took place earlier this month. It was focused around the three pre-exascale systems that are currently procured by the EuroHPC Joint Undertaking (JU) together with hosting entities. Pre-exascale systems are systems that achieve 200 Petaflop/s or more performance LINPACK performance. The latter is used as yard stick in the benchmark for the TOP500 list of world's fastest supercomputers. So these are systems that would be in the top three of today's TOP500.

Because the systems are paid by the EuroHPC JU, the first presentation was given by Vangelis Floros of the JU. He was talking about the current state of the Joint Undertaking. The most important thing that he mentioned was that since September the Joint Undertaking is now an autonomous legal entity which means that it can sign contracts and it can also go into negotiations. Before that, it was managed by the European Commission. Now its Governing Board is the most important entity that can take all the formal decisions. The EuroHPC JU has new offices in Luxembourg.

Vangelis Floros was talking about the other contracts that were already signed at the time of the workshop which were the MeLuXina supercomputer in Luxembourg, a 12 Petaflop/s system, and the 6,8 Petaflop/s system in Maribor in Slovenia. These are the petascale systems, not the pre-exascale ones.

A few days later, also the contact was signed for the system with a pretty unpronounceable name "Euro-IT4I" in the Czech Republic. This is a 40,8 million system and 9 Petaflop/s R-max performance. The unpronounceable name will be changed because they have now a contest and you can enter the contest and propose an unpronounceable name or another name which is easily pronounceable, I expect.

The other petascale systems are still under negotiation. We can expect the first system to be operational in May 2021. The one in Portugal will be operational somewhere in October 2021 and the PetaSC system in Bulgaria will be operational in March 2021. Of the three pre-exascale systems, currently the Leonardo system contract has been signed a few days after the workshop. At the workshop the presentation was more on operational aspects. Later this week, the LUMI contract will be signed. It will be available in the second quarter of 2021 and the MareNostum5 is a little bit behind that. It will be operational somewhere in the fourth quarter so that means that the pre-exascale systems will enter the TOP500 in November next year at the earliest.

Vangelis Floros was also talking about the next Framework Programme so we can see that the financing will run until 2027. But of course, if you buy a machine in say 2026 you need to also maintain it for five years. That is why the operational period is 2021-2033.

The first system discussed was the Leonardo. Mirko Cestari, CINECA, was presenting it. He was mainly talking about the procurement process and the lessons learned from that. An important thing to realize is that CINECA, the hosting entity for Leonardo, has a road map which ends somewhere in 2027. They hope to have a post-exascale system somewhere by that time frame, so paid from the next Framework Programme. The pre-exascale system should have characteristics following the road map leading to general requirements. The basic idea is that you have a general purpose part, you have a booster - or accelerator - part and you have a data centric part. So these parts should be able to run AI applications, general HPC applications and Big Data applications. Of course, real complex applications can use a mix of the three. The goal was to have an HPC system and a five-year maintenance contract. The acquisition cost is 120 million euro but there is an additional same amount of money for running the system.

The whole system was expected to have a sustained computing capability of at least 200 Petaflop/s and of course they wanted to use the experience that they already. The system should be a pathway to more heterogeneous systems. The requirements were to have a 200 gigabit low latency interconnect which more or less narrows it down quite a bit. For the accelerator they actually didn't have much requirements. What they wanted to do in the procurement process is to have a "competitive dialogue" process with the vendors. In the end they negotiated with three different vendors which came to the end round. For the Leonardo negotiation team that worked out very good. So they claim it is good to have a competitive dialogue process and to not have too many hard requirements from the start but keep the goals of the technology road map in mind. Of course, there should be a cost performance analysis.

Actually the contract of Leonardo was awarded a few days later and from the press conference we extracted this data the machine will be delivered by Atos. It will be a BullSequana XH2000 with 150 racks. It will have Intel Sapphire nodes in the end for the CPU performance. The LINPACK performance will be at least 248 Petaflop/s. If it was installed already today, the Leonardo would be the second fastest machine in the world after the Fugaku, before the Summit. But, of course, next year things might have changed. The FP16 peak performance in the Booster/Accelerator is 10 Exaflop/s. This is mainly used in AI. The accelerator consists of NVIDIA A100 chips. There will be 5 petabyte of memory and 100 petabyte of storage. The interconnect is Mellanox, which is also now the NVIDIA network. That were the data about Leonardo from the press conference.

Now back to the workshop where Pekka Manninen explained the LUMI supercomputer. LUMI is the most northern of the three pre-exascale systems. It is in Finland and it is supported by a consortium. Actually, each pre-exascale system is supported by a consortium. The LUMI consortium is by far the largest with 10 partnering countries. Together, the countries that are partner of a consortium, in this case of LUMI, get half of the capacity, the other half of the capacity is for the Joint Undertaking and distributed amongst all European researchers.

The timeline of LUMI is the following. There is a system procurement which was finished a few months ago and the system contract will be announced in a few days. Then the system installation will be somewhere in the first quarter of next year; the general availability in the second quarter of next year; and then the system will be in operation until the end of 2026.

The architecture of LUMI is also quite heterogeneous with different CPU partitions, with different types of memory and different types of storage. For instance, there is CEPH object-oriented storage and there is also a parallel file system based on LUSTRE. There is a GPU partition. All in all, there are several partitions used for different types of applications and, of course, applications can also use more partitions.

Pekka Manninen actually also used another name for LUMI: Queen of the North. LUMI is designed as a general purpose high performance computer because it is a big system with different kinds of sections that you can use for different kinds of programmes and algorithms.

Sergi Girona, from the Barcelona Supercomputing Center (BSC) was talking about the MareNostrum5. Actually, he was mainly talking about the infrastructure supporting the modern MareNostrum5. The reason is that this is the stage where BSC is currently. He had some observations about the infrastructure that is needed for such a large computer. First, of course, you need a building. There is a new building for the Barcelona Supercomputing Center. So, it is away with the chapel - well it is not away with the chapel itself, but a lot of hardware infrastructure will move out of the chapel to the new building. The people who are working there, currently about 700, will also find a place in the new headquarters which is also designed to be able to house theMareNostrum5, a big machine with large requirements in terms of electricity power cooling, etc.

The hosting consortium for the MareNostrum5 is smaller but still there are four countries (will be expanded to 5) involved. The MareNostrum5 will be a very heterogeneous system already which is based on the experience that they have now with the MareNostrum4 which is neither a monolithic system. But the 5 will be even more heterogeneous. Sergi Girona was then talking about the site preparation which means laying down all the cabling and laying down all the cooling equipment etc., so taking care that the machine can be fitted in there, once it has been acquired. The original idea was that it should be ready in September 2020 but because of the COVID-19 crisis which also very hard struck Barcelona and Spain, the expected date is now April next year.

BSC has a virtual computer room video already so you can have a virtual experience and walk to the computer room and walk to the new building. You can see there will be a very wide corridor for the visitors so that they can overlook the MareNostrum5 doing its pre-exascale calculations.

Herbert Huber, LRZ, closed the workshop announcing that the next physical workshop will be about HPC infrastructures in Europe. This will hopefully take place somewhere in April next year in Garching, Munich. If you want to know more about the systems, you can have a look at the data by visiting our EuroHPC website .

Ad Emmen