The vision of the Mont-Blanc project is to leverage the fast growing market of mobile technology for scientific computation, HPC and data centres. So far, there have been three editions of the Mont-Blanc project. The third follow-up project is still ongoing. In the first edition which started under the European Commission's Seventh Framework Programme in 2011 the hardware consisted of mobile technology with the ARMv7. The software was developed to enable ARM-based clusters. The next-generation studies consisted of gathering information from the first prototypes to extrapolate this.
In the second edition which also ran under the Seventh Framework Programme from 2014 until 2016, the hardware consisted of mobile and server technology with the ARMv8. There was a massive improvement of the software ecosystem with regard to the development tools and the resiliency. The next-generation studies addressed the pre-exascale HPC compute node.
In the Mont-Blanc 3 project under the Horizon 2020 Framework Programme the hardware consists of ARMv8 server technology. As far as the software is concerned the team is focusing on the study of mini-apps and real industrial applications. The next-generation studies are focusing on an HPC System-on-Chip and the ability to get a market-ready system.
Filippo Mantovani showed that the Mont-Blanc project has already provided a lot of contributions to the ARM-based prototypes in terms of mobile and server technologies and system integration. In software terms, the project has delivered scientific libraries, performance analysis tools, support for run times, and a power monitor. The team states that there is also an educational challenge in addition to the industrial component. Therefore, Mont-Blanc actively participates in the Student Cluster Competition. The project also dealt with scientific applications through the porting and benchmarking of mini-apps and full scale applications. Mont-Blanc has performed a scalability study on real ARM-based platforms. In addition, the project has contributed to the next-generation studies by running performance projections and doing simulations from System-on-Chip to full system in order to generate a multi-scale simulation infrastructure.
Filippo Mantovani stated that prototypes are critical to accelerate software development by deploying a system software stack and applications. He gave an overview of the Mont-Blanc prototype ecosystem. It started with the PRACE prototypes called Tibidabo, Carma and Pedraforca. Next came the mini-clusters called Arndale, Odroid XU, Odroid XU-3, and NVIDIA Jetson. The first real Mont-Blanc prototype consisted of 1080 compute cards and a fine grained power monitoring system. It was installed between January and May 2015 and has been operational since May 2015 at the Barcelona Supercomputing Center. In a next phase, the team has developed ARM 64-bit mini-clusters consisting of APM X-GENE2, Cavium ThunderX, and NVIDIA TX1. At present, there is the Mont-Blanc 3 demonstrator, based on new-generation ARM 64-bit processors and Cavium ThunderX2 System-on-Chip, which is targeting the HPC market.
Filippo Mantovani expanded a little bit on the first Mont-Blanc prototype with a local storage consisting of microSD up to 64GB and a memory of 4GB LPDDR3-1600; an Exynos5 Dual System-on-Chip consisting of two cores of ARM Cortex-A15 and 1 GPU ARM Mali-T604. The network is built out of a USB3.0 to 1GbE bridge. The system has 2 racks of 2160 CPUs and 8 BullX chassis of 1080 GPUs, 72 compute blades of 4.3TB of DRAM, and 1080 compute cards of 17.2TB of Flash.
He also described the new Mont-Blanc 3 demonstrator, code-named Dibona. This system is equipped with power supply units and 48 compute nodes consisting of 6 Cavium ThunderX2 CPUs, 3000 cores and 12.000 threads. It has direct liquid cooling, a redundant management server and storage, Infiniband EDR switches, and a management network.
Filippo Mantovani went over to the system software stack for ARM, which is developed since 2011 and today is maintained in collaboration with all major OpenHPC partners. It has been tested on several ARM-based platforms and is mostly based on open-source packages. There has been an effort in standardize power measurements formats for the efficient use of existing systems.
The Mont-Blanc team has developed a methodology for scientific applications. They are run on benchmarks, mini-apps and are using production and industrial codes. Applications are being traced with the objective of fixing features or limitations of current systems implementation such as the memory affinity on Cavium ThunderX. A second objective is the application of OmpSs/OpenMP4.0 and analysis of the effect. Third comes the understanding of code limitations and helping the developers in restructuring it. The fourth objective consists in performing extrapolation studies using next generation machine parameters like MUSA.
The Barcelona Supercomputing Center has developed Alya, a code for multi-physics problems. In order to parallelize the finite elements code, there is an analysis with Paraver. The team also performs reductions with indirect accesses on large arrays using no colouring, colouring, and commutative multi-dependences. Filippo Mantovani hoped that this OmpSs feature will be included in OpenMP. Alya is used for taskification and dynamic load balancing in order to get towards throughput computing. Dynamic load balancing (DLB) helps in all cases, even more in the bad ones. The team also observed some side effects: hybrid MPI+OmpSs Nx1 can perform better than pure MPI and Nx1 with dynamic load balancing constitutes a hopeful help for lazy programmers. Alya uses coupled codes. The team observed a big impact configuration and a kind of coupling in the original version. There was also an important improvement with DLB-Lend-When-Idle in all cases, as well as an almost constant performance independent of configuration and the kind of coupling.
Another code is NTChem, consisting of hybrid MPE + OpenMP. Filippo Mantovani explained that as the node is not fully populated, the system activates TurboBoost, increasing the frequency from 2.6 to 3.1GHz. There is load imbalance with global serialization and noise but some gain when using a low number of threads. There is a high overhead and fine granularity at a large thread count. With a simpler code, one gets better performance. There is sufficient task granularity but a communication computation overlap. DLB is needed at very large scale.
A third code is the Lattic Boltzmann D2Q37. This is a fluid dynamic code consisting of MPI+OpenMP for simulation of for instance mixing layer evolution of fluids at a different temperature/density. It has a simple structure for serial initialization and closing, to propagate which is memory bound, and to collide which is compute bound. The team performed a clustering analysis on ThunderX. The different runs were performed over the same lattice size with a varying number of threads, including 6, 8, 12, 16, 24, 32, and 48. The collide function is scaling almost perfectly up to 48 threads. For the propagate, increasing the number of threads makes threads competing for the same resource, explained Filippo Mantovani. The team also performed clustering of OpenIFS in ThunderX. There is strong scaling tracking of IPC.
Filippo Mantovani has the following conclusions. Most hardware limitations will evolve, eventually in the original market of the devices, when extending to the server market, and pushed by other markets, such as the automotive. The programming model and runtime will help overcome asynchrony and overlap, resilience, variability and load balancing. Tools can help understand the real problems and suggest or evaluate alternatives, such as correlating performance and power.
Filippo Mantovani ended by saying that the work performed on Mont-Blanc 3 applications is not strictly ARM specific. There are benefits for any new platform.