Sunway TaihuLight is not the first supercomputer. The first one in this series of Sunway supercomputers was the Sunway BlueLight, built in 2012. This machine is located at the National Supercomputing Center in Jinan. It has a performance of 1 PFlops and uses a multi-core processor of 16 cores. The Sunway TaihuLight has a performance of 125 PFlops. That is roughly a 2 orders of magnitude improvement in four years time, a pretty big jump in possible performance and efficiency. There is also a major change in the architecture that uses a many-core processor of 260 cores in one chip.
The system vendor is the National Research Center of Parallel Computer Engineering and Technology (NRCPC). The CPU vendor is the Shanghai High Performance IC Design Center. The facility host is the National Supercomputing Center in Wuxi, which is a joint team by Tsinghua University, the City of Wuxi, and Jiangsu Province.
Haohuan Fu showed the technical features of the system. As already mentioned, the peak performance is 125 PFlops. The Linpack performance of TaihuLight is 93 PFlops, with a total memory of 1310.72 TB, and a total memory bandwidth of 5591.45 TB/s. The number of nodes amounts to 40,960 and the number of cores climbs to 10,649,600.
For each node in the system there are 260 cores. The peak performance per node is 3.06 TFlops for 1 CPU. The memory per node is 32 GB and the memory bandwidth amounts to 136.5 GB/s per node.
Next, Haohuan Fu described the major hardware and software components of the Sunway TaihuLight. The machine has a 010 structure. The two zeros are actually two circles of computer cabinets that make up the computer system. The 1 in the middle is the interconnect cabinet for the network system. The peripheral system includes both the storage and the management system. Furthermore, there are the maintenance and diagnostic system, the power system, the cooling system and the software system, supporting the parallel applications.
Some of the key innovations in the Sunway TaihuLight is the SW26010 many-core processor. It is China's first homegrown processor, developed by the Shanghai High Performance IC Design Center. The chip has 260 cores per processor. There are 4 core groups, each of which has one management processing element (MPE) and 64 computing processing elements (CPEs).
The management processing element has a 64-bit RISC core and supports both user and system modes with 256-bit vector instructions. There is a 32 KB L1 instruction cache and 32 KB L1 data cache, as well as a 256 KB L2 cache. The computing processing element has 8x8 CPE mesh, a 64-bit RISC core. It supports only the user mode with 256-bit vector instructions. It consists of 16 KB l1 instruction cache, and a Scratch Pad Memory (SPM), Haohuan Fu explained.
The high-density integration of the computing system involves a five-level integration hierarchy, consisting of the computing node, the computing board, the super node, the cabinet, and the entire computing system.
The network system has a three-level network, with a central switch network, connecting different super nodes. All the 256 nodes within each super node are fully connected, providing high bandwidth and low latency for the communication within the super node. The resource-sharing network connects the computing system to other resources, such as I/O services.
The peripheral system has a total storage of 20 PB. It is a high-speed and reliable data storage service for the computing nodes. The management system consists of the system console, the management server and the management network.
The maintenance and diagnosis system has major functions, involving the online maintenance management, the status and environment monitoring, the fault location and recording, and security services.
The power system, as explained by Haohuan Fu, has a mutual-backup power input of 2x35 KV, a front-end power supply output of DC 300V, a cabinet power supply of DC 12V, and a CPU power supply of DC 0.9 V.
The cooling system is recycling the cooling water. The heat exchange happens at the level of the computing boards.
The software system consists of the parallel operating system environment on top of the hardware system and the storage service. The major network file system is the Sunway GFS. For the compilers, the system supports CC++. Basic software libraries are provided. For the parallel programming parts, OpenMP and Open MPI are being supported. For most of the applications that are run on the system, a hybrid mode is used that combines MPI and
Haohuan Fu also presented the key applications that are run on the Sunway TaihuLight. The key application domains are earth system modelling and weather forecasting; advanced manufacturing (CFD/CAE); life sciences, and Big Data analytics.
A first project in the earth system modelling and weather forecasting area consists of refactoring and optimizing the community atmospheric model (CAM). This is a collaboration project by Tsinghua University and Beijing Normal University. A typical challenge when traditional HPC applications are moved to many-core architectures consists in the high complexity of the application and a heavy legacy in the code base, involving millions of lines of code. There is a misfit between the in-place design philosophy and the new architecture. Unfortunately, there is also a lack of experts with interdisciplinary knowledge and experience.
The project team tried to address these challenges by porting CAM on the Sunway TaihuLight. The entire code base consists of 530,000 lines of code. The CAM-SE dynamic core has been rewritten with regular code patterns and a manual OpenACC parallelization and optimization on the code and data structures.
The CAM physics schemes are irregular and complex code patterns, written by different scientists. There is a loop transformation tool to expose the right level of parallelism and code size. A memory footprint analysis and reduction tool has been developed. The result of the scalability is pretty good but the speed-up is not that significant, HaoHuan Fu told the audience. However it is a good starting point to continue to work on the new system.
The second project is a high-resolution experimental atmospheric model for the Sunway TaihuLight. This is an experimental project started by the National Supercomputing Center in Wuxi. The hardware-software co-design involves a structural change of the model components; many-core accelerations of all the compute-intensive parts; and a loosely coupling scheme between the dynamic core and the physics scheme to further improve the scalability. The project is targeting at high-resolution simulation scenarios with a 10 km to 3 km global resolution. For some scenarios, it is even possible to scale to the entire system of TaihuLight.
The third project is about Big Data analytics and is called swDNNv0 and involves developing an open source Deep Neural Network library for the Sunway supercomputer. There are two goals. The first one is an architectural optimization for high performance and efficiency to achieve a good support for a double, single and fixed number of representations. The second one is an algorithmic optimization to discover how to benefit from a large-scale computing resource.
Haohuan Fu told the audience that there have been five Gordon Bell submissions for applications running on the Sunway TaihuLight system. They tackle the problems of a fully-implicit nonhydrostatic dynamic solver for Cloud-resolving atmospheric simulation; a highly effective global surface wave numerical simulation with ultra-high resolution; a peta-scale atomistic simulation of silicon nanowires; a large scale phase-field simulation for coarsening dynamics; and time-parallel molecular dynamics simulations.
The fully-implicit nonhydrostatic dynamic solver for Cloud-resolving atmospheric simulation is a collaboration project supported by the Institute of Software from the Chinese Academy of Sciences, Tsinghua University and Beijing Normal University. The goal is to develop an efficient and scalable solver, composed of a hybrid multigrid domain decomposition preconditioner, a physics-based multi-block iLU, and parallelization and optimization at process, thread and instruction level.
The highly effective global surface wave numerical simulation with ultra-high resolution is a collaboration project supported by the First Institute of Oceanography and Tsinghua University. It has been scaled up to 8,519,680 cores.
The peta-scale atomistic simulation of silicon nanowires is supported by the Institute of Process Engineering from the Chinese Academy of Sciences. It achieved a 6.62 to 14.7 PFlops double-precision performance.
The large scale phase-field simulation for coarsening dynamics is supported by the Computer Network Information Center from the Chinese Academy of Sciences. It reached a 40 PFlops sustained performance.
Three of these five projects have been selected as Gordon Bell finalists, concluded Haohuan Fu.