China's effort on HPC in the next 5 years - from exascale prototypes to exascale system

21 Jun 2017 Frankfurt - China firmly ranks first and second at the TOP500 of fastest supercomputers in the world for some time already. But where is China heading? Depei Qian, Beihang University/Sun Yat-sen University gave an overview of China's effort in HPC in the next 5 years at the HPC in Asia workshop at ISC-HPC ISC'17 in Frankfurt. He started his presentation with a historical overview of HPC in China.

Until now, Depei Qian said, there were three "863 Key Projects" on HPC: The first, running from 2002-2005 focused on an HPC computer, core HPC software and setting up Tflop/s as part of the China National Grid (CNGrid) all around the country. The second, from 2006-2010, focused on high productivity computing and a Grid service environment. As part Petascale supercomputers were developed and service features of the HPC environment.

The third from 2010-2016 focused on the development of 100 Petaflop/s supercomputers, developing large scale HPC application, and upgrading CNGrid.

Depei Qian recalled that in the past twenty years, China has made a significant progress in HPC from the Dawning 1000, that had a performance of 2,15 Glop/s through the TH-1 and Sunway Bluelight supercomputers in 2011, that performed over one Petaflop/s. The Sunway used home grown processors. In 2016 the Sunway TaihuLight performed 125 Petaflop/s peak - a 50 million times increase in 20 years. In 1996 China had only one national HPC centre, in 2016 there were 17 national supercomputer centres, all part of China's national Grid.

China also has what they call 'Application villages', Depei Qian said, each specializing on a domain-oriented application space providing services to the end-users. Currently there are "App villages" being developed for industrial product design, new drug discovery, and digital media. From limited applications available in 1996 - weather forecasting and oil exploration - the available HPC applications now span a wide range of areas. In 2016 a Gordon Bell Prize was awarded for running an application on 10 million cores in parallel on a Chinese built supercomputer.

There are some lessons to be learned, noted Depei Qian. Coordinated effort between national research programmes and regional development plans is needed. There should also be a multi-lateral collaboration between supercomputing centres, enterprises such as Lenovo & Sugon, and application organisations. Co-design is needed: a balanced and coordinated development of HPC machines, HPC environment, and HPC applications.

Of course, a large multi-annual project like this also has its problems. Depei Qian mentioned a few of them. Lack of a long-term programme for HPC was one. HPC has to compete for funding with other areas every five years. China is also not very strong in core HPC technologies like processors/accelerators, novel devices like new memory technologies, and large scale parallel algorithms. The HPC application software is a still a bottleneck despite the formation of the domain applications software centres. China also has a shortage in talented HPC people, especially cross-disciplinary talents. There is still a lack of multi-disciplinary collaboration.

Issues in exascale system development

Depei Qian saw some major challenges ahead when developing exascale systems: Power consumption; applications' performance; programmability; and resilience are the most important ones. So there will probably trade-offs to be made between performance, power consumption and programmability. Achieve continuous non-stop operation, or at least a very long time between system failures will require effort. The exascale systems must be able to support a wide range of applications with reasonable efficiency which is not easy to do.

For the architecture of an exascale system, China is looking at a novel architecture beyond the current heterogeneous accelerated many-core based architectures. Co-processor or partitioned heterogeneous architectures are one option. A co-processor based architecture has some disadvantages: in some applications there is a low usage of the co-processor. It can be difficult to use the CPU and the co-processor accelerator at the same time. Moving data between the CPU and the co-processor is a bottleneck. Application-aware architectures look promising but there are still open questions. On-chip integration of special purpose units allows choosing the right tool to do the right thing. However, it is not clear whether such a system could be made dynamically configurable and how it could be programmed.

Memory systems are always an issue, Depei Qian continued, we all know the memory wall. China is pursuing large capacity, low latency, high bandwidth memory. One approach is to increase the capacity and lower the power consumption by using both DRAM and NVM. But then we have a data placement issue, similar to cache approaches. Another option is to improve the bandwidth and latency by using 3D stacking technology. But also the data movement should be minimised by placing the data closer to the processor by using high-bandwidth memory near the processor, using on-chip DRAM, or doing simple functions in memory. Data copying costs can also be reduced by using a unified memory space in a heterogeneous architecture.

Interconnect is an other important part of an exascale system. Here, Depei Qian said, China is pursuing low latency, high bandwidth and lower energy consumption. Highly scalable interconnects are a must for exascale systems. An interconnect should be able to connect 10.000+ nodes in a low-hub, low-latency topology and provides reliable and intelligent routing. To improve the performance of interconnects, we have to adopt new technologies like silicon photonics for communication between components, optical interconnects and communication, and miniature optical devices.

Programming these heterogeneous systems is also a big issue; how to express the parallelism? This requires efficient expression of parallel dependencies, data sharing, and execution semantics. Also the problem of how to decompose the problem in an appropriate way for a heterogeneous many-core systems needs to be tackled. In China a holistic approach is proposed to improve programmability, Depei Qian said. This will require new programming models, and new compiler supported programming language extensions. Parallel debugging tools and runtime support and optimisation are needed as is architectural support for these new programming models.

Concerning computational models and algorithms for exascale systems, China tries to follow a full chain of innovation approach, from mathematical methods, to computer algorithms, and algorithm implementation and optimisation. A good mathematical method has often more influence on the performance and is more effective at reaching high performance than hardware improvements or algorithm optimisation. Architecture-aware implementation and optimisation of algorithms are needed in heterogeneous systems. To improve software productivity and performance domain specific libraries need to be developed.

Resilience is one of the key issues for a future exascale system, Depei Qian pointed out. This is because of the large scale of the system: between 50.000 and 100.000 nodes, the huge amount of components, the resulting very short Mean Time Between Failures (MTBF), and the long non-stop operation times required for solving large scale problems. So reliability measures are required at different level: device level, mode level, and system level. Coordination between the hardware and software is necessary. This should allow for fast context saving an recovery for check pointing in case of short MTBF. Algorithms and the application software level should be fault-tolerant.

Tools will be very important for the exascale systems to improve the performance. This is very critical for China, Depei Qian said, because China is forced to sue its home-grown processor, so the commercial tools and general available research tools are not available for the home-grown processor. There will be three types of tools developed: Parallel debugger for correctness checking; a performance tuner for performance improvements; and an energy optimiser to increase energy efficiency.

There is an urgent need for a complete ecosystem that supports China's how-grown processors. This should include languages, compilers, an OS, and runtime support; tools; application development support; and application software. It is not easy to create such an ecosystem in a short time. Hence coordinated efforts are needed between the hardware manufacturers and the third party software developers. The ecosystem should support a product family rather than a single system. Collaboration is needed between industry, academia and end-users.

New key project in China's 13th 5-year plan

The national research and development system is being reformed. More than 100 different national programmes and initatives have been merged into 5 tacks of national programmes:

  • Basic research programme (NSFC)
  • Mega-science and technology programmes
  • Key R and D (the former 863 and 973 enabling programmes.)
  • Entreprise innovation programme
  • Facility and talent programme

HPC is part of the Key R and D programme.

Depei Qian said China has now launched a new key project for HPC as part of the 5-year programme that started last year. High-performance computing has been identified as a priority subject under the R and D programme. Strategic studies are at the basis of the new Key HPC project, they started in 2013. The HPC key proposal was submitted for the 13th five year plans early 2015. It was approved in October 2015.

The motivation for the new HPC key project was threefold: using the key value of exascale computers; advance the computer industry by technology transfer; and developing self-controlled HPC systems.

One of the key values of exascale supercomputers is the ability to address grand challenges, like energy shortage, pollution, and climate change. Another key value is enabling industry transformation like high-speed trains, commercial aircraft, and automobile.China is undergoing an economical transformation from manufacturing centric to service and innovation centric. HPC is playing its role there. Another key value of the HPC project is to foster social development and for the people's benefit. New drug discovery, precision medicine, and digital media are examples. the last key value is the traditional support of HPC in enabling scientific discovery. High energy physics, computational chemistry and new materials are examples.

The need for developing HPC systems using self-controllable technologies, self-controllable means under control by China, was a lesson learnt by the recent US embargo on the use of US processors in China's largest supercomputers.

The goals of the key HPC projects are:

  • Strengthen R and D on kernel technologies and pursuing the leading position in high performance computer developments.
  • Promoting HPC applications and establishing an HPC application eco-system.
  • Building an HPC infrastructure with services with service features and exploring the path to an HPC service industry.

There are three major tasks in the HPC project: exascale computer development; HPC applications development; and HPC environment development.

Within the exascale computer R and D will be done into novel architectures and key technologies. An exascale computer will be developed based on homegrown processors, and technology transfer to promote development of high-end servers.

The HPC applications development will perform basic research on exascale modelling methods and parallel algorithms. It will also develop high performance application software and establish an HPC application eco-system.

The HPC environment development task will take care of developing a platform for the national HPC environment. It will also upgrade the national CNGrid environment, and develop service systems on top of it.

Each task will cover basic research, key technology development, and application development.

The basic research in the first task will concentrate on a novel high performance interconnect and new programming and execution models for exascale systems. The theoretical network on interconnects, will be based on enabling technologies of 3D chips, silicon photonics, and on-chip networks, not only for exascale, but also for post-exascale. There will be new programming models developed for heterogeneous systems and research will be done on improving programming efficiency.

The technology in the first task consists of the three exascale prototypes. These prototypes need to verify exascale system technology. Possible architectures for exascale computers will be explored. Implementation strategies, and technologies for energy efficiency will be explored too.

A typical prototype system will consist of 512 nodes, with 5-10 Tflop/s performance per node. It will produce 10-20 Gflop/s per Watt.The point-to-point bandwidth will exceed 200 Gbps. MPI latency will be below 1,5 microsecond. The emphasis, Depei Qian stressed again, will be on self-controllable technologies. There will be system software developed for the prototypes, and 3 applications will be used to verify the design. The three teams are NRCPC developing Sunway, NRCPC with Sogon, and NUDT.

There is no clear technical solution for exascale yet, Depei Qian said.

Other key technologies developed in task one are architectures optimised for multiple objectives; a highly efficient compute node, high performance processor and accelerator design, exascale system software, a scalable interconnect, parallel I/O, an exascale infrastrcuture, energy efficiency and exascale system reliability. Based on those technologies, an exascale computer system will be developed. The exascale system will have exaflop/s peak performance and a Linpack efficiency better that 60%. Because of budget limitations, it will have only 8-10 Petabyte of memory and an Exabyte of storage. The limited memory is a major problem. The system will have 30 GFlop/s per Watt energy efficiency. So the total system will need more than 30 MWatt that is below the target of DoE. The interconnect band width will exceed 500 Gbps. there will be large scale system management and resource scheduling. An easy-to-use parallel programming environment will be made available, and there will be support for large scale applications. The system will have system monitoring and fault tolerance.

The Basic research in Task 2 - HPC application development - will concentrate on computer modelling and computational methods for exascale systems. Highly scalable and efficient parallel algorithms and libraries will be developed too.

The key technologies developed in Task 2 will be a programming framework for exascale software development, including a framework for structured and unstructured mesh, mesh free combinatory geometry, finite elements and graph computing. In total 40 different types of software should be support with million-core parallelism. Demo key technology and demo applications applications include numerical simulators, such as numeric nuclear reactor, a numerical aircraft, a numerical earth system, and a numerical engine. Furthermore, Task 2 will develop High performance application software for domain applications, including, computer engineering, numerical simulation of the ocean, design of energy efficient fluid machinery, drug discovery, electromagnetic environment simulation, ship design, oil exploration, and digital media rendering. High performance applications for scientific research include material science, high energy physics, astrophysics, and life sciences.

Finally, Task 2 will take care of the very important platform for HPC application software development

eco- system effort. A part of this, a national-level research and development centre for HPC application software will be established. A platform will be build for HPC software development and optimisation. Tools for performance, efficiency, and pre- and postprocessing will be developed. A software resources repository will be built too. Domain specific application software development is also part of this task. This task will be a joint effort of national supercomputing centres, universities and research institutes, and industry. The National Supercomputer Center in Guangzhou will lead this effort.

The Basic research in Task 3 - HPC environment development, will concentrate on models and architectures of computational services. In addition it will look at virtual data spaces: an architecture for cross-domain virtual data spaces with integration of distributed storage with unified access and management. It will allow for partition, separation and security.

The key technology in Task 3 is mainly focused on mechanisms and a platform for the national HPC environment, including technical support for service-mode operation. The national HPC environment (CNGrid) will be upgraded to: over 500 Petaflop/s computing resources; more than 500 Petabyte of storage, over 500 application software packages and tools; and more than 5.000 (team) users.

The demo applications in this Task 3 will look at service systems based on the national HPC environment. This includes an integrated business platform, application villages - including and SME computing and simulation platform - and a platform for HPC education that will provide computing resources and services to undergraduate and graduate students. This should also solve the shortage in talents issue.

Calls for proposals

The first Call for Proposals was issued in February 2016 and the first projects were launched in July 2016. The second Call was issued in October 2016 (for 2017). The proposal evaluation for this call is ended. These two calls combined cover most of the tasks of the key HPC projects, except the development of the exascale system. This development will only start after the completion of the three prototypes.

A recording of the presentation is available on our Primeur video channel.

Ad Emmen