Twenty-four years ago, it was a challenging thing to do, to really test the machines and show the performance, not only for that benchmark but matched with real applications. Over the years, the LINPACK benchmark perhaps is not as relevant as it was back then, but it is still used. There are a lot of historic data and the TOP500 is something Jack Dongarra believes will be around for a long time. There should also be other benchmarks that measure other things but Jack Dongarra didn't like to go into that discussion.
At the start of the ISC 2016 Conference, there was a very interesting announcement. The Sunway TaihuLight supercomputer, built in China, was christened as the new number one in the TOP500. Sunway is a unique computer in that it is based on Chinese parts. There is a Chinese processor. The Chinese fabricated the interconnect and put the machine together at the research facility in Wuxi. The chips were actually fabricated there. The machine is based on a many-core processor that has 260 cores in it. These are very lightweight cores. They have a cycle time of 1,45 Gigahertz. Each of these cores has a peak performance of about 11 Gigaflops. The chip has a performance of about 3 Teraflop/s. The chip has been put together in a certain configuration to fill up a cabinet. This cabinet has a peak performance of 3 Petaflop/s. The Chinese put together 40 of these cabinets to build out roughly 125 Petaflops. The theoretical peak performance is 125 Petaflop/s. They ran the benchmark, using all the cores this machine has. There are ten million cores in it. That is the aggregate core count for this machine. With that full machine, they ran the benchmark, coming in at 93 Petaflop/s. This is roughly 74 percent of the theoretical peak performance. This is a pretty impressive number, Jack Dongarra stated.
The interesting thing is that the power consumed during the running of the benchmark came in at roughly 15 Megawatts. This is brilliant for that benchmark. This translates into 6 Gigaflop/s per watt. This is a pretty high number and a very impressive efficiency, Jack Dongarra pointed out. Most of the machines we see in the TOP10 come in at about 2 Gigaflops per watt. This machine is three times more efficient in terms of power for that benchmark and is about three times more powerful than the previous number one machine, actually 2.7 times more powerful. That is quite a combination.
The machine also has some other interesting characteristics. It was used in some real applications. Those applications were written up and submitted to the Supercomputing Conference as Gordon Bell potential contenders. The Chinese actually submitted five papers. Three of those papers were chosen for Gordon Bell finalists. To put this in perspective, there are only six papers chosen at the SC event for Gordon Bell finalists, so they have half of these papers that are running for Gordon Bell. This proves that it is a very powerful machine with a very high impact. It is not just a stunt machine but it is a machine that could be used for real applications. Jack Dongarra doesn't know how much work went into developing those applications but these applications present a non-trivial implementation, whereas the benchmark perhaps is considered as a trivial implementation.
It is an impressive ecosystem. There are some deficiencies in the machine and those deficiencies come about when one starts to move large multi-data around. It is a good architecture for doing very dense matrix kind of computations, but if one starts to move information and new data through the memory hierarchy, one starts to see the weaknesses of the machine. The machine is using slow DDR3 memory. The network that they have overall provides for a very poor interconnect system. The other benchmark that Jack Dongarra has, is called the HPCG benchmark - a benchmark that does a lot of data movement. It implements the solving of a system of equations but this time using an iterative method where one has a sparse matrix that one is operating with. This benchmark shows an efficiency of 0,3 percent of peak performance. That is a very low number, compared to some of the other machines, Jack Dongarra explained. The other machines come in between 1 or 2 percent with the highest machine being the K computer, which comes in at a little over 4,5 percent at the theoretical peak.
So, this machine has a potential for certain kinds of problems to do very well, using all the cores in the system very efficiently. For other problems perhaps, however, which are more related to solving three-dimensional partial differential equations, it is going to be very hard to extract that performance from this architecture.
Primeur Magazinewanted to know which answer Europe, the USA and Japan would come up with in the next years for the TOP500 to counter the Chinese ambition.
Jack Dongarra said that in the USA three big machines are planned to come online in the 2018 time frame. They will be phased in, one in the beginning, one in the middle, and one towards the end of that time period. These machines will be roughly equal or greater than the peak performance for this machine. So, it is about a year and a half away. The Chinese certainly have a lead for measuring in terms of machines in the order of over 100 Petaflops performance, a lead of perhaps 18 months at this point. In Europe, Jack Dongarra is not exactly sure what the situation is in terms of big machines, so he could not comment on that. The Chinese machine has been in operation for a little while. There is some speculation that they are working on a follow-up machine. The following machine could take them close to half an Exaflop in terms of the performance, if not more. For the new number one, one will need to talk to the Chinese to know when this will happen.
This is one of three projects in China. This is the Wuxi project. There is a project going on at the NUDT, the National University of Defense Technology in Changsha to upgrade the machine that they have, which is the Tianhe-2. They are planning to replace all the Intel parts and replace them with other design parts which would take that machine over 100 Petaflop/s. There is rumour about a machine at the Chinese Academy which could come in with their own processor again to be in this race.