2016 - Another year on the Road to Exascale

17 Jun 2017 Frankfurt - Just before ISC17 starts we publish the thoughts Satoshi Matsuoka and Thomas Sterling had last year in 2016. The interview was conducted at ISC 2016 in Frankfurt, but it is still important as a historical document, after all we are not at exascale performance yet.

Primeur Magazine:So 2016, another year on the road to exascale. The question is: How many times more? Will it be 2020 or will it be 2018, like some Europeans say, or will it be 2022?

Satoshi Matsuoka:To be very short, what the Chinese have demonstrated here at ISC 16, with the Wuxi centre and the Sunway TaihuLight machine now definitely paves the way for a peak exascale machine possibility by the 2019-2020 area, if they really push at it. It will not be 20 Mwatt, more like 35-40 Mwatt. On the other hand, a lot of the new centres are shooting for that capacity, including Japan's Post-K, Oak Ridge and so fort. So given that, if you extrapolate the TaihuLight technology to the 2020 timescale, yes, there can be exascale, the question is: "Will they do it?" I am sure they will.

Thomas Sterling::The Chinese machine is a game changer. It is a controversial machine in its approach and frankly, its aggressiveness is astounding. I had not anticipated we would be within reaching distance of exascale for another year, that is a peak performance in excess of a 100 petaflop/s: 125 Petaflop/s, and with a LINPACK number somewhere around 93 Petaflop/s. This really changes, especially after we had a static situation in the Top10 of the TOP500 for the last three years, literally. It is more than simply having a big machine, it is a game changer, because now this machine is truly internal to the Chinese people and to the technology in the nation. This is their processor core, and that core takes a distinct change from conventional practices.

There has been a drift in trends in some cases towards smaller cores, this is more reminiscent of the CELL architecture of eight years ago that was used unsuccessfully, in my view, in high performance computing with the introduction and almost instant demise of the Roadrunner machine. It was credited with the first Petaflop/s scale machine. This is 100 times more powerful with a very different balance from conventional machines. The memory-to-core is very, very light. I think it is the right decision. Secondly, the aggressive use of scratchpad memory in lieu of caches is a challenge. It may be a mistake, or it may in fact be the right direction, and, most importantly, it comes with applications.

The Chinese have been criticized, cautiously, for inadequacy in previous years in this, but now they are bringing this with the full package. I have tremendous respect. I am not defying the outcome, I think what they have done is exciting, and they recharged the interest. I do not see it as a competition, I see it as a call to arms, that yes, in fact we can get to exaflop/s earlier. I will say, in passing, it would be fascinating to see what the response will be in the US to this. I think it will be significant, not necessarily in a positive way. I think that our colleagues in the EU will be reconsidering their plans accordingly. I think the Japanese are on a good stable path and I think they need to stay on course to produce the high-quality machines they have always been known for.

Primeur Magazine:From June 15 to June 17 2016, there was the BDEC workshop. Was there an answer there?

Satoshi Matsuoka:Certainly, the information concerning the Chinese machines has been known in inner circles. Of course, we were not able to talk about this until today. It is not like the Chinese gave birth to this machine in a year or so. In fact, it was in the planning for five years, because the previous Tianhe system was a petaflop/s machine in 2011. Then there were machines, previous generations which were not so large machines. It does take five years or so to develop a new machine and for the Japanese Post-K supercomputer, as Thomas said, it is on its tracks. The detailed design already worked out. So there may be some minor changes, perhaps in response to the Chinese machine, but overall, the outcome of the Japanese machine will be by and large the same or in fact be improved, but it will not be as aggressive as Thomas said with the TaihuLight system, for better or for worse. It will be more general purpose, more accommodation-directed, but the performance as a consequence could be less for the same power envelope.

There are two reasons why the Taihulight has become a success. One is, of course, since the Roadrunner, now people have learned how to program GPUs. And there is a software stack there. People have learned how to program Xeon Phis, with multi-threading. There are now scaling applications in those many-cores, which are very different from the Roadrunner days. So I know that many of the application teams running on the TaihuLight achieving this incredible performance, are the same people that were using GPUs. Now they are more used to this type of architecture. Secondly, they have got the best teams, not only on the applications, but also on the system software so that the machine will be usable. Again, this is very much different from their previous efforts, where much of the software was nascent and the application teams were in the dark on what did and did not work. In essence, the most impressive thing about this Chinese achievement is the whole package. It is not just the fact that they achieved 125 Petaflop/s, they achieved this in a tightly integrated, best managed project. This is very different from previous Chinese efforts. Typically, Chinese projects were less managerial and more ad-hoc planning. This time it is definitely not that.

Primeur Magazine:When I am correct, they have several teams, another one around the Tianhe-2?

Satoshi Matsuoka:They have three teams. But, of course, obviously now the TiahuLight team has a significant lead. We will see how the other teams react.

Primeur Magazine:Are they competing inside China?

Satoshi Matsuoka:Yes. That is like asking the question: do Cray and IBM compete? Do Fujitsu and NEC compete?

Primeur Magazine:Are there any other significant things that have happened in the exascale arena?

Thomas Sterling::The US is undergoing an extreme confusion. It is at the highest level of politics and the medium level of funding, and at the lowest levels, there is a real distinction or disagreement to future progress. The highest level, since we last met, is the White House executive order creating the National Strategic Computing Initiative with oversight by the OSTP, the Office of Science and Technology, as well as OMB, the Office of Management and Budget. This was very exciting, but it proved to be a shadow of the underlying reality. There was no specified budget, and the budget has proven to be probably about one quarter of what was necessary. It was defined to be a whole government exercise, but in fact there is little or almost no apparent or visible relationship between the different agencies. The funny thing, in fact, is that even within the lead agency, the Department of Energy, instead of their being a unification that resulted, things seem to have occurred, along the lines that are not unfamiliar to those that are inside federal institutions, such as the NSA, the National Security Agency versus the Office of Science and also between and among the different laboratories in the Department of Energy.

I say this somewhat unkindly, but not without validity, there is a great deal of consolidation in the attempts, even as there is some cooperation and coordination amongst the labs in planning and working with industry on the next machines in the hundreds of Petaflop/s arena, such as Coral, which will deploy at the very end of next year, 2017 and into 2018; Summit at Oak Ridge National Lab; Sierra at Lawrence Livermore National Lab; and Aurora at Argonne National Lab. Two of these machines are very similar. The ones at Livermore and Oak Ridge, the third one is very different. Curiously, if you look at the past, there are similarities, differentiating between course grain accelerators and fine-grained cores at Oak Ridge and Argonne respectively. That is probably good as it allows past investments in legacy developments of scientific codes to match the classes of machines they are developing.

Those are two activities that are going on, currently in 2016, and in fact they are planning up to exascale. Within the DOE and therefore largely leading the nation, is the Exascale Computing Project which has now been activated, and is still in strong finalizing planning stages. Paul Messina is here at this meeting and will give a talk on the details of the strategy, but not the final definitive answers of exactly what they are going to do, because there is still a lot of consensus building and optimisation. But there is a big challenge. Because of limitations of budget, there will probably have to be a risk-adverse approach. A single path rather than multiplicity. I personally feel that a more near associated path is actually less risky, even though some of the individual paths might be a little bit forward looking. So, we probably will not be able to explore other than extensions of conventional practices. For many codes that will probably be satisfactory: weak scaling, throughput computing, probably work up to the 2020s and to the first 1, 2, 3 exaflop/s scale machines. But there are other trends which are also working against that, and these are areas of stronger scaling and diverse problems which are irregular in data structures, and truly time variants. It is unclear wether we are going to be opening up new frontiers largely or in the narrow, more focused areas.

Primeur Magazine:What about Japan?

Satoshi Matsuoka:Let us talk about Japan and Europe. The Post-K project has been in development. The first inception was in 2010, and then the first project money was allocated for 2014 to do some feasibility study. So it is going on for some time. The current plan is that the detailed design will be finished by the beginning of 2017, the chips will be fabricated by 2018, and the build process by 2019. The machine should be in full operation by the end of 2019 and then open for the general public by 2020. Now there might be some deviations, in response to the Chinese situation, as well some other possible technological deviations, for example in the fabs, but that is largely to schedule. So it will not be an exascale machine, but it will be fairly close. At least in terms of usable capacity. I know at least one application will be 100 times faster than on the K computer, some others are like 30 times faster, some are as slow as ten times faster.

This includes improvements in the algorithms. One difference is - maybe we cover this later - an enormous surge in the interest by the government and the industry at regarding Artificial Intelligence (AI). There is a recognition by some of the people - and this was exactly the idea of the keynote at ISC - that HPC and AI really need to be collaborative. One of the reasons why AI has become so prominent is that you need this kind of technology, but the other is that HPC was an enabler to get it working for real. This is increasing the case that you need more capacity, you need more performance. Because of this research, Japan was investing in AI in the eighties and the early nineties, and it flopped because, back in those days, they were using symbolic AIs and this has its limitations. But now we are in the neural networks, and another advance is stochastic learning processes. This is more credible and more usable. You will see how this enormous increase in the requirements imposed by AI, especially in the learning process, will affect the HPC ecosystem. The interesting thing is that this is already happening in China. Because one of the things that is happening this year, is that China not only has the top 1 & 2 machines, but their number of machines on the list and the total capacity has surpassed the US for the first time.

The US is number 2 in every respect for all the major metrics on the TOP500. Why is this so? You can understand why for the TOP500, but why this proliferating of machines all over China, 150 in the June 2016 list to be exact? This is likely largely due to the fact that they are not used as HPC machines but that they are used actually as machine learning. It are machines with a bunch of GPUs. But these are listed on the TOP500. So this is where the Chinese are investing a lot into. They are not only investing in home grown machines, they are also investing into machine learning platforms. Currently, they are using NVIDIA's GPUs. My question to the AI community in Japan is: "China does this and we are supposed to have a nation-wide agenda on using AI to advance industry. We are trying to put a whole lot of money into the project. Where is our machine?" China has its own machine and it is mostly for AI. What are we doing? We understand this for the US. The phenomenon may be that Google, because of corporate privacy, does not expose the machine to the outside, despite the fact that companies like Google have an enormous infrastructure. They will not expose their benchmark results. However, China does that.

Europe and Japan are in a better position to expose all this, but I do not think we do it. I think it is because we do not have one. So this is a serious shortcoming. That is something we may see as people get aware of the situation. They may invest further in HPC machines for AI. Europe now has finally realized that they really need to. There were only two countries in the past that had a number 1 machine of their own design: the US and Japan. Now China has joined the club with a number 1 machine of their own design. Some components like networks are still made elsewhere, but they have a processor of their own design. Now Europe, a major continent, has never had a number 1 machine of their own design.

Now you have the TaihuLight which is 125 Petflop/s. Japan has 11 Petaflop/s for the top machine, the K, but we have got new ones coming. There will be a 25 Petaflop/s machine at the University of Tokyo, there will be TSUBAME-3. We do not know yet how much that will be, but for machine learning, we can reduce precision. If I get all the funds I am asking it is definite, technologically definite, it is just a question of money before we will have a 100 Petaflop/s. These are reduced Petfalop/s, not LINPACK, but like single precision arithmetic. We will have a capacity of 100 Petaflop/s for machine learning in 2017. So that is already certain: these requests and designs are in the pipeline. In the US, of course, you have CORAL, a 100 Petaflop/s, 200 Petaflop/s, 300 Petaflop/s are in the pipeline, and Post-K.

Where is Europe? In Europe the 2016 top machine is Piz Daint, which is 7 Petaflop/s peak. There is an update to Piz Daint that is coming, but I do not see any credible alternatives. The second machine is in Stuttgart, the Hazel Hen, that is a 6 Petaflop/s machine, but it is not European design. When you look forward, the divergence is large, despite the immense amount of funding that is poured into European exascale efforts. These are being poured into small projects that are mostly software, or human resource space and the funding for the infrastructure has not been very adequate. The difference between Europe and other countries is widening. The European Commission knows this, and they are launching a new programme to try to resolve this situation, to use European technology to design Europe's own machine towards exascale. I am certainly hoping the IPCEI project will be a success, because Europe needs that, and they have the capability. What they really need to do - I know how machines are planned and built - is start designing, if they would like to have anything credible by 2022. The design work should have started last year, but of course, it has not started. I think it will be quite a challenge to have a dedicated design, and it is not just hardware, it is also software, but Europe is pretty strong in software. However, it still needs to have a dedicated design that will equal the exascale efforts by the other continents to succeed for 2020 or 2022. But having said that, I think it is worthwhile. Otherwise, I think it is the last chance. So Europe should try recover the lost time. Currently, I do not see any credible organisation. It should happen now, and accelerate now.

Primeur Magazine:Thomas, what is your opinion about the situation in Europe?

Thomas Sterling::One of the things I would like to view in Europe - and I enclose Switzerland in that space because of its activity and interrelationship in European projects - is successful emphasis on end applications, both of a practical nature in industry and society, and in the Deep Science. It should not be lost on us that the recent discovery of the Higgs Boson, which we usually associate with a 27 kilometre ring, is in fact a computational achievement as much as, both in terms of algorithms and methodologies, were the sheer capacities, not just the computing itself, but the data storage and the data movement, almost continent-wide involving many, many scientists in many different levels and roles within Europe and outside of Europe. To cite the work at CSCS, you know the interest in applications for practical, meteorological conditions in very specific geographical regions, as well as the UK work, also in meteorology, something they excel at, I might add. The most recent accomplishment I will say and the side-effect added to the Higss Boson, is of course the discovery of the gravitational waves. Not once, but more than once. That too was a computational problem: both simulation and data analysis of multi-kilometre measurements on the LIGO. That is not Europe, but it is another example of the importance of applications. These are computational telescopes. They let us view a reality at a level of precision and distance that we simply have not done before. That is an aspect of this, it is not just the big machine on the floor.

Europe, I have to say, has to learn, if I may paraphrase my colleague, to get its act together. Not because of this most important thing of the competition, but rather because every time there is an order of magnitude step by some base, there is a new opportunity for discovery and the service of society. While I think the European research on the application side is really dramatic, I think they are failing themselves, and to some extent the world in this regard. If I may say one thing beyond Europe to the broad international community, beyond Japan, beyond China, beyond the US: There is a lower level, but dramatic increase in engagement - I do not like words like Third World, or developing world or the other hemisphere, but a much broader international engagement if you look at the Cluster Competition that is taking place at this conference. You see representatives of a large number of places, and they are successful. The announcement of the deployment of the Petaflop/s system in South-Africa is important. This is a nation with many economic challenges, but with a vision of its future, realizing that part of its investment is in industry, society and science, as it relates to computing. And this is by no means the only one. Look at the SKA project, which involves Europe, but also involves South-Africa. I think the total number is 11 nations.

Of course, the US is not among those, and this may be the single most important scientific experiment. So I would like to make one other comment only because of Satoshi's comment that we do the right kind of AI now and that we did not do the right kind of AI in the eighties. I remember the fifth generation computer, which perhaps was wrong, but very exciting, and stimulating intellectually, and kicked the US in an appropriate location that caused it to do a bit more. But here is my comment. Symbolic processing is quite different from the very productive AI that Satoshi is talking about. But symbolic computing, in my personal opinion, will ultimately be the form of computing that will consume the greatest number of cycles. I may have said this before in previous interviews, but we are taking the necessary baby steps, and we are finding that there is a greater statistical domain, and a very high dimensionality domain. We are mastering these right now with tremendous and provocative opportunity, but in the very long term, I would say, supercomputing - and I concur, that it is HPC that is enabling AI - will be heavily consumed with complex memory systems required for symbolic computing in support of even more advanced Artificial Intelligence applications.

Satoshi Matsuoka:For decision making, the output is symbolic; for classifiers, the ultimate classification is yes or no. Of course, it can be more complex, but this AI will consume significant portions of the compute cycles, and of course the data processing capabilities. This was also pointed out by Andrew Ng during his keynote, and I concur with that very much. It remains to be seen whether the standard, whether the network is the only methodology as we know today. It works very well. Now the IT scales well thanks to various engineering challenges that have been overcome. Of course, there are limitations, but it scales now to 100 GPUs or so which is quite an achievement. We have done some work in this area too. But the other option, Andrew at this ISC 16 conference was not too positive about, but still some people believe it is a more direct simulation of neuromorphic processing activities: neurons exchanging pulses along the synaptic connections.

The European Human Brain project basically takes this approach and of course, there are criticisms, but basically, their approach on the neuromorphic computing itself, trying to build a brain simulator at the neuron level, somehow trying to scale it and make it bigger by several tactics is a very interesting and complementary approach. They have software they are running on HPC systems. They have two distinct architectures, one is SpiNNaker and the other is the Neuromorphic Simulator. SpiNNaker for example is using small ARM cores, which is European, British, technology, and it is 100.000 cores. Now these are small cores, optimised for simulating neurons, not for floating point, but still it is quite a feat they were able to build a 500.000 core system and actually make it work. So it is not as if Europe does not have the capability to do excellent science with a strong connection to computer architecture. There are lots of excellent people. It is just that the commercial justification has been more difficult. When you come to think about it, I can give two sides of the coin. Of course, you always need justification to build any credible system and spend a lot of money and the argument has always been that you need more. HPC alone cannot fund all these efforts. That is why a lot of systems use Intel that basically has the same architectural design as those maybe in your laptop, your workstation or your graphic cards, and so fort. A dedicated HPC system has been waining, because they are losing out to those systems. But if you look again at the Taihulight system, that has ten million cores. Ten million cores, how much is that? If you take the world's global shipments, in units, how many servers from Dell, HPE, Lenovo and all those server companies, how many are shipped each year?

This may not include all the Cloud vendors. Some Cloud vendors integrate their own. But again, taking all these major vendor shipments, there are about 6 to 7 million. Assuming there are some 20 cores on average in these machines on each node, that is about 150 million cores in the world shipped annually to the entire world from the Cloud vendors to the server in your closet. This TaihuLight system, one system, one machine is ten million cores. 1/15th of the entire server capacity production in the world is in one system in number of cores. That is huge, that is a tens of billion dollar market. If this Taihulight is big enough, maybe you can make a strong economic justification for its existence. Especially, maybe not so much tied to HPC as tied to another important area, machine learning. So these combined give a happy consumer, these backend HPC simulations plus data processing, analytics, in terms of Big Data, machine learning, these two market segments alone may be sufficient to drive the more efficient dedicated architectures. I think, to go back, there is a question: Are we going towards exascale, are we changing, maybe not a change in a change we are starting to sense? This emergence of machine learning plus the fact that we are trying to learn how to build these foreign architectures, we may be deviating from the consumer architectures and this is even evident for existent vendors, like NVIDIA for example.

Even NVIDIA now in 2016 has just announced two Pascal lines. One is the consumer Pascal with a graphics card which is not very different from the old one and that is totally tuned towards doing graphics work, when you apply it to this. The previous generation was good for both graphics and machine learning. It was not very good at HPC because it lacked double precision, but it was great for machine learning. But now they have a new card out which is inexpensive - 700 dollars - but the machine learning capabilities did not really go up. Graphic performance did go wild, because now NVIDIA is putting all the emphasis with respect to inclusion of machine learning into their Pascal server line: P100 instead of the P104 for consumers, which is completely different. So it used to be consumer graphic and machine learning here and HPC there. Now that has changed. These server line Pascals are much more expensive, these are not 700 dollar consumer electronics, these are thousands of dollars. They are designed to be like that. So I would like to see that even existing vendors are starting to shift this way. This may accelerate the exascale process, not just the competition, because of this immense opportunity for the vendors to ship and make profit on these large scale architectures, way beyond what the Cloud vendors do with two-socket Xeon systems but much more denser, much more high-performant. So we may see that. I think that is a big change for 2016.

Primeur Magazine:Thomas, do you have some closing statement?

Thomas Sterling::I would say this is one of these rare moments that is very exciting. When the Earth Simulator came out - you may have been there - we were on the island of Santorini. I got a call from John Markov. I almost said something that would have gotten me fired if he would have actually quoted it. But this is one of those turning moments, not when we are given THE new answer, but we are given THE new challenge. By challenge I do not mean any one nation being beaten by another one, that is easy to include, but rather in the intellectual space of moving forward. We are understanding there are new opportunities and that perhaps in Europe and in the US we have been a little lackluster in identifying and attacking those challenges, and we are being reminded of that. I think this is a very healthy thing and I would only encourage our colleagues in China to continue to emphasize the value in application space where new discoveries are possible. This is a really cool moment.

Satoshi Matsuoka:I am only happy that this happened at ISC16, that I am chairing now. To be chairing, to be very honest, one of the programme elements that we had included are the results of anticipating this change like the keynote and some of the talks we have on exascale. We were cognizant of the change that Thomas mentioned so we tried to include those changes. It is still ongoing. Hopefully, we will hear more of these changes. There are a lot of things we can cover, technologies, quantum, or memory, all of those technologies. They are all there and they are getting better and certainly going to be game changing elements as we move forward.

Thomas Sterling::I think, one very last statement as we are sitting here on this bench, a year from now, a lot more will be understood about the details of the directions and the plans, the priorities and the strategies. We will have a much more concrete trajectory to exascale. I think this will be a year of change in which the international community as a whole embraces and progresses towards exascale, with some degree of uncertainty.

Primeur Magazine:Thank you very much for this 2016 interview. See you at the next ISC 2017.

Ad Emmen