18 Sep 2017 Frankfurt - Each year, now already for the eighth consecutive year,Primeur Magazineasks Supercomputer experts Satoshi Matsuoka and Thomas Sterling how we are doing on the Road to Exascale. This year's interview took place at ISC'17 on June 21, 2017. What has not changed is the prediction when we will enter the exascale era - somewhere in the early 2020s - however, there is now more clarity on what the systems will look like. And this year we saw the rise of Artificial Intelligence (AI) and machine learning to levels unseen before. But does that mean we can forget about HPC and traditional exascale computing? And what will be post-exascale? Could neuromorphic computing and quantum computing be an answer? Or are we just talking about those because we do not really have an answer for the Post-Moore era yet? Let us hear what Thomas Sterling and Satoshi Matsuoka have to say about it. We publish this interview as a series of four articles:
Primeur Magazine:So another year on the Road to Exascale. We should probably discuss a little bit what happened during the past year. Thomas said something interesting during his keynote: now the whole world is committed to exascale.
Thomas Sterling:Indeed the whole world is committed to exascale. But not everyone is on the same schedule. And that is a good thing, not a bad thing.
Primeur Magazine:Shall we start with that?
Thomas Sterling:Let me start with the US. I think the biggest event in the US was the firm establishment of the Exascale Computing Project. This follows the direction of the prior administration in 2015, the President Executive Order on the National Strategic Computing Initiative. The Exascale Computing Project (ECP) has been given the guidance by, and is under direction of the Department of Energy. Both sides in there are involved: the Office of Science and the National Nuclear Security Administration (NNSA). It is a very risk adverse, very responsible project that has been defined under somewhat lower budgets than had been anticipated but at least with the assumption that it would achieve its end goal shortly after 2020. It is based on the premise that we already know what the answer is, and therefore all that is required is to identify the technical gaps between where we think we know we are going to go, and where we are right now as defined primarily by the vendors. So, a substantial amount of the funding is going to the Path Forward Project for the vendors to decide how they want to spend it in order to reduce the risks of failing to accomplish. The other side of that is the importance of end-applications and the means of developing them, so a strong part of the programme is dedicated to both of those. There are very little differences between the machines that are being deployed in the next two years, and the ones that will be deployed in the 2020 - 2022 timeframe, except for, of course, scale, and perhaps, to some degree, balance, and ultimately incremental but important changes to the technology. So I would say first and foremost, that is the big event in the US in High-Performance Computing and towards exascale.
Primeur Magazine:And what about the other countries?
Satoshi Matsuoka:We all know China has made quite a bit of progress in terms of their advances and visions towards exascale. It is very much sustained; they finally won the Gordon Bell prize on their own machine, announced at the last SC16 conference. They actually had three finalists, and all of them exemplified unprecedented performances in tens of Petaflop/s, which was never seen before. So it was righteous that they had won the Gordon Bell prize and then they have followed onto this with an increasing portfolio of applications, followed by efforts towards new machines. It has been announced that the second 100 Petaflop/s that is the successor to the Tianhe-2, the Tianhe-2A is going to be deployed sometime this year or early next year, but nonetheless, using indigenous Chinese technology, since they have been prohibited from using Intel Xeon Phi (Knights Landing) which was their original plan. Actually, I think that everybody agrees it actually drove them quicker to their goal, plus there are several other companies and centres in the running to reach exascale by 2020 at the earliest, or maybe 2021. In addition to Sunway TaihuLight, and the Tianhe-2A, there is a third project still in the running; three of their prototypes are to be presented, demonstrated and then going towards exascale. Of course, the current Chinese sentiment officially is to pick one of them, but there is no stopping them to actually run multiple projects if the money is there. Why throw away something that is demonstrated if you have the money? But again, I think that China is not only progressing in hardware, although not as prominently as other places, but also their portfolio of actual applications running on these machines is steadily increasing. So, no longer can you blame China for just doing stunt machines. I think they are throwing arsenals of their younger generations into developing or porting applications to these machines for real usage.
Japan is largely on track with its Post-K. However, since the last ISC it was announced that this machine will be delayed one or two years due to the fact that semi-conductor scaling is slowing down, and as a result the anticipated performance could not be reached with the original plan. Fujitsu and Riken had to reorganize with a new plan that has the goal of the 2021 - 2022 timeframe deployment. They are adding new features, such as, it was announced, it will use an ARM processor in August with vector processor instruction set SVE extensions. This was achieved by working closely with ARM, and it is an official extension such that it could also be adopted by other ARM companies like Cavium. Recently they announced they will also enhance support to machine learning workflow adaptability by incorporating short precision arithmetic FP16 into their instruction set.
Also there are new machines installed in Japan: the University of Tokyo deployed their Oakforest-PACS which is a 25 Petaflop/s machine. We, at Tokyo Tech, will deploy our TSUBAME-3 next month which will be a 12,1 Petaflop/s machine and its AI performance being almost 50 Petaflop/s short precision arithmetic. Then there are a number of other centres that also have new machines. Although not as aggressive as China, there is a steady increase of multi-petascale machines. There is also a trend of shifting the emphasis of high end computing towards AI in Japan. We will cover that later.
In Europe, they announced the intention to build exascale machines with European technology, the European Commission being very proactive and promoting this new direction. The details are not disclosed yet, so we will see next year what will happen with these European efforts. There are lots of research projects in Europe, but none of them are really, I would say, concrete enough by themselves to be able to build these large-scale machines in production, but I think finally Europe is stepping up to this game. However, compared to other countries, it does not have the industrial backing up to this extent. We will see what they will do here - China has stepped up to this game very quickly, maybe Europe can as well.
Primeur Magazine:That are the countries/regions with exascale efforts?
Satoshi Matsuoka:I do not think there are other countries with exascale efforts.
Thomas Sterling:To be clear, in the general purpose sense that is true, but we also have to recognise there are specific needs, such as the Square Kilometer Array, which, when it is up to full capacity, will have an enormous amount of raw data input and they have to build a domain specific distributed set of computing. It is quite well possible, that when completed, the total aggregate processing will be in the trans-exaflop/s regime with antennas in the long wavelength area and medium wavelength area in Australia and South-Africa.
Satoshi Matsuoka:So, if you mean double precision exaflop/s then being general purpose is the trajectory we outlined, but when you consider more domain-specific machines, even the American DoE Summit and Sierra supercomputers, it can be different. These two machines, because of the NVIDIA Volta, will have significant acceleration speed in reduced precision arithmetic FP16, with what they call their Tensor Cores, which are in reality 4-by-4 FP16 single cycle matrix engines. The peak performance of Volta chips is 120 Tflop/s. So, the performance of the Summit and Sierra that will deploy these chips in tens of thousands, in double precision arithmetic, may be somewhere around 130-200 Petaflop/s, but in terms of their FP16 AI flop/s they will be 2-3 exaflop/s. So, that proves, as Thomas remarked, although the world had been fixated on double precision arithmetic being general purpose, that in reality, people are building machines that are a little bit more domain-specific, and we already will reach exascale by next year in that sense.
Thomas Sterling:I think there is this interesting continuous mindset that says that ARM has a major role to play. And I am part of the problem: I believe that ARM has a major role to play, and yet, as the years go by, they have an enormous market of course, but in terms of the conventional HPC market it has yet to manifest itself. The rumours, and I know not more than rumours, for the European approach also suggest that ARM will be the processor of choice, and yet, it is not. So, that is another thread we need to watch. The one other thing I want to say, and this crosses national boundaries, is that the realisation of the bottleneck of memory, is certainly not new in anyone's mind, but it has been taken much more seriously both by, for example, NVIDIA and Intel. Much of the design innovation from these companies is coming in: buffering and moving at least for bandwidth, if not for latency memory. There is some shift, Satoshi has suggested this, some shift on the emphasis of evolution of the elements that are contributing to that. There is a resurgence in interest in secondary storage and how that interfaces with primary memory. Especially with different workflows related to non-floating point problems that Satoshi referred to.
I make one final statement which may be a segue into the next topic. In the US floating point problems are still very, very important: Simulation, if not the fastest growing area of interest is a growing area of interest, not only in the hardware development but also in the software libraries. So everybody is concerned with the memory wall.
Read also the second part of the interview The rise of Machine Intelligence .