2011 - Another year on the road to Exascale

23 Jun 2011 Hamburg - At the last day of ISC11 in Hamburg, we talked again to Satoshi Matsuoka, and Thomas Sterling about where we are on the road to Exascale. We hope to talk to them each year. We started in 2010, so we can watch the progress the supercomputing community made during the past year.

Primeur magazine:  Last year you two were very positive: there was an Exascale community; it was very lively; everybody was working together, I think that is where we stopped last year. So what has happened since then?

Satoshi Matsuoka:  Certainly this impetus has continued. In fact many countries or regions as well as companies are now embracing Exascale, while in the past they did not. The application areas are really starting to think seriously about the changes they would need to put in place in order for the applications to scale. Projects such as the European Exascale FP7 programme are emerging. There was a call and three new projects were accepted. In Japan there are projects, and also committees, being established to research on post-Petascale and then on to Exascale. The whole community is embracing Exascale.

Thomas Sterling:  Let me add a couple of things to support Satoshi's comments. Satoshi is one of the leaders of the international Exascale software project. And this group is continuing to meet, the most recent was in San Francisco, I believe.

Satoshi Matsuoka:  It was supposed to be in Kyoto.

Thomas Sterling:  It was supposed to be in Kyoto, but something happened there. The next one is in Cologne in October. This is about establishing a Road map which is allowing the international community, to the degree that they wish, to coordinate and cooperate, to join forces and resources and to avoid unnecessary competition and duplication.

Since we last met, DARPA started the UHPC programme - Ubiquitous High Performance Computing programme, and inaugurated four full different teams. The teams were the Excalibur team, led by Sandia National Laboratories. A team led by Intel, the Echelon team led by NVIDIA and closely cooperating with Cray, and a fourth team, led by MIT. And I might add that I am involved in the Excalibur team. But two teams - the Excalibur and Intel lead teams - are already directly collaborating on the execution model question.

DARPA also began, but then quickly terminated the OHPC programme. Caveat: UHPC is not explicitly Exascale. It is very careful not to be Exascale. But it is intended to develop technologies that are suitable for Exascale, with one Petaflop/s sustained performance in a rack. In addition Satoshi also mentioned the EESI in Europe and the Department of Energy in the US has also launched a couple of initiatives, including the X-Stack project to develop Exascale software that will be the second release of that. So a great deal has happened; including a substance of work in the last year.

Satoshi Matsuoka:  Let me supplement that. The DOE has the co-design centre. Europe, in addition to EESI, has the Exascale call. It is in the final negotiation state. Three groups that have been awarded to develop mid tier technologies leading up to Exascale. the goal is to have something that is substantive in the 2014 timeframe. In Japan there is a software development project for 100 Petaflop/s and beyond, called post-Petascale. Now we have five groups working and we will accept several more. In addition to the Committee that is looking at exploiting the current K system. Not K as a machine, but K as a centerpiece of the high performance computing infrastructure in which five key application areas are explicitly involved . These are five representative labs and all of these labs are now being asked to come up with Exascale. What science breakthroughs they can achieve if they have Exascale machines. So all this is similar to DOE in the US and to Europe. In China, although details are not known yet, they are expecting to be forthcoming with their 100 PetaFlop/s and Exascale plans. The details are not out yet, but they are certainly considering Exascale.

Thomas Sterling:  But the commitment to the Exascale direction is clear. In summary I would say, both the concrete will that has been established formally, and some initial research activities have been under way. In the programming area for example, and represented here at ISC we have found several approaches: certainly the work out of Tennessee, out of Barcelona, which is ongoing to explore cyclic graphs as representations; and my own work on the Parallax project to expose parallelism. For all of these the challenge is to expose more parallelism and to hide latencies, while reducing overheads, and overlay constrained synchronisations. All of this work produces real results on real machines today.

Primeur magazine:  You were talking about the session here at ISC’11 with Jesús Labarta from Barcelona, and Jack Dongarra, and you, Thomas Sterling?

Thomas Sterling:  And also Robert Harrison. So these represent real successes already, moving beyond the conventional practices in programming, towards higher level parallelism, including ultimately Exascale.

Primeur magazine:  How far are these efforts? Is this just theory?

Thomas Sterling:  No, no, these are working software systems. Ultimately they have to be augmented with hardware architecture changes. But these are working things. My group alone has been able to deliver a factor of 2 to 3 performance advantage on adaptive mesh refinement, just by using these new concepts. And you saw factors of up to two by Jack and by Jesus, as well.

Primeur magazine:  That is on current hardware?

Thomas Sterling:  Even on current hardware, and that does not even have advanced architecture changes.

Primeur magazine:  And if the architecture changes, then...

Thomas Sterling:  You see reduced overhead, with more parallelism. Remember, we need 10^4 more parallelism in order to provide the concurrency and hide the latency, so I am afraid now we are working on the first two orders of magnitude.

Satoshi Matsuoka:  So having said that, there are tremendous challenges to Exascale programming alone, as Thomas mentioned, how do we find the parallelism, of course, people estimate we need 100 million, a billion, perhaps more, parallelism to achieve Exascale.

Thomas Sterling:  I believe the answer is more like 10 billion, including for hiding the latencies.

Sathoshi: As Thomas is mentioning, hiding the latencies will be required due to two reasons, one is, of course, the machines will have a more deeper hierarchy. So there will be a great variance in the latency. Moreover, Moore's Law dictates that in order to scale, the strong scalar component of the applications will start to manifest itself, well beyond what they do now. So in order to deal with these scalability challenges, we also have to either shorten the latency or hide the latency. But hiding the latency incurs more parallelism.

And, of course, there is a power problem. We think that we know how to build an Exascale machine despite these problems, but the biggest problem, of course is power. And there are various ways we can tackle this problem. But one of the ways to do this is to come up with new programming models that will conserve power by not moving data around too much, but that also entails we would move around computation, which means: moving around computation to the data requires that the system be highly asynchronous.

So all these efforts, as Thomas mentioned, and there are other efforts under way, are really moving us away from the traditional SPMD style of programming, which is kind of ironic, as Valiant just received a Turing award for SPMD, which is the basis for many applications today, but what Thomas is doing, what Jack is doing, what some of the people of my group are doing, we all are embracing highly asynchronous or MIMD like, or actually a kind of rebirth of the data-flow.

Thomas Sterling:  Rebirth and recasting.

Satoshi Matsuoka:  Yes, but in a very different context. The ideas that we deal with are this highly asynchronous data and control, computation driven by data dependency, and very low synchronization overhead. These are some of the underlying principles of these new types of programming models. But the question is, how do you let users use them? Because dealing with asynchrony is fundamentally hard.

Thomas Sterling:  So for example, we are developing a simple low level API, called XPI, which manifests and exposes through semantic constructs and corresponding syntax the kinds of mechanisms Satoshi just described. This is not intended for all programmers in the future, but rather to provide an easy path to experimentation and as a target for higher level language sorts of compilations and it will inform the ultimate programming models that will be out for use, I expect them to be a throw away. But then I thought Beowulf would be a throw away. But Satoshi is right. Now most of the communities, think the answer is going to be MPI + X where, X is unknown. A lot of people think it is OpenMP, A lot of other people think it is CUDA. Our Group takes a different view. We do not think it will be MPI + X, we think it is X and we think that MPI will be targeted to X, to allow legacy codes to advance, and to provide incremental changes to MPI programmes, but this is hotly debated, and by the way I am in the minority view, for consistency and fairness, most of the community disagrees with me, but I believe that that is likely to be the long term.

Primeur magazine:  But could not the community say it is MPI + X, because they do not want to throw away everything?

Satoshi Matsuoka:  Whatever the programming model that will win out. Of course, it may not even be a single one, as it is even today. People program in MPI, while others use OpenMP, or CUDA, people use automatic parallelisation tools. But because of asynchrony that will be involved, it will be fairly complex, given this zillion way parallelism. Given the myriad of execution that will be happening, concurrently, these are just too difficult for people to handle, at least without higher levels of abstractions. So Rob Harrison said, and some other groups, like Berkeley are really looking into providing much higher levels of abstraction for example in the form of domain specific languages. That is one way. You could of course have high level constructs for programming languages, that is another way. People are experimenting with various ways, but whatever it is, there is an agreement that even if it fits MPI + X, X has to be quite a bit of high level of abstraction and that is one of the challenges to the parallel HPC community.. How do we come up with this right level of abstraction that will make it easier for programmers to program in the massive parallelism. That has been an ongoing challenge for years.

Thomas Sterling:  But now we are forced. Before it was a choice. Now it is not a choice, it is a problem. If there is a consensus then it is that something needs to change. Whether that is an incremental change or it is a revolutionary change.: something needs to change. That in itself is a big step for our community.

Satoshi Matsuoka:  That is true, because right now the system level is about a million way and there is skepticism we can go much beyond that. Machines like K, which now has the highest number of processors, have about half a million, and it will go up to seven hundred thousand. No longer that machine is pragmatic to be programmed in flat MPI. So Fujitsu strongly avoids this and even almost forces people to use hybrid either OpenMP or automated parallel vectorization + MPi. So MPI, the degree to which people do MPI, the number of ranks have been significantly reduced. So we are hitting the wall both for programme building and also for system implementation reasons. It is very hard to maintain million way MPI ranks. It becomes quite unfeasible from the system's point of view. So as we progress we have no other choice than increase the level of parallelism and thus to become more heterogeneous, become much more deeply hierarchical, asynchronous. We need higher levels of abstraction.

Thomas Sterling:  And for completeness: we have not discussed I/O and Mass storage, for a large part of science is data intensive and how do you engage what has historically been completely separated systems, with completely separated name spaces. The file I/O name space is different from the processing system and these orders of magnitude different latencies and overheads and then with the computation, now with all of these asynchronies we have a much more challenging problem. Not just overhead, but also contention for these shared resources and we cannot forget that - which we always forget. And all these have to be resolved in the next seven years.

Primeur magazine:  Why seven?

Thomas Sterling:  The TOP500 list extrapolations claim that in seven or eight years, we will have the first Linpack Rmax Exaflop/s supercomputer. And I am with Satoshi, but that has been a driver, putting the stake in the ground, before the end of this decade. Because that would be consistent.

We will not do it again, by the way. We will not get to Zetaflop/s. You can quote me on that. I am the minority: in one out of six billion. But slowly people will come to understand this. Standard Boltzmann constant, speed of light, and atomic granularities, will screw us. Exactly where, I think it will be around 32 or 64 Exaflop/s but I am not sure. I think you can report this to your readers.

I ask Satoshi to confirm or conflict that Exaflop/s is now a very serious international activity. It is no longer a casual academic activity. There are real commitments of real resources and road maps. Quantification of challenges and various alternative approaches have been considered. And I think that by next year this time Exaflop/s will be as a serious a business as anything else. Is that a fair statement?

Satoshi Matsuoka:  Yes it is. So for example, I see at the SC conference in Seattle in November, there are research papers accepted that deal with some of these technical issues. Not as a conceptual item but with concrete results that are aiming to solve some of these problems and again, vendors that have not been open about whether they are committing to Exascale, are now in some ways joining the band wagon. So now, major vendors, it is not just IBM or Fujitsu or Cray that have big machines. Even other companies where you might have thought that they are smaller players in the market or they have smaller systems in the market now are very much committed to Exascale. So this will become very evident over the next year.

Thomas Sterling:  The three problems that have to be solved for practical Exascale are in my view, first the parallelism, second the power, and third the programmability. And the fourth of those three is reliability. And the way we will address reliability will be a challenge. If we can address those four problems, then we can have an active Exascale community and industry starting in the next decade.

Primeur magazine:  Over the past year has it already become more clear what the achieved power consumption could be?

Thomas Sterling:  Current estimates for the optimistic system: 400 MWatts by the end of this decade. The threshold to pain is considered to be between 20 and 25 Mwatts. The Fujitsu K machine I believe is 10 Mwatts for about 10 Petaflop/s. It is a beautiful machine by the way. We cannot continue on that particular path of power envelope trajectory,

Satoshi Matsuoka:  So that would entail a dramatic change in the architecture and in some cases device technology. So that is why machine design is affecting software and all the other issues. Having said that, some of the projections I have seen recently are getting closer to the target. But again, assuming there are great changes in both the architecture and devices and also with the parameters. For example people dealing with small memory footprint are compromising that aspect in order to conserve power, better cooling technology or even recycling energy. Combining all those, we are starting to see realistic numbers on the horizon. 20 Mwatts is what the DOE set as a target, still is aggressive, but it is not as hopeless as maybe we had conjectured a year ago. It is starting to look like maybe we end up with a system using 40 MWatt instead of 20. But it could be something like that.

Thomas Sterling:  One obvious activity is that our community constantly is engaged with looking under the lamp post. We constantly talk about the applications that we have been able to run. We have not, and I would love to see a session here at ISC11, talked about applications that will not scale. I would love to talk about strong scalar problems that we have simply lost because our methods do not expand and many problems in computation simply do not scale. We do not talk about that in our field and what we have to be very careful about is that Exascale does not prune the set of possible computational challenges and ultimately becomes a special purpose machine. Many of those estimates do not examine carefully enough the data movement just as Satoshi mentioned before. They do not examine the actual costs in energy. That 400 Mwatt. I cited is by Peter Kogge who actually said the original analysis was done too quickly. The original analysis was 70 MWatt and on the same machine, after detailed analysis of actual application data movement they found that is was more 5 times as much.

Satoshi Matsuoka:  That is why this is a kind of repeating theme, that is why we need changes in the architecture, we need changes in the algorithms, we need very detailed studies and test how we need, what changes we need in order to deal with strong scaling and power and which would basically entail asynchronous for we need more parallelism, more abstractions. All these are intertwining. These are not individually to be tackled problems. Most of these problems are intertwined, because again, Exascale is a much more aggressive target than when we went from Tera to Peta. It is a very challenging target this time. The parameters are very tight, and they are very much intertwined. But that is also why the industry is getting so excited. Because it is an issue that if solved, will lead to the next plateau of computing.

Primeur magazine:  And then for a number of years, we can improve the algorithms and do real scientific work, instead of looking at the next big computer?

Thomas Sterling:  Let us hope we continue to do some scientific work, even when exploring the Exascale regime.

Primeur magazine:  We could also say: let us just stop here: 500.000 cores is enough. Let us go off and do science instead of searching for Exascale.

Thomas Sterling:  I do not think there are many people who predict that. If we stagnate, we go stale on the increase, entire economies will collapse. The economies are driven by constant change and advancements. So no, we can not say that. But there are many parts of the community for which that has happened by default, because we have now provided them with a means of constant scaling. We are already finding strong scalar problems and move them an order of magnitudes better performance by doing new methods. That should be an indicator of the trends for the future.

Satoshi Matsuoka:  Many studies are under way that indicate that the application scientists do need the power. There are challenging problems like for example in Japan. The scientific labs are asked, what are the scientific breakthroughs, that will be enabled by Exascale? And then people even come with Zetascale stories.

Thomas Sterling:  Climate research: pure full climate modelling at kilometer resolution. And some controlled fusion simulations for ITER and other approaches. That is clearly Exascale and, although we do not know how to do it, some very complex biological simulations.

Satoshi Matsuoka:  Organ simulations, for instance.

Thomas Sterling:  Exactly. These are Exascale problems and they are all critical to the human endeavour.

Satoshi Matsuoka:  Many of the energy related problems are also Exascale.

Primeur magazine:  Whether a machine runs at 20 MWatt or at 400 MWatt does not really matter anymore once you have solved the energy problem with it. Who cares?

Satoshi Matsuoka:  To some extent.

Satoshi Matsuoka:  It is good to set a target and then blow it by a factor of two.

Thomas Sterling:  Not every computer centre can build a nuclear reactor. And if you would extrapolate today's power consumption, that is exactly what you would have to do. A nuclear reactor generating a GWatt and that is what you would need for the next Exaflop/s machine today.

Satoshi Matsuoka:  Ultimately, of course, you know, there are arguments like that. Precisely what do you care if a machine is using 5 MWatts, and maybe that is right. But now we aim at Exascale machines and even smaller portions, like a rack Petascale machine. With the hierarchy of the infrastructure it is really starting to be evident that they are necessary for parts of solving these very important societal problems for human beings, for the well being of humans. And this is, let me also mention disasters like the disaster we had in Japan. If we had capabilities, we could build early warning systems, a very deep, detailed early warning system, that would allow us to do very precise analysis: "what if" this happens for some of these disasters. So, of course health, energy environment, these are all very important issues. Not to mention all the industry related applications that will form new industries.

Exascale is a driver, in itself an IT driver, that is why we are designing machines that are very efficient and scale very well. But also it is a driver for improving the human society. To sustain the world. If it is country, by country, by country investment, it is one of the key aspects to sustain the national power at a global scale sustained of humans. That is a very grand old school, but that is what we as a community believe in. That is what we have to do.

Primeur magazine:  Just to end with some practical stuff likeon the hardware, what were the most important advances during the past year? People are now talking about GPUs being included in CPUs and the other way around.

Thomas Sterling:  I believe GPUs as we currently know it, is a transitional technology in architecture and will not last very long. Again, this is a controversial statement. I believe heterogeneous computing was what were GPUS with processors with rigid data intensive regular data flows. People tend to forget. May I just expand on this?. Initially we had a complete machine just about vectors. Now we have vectors and pipelining all of the time. There were other machines that were completely SIMD. Now we have SIMD extensions build into our processors. So this is another case where we are once again exploring a particular flow of data and that these will in time migrate back into the regular socket. So I think we are in a transitional period. I would say that the hype has been more important than the actual results. remember that of the TOP machines only a few including Satohsi's use GPUs. Out of 500 machines, only 17 incorporated GPUs. That is just a point, maybe trends will prove me wrong, but we really should note that.

The technologies which I think are most important, are the 3-D stacking of chip dyes, which are going to greatly reduce the internal latencies and increase the bandwidth with low energy. The second is the use of 3-D transistors, and the 3rd is the use of technologies like Hafnium, which are greatly reducing the leakage current and then will give us a viable path to at least 22 nanometer which we are already doing, and quite possibly down to 10 nanometer. I am excited, but I am not predicting the results of the graphene technology advances. Also the continued advances of ways we are considering optical data movement, even on the chips, are being explored. So these may seem like details to you, but they are the kind of underlying details with potential near-term impact. Potential being within this decade.

Satoshi Matsuoka:  I would add one more key technology which will be next generation memory technologies, especially those that are non-volatile, which will help to greatly reduce the power consumption and also go with the reliability of the machine, because for example, take instantaneous check points. So key is they already start to be incorporated into machines, for example, our machine TSUBAME-2 already has a great deal of SSDs in it on every node and these are being used to come up with very advanced checkpointing algorithms that have very low overhead compared to the traditional way of check pointing on parallel file systems.

So now about this GPU debate. I think a general statement would be again, machines will be heterogeneous, they will be a combination, the physics dictate, and the way we compute, dictate that we will have a very massive many core driving their parallelism. But the strong scalar component requires a small number of large scalar cores. These combinations are tightly coupled to some extent, that the latency will become the building block of future architectures. So even today, GPUs are no longer GPUs of yesterday, they look very much like normal processors. So the whole GPU naming is more marketing. In fact, for example NVIDIA claims GPUs will be the CPU, which I would say if the definition of CPU is: what does perform the bulk of computing, yes. And Intel also has something similar. But again, that type of technology will be part of a heterogeneous node. And there will be others, like AMD, players in the market, that will build heterogeneous, many-core, what we call latency core processors. This will be standard and how we will go about using those will be an issue but tighter coupling: no question.

So people these days are advocating we should not refer to GPUs as accelerators, because they are an integral part of many core processors that are an integral part of the architecture of the future. So it is not so much about acceleration as about differentiation what is done on different parts. Multi cores are here to stay: whether we call them GPUs, that is marketing.

Thomas Sterling. I disagree a little bit with my colleague on where the convergence architecture will be. Although most of what he said was true. I think the CPU in the conventional notion of CPU though is dead or will be dead in the future. The CPU is a generalized machine that does nothing perfectly well. The CPU will be replaced by the combination of what we now call GPUs, which are really stream processing or throughput processing, and I agree, they will be a centre piece, but they will be augmented with probably an equal amount of embedded memory processors. Where now you have two pieces. One is optimized for throughput computing, the other is optimized for memory intensive latency computing and for some applications by the way, that will be the abundance, the majority of the computing. Not the numerical computing. That is my view. There will always be few of those old CPUs, because someone will have an x86 application they need to run.

Primeur magazine:  Thanks for sharing your Exascale thoughts with us. I hope we can come back next year.

Also read the 2010 interview on the Road to Exascale.

Ad Emmen