Another year on the road to Exascale

3 Jun 2010 Hamburg. - At the last day of ISC10 in Hamburg, we talk to Satoshi Matsuoka, and Thomas Sterling about the state of HPC, and where we are on the road to Exascale. We hope to talk to them each year, so we can watch the progress the community did make during the past year.

Primeur magazine:The Exascale sage reminds me of the "The fifth generation computing system" project.

Satoshi Matsuoka:The Fifth Generation - That is an old concept!

Primeur magazine:Indeed that is old, but basically what they did at that time is set a design goal, some ten years in the future; a design goal that was too unrealistic to meet, but along the way trying to reach the goal some nice things came out of the process. Is the Exascale Initiative similar? You set a goal in the future, some ten years from now - reaching Exaflop/s - you have no idea if you will get there, or how you will get there. In the end the 5th generation also did not deliver.

Satoshi Matsuoka:I think the situation is slightly different. For the fifth generation the target was something that was really qualitatively different. What they have done was to mis-estimate the goal. It was not something that could be reached in ten years - and it was obvious to see that. So they missed the goal. It was more like a fifty year project.

Exascale is a slightly easier target, although it will bring qualitative changes. But the metrics by which we can get there, are less challenged. We know the metrics and we have some ideas how to solve it. I would typically compare it to the Apollo project. To get a rocket to the Moon is not so much different from just launching a space vehicle. If you look at the technology there was not much of a difference. You had to build a bigger rocket to get high velocity. You had to sustain the astronauts longer in the capsule. You had to develop a longer range communication system. So these were quantitive challenges that you had to meet in order to arrive at the Moon. There were some engineering innovations. Basically Exascale is similar. I think we will get there, the challenges are there and certainly things will be changed along the way as a result of trying to get there.

Thomas Sterling:Let me add to what Sathosi said. First let me take the metaphor and expand it to the history of aviation. Remember the first computers were supercomputers, conventional computers spun off from supercomputers, not the other way around. The difference in the metaphor is that when we went to the Moon in the 1960s we used the best technology we had. Curiously in computing we do not do that. In computing it is as if we are trying to get to the Moon in a DC3. We are using the same basic architecture, the same languages, the same operating system model that we were using 35 years ago. This is very telling about our most rapidly moving technologies that are ultimately conservative.

So let me use that analogy to answer your question. If the Exascale were like the 5th generation computer project. At that time the Japanese articulated a challenge, and a strategy addressing that challenge. All of that still confronts us, thirty years later. And I do not believe Thatcher had been embraced by the international community yet. And I agree with Satoshi, that our Exascale goal is somewhat easier, even with the challenges that are facing us, however, I believe that the irony is, that it will be the Exascale machine that will be the enabler to allow the goals of the Fifth Generation computer project to be achieved. Because one of the major problems is that what they simply did not have in the eighties were the memory technologies, the bandwidth, the throughput, the computing models. They were at the complete wrong playing field. They could not have gotten it right.

Primeur magazine:Of course, you build computers for fun, system designers do it for fun, but also, and mainly for applications to run on. Of course we do not really know which kinds of applications will really use the Exascale systems. So should we not just tell to the system designers: build the best machine you can, we tell a little bit about what we think, but it is your job to come up with the best machine. But this co-design, is that a good idea?

Satoshi Matsuoka:Firstly, I think it is a misconception to believe that when you have an Exascale machine or a Petascale, or whatever, that you need to achieve near peak performance to consider the machine a success, because applications are typically bound by different constraints in the system and some are not flop/s bound; some of the constraints are memory bandwidth, or I/O bandwidth, some of the constraints are productivity constraints. So for a machine to increase performance, because obviously you have to build something, you do not just increase a particular part, unless you build a specialized system like a Grid with enormous compute power, but no bandwidth whatsoever.

When you try to build a machine that is slightly more general, then you basically have a set of applications that run well, given those restrictions. So that is why co-design is important, because what you are trying to do, is, well it is very hard to build a machine that serves a particular community, unless that community can tell that that particular problem alone will probably affect the global economy in a positive way. So you have to have some generality in terms of a greater community, not just a single one. But you do try to co-design in that you have a selective set of applications, that you then try to optimize in a way that each application works well. Now that does not mean they all run at peak performance, perhaps at only 1/20 of peak, but that is fine, because that is not the single metric. We have designed the machine with say 100 times more memory bandwidth than the previous generation when going to Exascale. That may run a computational weather code a 100 times faster. And that is good. Where if you did not do the co-design, it would have been 1/10th, because it would have been all Linpack based. And then you would not have achieved your goal.

So again, it is important to look at what really are the governing parameters, of the applications that you think are important to run and then design a system that you do not compromise on those parameters. Look at previous machines: where are the

bottlenecks? And look at the scalability you get for the investments.

Thomas Sterling:Again, I fully agree with Sathosi, but let me add a couple of points. First the use of the notion co-design is new to the Exascale community. And it is a very young community: it exists only a couple of years. Co-design is used in two different ways. First co-design the system with respect to the applications. The second, and this is also important, is to return to the clean sheet of paper and do the elements of the system stack, the architecture, core architecture, system architecture, run time system, compilers, libraries, programming languages and tools; and co-design them so that each is designed in the context of the others so that there is an intentional set of roles and relationships that are optimized around the different metrics that Sathosi mentioned. Both of these are important. They require formalisms, and different methods about going around the design process.

Now with respect to applications: I would like to continue an earlier comment that seems appropriate to your question. Historically over the past N decades, we have focused on the vector. Vector manipulation however is structured, the use of that kind of spatial and temporal localities and those classes of numerical operations. The scale of machine we are moving towards now, also allows to permit a completely alien set of structures, more specifically dynamic graph structures which address a completely different set of problems, although there is overlap in the areas of adaptive mesh requirements and some n-body problems as well. These graph problems however, can also be used for knowledge processing. Knowledge processing is different from data processing, because we are manipulating symbols. These graph forms have been understood even in this form for decades, but have got relative little attention. Even the graphs we are talking about primarily today, the ones that Google and the others use for instance, these are large key-search. They do not have semantics. These are important in part for national security, but also for many other things. Ultimately, the goal is machine intelligence, which brings us back to the Fifth generation computing project. This is based on the ability to rapidly, and effectively manipulate knowledge. I give you just one quick example. The history of science, certainly an important pivotal point, was the sequence of about a hundred years from he hypothesis of Copernicus, through the mathematical modelling of Kepler, through the experiments of Galileo, and the ultimate unification by Newton. You started with strange twitches in Mars's orbit and you ended up with the inverse square law. We cannot do that today on our computers. We have neither the machines, nor the formalisms that allow us to move to where we are manipulating the knowledge. You could say yes it is a curve fitting problem, but only if you already decided what the curve is, you can do that.

Primeur magazine:The representation of knowledge is important.

Thomas Sterling:That is right, it is the representation of knowledge and representation of the operations you perform on knowledge. So this is what I think we are about to see. Because there are break points of capability and capacity and knowledge representation will be facilitated in the Petaflop/s era and then into the remainder of the era of the Exaflop/s. I believe we will see performance flatten out around 32 Exaflop/s.

Sathosi: Let me add a little bit again. So basically graph processing is largely a sparse problem, solving very large sparse linear on linear problems. And unless we do something bizarre, the only way we can solve this basically is to do two things. One is to increase bandwidth, and the second thing is to increase asynchrony, in order to hide the latency of access. Of course you can minimize latency, but that is a hard thing to do. So in some sense the traditional architectures that are focused on very dense computations like Linpack, are not very good at this. They are very small and fragmented with limited connectivity between the elements. They are not very amenable to these types of problems. But again these are important problems for knowledge space.

Also other commercial applications like web mining are very much a sparse problem. The danger is that Exascale is just so focused on reaching the Exascale, that we forget the important properties of the system that will allow us to scale these types of problems, that are the sparse asynchronous problems. Otherwise the machine will be good for only very limited sets of problems, for example material science problems.

Fortunately, there are sets of technologies that are coming out both in proprietary and in the standards space that solve these problems partially. Not fully, but partially. Extremely multithreaded architectures like GPU actually go a long way in solving these problems. Some of the interconnect technologies or memory stacking technologies: these are also very much working in the direction for solving the challenge. The reason I stress it, is that we put more of these types of technologies in, not just to achieve Exascale, but also to enhance the system with properties that are important to ultimately solve the knowledge problems which are very sparse.

Thomas Sterling:A major difference that has to do with what Sathosi says, is that I hope we move from a paradigm which conventionally is the BSP model of message passing, buried in it is SIMD-like instruction level parallelism implemented. That is one characterization of how we did it the last 20 years. We are going to move to a technique that is known in computer science, but not applied in computer systems: data directed execution. The ability to actually move the flow control, move the control state governed by the actual data structure that you are processing as opposed to always have the control state in fixed locations and doing a gather-scatter, which is the standard mode. And that is the transition. We start for the first time to migrate flow control: that the data starts to drive the computation, rather than the computation just simply filtering the data.

Primeur magazine:How does that compare to the old data flow machines?

Sathosi: It is very much like that.

Thomas Sterling:There are many similarities, but the data flow machines were primarily still flow control. They were still a graph of operations as opposed to the data driven directed techniques that will come. (And by the way my first computer science course at MIT, was taught by Jack Dennis.) But it is the merger of flow control and data structure as opposed to using a program that is an interpreter of the data, that is the fundamental difference. I just wanted to give an example to support what Satoshi said.

Primeur magazine:That could open up a new class of knowledge applications?

Satoshi Matsuoka:It is not just knowledge, it is a real new class of applications. What is important is the asynchrony and the associated bandwidth because these will lead to strong scaling of hard problems. But you do not necessarily reach the peak. On today's Petaflop system, the Jaguar, you also reach only 50 Tflop/s on a weather code. So you sacrifice by a factor of thirty. Why is the code so slow? First it is physics. Superconductivity is physics too. Why a factor of thirty difference? Apparently on the same machine? Because the weather codes are fundamentally bandwidth bound. And the only way of solving these, accelerate these to the point where you get reasonable performance is simply to embrace strong scaling of the architecture, and the only way of doing this is to increase bandwidth and shorten the latency. These are also the properties that are required for the machines that can also be applied for new applications like knowledge engineering or even the web. So it is very important when we go Exascale to think about not achieving sustained Exaflop/s Linpack but we really need to think about the class of applications that we want to accelerate on those machines. By what factors do we want to increase performance on some critical applications, and design machines and the software stack, and the programming model and so forth.

Primeur magazine:And how does that fit with the Exascale Roadmap figures: you need to increase the processor performance by a factor of a thousand, but in the meanwhile memory and memory bandwidth would only be increased by a factor of a 100. Are that the wrong design choices then?

Sathosi: Those are design choices governed by the foreseeable restrictions when you extrapolate from the current sets of technologies so while it is important too, for realistic reasons, it takes time to design these chips. It is good to think about the software level to solve this. It is also important to invest in key technologies for areas that need to provide a breakthrough, for example optical chips for interconnects, or 3D chip design. These may actually solve some of these problems. This can also benefit the IT industry as a whole. So it may be that technology developed to meet the requirements of the Exascale machines, may actually have a positive effect on the industry, which is fed back to the Exascale machines, when the technology is standardized. So it is important to invest in the pain areas, to enable these types of scaling to persist despite the difficulties. But right now, the numbers are extrapolations. They are what we conservatively can expect. Any breakthrough there is a welcome.

Thomas Sterling:I think you both characterized the standard extrapolations from existing practices that is in fact a demonstration of proof. That is not what we are going to do, except in the broad throughput, cloud computing arena. In fact you need to break capacity from bandwidth. Though today packaging does not make that break possible. By doing the stacking, what you will do using embedded processing - this time, embedded in memory processing - you will effectively get much more bandwidth from the memory banks. You will define the machine size by the memory capacity, not by its peak Flop/s. The memory bandwidth will be essentially the performance capability. The rate at which we will do load stores, not Flop/s but load/stores, and so it is new structures that will be essential in order to avoid the strangleholds that Satoshi confirmed.

Primeur magazine:The Earth Simulator was basically designed to do earth simulations. And if you look at the societal costs, that did make sense. So would it make sense today to design a new Earth Simulator, and let us make a physics machine, and some others, say 5 or 6 big targeted architectures. Would that not also pay off? Why does it have to be one? Why can we not make a few application focused machines?

Satoshi Matsuoka:In some areas that pays off, like for example for N-body problems. It is a hard question to answer, because now the characteristics that are important for a general computing to advance, not just supercomputing, are largely in line with some of the new stuff that is starting. Stuff like chip-to-chip interconnects, new silicon memory devices. Some of these are requirements that are coming from other bigger markets than HPC. So for those areas that are aligned, it does not make that much sense to build something new. Or it makes sense maybe to build at chip level. But it does not make that much sense to develop new technologies. So it could be that HPC is driving the commodity and then the commodity is driving the supercomputing. So far that has been mostly, if you look at HPC, come from the alignment that for instance results in the proliferation of clusters. And clusters are now enhanced with GPU's and all kind of other special stuff. So they are not any longer "off-the-shelf".

Now for Exascale there are challenges. Some people do claim that this alignment does not work for Exascale. For example, for interconnect. Is there a market need for advanced interconnect, sufficient for driving the bandwidth required in Exascale machines? Of course with low power and so forth. My speculation is no. Look at your laptop. You are used to have 10 Mbit/s Ethernet, then you started to have wireless. Then you started to have a Gbit/s and then you get wireless with 150 Mbit/s. Do you need faster? Do you need 10 Gbit/s? Probably not. And this is not only true at the low end, but also in enterprises. People have been using 10 Gbit/s for some time now. In HPC 10 GBit/s is laughable. It is not a plausible bandwidth. Now 40 Gbit/s is coming out, but it is still very expensive. HPC is depending less and less on so-called industry standards. You do not need 10 Gbit/s on your laptop. Servers do not need 100 Gbit/s. So it is important to differentiate which aspect would be important to design for future generations to achieve applicability to a wider span of applications.

That does not mean we build purpose built machines. Purpose built machines are expensive and do not leverage some of the technicalities that are there as low hanging fruit. So it is really important to think of a design incorporating most advancements to leverage existing technologies and adopting new technologies. But in the end, the result can be used in many disciplines. That is the only way you get sustainability. Special purpose machines are not sustainable. They cost too much. The scientific importance may change. One day the interest may change. Look for example about 30 years ago, the department dealing with concrete in Japan was really important, because of the amount of building taking place. Hence working on concrete was really an important scientific discipline. Today it is not very important. And no matter how important the scientist claims their research is, each scientific discipline has its lifetime. Machine designers should not ignore that: one day a field is important, the next day it is not anymore.

Thomas Sterling:Historically, except in the case of very large markets, like embedded computing, where real time or reliability constraints were imperative, literally life and death issues, or where streaming processing for host sensor processing allowed for high utilization, in each of these cases the economic case allowed for special purpose design. The usual answer to this question would have been "no", because even if you could see how to get more performance by a factor of 2, 4, or 10, the time to development usually exceeded the time to which technology advance of conventional machines would reach the same point at orders of magnitude lower costs because of economies of scale. So one would tend to shy away from that for good reason. With some exceptions. FPGA's slip in there a bit for things like gene sequencing. However, by the time we get to the Exascale era, so in the range of the early 2020's. I predict we will have topped the S-curve and be on a very shallow asymptotic slope of performance gain, which will be as bad as clock rates are today.

When we are at that point the only way we will achieve parallelism or better performance or performance to cost, is by potentially well structured systems. Here is one reason to be motivated to believe that. An analysis of Petaflop/s applications some years ago looking at the data set size for applications showed that the amount of memory required for different problems varied as much as four orders of magnitude, and yet memory capacity is the principle cost of systems. If you can drop the cost of your system by an order of magnitude and triple your performance and spend that money on communications or something, then you get a much better trade off. And by the way you reduced your energy, your operating costs as well.

I am just giving you one dimension of an example why at a point where we can no longer rely on performance gain through conventional technology means, then structure may allow us to eat out that last order of magnitude and exploit the specific problem opportunity. My guess is however, that we are going to find that the vast majority of problems have so much overlap in various needs, that other than telling your vendor to do something about his configuration, that will benefit most from what you can gain through optimization, and we are beginning to do that today.

Satoshi Matsuoka:Some of these so-called special technologies already are taking their inroads such as GPU's. GPU's are very, very different from IA86. Because of other market driven needs. Because of HPC growth. Silicon memory, again, probably will be used in HPC. For future machines it is possibly a prerequisite for achieving high i/O bandwidth. Otherwise, I/O is a bottle neck. Disks will not do.

So if you take those components or custom made interconnects, as Thomas mentioned, there are ways to parametrize machines to the point where it can be applicable to certain domains. BlueGene is like that. The original BlueGene was supposed to have a lot less memory than it has been actually built with, because it is a special built machine for dynamic simulations. Especially, for molecular dynamics you have very little memory requirements.

Thomas Sterling:You are talking about the original BlueGene. The Monty. They did build a machine called Cyclops derived from that.

Sathosi: So there are instances where it has been tried building a special purpose built machine, however, today you can take a single machine, a general purpose machine and configure it for specific domains and as Thomas said, arrive at a much higher performance within a physical bandwidth cost, or space or power. And we will probably see that. We will probably see these types of customization go on, and we probably will have started with Petascale. If you look in a year or so, for the Petascale machines that will be in the TOP500, you will see that the architecture and the other parameters, will show a great diversity. Some will have lots of memory, some will have only ix86. Some will have accelerators, some may have cell; all these kinds of diversity. Interconnects, again, amongst the machines will be different. For a good reason. Given certain performance targets, certain sets of disciplines, or application areas that you are looking at. You do try to optimize. So I do not think there will be one Exascale machine built, because there will be a first Exascale machine, but the 20th Exascale machine may look very different from the first for that reason. It is the same with Petaflop/s. The 20th machine that will appear may be next year or in two years from now, if you look closely, will look very different.

Primeur magazine:Is there a question I should have asked but I did not?

Thomas Sterling:we should meet again next year, and you should ask that question. There is real movement right now, as I said in my keynote address in the Exascale area with actual funded programmes, kicking in, within the next couple of months. Decisions, already have been made, just not announced, for the engagements.

Sathosi: So here is a question, Why did it happen this year? This general acceptance of the trend by the community that Exascale is a formidable goal. There is a general consensus.

Thomas Sterling:And to fine-tune that question, what is different now in the modality we are experiencing, both nationally and internationally towards this new goal from say 12 years ago when we were examining Petaflop/s? Because it is a completely different dynamic. Petaflop/s was a grassroots, technically driven set of forums and explorations with relatively little support, certainly no driving requirements from the governments. This is as much a top down and yet embracing the technical community in a partnership. It is very different. That is really the interesting question: What is different today?

Sathosi: What do you think?

Thomas Sterling:First of all, there has been a serious process of workshops and Sathosi and I see each other more in the last couple of years than in the previous six years, just because of these activities. And he is responsible, the leader, of the fact that it is so highly international. I think the reason is because we are in a different part of the S-curve. That multi-core, everyone recognized, that multi-core is a challenge. It is not an opinion, it is a fact. It has much broader implications than just on the science and technical computing community and the potential opportunity and challenge of heterogeneous computing, None of this is new, but it is now being realized as a central way to turbo charge, and the balances and costs where the floating point has become a very small part of the total cost and the data movement now dominates the cost, especially in energy. And I think it is that challenge that we face on this triplet orders of magnitude, which are sufficiently demonstrative and we have lived in the past track of the execution model over the previous twenty years and improved, nobody has real confidence. This is an interesting dichotomy that the mental mind set of our community has right now. On the one hand, deep down inside, we all believe that somehow we can continue the model though, with small perturbations and methods that we used over the last twenty, twenty-five years. At the same time, and the same people - I am amongst those - we recognize there are so many ways of what we are doing have to change, that we should use a clean sheet. Jump paradigms, and do computation differently as we did it. As we did it differently from times before. And somehow simultaneously those conflicting views are retained in the same persons. Do you agree?

Sathosi: Certainly: people are constantly looking at both options. It has always been in computer science, and in engineering, to look at a clean sheet or do a more evolutionary process. That is what makes the field very exciting.

Thomas Sterling:There is more humility in our community right now. It is not the religious fight you would expect to have and we have had. And Satoshi is right. This is a class of arguments that has gone on for decades. But now, I see both sides recognize the real uncertainties in both sides. And everyone seems more prepared to talk and to learn and to truly try to search out at a deeper level than we usually go. I see a greater energy, and a great excitement and more sobriety. People are just a little bit more self-conscious because this is hard. We could kill the astronauts.

Sathosi: There is less religious force going on. In the past there were always religious wars going on like on machine architectures. Now people are really recognizing: This is it. This is real and we really got to get down to reality and certainly there is a software problem, we know how to solve this in some ways, but there are breakthroughs that are needed, and probably possible. So we got to sit down and think about what is the right way in a very realistic sense. Let me add one more thing. Why do we think it is a little different from going from peta to exa? Because I run a centre and talk to people quite a bit. People are more serious about computing, They trust the simulations. This is, I think the result of various programmes that were instituted during the Terascale era. We have seen the success going. People are now saying: "Well maybe environmental science, simulations can produce trustworthy results". The DARPA and ASCI programmes, a lot of the bio, proteomics and those types of research doing in silico drug discovery and so forth. These large scale problems were problems that people thought. People in the scientific disciplines have always taught their own discipline, usually you have the big bosses that said: "Simulation in a computer, that is bull shit. Do not waste your time in front of a computer, do real experiments, devise a theory, computing is like for the whole tier second grade or third year researchers." There were mentalities like that. But successes in Terascale computing and the Petascale and the growth, the incubation of the new generations of scientists, that are used now much more to trust the results of the simulations are making HPC in general valid. So now people have very high expectations: got to dp Petascale, got to do Grand Science like we never imagined before. But now we have all these grand requirements from science. That came out of the DoE studies. Like Exascale, Yettascale and so forth. So people now, because they trust them, they want more. It is also top-down: like Obama says: Exascale computing is important because the cost of computing can make a difference in a sustainable society.

Thomas Sterling:Sometime in the Terascale era, we passed a qualitative threshold. The performance capabilities and the data capacities exceeded some unspecified boundary in which the complexity of the non-linearities phenomenology could sufficiently be modelled with high fidelity, such that only through these means we could see the results. There were not the means to do it experimentally, the mathematics were unsolvable. So we passed - you did not wake up and said: IMy God we are there - but we passed some point and we found we were getting results that could not be achieved in any other way. And that meant new science; and once that happened, HPC became super charged.

Satoshi Matsuoka:And now we are really super charged. And that is why we are seeing so much top-down initiatives. You know when the Japanese, new supercomputer was almost cancelled, because of what they did in the UK and they copied that: they tried to reduce "waste". They thought research was waste. But when they tried to effectively cancel the 10 Petascale supercomputer, the response from the science community, the biggest opposition came from their side. It came form the general science community, it came from the Nobel laureates. It came from several societies that were using HPC. They totally denounced the stupidity of the government in making such snap decisions. Which to me felt really good, because if this was ten years ago, - Thomas Sterling:it would not have happened - it would not have happened. I think this is what Thomas was talking about: some point has passed.

Primeur magazine:And the fact that the European Commission is now also putting money and effort in supercomputing, which they did not do for several decades.

Thomas Sterling:One of the extraordinary things that could happen and I do not know how it is going to happen, but if people like Satoshi, Jack Dongarra, Pete Bachman, could actually energize a true international partnership, not just for political good will, not just to do lip service, but actually harness the resources to increase the productivity of the world civilization to make these tools available. We have never seen anything like that with the possible exception of some of the biggest telescopes, and the LHC, and a few other special cases. But there is no reason why they should not succeed on this. There is absolutely no reason to duplicate. We have to try different paths, different experiments. And that is one of the things I am concerned about by the way: That it will preordain what the answers will be before they have done the research to do it. But that would be such a bumper crop, for the world community, I would love to live long enough to see that happen.

Primeur magazine:Thanks a lot for sharing your opinions. And hope we can do it again next year.
Ad Emmen