21 Jun 2012 Hamburg - For the third time in a row,Primeur Magazine was granted the honour to double-interview Prof. Dr. Thomas Sterling, Professor of Informatics & Computing, Chief Scientist and Associate Director, CREST, Indiana University, and Professor Satoshi Matsuoka, Tokyo Institute of Technology, National Institute of Informatics/JST CREST, Japan, about their view on Exascale development.
Primeur magazine: So for the third year we are discussing how far we are on the road to Exascale. Last year there was a lot of excitement: There were several groups discussing and working towards the achievement of exascale. How did we do this year?
Thomas Sterling: Last year there was a sense of moving in a direction that we did not really see in the ramp-up to Petaflop/s. But I really think that this year that some substance and form has emerged. Satoshi was one of the leaders of the now almost three years of the international Exascale Software Project, (IESP) meetings also led by Jack Dongarra, Bernd Mohr, Pete Beckman, Paul Messina, and others, which has concluded the series of 8 international meetings since we met last time. One was in Cologne, Germany and the last one was in Kobe, Japan. There it took a while to allow reaching a consensus on what the consensus was. In particular, for much of the time there was real contention and even push back against anything that was non-incremental. But towards the end we found that there was room for more extensive, more innovative approaches to be considered and compared against the possibilities of incremental approaches if and where needed.
In Europe the EESI completed its original examination and launched the much more substantive EESI2 under Jean-Yves Berthou that is setting in place a plan for software. In the USA, which some consider or perceive as being behind other nations, the first significant and formal programs have been put into place just prior to last year: the beginnings of something called the co-design centers have begun to look at applications and machines with respect to each other and understand how they should evolve together.
Most recently, under the Department of Energy, the Office of Advanced Scientific Research, led by Dan Hitchcock and Bill Harrod and under the program management of Sonia Sachs, now has in place the X-stack Program, an explicit program to develop the system software stack from programming models down to the architecture interface for Exascale computing with near-term off-ramps to be useful for the trans Petaflop/s domain as well. We also know there are plans in China, Japan has its Road map, and there are a number of other nations that are examining their possibilities as well. I should mention in particular that of Russia, and a key role of T-Platforms and Moscow State University.
Primeur magazine: And what about Asia?
Satoshi Matsuoka: China has re-announced its commitment towards Exascale. They are still a communist country, so they have these five year plans. They announced that they will build a 100 Petaflop/s system in the 2014/15 timeframe. It is not known whether they will use their home grown technology, like their Sunway Blue Light [processor], or will be using other countries' technologies mixed with theirs, like Tianhe-1A. But certainly they have announced their commitment to do so.
In Japan, since then we have been putting together road maps towards exascale. A number of people have been recruited. In fact almost the entire HPC community has been recruited, along with the vendors. We have come up with road maps, and recently this has turned into more explicit studies towards exascale. In our feasibility study program, four groups, three architecture groups and one application study group, have been approved. They will start this Fall and it will be about a two year project. Of course, the outcome of this project will determine how we should proceed, or if we should proceed to Exascale. Certainly, there is a momentum.
If I take a look at the global picture, to supplement Thomas' view, in 2012 we are seeing the rounds of Petascale machines finally reaching their goals. That is, we have multiple ten-Petascale machines and some more coming, and we know largely what they are, and how they will behave, and we know they work. There are lots of Petascale machines. Now all the TOP20 machines on the Top500 list are Petascale machines [measured by HPL benchmark].- and if you measure by peak it is the first 28 So, by November this year it is very likely you will have to be a peak Petaflop/s machine to be in number 50.
Certainly people now have the feeling we have harnessed multi-ten Petascale technologies. That has given good confidence that the road to Exascale, although there are difficulties, is much more realistic, be it revolutionary or evolutionary. It is not known when exactly the targets, initially said to be 2018, will be met or whether 20 MWatt exaflop/s machines can be built. But certainly people feel there are opportunities, by relaxing those parameters, it would be possible to achieve Exascale by some time around that range. We are starting to realize what these machines may look like. When we started IESP we did not even know what the machines would look like. We did not even know what the software stack would look like. But this year we are getting a sense of what they are, what they would be architecturally. Still there are lots of options, but we know what the options are.
For software, again we know what the options are. We know what applications would benefit. So there is much substantially clear about a path this year. That is what is getting lots of vendors, including processor makers, like Intel, NVIDIA, all very explicitly committing to Exascale. For example, Intel established Intel Federal and they proclaimed very explicitly that they would pursue Exascale. They are involved in many projects. They are involved in the Exascale project. They are involved in the DoE project. NVIDIA, as well, are involved. Fujitsu announced their intention and at least are considering exascale.
Primeur magazine: So Satoshi said it is already a little bit more clear how the machines will look like. Do you agree with that? And what are the options?
Thomas Sterling: I think the space is becoming clarified. And I agree with that. But if anything is clear in my opinion, and I think this is a good thing, not a bad thing, what the machine is going to look like has become less certain over the last year, as we have come to understand more clearly the options. So a year ago there was this juggernaut, this bandwagon. And Satoshi's TITech TSUBAME 2 is an archetype to demonstrate the highest effectiveness of this particular model. Not just in performance and cost, but also in energy as well. It is the tall-pole in the approaches. But this year we see that the community does not necessarily agree on any particular single approach. When both Intel and IBM invest in a non-GPU based approach, and indeed in the TOP10, of the TOP500 list (and I wish we were not doing that, but I am as guilty as any) we see that there are very much at least at this point in time, multiple vectors pointing into relatively heterogeneous or relatively homogenous approaches. Now I happen to think there is going to be a homogenous/heterogeneous machine. That is to say, the AMD approach, which probably will not be sustained, is closer to the model than a migration, and integration of special purpose modality processing units, like a GPU for SIMD, to directly embed them into the communication and addressing space for low latency to multi-cores and lower operational overhead. I expect this is going to come closer. But now - and this is a personal opinion - I think that we have not explored the entire necessary space of design. In particular I believe that the processor core itself (and now what I am about to say, I may be the only one on the planet to be saying this - so I do not want your readers to misconstrue my opinion as established fact) today looks at the majority of the rest of the machine as an I/O device. This is the wrong hardware support for the semantics of a 100 million or billion core system that is working in harness on a single problem, a single tightly coupled problem, because of the need to reduce overheads, and to reduce latencies, as well.
I think in the area of applications, I fully concur with my colleague from Japan on the question that we have a very good idea on many of the important problems that could deliver breakthrough advances through Exascale. But we do not at the same time know how to extract the necessary and sufficient parallelism from those applications or what their algorithms are going to be in all cases in order to take advantage of exascale. So my view is aligned with Satoshi's but I believe that space of consideration is more rich and the key questions that are barriers to the ultimate success have yet to be answered, though I expect them to ultimately be resolved through sufficient innovation.
Primeur magazine: The hardware space options, are they still targeted towards real HPC applications as we find them today? Or also to new challenges and more knowledge type of analysis?
Thomas Sterling: I leap in here - Short answer: My view is that in the very long term (and I may have reflected this last year) knowledge management, and ultimately machine learning and understanding, will consume the majority of cycles of machines around the world. This will not ultimately be restricted to high-performance computing but as we are seeing a combination of smaller machines and the Cloud (I am not a giant fan of the Cloud, but I recognize its potential). This will be a form of high performance computing; maybe become the form of high-performance computing. We are not there yet. But strides are being made. We have to get through this MapReduce-is-everything notion. Once we get passed that, we can start looking more seriously at the really interesting complex and dynamic data structures that are ultimately symbolic in nature and which will make up the underlying codification of knowledge that machines themselves can manipulate.
Satoshi Matsuoka: Certainly. If you look at the new emerging benchmark, the Graph500, there are some misconceptions as to what it is. That is to say people regard this as a graph processing benchmark, but it is in-core: all the graph is in memory. So the nature of the benchmark really is that it will test the bandwidth and the networks, especially the bisectional message rate of the network.
Thomas Sterling: and the address streaming.
Satoshi Matsuoka: As well, so it taxes the data access, but not just in the conventional concept of MapReduce type of work loads, but rather, as Thomas said, it addresses issues that are pertinent to dynamic work load. However, at the same time, there are other properties of that Graph500 benchmark represents in real system workloads. We have a team in Tokyo Tech that got number 3 and number 4 this time. Working with those people, the nature of the benchmark is found to be very relevant to, for example, classical CFD and other types of so called sparse problems.
So what does it mean? That means that if we are to deal in Clouds types of problems that are not MapReduce type of problems, that is to say, the flip side of the coin, MapReduce was a type of processing that was invented to circumvent the inherent architectural weaknesses of the current Cloud. That is to say very weak network bandwidth, fairly standard processors, that may not be so power efficient. Local disks which may have their local I/O, but not global accessible bandwidth. Very, very focused on the way servers are constructed now. You slap it into a rack, with something like a Gigabit Ethernet, and there you go.
In some sense MapReduce was invented to stylize the type of data processing to match the particular architectural property of those conventional Clouds; as such for other workloads, they are completely weak. This is demonstrated by the fact that some of the Cloud infrastructures that tried to run the Graph500 benchmarks for large graphswere trailed the best supercomputers miserably or not able to execute them at all. So this means that as we advance towards symbolic knowledge processing, there will have to be convergence of supercomputing and Cloud infrastructure. That Clouds will have to be populated by what we know as supercomputers today, and thus the workloads as such will converge, the machines will converge, and the numerics and the symbolics will very likely converge. This will happen at the architectural level, software level and of course the algorithms and application level - they will likely converge at all levels.
So right now the architecture of the Clouds that we see today, are just accidental, that is to say they happen to just satisfy certain workloads, that is MapReduce. Basically this was the result of trying to fit the processing to the architecture that was created to serve the standard web as well as corporate workload offload such as ledger and that kind of stuff. As convergence proceeds with supercomputing, however, the Cloud will cover many applications that are known in supercomputing and the convergence will not just be numerics - the so called Big Data issues will certainly be one of the focal points of supercomputing in Exascale.
Primeur magazine: Just to go back a little bit in the past year, which highlights in say hardware developments you think are worth mentioning?
Thomas Sterling: There is a strong tension to avoid change at the hardware level broadly but to emphasize incremental change locally within the hardware. We have seen this work quite successfully with NVIDIA, right now with Kepler, and the next step is Maxwell and the past was G200. We have also seen the standard progress on the part of Intel moving down the feature size faster than I thought they would be able to and with apparent reliability as well through trigate transistors mastering the fearful leakage currents that we thought could have become, and may ultimately become, one of the major barriers. But these are good things.
We are seeing but have not implemented the next stage in hardware technology, which is likely to be the layering of dies. We know that is happening. We know there are multiple industry partners engaged in this. This will be very necessary for increase in bandwidth in producing at least localized latencies and reducing power. So that technology is just beyond us at the moment, but it is in-work and this is coming. There is the continued flirtation with the insertion of optics locally on the boards. The challenge there is to provide much higher bandwidth into the dies than can be handled by the limitations of pin growth. Number of pins and pin bandwidth are severe barriers. Optics onto the die could reverse this, although the short lengths are not required, for other purposes.
Usually with regards to bandwidth, the rule of thumb is about a foot. It is the smallest unit you need to benefit from optics over copper. But it is the bandwidth onto the die that is important. But this hardware certainly does not have a hold yet. We are using optics board-to-board, rack-to-rack, but not routinely effectively on or between dies. Although Xeon Phi "MIC" is an experiment too, and the value of MIC is that it is breaking away from the conventional by pushing density and lower power. Just the willingness of a major chip vendor to do that and at least in marketing terms, makes a kind of commitment to the industry that they will support this, is a symbolic but very important step in the hardware area.
Satoshi Matsuoka: Over the past year there has been a tremendous investment and results being apparent in the area of low power. It is not just BlueGene that has taken the Green500 crown. We have seen improvements in the conventional processors like Xeons and improvements in GPUs. There are projects now where low power is the key deliverable in various academic and industry projects. When I go around the show floor it is not just the processors themselves. I see a tremendous number of liquid cooled solutions. And liquid cooling, in the old days it was necessary, not because of low power, but because we could not use CMOS. Then for a while those things disappeared, because we could cool CMOS by air. But now by necessity of low power various sorts of liquid cooling solutions have become really necessary to embrace low power and also to control the higher density of power and heat dissipation of these components.
MIC, of course, as Thomas mentioned, is a key component in Intel's commitments to HPC and towards exascale. One fact of MIC is that it is a many-core processor, just as with GPUs, so in many ways they are very similar if you look at the architecture. However, the difference is, if you look at NVIDIA or AMD, they have been in this business for some time. Many-core is the only stuff they do, basically. On the other hand, Intel does not have to do many-core they do not have to invest in many-core processors, at least in a short term or medium term. They can make a lot of money selling Xeon. MIC in some way is dangerous, because it is a solution that if right could cannibalize their Xeon, and people have pointed out this danger. But they are still doing it. Last year they had to make a decision, a go or no go and they said: go. So this is a commitment of Intel to really pursue many-core low power architectures and that is a big investment for them.
Another interesting movement is ARM coming into the picture, which is for embedded low power. Of course we have seen this before with the PowerPC and BlueGene etc. But now ARM is a bonafide ultra low power technology. Their processors are selling by the billions in the embedded space: cell phones, refrigerators, etc. Now they are making these investments to HPC and their weapon is low power. Of course on the show floor we have seen things like Calxeda which is apparently funded by ARM themselves. The company is making these low power SoC chips based on ARM, but put some of the HPC or web serving features around it as an SoC, and there could be some situations where actually they may be superior in terms of their absolute low power and density driving other parameters to the point where they could be superior compared to Xeons, or even MICs or GPUS. So again low power is the key.
I forget to mention that in the TOP500 the top machines' efficiencies have dramatically improved over the past year. It has just sky rocketed, whereas the lower tiers of the machines have not really changed. In supercomputing technology adoption follows basically the waterfall model. The best technologies are usually tested on the big machines, and trickle down from the top. If the top machines make dramatic advances, the technology will trickle down. Since low power is absolutely necessary for Exascale, as I think was the big trend; and this will continue to be so.
Thomas Sterling: In fairness, just to add. ARM at this point is not delivering 64 bit floating point, although I think they will shortly.
Satoshi Matsuoka: Oh yes, they are. They do not have a 64 bit instruction set yet, but they have a double precision 64 bit arithmetic. But indeed 64 bit addressing is needed for HPC, and they are developing a 64-bit ISA.
Thomas Sterling: Just as a self-serving comment - I believe what we are confronting right now is at the hardware level, with some architecture at the core and system level opportunity for Exascale if we can manage the power and reliability. And as Satoshi pointed out: This year has certainly seen some reason for optimism in the power although there is an order of magnitude to go.
About the reliability question, I think, it is still largely virgin territory. It may be addressed when checkpoint-restart no longer can solve the problem. I personally believe (I may be in the minority here) that the exploitation of a billion cores and this highly distributed memory in strongly asynchronous systems (and quite possible heterogeneity) really demands an alternative execution model to that of Communicating Sequential Processes, which was designed for a homogenous single core per node system that was the natural fit between the program structure and the hardware architecture. Before that SIMD models were appropriate either for SIMD arrays or for vector machines.
So I believe we are in a different hardware space whichever path we take. We really do have to devise a new execution model and then use that as an important conceptual tool for defining the cut-set of responsibilities among the programming models and compilers, runtime and operating system, and the layers within the architecture. This suggests - and we have done it before - a revamping of that structure through the process of co-design. I do not think we are there. However, I would say that there is such a convergence of ideas, most of which have come from prior research over decades that even a controversial model is itself converging. The work that I and my colleagues have done, is reflected by the work, some is in industry, and some in academia. So I believe there is a minority view that is also emerging and becoming clarified that may in fact make it possible to run these systems. My biggest problem, and I repeat it, is that our processor cores today treat the rest of the system, all the other cores, as I/O devices. That is the wrong semantics in the hardware, to minimize the latencies and overheads. Those are both key to allowing successful exposure and exploitation of sufficient parallelism on complex problems to be used that will take advantage of Exascale.
Satoshi Matsuoka: I do not disagree with you Thomas, I just take a different view. So the challenges are still there. That is, we have billion-way parallelism to deal with. We know that is the only way we can reach Exascale. There are strong scaling issues. The law of Amdahl is still alive as well. So in order for applications to take advantage of a billion cores we have to deal with strong scaling issues that are coming to the surface. These are power issues. Obviously, a lot of studies have indicated that, despite the optimism, the power for moving data around will become the major factor. So we can no longer afford to move data around as freely as we have done in the past machines. Rather we have to be very aware of where data are and then try not to move the data, but move the computing around.
All those resiliency issues are constraining, in other words boundary conditions for the applications. Now, look at the other end, so what will applications do, or what will the numerical or other types of algorithms do that underlie these applications? What do they need to do? Obviously, they will need to look at the constraints and devise their most optimal execution style, based on those constraints. In some cases they may still continue as they had in the past. They can probably use the same code, maybe with some adaptations, but some of these may scale perfectly to Exascale. There would be applications that need to change the underlying algorithm constraints, because they can no longer work within that constraint. So that has to fundamentally change their numerical algorithm or even change their physical model in order to cope with these constraints. A classic example is this big push for changing the atmospheric code from implicit methods like spectral codes to direct solvers. There are pros and cons about that, but in any case they are totally different algorithms, and direct solvers are much more amenable to scaling. Then the question is, how do you program these? There are changes in the algorithm. There are changes in the constraints. How do you program these applications, especially, coping with strong scaling? So we have the high latency - we have to deal with latency in some way. You should not move data around. So you have to move computation around? What would be the most natural for these classes of algorithms that are within these constraints? How are they to be programmed? Is it the conventional style? Will it be an extremely new style? Or at least it might not be very new, but rather some of their basis being some unique ideas like dataflow that people have had in the past, and now finally becoming useful and also real, albeit with pragmatic modifications, like active objects or X-caliber. They are referred to as reactive type of computing with local synchronization, but would they be useful or necessary?
The jury is still out. We do not really know yet, because what people have started thinking about Exascale in terms of applications is fairly recent and we do not really yet know the answer. We know about hardware technology. But how do we create the whole ecosystem? That is still not specified at the interface between the application and the machine. That is a very open area. And that is going to be the topic of today's panel (here at ISC12). That is going to be a big challenge because software applications is ultimately the investments that need to be made. It is not the machine. It is not the hardware. It is really the software that survives generations of machines. It is those applications that make machines useful.
Thomas Sterling: One last word: "international": I totally believe the problem is daunting enough. The number of expertise and resources is few enough that we must retool the way we work to at least be informed by the experience of others and more desirable to share the resources and plans in order to make this more practical. No one person, institution or nation is going to provide the solution. We still have strong tension around the world that we need to relax for everyone's benefit.
Primeur magazine: Thank you both very much, gentlemen, for expressing your views on Exascale, and look forward to continue next year.