4 Jun 2010 Hamburg - Cray has announced this year a major new product, the Cray XE6, in which the interconnect is a major innovation. That is why Primeur magazine talked to Cray's Barry Bolding at ISC'10 in Hamburg.
Barry Bolding: We officially launched our XE6 product. Previously it was code-named Baker, and we are on track to be delivering these systems in the third quarter of 2010. We have a very large back log of orders. We have been doing a number of press releases on it to let the community know about it over the past four/five months. We now have reached the stage where we wanted to launch the system and give it its official name and be able to publicly talk about the characteristics of the system, the networking, the software, and the infrastructure built around the XE6.
Primeur magazine: Why is it called the XE6?
Barry Bolding: Well, it got a whole new network and we wanted to let users know in an easy way that it has this new network, but it has the same computer portion as the XT6. When someone who owns an XT6, or an XT5, he can actually upgrade to the new network and keep the other parts of the system. So the naming scheme, very simply, lets our internal engineers know what type of system it is in the field, it lets us know what type of system the customer wants, so if, for instance, an XT5 customer like Oak Ridge or someone else wanted a new network, they can upgrade and it becomes an XE5, instead of an XT5. The number represents the type of processor, the letter represents the type of network. So other than that there is no real reasoning behind the naming.
Primeur magazine: And there is also no reason to have fancy names?
Barry Bolding: We do not give the systems names. The customers give them names. So Jaguar is Oak Ridge's name for their system There is good and bad to that. It is good that the customer has the flexibility to give their system a name, and perhaps do the front in a design that represents the name. The bad is that the brand does not get exposure: it is really the XT5, it is a Cray XT5. But we think it is more important to give the customers something that gives them a feel of ownership.
Primeur magazine: Because it is only a relatively small number of customers?
Barry Bolding: You know, our naming system is typically numbers, letters. We do not typically use names. It is usually X1, X2,..., so the XE6 was just a natural follow on.
Primeur magazine: But it is, of course, the technology that is more important.
Barry Bolding: Yes, the new network is really critical. It is the first time we changed the network in over 5 years. We have been using the SeaStar network since the XT3 when it came out with the Red Storm project. And that network has done very well for us. It has competed very well over a five year period with Myrinet, Quadrics, and Infiniband. But it was time to move to the next generation that would enable multi-core support. In the multi-core era, you need an interconnect that can sustain very large message rates, because every node may have 24 cores, and those 24 cores may all be trying to send messages at the same time. So message rate is very, very important. The message rate is about 100 times better in our new network than it was in SeaStar.
Primeur magazine: With message rate you mean message throughput?
Barry Bolding: Small message throughput is the message rate. The important parameter here is how many small messages, even single word messages, you can do: simply doing a gather or put, or a SHMEM or an MPI. There are all these different programming models one has to support. And even on Infiniband you sometimes need to aggregate small messages to get good bandwidth. You need to package up many small messages and then send them as one large message. And that kind of aggregation slows things down. With our network we do no have to do aggregation. We have special hardware that allows single messages to travel very quickly.
Primeur magazine: You access the chip with the interconnect, not the individual core?
Barry Bolding: Yes.
Primeur magazine: So how many cores can that reasonably support?
Barry Bolding: That depends on the processor: we have a two-socket compute node. So it is either an eight core AMD or a twelve core AMD. With two sockets there are 16 or 24 cores trying to talk out through the interconnect. We have a couple of customers that will have the eight core parts so that will be 16 cores trying to talk out, but most have the 24 cores.
Primeur magazine: And the architecture itself?
Barry Bolding: The architecture is very similar to the current XT5. The topology is a 3D torus.
Primeur magazine: What is the maximum number of cores it could support in total?
Barry Bolding: It could easily support the next generation of AMD parts which are code named Interlagos. That is estimated to be a 16 cores processor. Placing these new chips will be a simple processor swap. We can easily support that many cores.
Our next generation of interconnect will be a whole new infrastructure. It will really be a new generation system. This new generation is part of the DARPA HPCS programme. That will be the time when we will actually change the whole infrastructure, the cabinets, everything. We have tried to keep the cabinets in a sort of blade compatibility for the last five years and we will have blade compatibility for another couple of years, but some time in the 2012 - 2013 timeframe we will go to a whole new cabinet, a whole new infrastructure and a whole new network. That network will be very similar to this new network that we are releasing today. So a lot of the software work we are doing today will be applicable to that new network. They are all what we call high rate networks and that is the type of network we really are pushing over the next couple of generations. Today it is all electrical and in our future systems it will be partly optical and partially electrical and eventually it will probably be all optical as we move towards Exascale.
Primeur magazine: So the Internet connect is the main new property?
Barry Bolding: That is the primary new property. It has roughly three times better latency than the current SeaStar and a 100 times better message throughput. The bandwidth is really limited by the Hyperlink transport of the AMD. So that has a certain bandwidth, just like PCI has a certain bandwidth. That is the limitation we call the injection bandwidth. But once you have that on the network, the network can sustain very, very large messaging rates. You can literally have billions of messages per second. Each ASIC, each network chip, can support about 150 million messages passing through the chip. We designed the network topology to be just like today on the XT5. The network is integrated with the blade. So you do not have any external switch cabinet or external switches. Every switch is part of the actual whole system.
With Infiniband or Gigabit ethernet we typically have leaf switches, and spine switches and you have to have external switch boxes, that is a sort of configured network in the centre of a large system. We design our network directly into the cabinets, and into the blades. That is very reliable, we have a bunch of new related features. Adaptive routing and also what we call link level reliability. We can support N-to-N reliability, but we also support check sums between every two Gemini ASICs - we call them Gemini - on every two network places on a torus, whenever a message goes from one to the other, it actually does a check sum to make sure that all the information is correct. So we support that throughout our system: it is very important for very large systems to have that feature, so that you do not have to resend too much data all the way from one side to the other.
Primeur magazine: Are there also changes in the software system too?
Barry Bolding: We have announced in April the CLE 3 (Cray Linux Environment) that will be also available and supported on the new system. So we get the cluster compatibility mode for the ISVs, we get extreme scalability and there are some new features specific to the network in the software. There are also features that allow us to hot swap blades. So basically we can, when we have a problem on the blade, tell the network it is about to be removed. It will then quiet down the network and then let you remove the blade. Then it will start up the network, and then the messages keep traveling around the system. So applications will not fail. An application running on some other part of the system, will not be affected at all by pulling a blade out. That will really help with the availability of the system, keep the availability high so the users are not impacted. That is an important new feature that the operating system will be working with the network.
Primeur magazine: What is the MTBF of the system?
Barry Bolding: The MTBF is very good. The reliability has gone up steadily over the last 18 to 24 months, and typically on most of our systems we are now at 99,5% to 99,9% availability. In a small system, a mini system with four-five cabinets, it can easily go a year without having a failure. In fact we had a four-cabinet system that was in the middle of a test for the last sixty days, and it had about 14 minutes of down time in these sixty days. It basically had a node issue. So that one node had to be serviced.
So we have a very high reliability and availability and even better, our systems go into the centre extremely quickly. In fact one of the biggest complements that a customer made during the last couple of days, was that he is so happy with their last installation of a fairly large system of about 15 to 20 cabinets: it was installed and up and running in less then 30 hours. So from the day it rolled into the door until the day they turned it on for their users. And that is a pretty large system: several hundreds of Tflop/s, and basically up and running in a day and a half.
Primeur magazine: Why is this all so reliable? It is all chips and connectors and these things fail and the more of them you put into a cabinet the faster it will fail.
Barry Bolding: Absolutely. Cray has learned, and has become better and better and better, at diagnosing what types of failures are about to occur and understanding them. We actually run a service called node care and when a job fails, so let us say when a job is running, and a job fails for an unexpected reason, our operating system will detect it is an abnormal failure and will actually run a quick check on all the nodes that job was running on, and if that check recognizes any of the common problems that we typically see, for instance, something like a type of memory issue, it will flag that node as needing service. This actually helps us tremendously with our system reliability. Actually we become better and better at it.
What you find in supercomputing today is these large systems having a very hardened operating system and a hardened files system. You know, we worked a lot with the Lustre stack, to harden it and to really test it in a very defined way for our customers. It is not the same as taking Lustre from the open source and then run it. People think that it is easy to do that, but it is not.
At the HPC Advisory Council meeting this week, there was actually a talk there, about the JuRoPa system. And there is another system in Juelich, the HPC-FF. They basically took some systems from Bull, some hardware, some open source stuff, and they started to build this. They showed the reliability curve of the system, and what you saw was that for six to nine months, the system availability was very bad. It took six to nine months to iron out all the issues that were present on the system. I think today, Cray, when we are delivering something that is already at that 99,9% availability, it is hardened, tested, the software is very mature. We do all the testing and integration in a way that most vendors today just do not do. We control every aspect of the software and the infrastructure. Even though we are getting some commodity parts like for instance the cables, we test every aspect of it. We are really dedicated to a fully integrated system. You just do not find that anywhere else. There are a few companies trying to do that, but you know, that is one of our main values.
Primeur magazine: Can you tell a little bit about the ideas you have to go to Exascale?
Barry Bolding: There will be some very large initiatives that drive us to Exascale. Now there are some huge hurdles too. A few of those hurdles can not be overcome just by the market forces alone. Because the market may drive the processor to a sufficient level, but the market may not drive the system infrastructure. Power and cooling: the market will drive that to reach Exascale. But memory is an area, where I think the market itself probably would not get to the point where you would have very low power, and enough memory close to the processor to reach Exascale productively. So you are going to need governments working with the private industry and you are going to need some big initiatives that drive those technology areas, where the market forces themselves would not be enough.
Primeur magazine: You work together with the US projects, of course?
Barry Bolding: Yes, we are involved in many projects. There are several in the US that we are in discussions with, and talking to partners about how to reach Exascale.
Primeur magazine: Are you also involved in PRACE?
Barry Bolding: Yes, we typically work very closely with PRACE. In fact we have been invited to come out and talk about cooling in the October meeting of PRACE, because we actually are pretty innovative in the power and cooling domain. We also have systems that are part of the PRACE testbed, including the Finnish CSC machine, and that is actually a couple of years old, but it is a good machine for the PRACE to test on. We also have an Exascale initiative at Edinburgh University on applications. So we have different things we are doing in Europe with PRACE and with the Exascale initiatives. But I think we need a little more closer ties. You know, the European initiatives on Exascale I think are just ramping up. I would like us to work closely with some of the PRACE initiatives. But right now it has mostly been advisory, and working through the PRACE sites.
Primeur magazine: Do you think that will also help in Europe? Because I mean Europe is not very good at hardware.
Barry Bolding: IDC this morning had some interesting things to say about it. They are doing a study with the European countries on how Europe can be more competitive. They have some public data. They are still gathering data. They are going to put out a report in August/September. I think the strength right now, is that Europe could develop into a HPC manufacturer - certainly could - but its strength today is really great software. Some of the best and innovative software: Allinea, Caps, some of the cluster management software, some of the platform work. There is a lot of innovative software work being done in Europe. I don't think that even the USA has everything to do Exascale on its own. So even if they have different initiatives in the different continents, they probably need some high level of collaboration to say: well you in Europe in your Exascale initiatives, you concentrate on the value added to the tools and software, and in the US initiatives, you really work on memory technology or system integration or chip design. I think if that works out, we can amortize some of the costs that Exascale will require. But if Europe tries to do everything, and the US tries to do everything, and China tries to do everything, I do not see that as a big benefit for everybody.
Primeur magazine: I think one of the problems with the designing large systems, is that you design them for applications that are not there yet. The benchmarks are always with applications that are already existing. So how do you see that can be best approached?
Barry Bolding: Well then, there are a couple of areas. We do believe that Exascale will probably be some type of more heterogeneous architecture that involves some type of acceleration closely tied to some type of good serial performance. You need to have both to have a good Exascale processor. The fact that it has some heterogeneous units, that it has some accelerators of some type built into it, means that you do have to have very good compilers. and you have to have very good ways of telling the compiler where the application has parallel properties.
I think the community really has to step up and improve the programming models that we use for telling compilers where parallelism exists. OpenCL is a good standard, but it is very, very immature. It needs to be developed even more, and I do not think the members of OpenCL have really put the energy into it that it deserves. OpenMP is a possibility. But OpenMP is really not designed for the level of threading in parallelism that accelerators have. So that would have to be worked on. Cray is actually on the OpenMP committee, and there is a subcommittee on acceleration. Cray is a co-chair of that subcommittee. We are working in that area. So I think that compilers, tools, and the community that gets a standard for parallelism and threading, a community that has got a standard language, or at least a programming model, for parallelism, is really important. Otherwise, we are going off into ten different directions, trying to hand code applications.
Primeur magazine: It depends, of course, also on the design of the machine. So do you think, you will have the next generation of machines to design them hand in hand with the software?
Barry Bolding: It is hard to say, but I think that there is a common vision if you look at the long term road maps of NVIDIA, Intel and AMD, and if there were other players, I cannot think of any other, but perhaps IBM, I add IBM to that. At least publicly, they all are thinking about ways to couple acceleration with serial or scalar performance, and it is almost like the old vector days, in some ways, and so there are similarities, and that we already know, we have been through the vector days. But now an accelerator is more like a vector, but it has got the power of the mass market behind it, because it has a much broader base. The challenges that vector had, are still going to be there, in the accelerator, even if they are less expensive. But there are challenges we could overcome. There were ways to let the compiler know more about vectorisation, so I think we have some directions. I think there has to be some programming paradigm that would have to work on all those different architectures. I think the market will push those in that direction. If they are too proprietary, they cannot participate, a solution that is too proprietary, only has one language or one way of programming, I do not think that will work out.
Primeur magazine: But it is still: "The market decides". But not really, because as you said, a lot of development will be driven by government funding.
Barry Bolding: Yep, some of it.
Primeur magazine: And that is not market driven, but more committee driven. It are committees that will decide. Will all those committees and rules also lead to some kind of market you think? With good realism?
Barry Bolding: I think it will. Governments can always make the stakes in the way they organize the programme. They could say: we are going to give all the money to oneExascale processor vendor. the advantage is that you do not spread yourself over several technologies. But the disadvantage is that you could go into a direction that the market does not want to go. Fortunately, there are some internal pressures within those companies to do what the market wants. The HPC market is relatively small. It is not large enough to drive any chip manufacturer to do a design just for that space. It would have to be a derivative of something else. You might have a few or a set of technologies built into a processor or a chip that were specific to HPC, but the HPC market alone would not be enough, so they would have to base it upon a chip and amortize the R&D. That is what we are doing today with GPUs. They are amortizing what they are doing for a graphics processor. They are changing it a little bit, and they are providing a few features that are specific to HPC. I think we can do that for Exascale. Maybe not, Exascale is going to be so different. But I do believe that we can and I think that the market will be able to influence even the governments on this.
Primeur magazine: Could you tell a little bit about the current and latest customers?
So with the announcement of the XE6, we announced about 200 million dollars in orders. Most of those are already press released. So we have a very large back log. They include customers worldwide, including EPCC in the UK, with the Hectorsystems, it includes the National Oceanographic Association in the US; the US Department of Defense; the National Nuclear Regulatory Agency, the NSSA in the US, and the Korean Meteorological Association. In that, it is really unique, because in that customer spectrum, there is climate, there is production weather forecasting, there is nuclear stock pile modelling, there is hardcore fluid dynamics and physics modelling. And Hector is open science. So in our customers base, in the backlog for this new product, basically almost every type of HPC that exists today is included. Probably the only thing we are missing are some data intensive types of applications, which the MPP is not particularly suited for anyway. So it really is a broad customer base. And Cray has grown its customer base tremendously the past four years, With this Gemini network, we are growing, and with the mini programme, the mid range supercomputer centre, we actually sold to a bunch of universities, including University of Duisberg-Essen, they are on the TOP500. They have an XT6M, and we are announcing tonight KTH, the Royal Kalinska Institute in Sweden has a new XT6M.
Primeur magazine: Are those systems also sold through resellers?
Barry Bolding: No, we do not sell that system through resellers. We sell the small systems through resellers, the CX.
So we about doubled/tripled the number of customers we had over the last 12 to 18 months. And the unique thing is that we are shipping more cabinets this year than the year we shipped the Oak Ridge system. In that year 200 of those cabinets were going to one customer. This year, we have a huge number of customers and no one customer is getting more than about 70 cabinets. We have diversified tremendously, and I think that is attributed to their ability, the performance that people are seeing, we have a lot of customers coming back for more systems, or for upgrades and that is attributed to our service maintainability, that they are happy with what they are getting, Oak Ridge is housing three different systems from others. So actually they are hosting three different systems for three different customers. One is the University of Tennessee, one is the NOAA, and one is the US Department of Energy's Jaguar system. You can talk to the people of Oak Ridge and ask: The reliability and the serviceability are key to their systems.
Primeur magazine: Is there anything you can say about future customers?
Mick Davis: In our last press release we talked about two systems: One was KTH; the other we cannot provide details about yet.
Primeur magazine: Are you also satisfied in Europe?
Barry Bolding: This has been a good year for Europe, because we have the Hector machine, we have several of these mid range systems, so Europe is very strong for us. Typically 20-30% of our revenue comes from our European subsidiary. So it is a great team here. We have the centre of excellence in Edinburgh. We have a centre of excellence at CSCS in Switzerland, and we have a large technical team, so we really have a broad team here in Europe and it helps us to do the benchmarks, the service and the interaction with the customers. So it is a pretty strong area for Cray. And I expect us to be a very strong contender for the large bids that come up in Europe. There are a number in the next 12 to 18 months. There are weather centres, academic and research centres and Cray is committed to participating in all of those procurements.
And we are hoping to win. We have a couple of features that are unique that will help us. For instance our network has global address space, and today that is unique to Cray, it is very much like the old T3E, that had a global address space where a job could address any remote memory location at any remote node and not interfere with the operating systems on the two nodes. And this is the first time we actually have begun to bring that back into our product, and it will be in all our future networks, it is the hardware supported network to do that.
Primeur magazine: Thanks for sharing your thoughts with us.