24 Sep 2013 Heidelberg - At the ISC Cloud'13 Conference being held in Heidelberg, Germany, September 23-24, 2013. Primeur Magazine had the opportunity to talk to keynote speaker Jason Stowe from Cycle Computing about the recent exciting developments that will take HPC in the Cloud a giant leap further.
Primeur magazine: Here at the conference the focus was on HPC in the Cloud. Chairman Wolfgang Gentzsch was saying we lag two years behind, and even more, with enterprise Clouds, which sounds strange, because HPC is at the forefront of IT development even beating Moore's law. Do you agree HPC in the Cloud is lagging behind?
Jason Stowe: I think HPC Cloud adoption in comparison with enterprise Cloud adoption is, as Wolfgang pointed out, definitely two to three years behind. We saw users in start-ups that are actually in web technology companies. So it would make sense, and there are not a lot of start-ups doing a ton of HPC right of the battery. When EC2 was launched, in 2006, there were a lot of start-ups doing social media, various forms of Big Data analysis and web analysis and so the Cloud definitely took off for those reasons. And enterprises started off adopting because the value had already been demonstrated, the tip of the spear had already gone through and enterprises kind of followed behind that.
What we do as Cycle Computing is we were essentially the first commercial software company providing software to run HPC workloads on Amazone Web Services (AWS) back in 2007, about six to nine months after AWS was launched. So we have seen the whole progression. The first real commercial user was in 2008, but we could see very early on that a good chunk of the market, that was high-throughput, was kind of going this way, and so we stayed with it. I think you know, 08 and 09 were earlier days. We got a lot of early adaptors. 2010 and 2011 we were kind of crossing the chasm - when you follow that marketing adoption pattern.
In 2011, 2012 and 2013 we have seen a lot more mainstream usage and pretty much everybody is considering it now. The stats that Wolfgang provided from IDC, and talking about how they adopt, 25% are either in an hybrid model currently. Everybody else is looking at how Cloud impacts the HPC environment that they have in-house. I think that is what we see in the market too. There is definitely a shift now and the winds are in our backs a bit in terms of usage. So the thing is I am not an enthusiast of a particular technology. I am an enthusiast of benefits. So the thing that is exciting from my point of view is that today scientists can do workloads that have never been done before and workloads that they could have done yesterday but may have been more costly or taken longer. They can now be done a lot more efficiently even if the run time does not change or it is a bit slower.
The time to get the hardware is now 15 minutes, the time to get rid of the hardware is 15 minutes, as opposed to a couple of weeks, or a couple of months, and three or four years of depreciation. So it is a tremendous advantage: being able to relinquish the hardware is actually equally as important as the acquisition time, because it allows you to experiment now. You can write a set of algorithms, deploy the hardware, in a matter that you think fits best, right now. But next month, if you come up with a better way of running your trading algorithm, your life science workload, your manufacturing simulation or the set of workloads around it, then you change the hardware, and you do not have to wait for that equipment for three to five years. Which is kind of a big deal.
So I think it lags, to get back to the original question, it lags web, but I am a fan of the benefits, and I think the benefits become very clear, because there are a number of cases now. We have done some large ones, with 10.000 servers but there are equally a number of use cases where someone borrowed forty cores for three hours and then turned them in. I think there is some clear benefit now, and that is what caused the lag to stop. The benefits that were clear in the web earlier and are now very clear in the HPC environment. So it shifted a bit.
Primeur magazine: So you provide, with Cycle computing, you could say a kind of HPC management software on top of AWS? Is that correct to say?
Jason Stowe: We do not really have a management interface. We have orchestration software, and it is more doing everything around moving workloads, moving data associated with the workload, building the environment in the remote computer, whether it is on your internal VMware farm, or AWS as an external provider, and deploying the computation after that, and all of those features. We have a number of patented applications and other IP essentially around managing those processes. Orchestration is really what we call it. It is actually automating the entire process of submitting a workload, and having the forethought to think: Shall I run this on my internal environment or should I run it externally?
In the event I do not have an internal environment I run it all externally, provisioning what is necessary externally, handling and automating the security, handling and automating the security, handling and automating the error handling, because on the Cloud you are basically doing a burn in when you acquire a server. Right upfront you do the installation of the OS. Then the burn-in process. All that is happening in the first 5 to 10 minutes you have the server.
So you discover error conditions at that point as well: we have a lot of software dedicated to that process. That is the orchestration piece: it really sits outside the scheduler. We are agnostic to what scheduler you use. If you like PBS or if you like LSF or if you like Condor, SGE, all these are usable. Our software is not the AWS console, it is not about managing AWS instances, your VMware instance, or your bare metal, it is more about workload orchestration, moving the workloads around and enabling scale.
Primeur magazine: So you separately start up your AWS instances, etc. and then you start using Cycle computing?
Jason Stowe: We have a set of head nodes that you would run our software on, that could be either in house or that could be on AWS and then this is capable of orchestrating an entire cluster or many clusters externally. So we have some users that use on a day-to-day basis ten small clusters for different users. Other folks have one giant cluster that is running tens-of-thousands of cores on a regular basis. Life Technologies is an example. They gave a talk at Amazon re:Invent last year, about analysis back-ends for the Ion Torrent genome sequencers; the highest volume selling genome sequencers today.
They are basically doing next gen sequencing using silicon technology rather than traditional burning of DNA and looking at the colour, phosphorescence, etc. So essentially for Ion Torrent and the back end they are using our software. The clusters together are up to several thousand cores in size, depending on the user requests, and in that sense it is multi-tenant. Because users upload sets of data, request some analysis, and the result, and the web application sits in front of the cluster. The cluster basically answers that question to the user. Our software is sitting on a dedicated server, but it is deploying an entire infrastructure, on behalf of users' applications.
Primeur magazine: So if you have local clusters, does it matter what type of clusters they are?
Jason Stowe: We have done a number of use cases locally. We started out in 2005 with our first customers, including Morgan-Chase and Lockheed Martin. In 2005 there was no Cloud. So all of our software was managing workloads and environments on internal computers. Then, as the use progressed, Cloud came out in mid-2006, and we recognized that this was an opportunity to solve the peak versus median use case. So I want to do peak in the Cloud, because it is flexible: The local cluster is too small when I need it most and too big every other time. So far as for that use case.
And then in 09 and 10 we saw people asking for more and more cores in their environment. They often times started exceeding the core counts they had locally, and what we realised is you can actually put together really large environments that would never have been possible to do before, because you cannot have thousand cores waiting for you to be used. Amazon is taking care of that problem for us, by amortizing the utilisation risk across all their users.
So that was the utility computing HPC notion. The thing you can do differently now that you could not have done before is really large environments - even if it is just for a couple of hours. I think one of the best examples of that is the one of the 10.600 cores, run about 340.000 hours of computational drug design, which is 39,5 years, and instead of spending 44 million dollars on hardware to deploy that, it was essentially run in 11 hours for 3.472 dollar.
But there are other use cases. I think the actual tightest time window we had was we spun up a 50.000 core environment, which was 6.700 servers, and once it got started, it ran a workload and turned itself of in three hours, well actually 2 hours and 50 minutes. That was probably the tightest time line workload. So basically the company that accomplished this - Schroedinger - wanted to see if they could give a drug designer something that would return a result in an afternoon. So they did run this massive workload in an afternoon, something that was not possible before. Those are the things that are definitely exciting for us and also illustrate the kind of activities you just could not do before.
Primeur magazine: What are your next steps?
Jason Stowe: There are a few things. More specifically, we will always be doing larger workloads. We also have a lot of technologies for enabling really small workloads very well and we are starting to help a lot of ISVs with Compute-as-a-Service. We have a few partners already where we can help them off. Today we have models with Schroedinger for instance. Upcoming things that are interesting for us are interactive workloads.
So there are a lot of things that are going to be exciting: one is putting in mobile devices like iPads, tablets essentially. Have scientific applications on them in front of dynamically created very large clusters in the Cloud because it is now cost-effective to do so. It is about 0.9 cents per instance hour, 0.8 cent for spot instances for example on AWS. A 10.000 core cluster costs basically between 90 dollars to some 180 dollars depending on what processor type and how much RAM you are using. I think there is a lot of undiscovered country in that area where connecting up people's brains with interactive access to very large compute capacity holds a lot of potential. That again is an area where we really had not had the ability to do that before because normally if you have a resource that is that large, you have got lots of batch users going in and using it. It would be unheard of setting it aside for one guy running it for a few hours and then turn it off. It really does not fit in the traditional usage model. So i think that is another exciting area where there will likely be a lot of interesting work to be done that would not have been possible before, because now the model is different.
Primeur magazine: Are the networks fast enough to support that?
Jason Stowe: So far we had a lot of workloads, we have one customer right now that uploads 1.2 Petabytes onto Amazon Glacier and I think our advantage for him is that the bandwidth is good enough now and it is getting better. There are certain geographical areas where getting good bandwidth is difficult. But for instance in my area, I can upload with a 60 dollar subscription a Terabyte a day.
I think other future things that are going to be interesting are advances in technology itself in deploying Clouds. For instance in enabling Infiniband access in Open Stack. One of the advantages of public Cloud is scale. But at the same point we see users trying to deploy Open Stack for heavy use. So it will be interesting to see how that progresses when the years go forward.
Also I think a lot of work needs to be done with the software providers to enable the people to consume their applications. A lot of the providers obviously are concerned about usage model based pricing. As much as everyone would get everything for free, I actually take a slightly different view on this. I actually want to pay the provider who gives me valuable software to make money, because I want him to further develop the software that I can use.
So I think there is a balance and my instinct is that phenomena like in the 1860's in the UK will repeat themselves: as the efficiency of turning coal into steam became higher, the access and cost of turning coal into steam became lower, the usage of coal did not go down but it did go up because it became cheaper and people could do more. And when people realised they could do more, usage went up again. It is the same thing as the number of cans of soda that were sold after vending machines became common: it went way up, because access was that much easier. You could just throw a few coins in the machine and get another bottle. I think that is going to happen with HPC software usage as well; when you move to a usage model and you lower the barrier, you do not necessarily have to have the same access pattern. My guess is that over time usage will increase, because people that (a) never considered using simulation will now be able to, and (b) you will have applications like Maker where we are doing a lot of 3-D printing.
There is going to be an extra zero on the number of CAD-CAE designers, because everybody that is doing 3-D printing in some form or another will have access to CAD tools to create designs. So it will be an interesting time, I think. We are in one of those exciting periods where the mixture is right. In the room there are a lot of interesting tools now available like GPUs, what is happening in the Cloud, 3-D printing, etc. It is definitely a thrilling time for manufacturing. Nothing but good things.
Primeur magazine: Things like running highly parallel codes, do you think that public Clouds like AWS will adapt to that?
Jason Stowe: I do not know. My instinct is that in the short term, capability machines are there for a reason. So my guess is that especially in the short term they will not go away by any stretch. That being said, I would expect, if you look at the capabilities of x86 processors in the eighties and nineties, versus RISC processors, essentially Intel's capabilities as well as AMD's were lower than the RISC processors. If you would have asked people running supercomputer centres at the time, "would you need desktop processors in your supercomputer?", they would have said "no". And now a majority is running a flavour of Intel on server. So similarly many things in technology have this profile where the initial capabilities are lower but the rate of increase is higher.
So the question is: Will the rate of increase be higher for Cloud? I do not know the answer. I do know there are a lot of workloads that today run very well; but there are a lot that run better on bare metal. My guess is that over time that will blur a bit. Not that I do not think that over time capability machines are not going anywhere. Actually I would recommend thinking about large Cloud environments like these 50.000 core clusters and 10.000 server clusters as just another capability machine. It are just different capabilities. It is not about fast interconnect, it is about inexpensive access to massive core counts. High throughput orientation. That is the way I think about it.
Primeur Magazine: Thank you very much for sharing this with us.