Combustion has been around for 100 years. Yet, our modern society still depends heavily on combustion for our energy needs. What Jacqueline Chen's team likes to do with exascale combustion simulation is to be able to predict the behaviour of both the conventional petroleum-based fuels and the bio-fuels in different combustion environments and realistic operating conditions of high pressure and high temperature intense turbulence conditions. The researchers need to do this in order to develop both the co-design of fuels, custom fuels optimally tailored to advanced engines that work efficiently and clean. The researchers also need to do this to extend the longevity of the use of fossile fuels and to reduce emissions from CO2 greenhouse gases. From the US there are coming a number of government mandates, including the CAFE standard of 54 miles/gallon by 2025 and the reduction of 80 percent in greenhouse gas emissions by 2050. Within those kind of short timeframes, there may be only a limited opportunity for several design cycles and having computation in the mix to augment experimental campaigns is very useful in the short amount of time that these devices can be brought to market, Jacqueline Chen explained.
In her presentation, Jacqueline Chen wanted to focus on some high fidelity first principles continuing direct numerical simulation methodologies that her team worked on. With very high chemical fidelity the team wanted to differentiate the different effects of fuels where there is very strong coupling between the chemistry and turbulence. The team wanted to resolve all of the mechanical scales from the largest and containing scales all the way down to where the heat is dissipated. The team still has to impose or incorporate models for the thermo-chemistry. Therefore, the team has to introduce chemical kinetics models and transport models. In addition with exascale computing coming online, the team can handle a greater degree of complexity in terms of multi-physics and begin to start looking at spray combustion, soot transportation, as well as thermal radiation. A lot of these practical combustors show very complicated flow fields, so the team also wants to look at items such as compression of a charge by piston moving, swirl-stabilized flames behind the gas turbines and even at high speed flows which are stabilized on cabin cavities, Jacqueline Chen stated.
These fundamental turbulence chemistry interactions are motivated by the design of advanced internal combustion engines as well as gas turbines for transportation and for power generation. The gasoline engines that people typically drive with, operate at a fairly high temperature. The trend of the automotive community is to go towards low temperature combustion in order to avoid both the soot and the nitric-oxide (NOx) islands in the combustion process. In these partially-premixed conditions, these regimes tend to by apparently mixed-mode. Often the pre-mixture burns into a reactive mixture that is oxidized to the local auto-ignition. There is still a great deal of fundamental knowledge to be learned about combustion in these extreme conditions. In that regime there are strong sensitivities to fuel chemistry, Jacqueline Chen told the audience.
Likewise in gas turbines, the trend is also to go to very lean conditions with premixes for the purpose of getting high efficiency and low NOxand also because of the increasing amount of renewables that are coming on the market. Especially in Europe, there is a big demand for fuel flexibility and operational flexibility to meet the demands of the fluctuations. In these conditions, the flow fields tend to be very complex and there are also issues with operability, including flame blow-off, flashback in the premixing section causing damage, dilution and radical depletion of the reaction zone, and coupling of the heat release with noise, sound and the chamber geometry through thermal acoustics. All these effects have to be understood. Predictive models are used in large eddy simulations that resolve some of the energy containing scales. The average Navier-Stokes algorithms which are currently used in industry need these models, Jacqueline Chen stated.
Combustion is multi-scale and multi-physics, like many other physics and engineering disciplines, with a large range of length and time scales, from the smallest combustor which is available with a diameter of 10 cm down to very small weight scales for soot inception at nanometers where a lot of turbulence chemistry interactions occur on a micron to centimeter scale. The chemistry complexity can amount to thousands of reactions in a reduced chemical model. The multi-physics complexity involves multi-phases of sprays and soot, and the occurrence of thermal radiation. These are all tightly coupled phenomena.
It is extremely difficult, if not impossible, in some situations to stick elements in a experimental programme and get detailed diagnostics in an engine environment of high temperature and high pressure. A lot of times, you can only get a very coarse resolution that doesn't resolve the fine-scaled physics. Combustion is one of the fields where chemistry has to be molecularly mixed before it reacts. You really have to understand combustion and its impact on efficiency to be able to understand what happens at the micro-scale. With computing at petascale level and now moving towards exascale, the researchers still can't resolve the full range of scales in a DNS sense. With high performance computing at petascale, Chen's team can resolve about 4 decades of scales directly. The researchers have the opportunity to have a moving window across the range of time and length scales to resolve some of the length scales, but the researchers have to model for the rest of it. The strategy with multi-scale modelling is to go from the quantum scales up to the device-level scale. The researchers use high performance computing to help perform direct numerical simulation (DNS) in some limited regime of time and spatial scales and then to develop new parameterizations that help them bootstrap that information upscale to the actual devices. Jacqueline Chen wanted to focus the attention on the continuum regime where she was looking at a direct numerical simulation and its input in the way of upscale modelling for large eddy simulation (LES) and for Reynolds-averaged NavierStokes (RANS).
At Sandia during the last couple of decades Chen's team developed a DNS code called SVD. SVD is a compressible reactive Navier-Stokes code. It solves reactive Navier-Stokes total energy species continuity equations using very accurate product difference methods that are 8th order in space and 6th to 4th order in time and also treats a range of different fidelities for the chemical models and transport models. This code has been refactored a number of times for various high performance computing platforms. Most recently, the code has been refactored from MPI to MPI/MP version. With the help of the Oak Ridge Leadership Computing Facility and their readiness programme for typing it has been refactored for a private-based approach in OpenHPC. In November 2015, through the co-design centre ExaCT, some computer scientists led by Alex Akin and his PhD students refactored the code in an asynchronous dynamic task-based model. The code now runs well as a consequence of these various optimizations on the GPU system, on Titan, on Piz Daint, on Keeneland, and various other places. It allows the researchers a little bit of a glimpse of what they might anticipate as helpful or useful as they move towards exascale.
To perform the kind of DNS with great detail requires a lot of CPU time. The researchers have been very fortunate to have this programme in the US called the INCITE Programme which allocates very large amounts of open science time at the leadership facilities Argonne and Oak Ridge. Most of the Chen team computations so far have been performed at Oak Ridge. With this kind of computing time at the petascale, the researchers can perform a multi-scale validation approach and do many experiments including first principles DNS and ambient to moderate pressures at moderate range around numbers with radical diversity viscous forces and with somewhat complex chemistry routinely order 40 to 50 species that are transported. As the researchers move towards exascale they anticipate going to much higher pressures in the order of 40 to 60 atmospheres, at much higher Reynolds numbers with much more complex chemistry and describing from the large hydrocarbon fuels and biofuels with order 100 to 150 transported species and increasingly complex geometry, handling some of these bodies, swirling types of flows and with multi-physics with spray, soot, and radiation. Currently, at petascale level, the researchers are looking at DNS in simpler canonical configurations where you can do many parametric studies. As they move towards exascale, they can start to make direct comparisons with laboratory scale planes and with isolated phenomena in order to integrate several muli-physics and more realistic system-level types of computations, Jacqueline Chen explained.
These are just icons of the kinds of DNS calculations that Chen's team has been able to perform over the past decade with HPC support looking at a range of fundamental templates chemistry interactions from local extinction, auto-ignition, plane stabilization, flashback types of problems, and pre-mixed flame propagation of structure. Jacqueline Chen showed an example to give the audience an idea of what is feasible now at petascale level. It was an example of lifted flame stabilization that you can see in a diesel jet. There is auto-ignition occurring to help stabilize the flame, as Jacqueline Chen showed in the accompanying movie. This is the type of information that you can get from an experiment. There is still a lot of missing detail that is not possible to get experimentally. Chen's team wanted to start to shed some light on this microscale detail by performing DNS. While the researchers cannot perform DNS of the full-scale diesel jet flame at the present time, they can look at some of the important features like the lift-off stabilization in a lifted DME jet flame at much more moderate pressures of 5 atmospheres which still represents some of the auto-ignition multi-stage physics that you would see in a diesel jet plane. There is a lot of structure and transient motion of the lift-off flame effected by the turbulence and also by the low temperature emission on which it stabilizes, as Jacqueline Chen showed in a second movie.
What Chen's team wanted to do with the DNS is to drill in and look at the lift-off branches and the stabilization point. From some calculations the team found that there is a pentabrachial flame structure of that lifted base which you can never measure experimentally. The team wanted to see if that kind of structure persists in this very high and intense turbulence environment. The information about the detailed flame structure was picked up and used by commercial software, the CEMA-based ARM implemented into CONVERGE using UDF. Through their user-defined function, the team implemented this chemical explosive mode analysis detection method of the plane to explain structure. This enabled them to adaptively apply more mesh locally in these very thin premixed structures to help them get better predictions of lift-off stabilization in a real detailed template. This kind of information is being passed on to CFD in terms of better predictions.
As exascale comes online, Chen's team would like to push up in pressure to look at reactivity, control, compression, and ignition where the team can vary activity and mixing in the engine and tailor it to the fuel to control the ignition and phasing with respect to the piston motion, as well as the soot formation and the burn rate. You don't want to burn it too rapidly so that you end up with a lot of NOx. To get there will require a lot more work in fine scale resolution, a bigger range of scales, longer simulations, more details in the transporting order number of species, including multi-physics, when you talk about staged combustors and gas turbines with various spray, as well as thermal acoustics to worry about. Because the I/O won't keep up with compute throughput at the exascale, the researchers are going to also want to include in-situ analytics and uncertainty quantification, Jacqueline Chen stated.
Jacqueline Chen went on to explain what her team is doing in the co-design centre and what some of the exascale issues are. At the petascale level, the team is limited by the peak clock frequency. Flop/s are the biggest cost for the system to optimize them for compute. The researchers also have a moderate, modest level of concurrency. So far, they have been relying on MPI+X as the programming model for locality. They have uniform costs within a node and for communicating across the entire machine. Memory scaling has been somewhat flat. It maintains the byte per flop capacity and bandwidth and more or less assumed a uniform system performance.
There are new constraints for the exascale level. The power is going to be the main design constraint. It is going to cost more to move data on a very small distance on a chip that will perform a floating point operation. Data movement will dominate. It will be necessary to optimize code to minimize data movement. There will be very large concurrency since the frequencies are fixed or slightly declined. The researchers will need exponential growth and parallelism with billions of cores and be able to reason about data locality and also possibly topology. Memory scaling is decreasing while compute time is growing two times faster than capacity or bandwidth. There is no global hardware cache coherence. Heterogeneity is about everywhere, not only in the software but also in the architecture. The team will see all sorts of performance non-uniformities go up.
Jacqueline Chen asked what this all means for scientists and engineers who want to use HPC in the future. Everything in the stack from applications algorithms to the programming environments, the runtimes and the hardware need to operate in sync with one another and be able to express data locality, sometimes at the expense of Flops and independence. The researchers have to express massive parallelism, minimize data motion and reduce synchronization wherever possible, and also be able to detect and address faults.
Within the interdisciplinary co-design centre ExaCT for combustion co-design the charter was to look at all aspects of combustion simulation from the formulation of the governing equations to the basic math algorithms, the programming environments, even down to the hardware characteristics, that the team could work with vendors on, to enable these combustion simulations and include both in situ analytics and some UQ. Through this five-year project, the team was able to interact a lot with the vendors to help define some of the hardware requirements, understand some of the hardware limitations, and with several computer scientists define requirements for programming environments, and with the applied math community design locality-aware algorithms, especially for solving large systems of pressure differential equations (PDEs). The goal for combustion is to look at block structured both uniform as well as adaptive mesh refinement with embedded and in situ UQ. The question is a surrogate to a much broader range of multi-physics areas that involve solving systems of PDE.
Jacqueline Chen said that within the co-design centre the approach was to look at proxy applications and proxy machine models. In between that, there was simulation modelling of architectures. Many of these applications are million line applications. Trying to reason about sensitivities in those types of production codes is very difficult to do. The idea was to express the essential part of the application for science, express the main points for performance and then hide the rest of the complexity that is not consequential to the performance. Similarly for the proxy machine models for exascale, they are a reflection of what the underlying anticipated machine architectures are going to be. Here, it is important to look at expressing what's important for performance and again hiding what is not consequential for performance. The researchers iterate back and forth on this a lot with the computer scientists at the co-design centre. At the end of the day, to get the actual realistic compute intensities they still need to look at some full applications.
The starting point for this co-design effort are two codes: S3D and LMC, which is a block structure, Low Mach Number, adaptive mesh refinement code, that has been developed at Lawrence Berkeley National Laboratory. It also has a lot of similar physics capabilities that S3D has, in addition to adaptive mesh refinement which the researchers need as they move up in pressure where there are greater disparities in scale. At the exascale, the researchers anticipate that they need a new code base to be able to encompass these kinds of capabilities.
First in the area of algorithms and solvers, the goal is to produce and to develop high-order compressible and Low Mach Number adaptive mesh solvers. In this activity, the researchers looked at a couple of different topics. They looked at alternative time stepping strategies that were more compute-intensive and less communication-intensive. They also looked at higher order AMR, iterative linear solvers using multigrid methods, and looking at improved data distribution methods for AMR data layout, which is very dynamic and requires a lot of attention to load balancing. They also started to look out at incorporation of adjoints for sensitivity analysis within the workflow of the AMR solution, Jacqueline Chen explained. They looked at this new improved time integration method based on Spectral Deferred Corrections (SDC). This basically recasts the time evolution as an integral in time and iteratively solves the resulting spectral collocation system. It is an iterative approach. This SDC method offers a number of advantages over the explicit Runge Kutta method, including multirate integration, meaning that different processes can be integrated on their own independent different time scales. This would enhance the work intensity because of the expense of the chemistry computation which involves a lot of transcendental function evaluations and at the same time less than the communication because you take evection steps on much longer time steps. It's also an algorithmically resilient approach because of its iterative nature. This method is now being coupled to AMR and multigrid in both space and time. Initial measures of the metric show that the team gets a five-fold improvement for integration of stiff chemistry on current machines.
In addition to algorithmic improvements, another significant area of improvement or change is in the way the researchers program the software at the exascale. This is to illustrate the challenges with keeping the data locality. Currently, the engineers and scientists have to write functionally correct application code and they also have to build a map that targets the resources of the machine, the processors and the various memory components on the machine. They also have to extract parallelism, if they want to get any kind of performance out of it. They need to manage the transfer of data, doing nearest neighbour communications at low levels to communicate across different nodes. In order to hide some of the latencies on the machine, for example by going back and forth between the GPU and CPU, they have to build an interlayer between communication and computation, so latency hiding and task scheduling is another responsibility. Finally, one size does not fit all. There is a lot of data-dependent behaviour and how you code may be different depending on the type and size of your data. At the end of the day, it is no wonder that the engineers go home feeling very frustrated as they are trying to move to exascale.
The shift in paradigm and separation of concerns that they would like to see is that engineers and scientists still have a responsibility of writing functionally correct code and they also need to come up with policies on how to map it to various machine architectures. The rest of that, however, should be the job of a compiler runtime who understands the data and how to schedule tasks and adjust to the data dependencies and latencies on whatever architecture you want around. This might be a dream but they can try to push down this path as much as they can.
Fortunately, there are some really smart computer scientists on the ExaCT team, including Alex Aiken and his students, primarily Michael Bauer and Sean Treichler, who worked on the combustion co-design effort. They have this Legion programming model, which is a data-centric model to capture the structure of programme data to a separate map interface. It decouples the specification of the data from the mapping of the tasks and the data to memory and processors. It lets you have some flexibility in programming choices at different levels. For example, they wanted to use a tiling approach like TiDA. It automates the data motion, parallelism discovery, synchronization and the hiding of long latency. It has support for both task and data parallelism. Proxy apps and proxy machine models are great at first cut in the evaluations of sensitivities but in the end there is no substitution for coding of the full production code on something that has some characteristics that you might see at the exascale.
The researchers ported SVD over to the Legion programming model, using Titan as well as Piz Daint. They are still working on porting it over to various other machines. In Legion you describe the data with logical regions. These are arrays of objects that have no implied layout or location. They are described by index space or a set of keys. You can partition the logical regions into subregions and you can also slice the data by the field. You don't have to operate on all the fields, you can operate on a portion of them if you wish. Similarly to the tree of regions, the tasks can also be partitioned in subtasks. The tasks have to specify which regions or subregions they are going to use. They also have to specify the modes or access modes for these regions, including whether it is read-write or reduction access. The Legion runtime schedules these tasks on the computer resources.
Legion has a mapping interface where the application selects whether the tasks run and where the regions are placed. The tasks can either be placed on the CPUs or on the GPUs. The logical regions can be placed in different parts of the memory. Mapping is computed dynamically. Legion decouples the correctness of the mapping from the performance. You might initially have a full performing code but you can optimize the performance, Jacqueline Chen explained. The code is about 200,000 lines of Fortran. It is a DNS code using explicit MOL methods. There are various versions of the code, including Fortran + MPI, and hybrid OpenACC. One of the important features of Legion is that it interoperates with MPI, so the researchers don't have to move the entire code but they can pick and choose the most compute-intensive parts of the code and migrate just those parts. The data organized in S3D is a large 3D cartesian grid of cells. The typical per-node subgrid is about 483or 643cells. Nearly all kernels are per-cell and embarrassingly data parallel. It could involve hundreds of tasks, so there is an opportunity for a lot of task-level parallelism. However, the computational intensity is pretty low. The code has huge working sets per cell, involving thousands of temporary variables. This places a huge burden on memory access. The performance limiter in this code is not compute, but memory.
In addition to the task parallelism opportunities, the researchers also have to optimize the compute leaf tasks in order to gain the best usage and performance on the GPUs. Ninety percent of the computer available time for Titan in on the GPUs so the researchers wanted to make sure they squeezed as much of the forms out of that as possible. The researchers also have to optimize each of the leaf kernels' performance. That is where a lot of the performance is gained and improvements over all. Over all the team achieved 80 percent of performance. Legion has a run time compiler that treats the tasks as black boxes. It doesn't care how you write the tasks but you still have this need for very fast leaf tasks, especially for the compute intensive chemistry kernels. The researchers need both GPUs as well as CPUs for multiple mechanisms. A DSL compiler, called Singe, was written by Mike Bauer for chemistry kernels that emits very fast code for GPUs as well as CPUs.
One of the big challenges in combustion is that the researchers have very large working sets. For example, the PRF chemistry needs 1722 double precision reaction rates per point. The current GPU register file only stores 128 per thread, so there is a problem with the working sets. One naive way to get around this problem is to break up the loops, the fissioned kernels but this is limited by memory bandwidth. It is very slow to retrieve and read and write the data back in. This is not an optimal way to deal with large working sets, Jacqueline Chen pointed out. Mike Bauer leveraged the knowledge of the GPU underlying hardware, using work specialization. The GPUs execute warps in streams of 32-wide threads concurrently. It is a data parallel unit but it is not necessary to run all the warps, running the same identical computation. In fact, each warp can run a different computation on the same data. In that way, you can generate code that specializes each warp. These different warps do different computations. Especially for chemistry, you don't need all elementary reactions. Each reaction is only involved in about a handful of different species. There is a lot of opportunity for task parallelism. With this, the researchers can get much better use of the memory of the GPUs while also keeping the processors busy and also fit the large working sets on chip on the GPUs.
In addition to the performance, Jacqueline Chen and her team also looked at in-situ chemical explosive mode analysis (CEMA), moving eigenvalue solve into the S3D solution to steer mesh refinement. This is a tedious task if you do this manually because you would have to interleave the computation across the different RK stages in a way that doesn't impact the critical operations. The researchers were able to schedule this eigenvalue solve with the Legion's code in a day's time. It is highly advisable to use these asynchronous task execution models, not only for the solve but also for the entire workflow. As far as the performance concerns, the Singe DSL was about 3.75 times faster than previously hand-tuned data-parallel CUDA kernels. There were three testbeds including Keeneland, Titan and Piz Daint with varying numbers and types of processors, different numbers and types of GPUs, and different types of interconnect. The researchers achieved a 2.25 to 2.50 times speed-up for OpenACC over the MPI-only code and an additional 1.5 to 2.85 times speed-up, using Legion on Titan. They also compared the Legion code with MPI-only code. At one node, they got a 4 times speed-up and at larger clock counts they got over 7 times speed-up. The greater speed of the larger clock counts is due to the asynchronous dynamic task scheduling which is able to overcome the network latency at larger clock counts and which the MPI-only code is not able to do, Jacqueline Chen explained.
Looking at the execution overhead of the in-situ analytics, both on Piz Daint and Titan, it took about half a second out of seven seconds to compute the eigenvalue solves. You basically get the analytics for free. The improvements on Titan were not quite as good but still very substantial. There is 98 percent reduction in overhead on Piz Daint with the Legion code and there is 84 percent reduction in cost on Titan. This is not quite a research production code. The researchers are actually doing production runs with this code on Titan and they are able to run auto-ignition runs at high pressure of 25 bar as a result of this speed-up.
The potential with these data-centric task-based models is that the code is just easier to modify and maintain. Porting the code is developing a new mapping which is easy to tune for performance. There is a new functionality as you build up your workflow. It just means adding new tasks. The Legion run time realm will help you figure out the dependencies and scheduling. In the researchers' experience though there is still room for improved productivity and there is a lot of room for developing higher level abstraction layers for scientists and engineers to write to, rather than writing at very low levels. That is an area to work on in the future, according to Jacqueline Chen. Co-design has been a great experience for working together between computer and computational scientists. That is the only way to go to actually get applications to really run well on exascale machines.