Challenges Simulating Real Fuel Combustion Kinetics: The Role of GPUs

Size: px

Start display at page:

Download "Challenges Simulating Real Fuel Combustion Kinetics: The Role of GPUs"

Jerome Todd
5 years ago
Views:

1 Challenges Simulating Real Fuel Combustion Kinetics: The Role of GPUs M. J. McNenly and R. A. Whitesides GPU Technology Conference March 27, 2014 San Jose, CA LLNL-PRES ! This work performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344

2 Enhanced understanding of high efficiency clean combustion requires expensive models that fully couple detailed kinetics with CFD Objective! Create faster and more accurate combustion solvers.! Accelerates R&D on three major challenges identified in the DOE VTP multi-year program plan:! A. Lack of fundamental knowledge of advanced engine combustion regimes! C. Lack of modeling capability for combustion and emission control! D. Lack of effective engine controls! We used to aim for! Detailed chemistry!!!! in highly resolved 3D simulations!!! Ex.!Diesel component!!c 20 H 42 (LLNL)!!7.2K!species!!53K!reaction steps! Ex.!SI/HCCI transition ~30M cells for Bosch in LLNL s hpc4energy incubator! 2

3 Enhanced understanding of high efficiency clean combustion requires expensive models that fully couple detailed kinetics with CFD Objective! Create faster and more accurate combustion solvers.! Accelerates R&D on three major challenges identified in the DOE VTP multi-year program plan:! A. Lack of fundamental knowledge of advanced engine combustion regimes! C. Lack of modeling capability for combustion and emission control! D. Lack of effective engine controls! Now we want to use! Detailed chemistry!!!!! Ex.!9-component diesel surrogate (AVFL18)!!C. Mueller et al. Energy Fuels, 2012.!!+10K!species!!+75K!reaction steps! in highly resolved 3D simulations! 3

4 Long term challenges requires new algorithms and lower cost computing architecture (flops/$) Current industrial research achieves 30M cells LES/flamelet cells or 1.4B species-cells RANS/detailed finite rate kinetics! Pick one: Turbulence Chemistry! The John Deur [Cummins] Challenge all of the above with 150M cells and 10K species! Bare memory requirements: 12TB for a snapshot! 400 nodes of cab (2x 8 core Intel, 2GB per core)! 2000 NVIDIA K20X (4 Indiana University Big Red II)! 4

5 HCCI relies on the chemical kinetics of the charge mixture to determine ignition timing and burn duration and not mixing Spark-Ignition (SI)! HCCI! Videos courtesy of Y. Ikeda (U of Kobe)! 5

transport Aceves et al. SAE-2005-01-2134:!

6 Predictive HCCI models require accurate chemistry more than detailed fluid dynamic transport Aceves et al. SAE :! What role do the in-cylinder transport processes affect HCCI performance?! turbulence intensity, m/s disc square solid lines: experimental dotted lines: numerical crank angle, degrees 6

Better algorithms and applied mathematics same solution only faster 2.

7 Approach Accelerate research in advanced combustion regimes by developing faster and more predictive engine models 1. Better algorithms and applied mathematics same solution only faster 2. New computing architecture more flops per second, per dollar, per watt 3. Improved physical models more accuracy, better error control 7

Tflop/s power to the partners desktop from 2010computing presentation to program Originally used for graphics

8 General Processing Units (GPGPUs) Our GPUPurpose researchgraphical into combustion algorithms had very modest roots bringaug. Tflop/s power to the partners desktop from 2010computing presentation to program Originally used for graphics intensive applications: - video games GPU: Cores: Mem: Tflop/s: Price: Price: CPU: Cores: Mem: Tflop/s: Price: Intel/AMD up to 12 up to 256GB up to 0.13 up to $1300 NVIDIA GTX GB <$300 $500 8

9 Enthalpy and entropy calculations have greater speedup benefit from more arithmetic operations per memory access 9

10 Approach Accelerate research in advanced combustion regimes by developing faster and more predictive engine models 1. Better algorithms and applied mathematics same solution only faster 2. New computing architecture more flops per second, per dollar, per watt 3. Improved physical models more accuracy, better error control 10

Better Algorithms: adaptive preconditioner using on-the-fly reduction produces the

iso-octane 874 species 3796 reactions 1. Classic mechanism reduction: Ex.

LLNL s adaptive preconditioner: Solution is faster but is not accurate over the

) slower faster Identical ODE Reduced mech only in preconditioner Filter out

11 Better Algorithms: adaptive preconditioner using on-the-fly reduction produces the same solution significantly faster Two approaches to faster chemistry solutions Ex. iso-octane 874 species 3796 reactions 1. Classic mechanism reduction: Ex.197 species Smaller ODE size Smaller Jacobian matrix 2. LLNL s adaptive preconditioner: Solution is faster but is not accurate over the entire operating range Jacobian Matrix (species coupling freq.) slower faster Identical ODE Reduced mech only in preconditioner Filter out 50-75% of the least important reactions Our solver is as fast as the reduced mechanism without any loss of accuracy - more than 10x speedup 11

12 LLNL s solver brings well-resolved chemistry and 3D CFD to 1-day engine design iterations using iso-octane (874 species) Simulation time (chemistry-only) for 10 6 cells on 32 processors! 2-methylnonadecane! 7172 species! steps! Traditional dense matrix ODE solvers still found in KIVA and OpenFOAM - 90 years 874 species iso-octane now only 1-day!! New commercial solvers using sparse systems days New LLNL solvers created for VTP program - 11 days 12

13 Approach Accelerate research in advanced combustion regimes by developing faster and more predictive engine models 1. Better algorithms and applied mathematics same solution only faster 2. New computing architecture more flops per second, per dollar, per watt 3. Improved physical models more accuracy, better error control 13

species i Rate of progress of reaction step j Example: hydrogen

14 The calculation of the species production rate serves as a practical case study to illustrate GPU algorithm development Creation rate of species i Rate of progress of reaction step j Example: hydrogen mechanism! Summation over all reactions containing species i as a product 14

15 Low arithmetic intensity and unstructured memory access of the species production rate algorithm makes it ill-suited for the GPU Scatter-add for the species production rates Unstructured read or write operations very costly on GPU iso-octane (874 species) GPU: 0.13 GB/s CPU: 6.6 GB/s 15

16 Processing multiple instructions that update the same species within the same thread block increases throughput Process 4 instructions per block: Shared Memory Bank iso-octane (874 species) GPU: GB/s (2 n instr) CPU: 17 GB/s GPU Thread Block 1 16

17 Similar algorithm improvements allows for an order of magnitude speedup in the calculation of the system derivatives on the GPU 17

18 The system derivatives for chemistry are only half the CPU cost the matrix operations must also be accelerated Total Simulation Time (sec)! A: CPU Only (1T)! D: Vector Ops on GPU! E: Matrix skipping (1T)! F: Matrix CPU threads (8T)! A! D! E! F! 18

NVIDIA has provided us with a new sparse matrix software library GLU sparse matrix package Developed internally at NVIDIA - lead developers Maxim Naumov & Sharan Chetlur Key application:

19 NVIDIA has provided us with a new sparse matrix software library GLU sparse matrix package Developed internally at NVIDIA - lead developers Maxim Naumov & Sharan Chetlur Key application: integrated circuits (SPICE) Non-symmetric matrices Beta software LLNL has been provided early access presentations/s3364-spice-acceleration-on-gpus.pdf! 19

20 The system derivatives for chemistry are only half the CPU cost the matrix operations must also be accelerated Total Simulation Time (sec)! A: CPU Only (1T)! D: Vector Ops on GPU! E: Matrix skipping (1T)! F: Matrix CPU threads (8T)! A! D! E! F! 20

21 First GLU results promising it is faster than the best work avoidance CPU-heuristic Total Simulation Time (sec)! A: CPU Only (1T)! D: Vector Ops on GPU! E: Matrix skipping (1T)! F: Matrix CPU threads (8T)! G: GLU Matrix Ops! A! D! E! F! G! 21

22 Second GLU results even better now comparable speed to multithreaded CPU but offers greater future growth Total Simulation Time (sec)! A: CPU Only (1T)! D: Vector Ops on GPU! E: Matrix skipping (1T)! F: Matrix CPU threads (8T)! G: GLU Matrix Ops! H: GLU Matrix Ops #2! A! D! E! F! G! H! 22

Further improvements are expected with new mix & match solver capability added to the next build Total Simulation Time (sec)! CPU matrix! GPU matrix!

23 Further improvements are expected with new mix & match solver capability added to the next build Total Simulation Time (sec)! CPU matrix! GPU matrix! A: CPU Only (1T)! D: Vector Ops on GPU! E: Matrix skipping (1T)! F: Matrix CPU threads (8T)! G: GLU Matrix Ops! H: GLU Matrix Ops #2! A! D! E! F! G! H! 23

24 Chemistry integration over a fluid dynamic time step Single chemical reactor! t 0 t 1 t 2 t N-1 t N dt 0 dt 1 Integrator decides how long a step to take and when to perform new factorization! Black lines indicates derivative evaluation and backsolve! Red lines indicate matrix factorization (traditionally expensive).! 24

25 Reactor coupling: Ideal Case cell 1! cell 2! cell 3! cell 4! Combined! 25

26 Reactor coupling: Ideal Case cell 1! cell 2! cell 3! cell 4! Exactly the same initial conditions means exactly the same combined time steps and matrix factorizations.! Combined! 26

27 Reactor coupling: Poor Case cell 1! cell 2! cell 3! cell 4! Combined! 27

28 Reactor coupling: Poor Case cell 1! cell 2! Closely related initial conditions can lead to very large coupling penalty.! cell 3! cell 4! Combined! 28

29 Reactor coupling: Acceptable Case cell 1! cell 2! cell 3! cell 4! Combined! 29

30 Under what conditions can GPU-friendly multireactor groups be solved efficiently? T i = 800,! Φ i! = 0.2! Speedup (CPU Time/GPU Time)! ΔΦ i! 0! 0.1! 0.2! 50! 25! 0! ΔT i 30

31 Initial study reveals many operating conditions where the multireactor approach is near-ideal T i 1800! 1600! 1400! 1200! 1000! ΔΦ i! 0! 0.1! 0.2! 2000! 50! 25! ΔT i 0! Speedup (GPU Time/CPU Time)! 800! Φ i! 0.2! 0.4! 0.6! 0.8! 1.0! 31

32 Future research directions to improve combustion algorithms on the GPU Near term (FY14): 1. Evaluate new metrics for organizing GPU-friendly multireactor groups! 2. Take full advantage of the hybrid CPU/GPU architecture using the CPU to avoid poorly coupled multireactors!! Long term: 3. Sensitivity calculations and other mechanism development tools used by fuel chemists! 4. Lagrangian spray models! 5. New control strategies for stiff ODE integrators! 6. Many-species transport! 32

33 Summary: GPU-accelerated multireactor solver will enable detailed fuel chemistry to be used in IC engine design Total Simulation Time (sec)! A: CPU Only (1T)! D: Vector Ops on GPU! E: Matrix skipping (1T)! F: Matrix CPU threads (8T)! G: GLU Matrix Ops! H: GLU Matrix Ops #2! A! D! E! F! G! H! 33

34 Acknowledgements We gratefully acknowledge the support of Gurpreet Singh, the Advanced Combustion Engines Program Leader for the US Department of Energy Vehicle Technologies Office. We would like to thank Stan Posey of NVIDIA for providing us with new K20 Tesla GPUs (2496 CUDA cores) for testing, and Maxim Naumov for adding new features to the GLU library and providing us with support. We would also like to thank Prof. Bill Green and his research group at MIT for providing large mechanisms to test from their Reaction Mechanism Generator (RMG) package ( 34

Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs

Presented at the 2014 ANSYS Regional Conference- Detroit, June 5, 2014 Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs Bhushan Desam, Ph.D. NVIDIA Corporation 1 NVIDIA Enterprise