60x Computational Fluid Dynamics and Visualisation

Size: px

Start display at page:

Download "60x Computational Fluid Dynamics and Visualisation"

Delphia Claribel Mathews
5 years ago
Views:

1 60x Computational Fluid Dynamics and Visualisation Jamil Appa BAE Systems Advanced Technology Centre 1

2 Outline BAE Systems - Introduction Aerodynamic Design Challenges Why GPUs? CFD on GPUs Example Kernel Compute and Visualisation on GPUs Summary 2

3 Some of our products 3

4 Outline BAE Systems - Introduction Aerodynamic Design Challenges Why GPUs? CFD on GPUs Example Kernel Compute and Visualisation on GPUs Summary 4

Aerodynamic Design Challenges Current fluid simulation tools and technologies are very capable for a limited range of design tasks Inability to fully explore the design space within useable

5 Aerodynamic Design Challenges Current fluid simulation tools and technologies are very capable for a limited range of design tasks Inability to fully explore the design space within useable timescales and to required/bounded accuracy Properties such as turbulence and flow separation are difficult to simulate accurately, but have a significant impact on the performance of the product. 5

6 6

7 Outline BAE Systems - Introduction Aerodynamic Design Challenges Why GPUs? CFD on GPUs Example Kernel Compute and Visualisation on GPUs Summary 7

8 Why GPUs? 8

9 Outline BAE Systems - Introduction Aerodynamic Design Challenges Why GPUs? CFD on GPUs Example Kernel Compute and Visualisation on GPUs Summary 9

Turbulence model Arbitrary Polyhedra Multiple GPU

10 CUDA enabled CFD 3D Explicit Finite Volume 2 nd order time and space Time accurate Two Equation Turbulence model Arbitrary Polyhedra Multiple GPU implementation Uses MPI to enable use of GPU cluster 10

11 Mesh and Physics Complexity 11

12 Validation 12

13 Rotating Laminar Flow Cylinder 13

14 GPU Speed-up

15 Nehalem Calc (with Tau) Volumes (Cells) (M) 33.5 Volumes (Points) (M) 18.5 Calc unit (edges) (M) 68.3 Iterations 5902 Computing unit (Cores) 128 Time (s) Time (s) per iteration Time (s) per iteration per calc unit (x10-6) Computing unit Time (s) per iteration per calc unit (x10-6)

16 Veloxi CFD calc Volumes (Cells) (M) 6.7 Calc unit (faces) (M) 21.9 Iterations 10 Computing unit (Cards) 1 Time (s) 94 Time (s) per iteration 9.4 Time (s) per iteration per calc unit (x10-6) Computing unit Time (s) per iteration per calc unit (x10-6)

17 Comparison Figures Nehalem Core equivalent per GPU card No. cards needed for 128 core equivalent

18 Outline BAE Systems - Introduction Aerodynamic Design Challenges Why GPUs? CFD on GPUs? Example Kernel Compute and Visualisation on GPUs Summary 18

19 Example Kernel global void UpdateKernel(VarTypes::var_t *celldata_d, VarTypes::var_t *celldatacopy_d, VarTypes::var_t *residualdata_d, Precision::Stype4 *timestepdata_d, Precision::Stype *cellvolume, Precision::Stype4 *cellvelocity_d, Precision::Stype RK, Precision::Stype cfl,int end, int start) { const int T = threadidx.x; const int B = blockidx.x; const int D_INDEX = T+(B*UPDATE_K_THREAD) + start; if(d_index < end){ VarTypes::var_t celldata; // <---- Data types chosen to maximise coalesed data transfer VarTypes::var_t celldatacopy; VarTypes::var_t residualdata; Precision::Stype4 timestep; Precision::Stype4 cellvelocity; // Copy data into registers (hopefully!) #ifdef PRECON celldata = celldata_d[d_index]; #endif celldatacopy = celldatacopy_d[d_index]; residualdata = residualdata_d[d_index]; timestep = timestepdata_d[d_index]; cellvelocity = cellvelocity_d[d_index]; // Call function using data in registers Update::Execute((Precision::Stype*)&cellData, // <---- This function is the same for host or GPU execution (Precision::Stype*)&cellDataCopy, (Precision::Stype*)&residualData, (Precision::Stype*)&timeStep, (Precision::Stype)1.0/cellVolume[D_INDEX], RK, cfl, params.gamma, params.gasconstant, params.pinf, params.velref, // <---- Key solver parameters stored in constant memory (Precision::Stype*)&cellVelocity); // Copy data back to main memory celldata_d[d_index] = celldata; } } global void UpdateKernel(VarTypes::var_t *celldata_d, { const int T = threadidx.x; VarTypes::var_t *celldatacopy_d, VarTypes::var_t *residualdata_d, Precision::Stype4 *timestepdata_d, Precision::Stype *cellvolume, Precision::Stype4 *cellvelocity_d, Precision::Stype RK, Precision::Stype cfl,int end, int start) const int B = blockidx.x; const int D_INDEX = T+(B*UPDATE_K_THREAD) + start; if(d_index < end){ VarTypes::var_t celldata; // <---- Data types chosen to // maximise coalesced data transfer VarTypes::var_t celldatacopy; VarTypes::var_t residualdata; Precision::Stype4 timestep; Precision::Stype4 cellvelocity; 19

20 Example Kernel global void UpdateKernel(VarTypes::var_t *celldata_d, VarTypes::var_t *celldatacopy_d, VarTypes::var_t *residualdata_d, Precision::Stype4 *timestepdata_d, Precision::Stype *cellvolume, Precision::Stype4 *cellvelocity_d, Precision::Stype RK, Precision::Stype cfl,int end, int start) { const int T = threadidx.x; const int B = blockidx.x; const int D_INDEX = T+(B*UPDATE_K_THREAD) + start; if(d_index < end){ VarTypes::var_t celldata; // <---- Data types chosen to maximise coalesed data transfer VarTypes::var_t celldatacopy; VarTypes::var_t residualdata; Precision::Stype4 timestep; Precision::Stype4 cellvelocity; // Copy data into registers (hopefully!) #ifdef PRECON celldata = celldata_d[d_index]; #endif celldatacopy = celldatacopy_d[d_index]; residualdata = residualdata_d[d_index]; timestep = timestepdata_d[d_index]; cellvelocity = cellvelocity_d[d_index]; // Copy data into registers (hopefully!) #ifdef PRECON celldata = celldata_d[d_index]; #endif celldatacopy = celldatacopy_d[d_index]; residualdata = residualdata_d[d_index]; timestep = timestepdata_d[d_index]; cellvelocity = cellvelocity_d[d_index]; // Call function using data in registers Update::Execute((Precision::Stype*)&cellData, // <---- This function is the same for host or GPU execution (Precision::Stype*)&cellDataCopy, (Precision::Stype*)&residualData, (Precision::Stype*)&timeStep, (Precision::Stype)1.0/cellVolume[D_INDEX], RK, cfl, params.gamma, params.gasconstant, params.pinf, params.velref, // <---- Key solver parameters stored in constant memory (Precision::Stype*)&cellVelocity); // Copy data back to main memory celldata_d[d_index] = celldata; } } 20

21 Example Kernel global void UpdateKernel(VarTypes::var_t *celldata_d, VarTypes::var_t *celldatacopy_d, VarTypes::var_t *residualdata_d, Precision::Stype4 *timestepdata_d, Precision::Stype *cellvolume, Precision::Stype4 *cellvelocity_d, Precision::Stype RK, Precision::Stype cfl,int end, int start) { const int T = threadidx.x; const int B = blockidx.x; const int D_INDEX = T+(B*UPDATE_K_THREAD) + start; if(d_index < end){ VarTypes::var_t celldata; // <---- Data types chosen to maximise coalesed data transfer VarTypes::var_t celldatacopy; VarTypes::var_t residualdata; Precision::Stype4 timestep; Precision::Stype4 cellvelocity; // Copy data into registers (hopefully!) #ifdef PRECON celldata = celldata_d[d_index]; #endif celldatacopy = celldatacopy_d[d_index]; residualdata = residualdata_d[d_index]; timestep = timestepdata_d[d_index]; cellvelocity = cellvelocity_d[d_index]; // Call function using data in registers Update::Execute((Precision::Stype*)&cellData, // <---- This function is the same for host or GPU execution (Precision::Stype*)&cellDataCopy, (Precision::Stype*)&residualData, (Precision::Stype*)&timeStep, (Precision::Stype)1.0/cellVolume[D_INDEX], RK, cfl, params.gamma, params.gasconstant, params.pinf, params.velref, // <---- Key solver parameters stored in constant memory (Precision::Stype*)&cellVelocity); // Copy data back to main memory celldata_d[d_index] = celldata; } } } This function is the same for host or GPU execution Update::Execute((Precision::Stype*)&cellData, (Precision::Stype*)&cellDataCopy, (Precision::Stype*)&residualData, (Precision::Stype*)&timeStep, (Precision::Stype)1.0/cellVolume[D_INDEX], RK, cfl, params.gamma, params.gasconstant, params.pinf, params.velref, // <---- Key solver (Precision::Stype*)&cellVelocity); // Copy data back to main memory celldata_d[d_index] = celldata; // parameters // stored in // constant memory 21

22 Profiler Output 22

23 Lessons Learnt Maximise coalesced memory transfers Use appropriate data structures Pay close attention to.cubin and.ptx outputs Manually move data associated with lmem by the compiler into shared Do NOT rely on the compiler to do this (at the moment) Use the profiler... Maintaining optimum code for different generations of products is time consuming 23

24 Wish List.. Toolkit to simplify the use of shared memory as a register spill over Improved debugging especially for complex kernels Improved error diagnosis Many bugs can only be found by trial and error Resolution of Infiniband/CUDA pinned memory conflict Require zero copy DMA for both stacks Next Gen GPUs as soon as possible... 24

25 Outline BAE Systems - Introduction Aerodynamic Design Challenges Why GPUs? CFD on GPUs Example Kernel Compute and Visualisation on GPUs Summary 25

26 Visualisation Techniques Vector plots Messy, difficult to interpret Streamlines Results depend on start points Without prior knowledge of flow field, can miss features

27 Line Integral Convolution A local streamline is calculated for each pixel A white noise image is smeared along these streamlines

28 28

29 29

30 Results Size (pixels) CPU Time (s) GPU Time (s) Speedup Frame Time (s) 100x x x x x

31 Outline BAE Systems - Introduction Aerodynamic Design Challenges Why GPUs? CFD on GPUs Compute and Visualisation on GPUs Summary 31

32 Summary GPU based high fidelity CFD is possible today NVIDIA PSC sized system equivalent to 100 Opteron cores or 60 Nehalem cores Large TCO savings (software, hardware and power) possible. Combined compute and visualisation will enable realtime simulation and interpretation of results 32

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures Dirk Ribbrock, Markus Geveler, Dominik Göddeke, Stefan Turek Angewandte Mathematik, Technische Universität Dortmund