Generic Refinement and Block Partitioning enabling efficient GPU CFD on Unstructured Grids

Size: px

Start display at page:

Download "Generic Refinement and Block Partitioning enabling efficient GPU CFD on Unstructured Grids"

Maximillian Matthews
6 years ago
Views:

Generic Refinement and Block Partitioning enabling efficient GPU CFD on Unstructured Grids Matthieu Lefebvre 1, Jean-Marie Le Gouez 2 1 PhD at Onera, now post-doc at

1 Generic Refinement and Block Partitioning enabling efficient GPU CFD on Unstructured Grids Matthieu Lefebvre 1, Jean-Marie Le Gouez 2 1 PhD at Onera, now post-doc at Princeton, department of Geosciences, ml15@princeton.edu 2 Director of CFD and Aeroacoustics Department, Onera, Châtillon 1 GPU technology Conference GTC March 2013 San José

2 Efficient GPU CFD on Unstructured Grids Context Objectives Development stages, data models, languages Results of CFD method implementations Outlook 2 GPU technology Conference GTC March 2013 San José

Context: Simulation & HPC CFD legacy softwares elsa : aerodynamics and aeroelasticity, Cedre : combustion and aerothermics, Sabrina/Space/Funk : LES, Aeroacoustics, Euler equations in perturbations,

3 Context: Simulation & HPC CFD legacy softwares elsa : aerodynamics and aeroelasticity, Cedre : combustion and aerothermics, Sabrina/Space/Funk : LES, Aeroacoustics, Euler equations in perturbations, linear or not Hundreds of thousands lines, mixed Fortran, C++, Python for elsa High usage in aerospace industry External user expectations : need solutions to process Larger flow domains effect of wakes on downstream components : BVI, Full system modeling like multistage aerodynamics in turbomachinery, coupling with the combustion More multiscale : near-wall modeling, technological details to gain in system efficiency, flow control, Up-to-date CFD usage : Shape optimization, building and use of response surfaces, robust design, i.e. effect of existing uncertainties on input data : on-going integration in mainstream codes 3 GPU technology Conference GTC March 2013 San José

4 Context: Simulation & HPC Internal users expectations, as front end to the external users : New class of methods to develop better models of flow behavior confidence that novel numerical schemes will better preserve their models from spoiling by numerical diffusion/dispersion (still not a consensus), more efficiency : accuracy versus number of cells, grid convergence Objective : Billion cells, long unsteady runs, deforming & overlapping meshes (LES, DES, Aeracoustics) Evidences : 1/ Ongoing heavy development plans of the legacy software to develop & validate new functionalities risks on the scalability towards Petaflop s computing 2/ Innovative research at Onera : numerous prototypes Grid strategies : - structured - unstructured - overset, self generated, adaptive associated to model certification, international benchmarks Numerical schemes and non-linear solvers : DGM (h-p-m adaptation), VMS turbulence (Variational multiscale with hierarchy in basis functions), High Order FV Mesh/model/physics coupling strategies 4 GPU technology Conference GTC March 2013 San José

5 Objectives : Simulation & HPC Proceed with innovative solver prototypes, h,p-model adaptive : DGM, Hybrid&Overset FV?, Isogeometric FE with high order curved boundary elements Analyse the hierarchy in hardware computing core implementations, coprocessor farms, the associated hierarchy in memory access, Specify modular solvers on partitioned grids, adaptive and hierarchical data models, reuse (rewrite, simplify) existing modules code lines (already subdomain&mpi) Follow the way for a CFD Domain Specific Language 5 GPU technology Conference GTC March 2013 San José

6 Generic GPU Finite Volume prototype Generic GPU Finite Volume prototype Uses a High Order in space FV method (NXO) From a Fortran basis, which uses complex data structures on partitioned grids, calls the coprocessor modules (C subroutines, CUDA kernels), mocks-up and arranges the data model in C and Cuda, manages the pointers on GPU memory and thread submission for high CPU efficiency on TESLA 1st (easy) stage : on i,j,k structured grids multi-gpu implementation 2nd stage : on an existing arbitrary mesh of triangles 2a : domain partitioning of fine linear triangles (gmsh) addressed to blocks of threads : efficient usage of cache 2b : partitioning and generic refinement of an unstructured coarse mesh of triangles (high geometrical order triangles by gmsh) : efficient coalescent memory acesses in all the solver inner stages (kernel computations element-based, face-based, node-based) 6 GPU technology Conference GTC March 2013 San José

Euler equations with immersed boundaries i,j,k grid 2 nd order schemes Roe Fluxes / MUSCL Cartesian structuration Cartesian Euler Used to determine

7 Euler equations with immersed boundaries i,j,k grid 2 nd order schemes Roe Fluxes / MUSCL Cartesian structuration Cartesian Euler Used to determine language / API / tools to use CUDA C/Fortran, OpenCL PGI Accelerator 2D and 3D solvers 7 GPU technology Conference GTC March 2013 San José

8 Cartesian Euler : Multi-GPUs Communication Scheme CUDA C model preferred 2 Fermi GPUs collaboration through the pthread library Flat plate 25 M points Mach degree incidence 8 GPU technology Conference GTC March 2013 San José

9 3D Experiments Speedup and Scalability Results Speedup Problem size Speedup 1 M2050 vs. 2 M2050 : up to GPU technology Conference GTC March 2013 San José

Challenge: Accelerating Non GPU Compliant Codes NXO Unstructured grids Large stencils (up to the third neighbor to compute the fluxes for each cell interface, very high memory stress Reaches

10 Challenge: Accelerating Non GPU Compliant Codes NXO Unstructured grids Large stencils (up to the third neighbor to compute the fluxes for each cell interface, very high memory stress Reaches 4th-order spatial accuracy However algorithmic efficiency is hampered by the numerous indirect memory accesses through the arbitrary connectivity lists accessed at successive stages of the algorithm (cell-,face-, node-, stencilbased descriptions). On a GPU such non consecutive memory fetches are penalizing 10 GPU technology Conference GTC March 2013 San José

11 1st Approach: Block Structuration of a Regular Linear Grid Partition the mesh into small blocks Map the GPU scalable structure SM SM SM SM Block Block Block Block SM: Stream Multiprocessor 11 GPU technology Conference GTC March 2013 San José

12 Advantage of the Block Structuration Bigger blocks provide Better occupancy Less latency due to kernel launch Less transfers between blocks Smaller blocks provide Much more data caching L1 hit rate 1 0,8 0,6 0,4 0,2 0 Fluxes time (normalized) 1,2 1 0,8 0,6 0,4 0,2 0 Overall time (normalized) Final speedup wrt. to 2 hyperthreaded Westmere CPU: ~ GPU technology Conference GTC March 2013 San José

NXO-GPU Phase 2 : Imposing an Inner Structuration to the Grid (inspired by the tesselation tesselation mechanism in surface rendering) Hints for further optimization : relax the grid

13 NXO-GPU Phase 2 : Imposing an Inner Structuration to the Grid (inspired by the tesselation tesselation mechanism in surface rendering) Hints for further optimization : relax the grid coordinates once the substructuration is done the whole fine grid as such could remain unknown to the host CPU red block partition 13 GPU technology Conference GTC March 2013 San José

is allocated to an inner thread (threadid.

14 NXO-GPU Phase 3 : Imposing an Inner Structuration to the Grid Unique grid connectivity for the algorithm Unique subsets of cell groups for communications Optimal to organize data for coalescent memory access during the algorithm and communication phases Each coarse element in a block is allocated to an inner thread (threadid.x) Sub-structured interface vector for communication To improve the percentage of coalescent communication : Renumber the coarse cells in a partition Change the orientation (permutaion of the sides) 14 GPU technology Conference GTC March 2013 San José

15 Code structure Preprocessing Mesh generation and block and generic refinement generation Solver Allocation and initialization of data structure from the modified mesh file Computational routine GPU allocation and initialization binders Computational binders CUDA kernels Time stepping Data fetching binder C C CUDA C Fortran Fortran Postprocessing 15 GPU technology Conference GTC March 2013 San José Visualization and data analysis

Stencils Code and structure Ghost cells Coupling mechanisms : Identify ghost cells with real fine cells from neighbor coarse cells (transfer its metrics) or : do an overset grid interpolation to

16 Stencils Code and structure Ghost cells Coupling mechanisms : Identify ghost cells with real fine cells from neighbor coarse cells (transfer its metrics) or : do an overset grid interpolation to obtain the volume average of the conserved variables on the extruded ghost cells Reference element to be mapped on curvilinear cells 16 GPU technology Conference GTC March 2013 San José

17 Code structure CPU Data structure Block List * Block + Conservative variables + Stencils coefficients +... Coarse Element + Generic geometry arrays + Connectivity between fine elements Fine element number Coarse element number 17 GPU technology Conference GTC March 2013 San José Array filling : face AND cell values are consecutives for coarse element indexing

18 Code structure GPU Data structure Block List CPU 1 1..* Block CPU +... CPU +... GPU 1..* Fortran 2003 interface Block List +... CPU GPU 1 Block +... CPU GPU Fortran 2003 ''C pointer'' localisation Block ''C pointer'' List +... CPU CPU Coarse Element CPU Coarse Element CPU Fortran +... CPU +... GPU C / CUDA 18 GPU technology Conference GTC March 2013 San José

19 Code structure Language interoperability typedef struct { int number_elements; int number_faces; int size_stencil; double *wall_distance; double *dissipation_rate; //... double *flu_u; double *flu_v; //... double *producted_kinetic_energy; double *pseudo_timestep; double *ro; double *sfx; double *sfy; double *temperature; double *viscosity; double *vol; //... int *face_vertices; int *index_element_faces; int *index_face_elements; int *number_elements_stencils; int *stencil_indexes; int *index_stencil_faces; // inter coarse element communications int *message_src; // Boundaries conditions int number_adherent_faces; int number_open_faces; int *adherent_faces; int *open_faces; } block_data_t; TYPE block_data_t integer(c_int) :: number_elements integer(c_int) :: number_faces integer(c_int) :: size_stencil real(c_double), dimension(:), allocatable :: wall_distance real(c_double), dimension(:), allocatable :: dissipation_rate!... real(c_double), dimension(:), allocatable :: flu_u real(c_double), dimension(:), allocatable :: flu_v!... real(c_double), dimension(:), allocatable :: producted_kinetic_energy real(c_double), dimension(:), allocatable :: pseudo_timestep real(c_double), dimension(:), allocatable :: ro real(c_double), dimension(:), allocatable :: sfx real(c_double), dimension(:), allocatable :: sfy real(c_double), dimension(:), allocatable :: temperature real(c_double), dimension(:), allocatable :: total_energy real(c_double), dimension(:), allocatable :: viscosity real(c_double), dimension(:), allocatable :: vol!... integer(c_int), dimension(:), allocatable :: face_vertices integer(c_int), dimension(:), allocatable :: index_element_faces integer(c_int), dimension(:), allocatable :: index_face_elements integer(c_int), dimension(:), allocatable :: number_elements_stencils integer(c_int), dimension(:), allocatable :: stencil_indexes integer(c_int), dimension(:), allocatable :: index_stencil_faces // inter coarse element communications integer(c_int), dimension(:), allocatable :: message_src! Boundaries conditions integer(c_int) :: number_adherent_faces integer(c_int) :: number_open_faces integer(c_int), dimension(:), allocatable :: adherent_faces integer(c_int), dimension(:), allocatable :: open_faces END TYPE block_data_t 19 GPU technology Conference GTC March 2013 San José

Measured efficiency on a Tesla 2050 (with respect to 2 Cpu Xeon 5650, OMP loop-based) Simple duplication Fine grid partitioning Coarse grid refinement and partitioning Refinement without partitioning

20 Measured efficiency on a Tesla 2050 (with respect to 2 Cpu Xeon 5650, OMP loop-based) Simple duplication Fine grid partitioning Coarse grid refinement and partitioning Refinement without partitioning Recent results on a K20C : Max. Acceleration = 38 wrt to 2 Westmere sockets, very good scaling from the C2050 with the number of cores (2000 / 480), due to a lower pressure on registers? 20 GPU technology Conference GTC March 2013 San José

21 Conclusion Generic refinement methodology permits coalesced data accesses in global memory Exchange step between coarse cells ~ 10% global time Refinement level adaptable to the target computer architecture Refinement level adaptable to the solution (clustering in space and blocks of threads of coarse cells with identical refinement) Natural extension of the partitioned structure to multi-gpus, Further works : extension to 3D and other types of elements A prototype of hierarchical data model, further methods (DGM, 2nd order compact schemes on unstructured grids) and turbulence models will be ported in new kernels. At present non linear implicit in real time by dual time stepping, a linearized implicit formulation will be implemented 21 GPU technology Conference GTC March 2013 San José

Recent results with elsa on multi-cores

Recent results with elsa on multi-cores Michel Gazaix (ONERA) Steeve Champagneux (AIRBUS) October 15th, 2009 Outline Short introduction to elsa elsa benchmark on HPC platforms Detailed performance evaluation IBM Power5, AMD Opteron, INTEL Nehalem