Automated Finite Element Computations in the FEniCS Framework using GPUs

Size: px

Start display at page:

Download "Automated Finite Element Computations in the FEniCS Framework using GPUs"

MargaretMargaret Horn
5 years ago
Views:

1 Automated Finite Element Computations in the FEniCS Framework using GPUs Florian Rathgeber Advanced Modelling and Computation Group (AMCG) Department of Earth Science & Engineering Imperial College London 2nd UK GPU Computing Conference December 13th 2010

2 Outline Introduction: The FEniCS Project Parallel Finite Element Assembly Avoiding Global Assembly Benchmarks and Proling Results Automating Finite Element Assembly Conclusions

3 Outline Introduction: The FEniCS Project Parallel Finite Element Assembly Avoiding Global Assembly Benchmarks and Proling Results Automating Finite Element Assembly Conclusions

4 The FEniCS Project Automation of Computational Mathematical Modeling (ACMM) DOLFIN FEniCS UNICORN

5 Automating nite element assembly mathematical problem weak form form implementation UFC interface global matrix assembly PDE solver UFL FFC UFC DOLFIN Unicorn UFL, FFC, and UFC: the link between variational form in mathematical notation and assembly in DOLFIN

6 Modular DOLFIN library architecture source: Logg and Wells - DOLFIN: Automated nite element computing (2010)

7 The missing link source: NVIDIA

8 The missing link Goals GPU backend for DOLFIN Performance gain No sacrice in degree of automation source: NVIDIA

9 Introduction: The FEniCS Project Outline Parallel Finite Element Assembly The Finite Element Method (FEM) Finite Element Assembly Finite Element Assembly on the GPU A Question of Data Layout Avoiding Global Assembly Benchmarks and Proling Results Automating Finite Element Assembly Conclusions

10 The Finite Element Method (FEM) FEM Triangulation FEM Assembly Sparse matrix FEM Solution source: Wikimedia Commons

11 Finite Element Assembly Assembling a 3 3 element matrix into the global system matrix source: Alnaes et al. - Unied Framework for Finite Element Assembly (2009)

12 Finite element assembly on the GPU assembly time assembly phase tabulation of degrees of freedom evaluation of coefficients (constants, expressions, functions) tabulation of element matrices assembly of global system matrix assembly kernels tabulate_dofs eval_expression interpolate_func tabulate_tensor matrix_addto global memory entity indices function values DOF mapping coefficient values vertex coords element matrices global matrix A data ow diagram showing input and output of kernels for the dierent assembly stages and how data is streamed between them

13 thread ID A question of data layout consecutive thread ID n-2 n-1 n n-2 n-1 n consecutive GPU data layout to achieve coalesced transfers from and to global device memory (right) compared to a corresponding layout in CPU memory (left)

14 Outline Introduction: The FEniCS Project Parallel Finite Element Assembly Avoiding Global Assembly Benchmarks and Proling Results Automating Finite Element Assembly Conclusions

15 Sparse matrix-vector product without assembling the matrix A A T = = M e A Linear algebra representation of the sparse matrix-vector multiplication: M e block-diagonal matrix of element matrices A sparse matrix corresponding to local-to-global mapping rows: all local degrees of freedom columns: global degrees of freedom

16 Introduction: The FEniCS Project Outline Parallel Finite Element Assembly Avoiding Global Assembly Benchmarks and Proling Results Proling GPU Assembly Assembly Benchmarks Assembly and Solve Benchmarks Interpreting Speedup Figures Automating Finite Element Assembly Conclusions

17 Proling GPU assembly GPU kernel execution time (%) GPU kernel execution time (%) tabulate_dofs tabulate_tensor matrix_addto memcpy tabulate_dofs GPUVelSource GPUSource tabulate_tensor matrix_addto memcpy Poisson's equation (P1) Stabilized Navier-Stokes momentum term(p1)

18 Speedup reassembly 10 9 CUDA, NSE momentum (coefficients) CUDA, Poisson 3rd order CUDA, Poisson 2nd order CUDA, Poisson 1st order 8 7 Speedup e+06 Number of cells

19 Total runtime assembly and solve Poisson 3rd order, CUDA LMA Poisson 3rd order, CUDA assemble Poisson 3rd order, PETSc LMA Poisson 3rd order, PETSc assemble Poisson 1st order, CUDA LMA Poisson 1st order, CUDA assemble Poisson 1st order, PETSc LMA Poisson 1st order, PETSc assemble 8 Runtime [s] Number of iterations

20 Interpreting speedup gures Computations in double precision factor 8 penalty for oating point computation, factor 2 for memory transfer 78 GFlop/s peak performance, on par with Intel Nehalem 8-core CPU Fermi architecture improves double precision penalty to factor 2 High speedup gures often: mediocre CPU against highly optimized GPU implementations here: highly optimized linear algebra (PETSc) and tensor contraction (generated by FFC) against generated GPU kernels without any hand-optimization Performance comparisons include data transfer between host and device true one-to-one comparisons including all transfer and setup times

21 Outline Introduction: The FEniCS Project Parallel Finite Element Assembly Avoiding Global Assembly Benchmarks and Proling Results Automating Finite Element Assembly Conclusions

22 FFC code generation Poisson.ufl format Poisson.h parser compiler FIAT FERARI Compilation of a variational form given as a UFL le to a UFC compliant header using FFC

23 Integration of generated code with a user program Poisson.ufl Poisson.ufl Poisson.h compile ffc -l dolfin include main.cpp dolfin.h compile ffc -l dolfin-gpu dolfin-gpu.h Poisson.cu Poisson.h compile nvcc Poisson.o include main.cpp compile gcc link compile gcc libdolfin.so link user program libdolfin-gpu.so user program using DOLFIN library using DOLFIN-GPU library

24 Outline Introduction: The FEniCS Project Parallel Finite Element Assembly Avoiding Global Assembly Benchmarks and Proling Results Automating Finite Element Assembly Conclusions Conclusions Future work

25 Conclusions Code generation for Finite Element Computations on GPUs 1. Finite element computations in the FEniCS framework on the GPU, showing a speedup of up to 9 over PETSc on single CPU 2. Automated assembly from a variational form in mathematical notation using the FEniCS Form Compiler FFC

26 Conclusions Code generation for Finite Element Computations on GPUs 1. Finite element computations in the FEniCS framework on the GPU, showing a speedup of up to 9 over PETSc on single CPU 2. Automated assembly from a variational form in mathematical notation using the FEniCS Form Compiler FFC Current limitations: 1. Only cell integral assembly 2. No support for strongly enforced boundary conditions 3. Only conjugate gradient solver symmetric positive denite matrices 4. Bugs in nvcc compiler prohibit complex forms

27 Future work Integrate with OP2 and Fluidity Optimise generation of CUDA kernels MPI-parallelisation (distributed memory) Fully implement UFC Port to OpenCL

28 Thank You!

29 The NVIDIA Tesla GPU architecture TCP SM MT scheduler instr. cache const. cache NVIDIA Tesla C multiprocessors (MP) 8 stream SP SP processors (SPs) SP SP SP SP registers 16 KB shared memory SP SP 1.30 GHz shader clock texture unit DRAM DRAM texture unit interconnect network DRAM DRAM DRAM texture unit DRAM SFU SFU DP-SP shared mem 4096 MB global memory (frame buer) GFlop/s arithmetic peak GB/s memory bandwidth

30 Finite element assembly on the GPU assembly time assembly kernels tabulate_dofs eval_expression interpolate_func tabulate_tensor matrix_addto global memory entity indices function values DOF mapping coefficient values vertex coords element matrices global matrix referencing classes GPUVector Function GPUFunctionSpace GPUMesh GPUForm GPUAssembler GPUMatrix A data ow diagram showing input and output of the assembly kernels, how data is streamed between them, and where it is stored

31 Speedup assembly 2.4 CUDA, NSE momentum (coefficients) CUDA, Poisson 3rd order CUDA, Poisson 2nd order CUDA, Poisson 1st order Speedup e+06 Number of cells

32 4 3.5 Speedup assembly and solve Poisson 3rd order, CUDA LMA Poisson 3rd order, CUDA assemble Poisson 3rd order, PETSc LMA Poisson 1st order, CUDA LMA Poisson 1st order, CUDA assemble Poisson 1st order, PETSc LMA 3 Speedup (factor) Number of iterations

Generating high-performance multiplatform finite element solvers using the Manycore Form Compiler and OP2

Generating high-performance multiplatform finite element solvers using the Manycore Form Compiler and OP2 Graham R. Markall, Florian Rathgeber, David A. Ham, Paul H. J. Kelly, Carlo Bertolli, Adam Betts