PyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python. F.D. Witherden, M. Klemm, P.E. Vincent

Size: px

Start display at page:

Download "PyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python. F.D. Witherden, M. Klemm, P.E. Vincent"

Amie Ford
5 years ago
Views:

1 PyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python F.D. Witherden, M. Klemm, P.E. Vincent 1

2 Overview Motivation. Accelerators and Modern Hardware Python and PyFR. Summary.

3 Motivation [1] Murray Cross, Airbus, Technology Product Leader - Future Simulations (2012)

4 Motivation Objective is to advance industrial CFD capabilities from their current RANS plateau. Achieved by intelligently leveraging benefits of high-order Flux Reconstruction (FR) methods for unstructured grids and massivelyparallel modern hardware platforms.

5 Modern Hardware 10 6 MFLOP/s, MB/s Measure Peak FLOPS Peak bandwidth Year Intel Xeon CPUs from

6 Accelerator Adoption FLOPS contributed by accelerators to the TOP500. [HPCwire]

7 Accelerator Adoption Within the top ten: Intel Xeon Phi 2 of 10 NVIDIA Tesla 2 of 10

8 Increasing Heterogeneity Consider Stampede at TACC. Currently #7 on the TOP500 list. Intel Xeon CPUs 2.2 PFLOP/S Intel Xeon Phis 7.4 PFLOP/S

9 Increasing Heterogeneity and it is not just the high end.

10 Performance Portability It is a challenging environment

11 PyFR Our solution PyFR. Written in Python. Uses flux reconstruction to solve the Navier- Stokes equations on mixed unstructured grids in 2D/3D. Performance portable across a variety of hardware platforms.

12 PyFR Python outer layer. Python Outer Layer (Hardware Independent) Setup Distributed memory parallelism Outer for loop and calls to Hardware Specific Kernels

13 PyFR Need to generate hardware specific kernels. Python Outer Layer (Hardware Independent) Setup Distributed memory parallelism Outer for loop and calls to Hardware Specific Kernels

14 PyFR In FR two types of kernel are required. Python Outer Layer (Hardware Independent) Setup Distributed memory parallelism Outer for loop and calls to Hardware Specific Kernels Matrix Multiply Kernels Data interpolation/ extrapolation etc. Point-Wise Nonlinear Kernels Flux functions, Riemann solvers etc.

15 PyFR Matrix multiplications are quite simple. Python Outer Layer (Hardware Independent) Setup Distributed memory parallelism Outer for loop and calls to Hardware Specific Kernels Matrix Multiply Kernels Data interpolation/ extrapolation etc. Point-Wise Nonlinear Kernels Flux functions, Riemann solvers etc. Use GEMM from vendor supplied BLAS

16 PyFR Harder for point-wise nonlinear kernels. Python Outer Layer (Hardware Independent) Setup Distributed memory parallelism Outer for loop and calls to Hardware Specific Kernels Matrix Multiply Kernels Data interpolation/ extrapolation etc. Point-Wise Nonlinear Kernels Flux functions, Riemann solvers etc. Use GEMM from vendor supplied BLAS Pass templates through Mako derived templating engine

17 PyFR These can now be called. Python Outer Layer (Hardware Independent) Setup Distributed memory parallelism Outer for loop and calls to Hardware Specific Kernels Matrix Multiply Kernels Data interpolation/ extrapolation etc. Point-Wise Nonlinear Kernels Flux functions, Riemann solvers etc. C/OpenMP Hardware Specific Kernels CUDA Hardware Specific Kernels OpenCL Hardware Specific Kernels Use GEMM from vendor supplied BLAS Pass templates through Mako derived templating engine

18 PyFR These can now be called. Python Outer Layer (Hardware Independent) Setup Distributed memory parallelism Outer for loop and calls to Hardware Specific Kernels Matrix Multiply Kernels Data interpolation/ extrapolation etc. Point-Wise Nonlinear Kernels Flux functions, Riemann solvers etc. C/OpenMP Hardware Specific Kernels CUDA Hardware Specific Kernels OpenCL Hardware Specific Kernels Use GEMM from vendor supplied BLAS Pass templates through Mako derived templating engine

19 Mako Template <%pyfr:kernel name='negdivconf' ndim='2' t='scalar fpdtype_t' tdivtconf='inout fpdtype_t[${str(nvars)}]' ploc='in fpdtype_t[${str(ndims)}]' rcpdjac='in fpdtype_t'> % for i, ex in enumerate(srcex): tdivtconf[${i}] = -rcpdjac*tdivtconf[${i}] + ${ex}; % endfor </%pyfr:kernel> 19

20 PyFR Core Tenants Exploit the rich Python ecosystem: numpy, mpi4py, PyCUDA, PyOpenCL, pymic, h5py, mako Offload all computation: overhead from the interpreter < 1%.

21 PyFR Results I Benchmark problem: flow over a cylinder: Isosurfaces of density Ma = 0.2, Re =

22 PyFR Results II Single node performance. W9100 (OpenCL) K40c (CUDA) E (C/OpenMP) Sustained GFLOP/s Polynomial Order

23 PyFR Results III Multi-node heterogeneous performance on the same mesh. Achieved Lost Polynomial Order Sustained GFLOP/s

24 PyFR Results IV Weak scaling on Piz Daint at CSCS for a NACA Normalised runtime PFLOP/s ~ 50% peak 32 billion DOFs # NVIDIA K20X

25 Summary Funded and supported by Any questions? 25

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids Patrice Castonguay and Antony Jameson Aerospace Computing Lab, Stanford University GTC Asia, Beijing, China December 15 th, 2011