Performance-Portable Many Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond

Size: px

Start display at page:

Download "Performance-Portable Many Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond"

Eugenia Flowers
5 years ago
Views:

OpenPower and Beyond Erik Zenker1,2, René

1 Performance-Portable Many Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond Erik Zenker1,2, René Widera1, Axel Huebl1,2, Guido Juckeland1, Andreas Knüpfer2, Wolfgang E. Nagel2, Michael Bussmann1 1 2 Helmholtz-Zentrum Dresden Rossendorf Technische Universität Dresden Prof. Peter Mustermann I Institut xxxxx I

2 PICon GPU Electron Acceleration with Lasers Ion Acceleration with Lasers Plasma Instabilities Compact X-Ray sources Tumor Therapy Astrophysics 2

3 Domain Decomposition Field and Particle Domain 3 Moving Particles create Fields Particles change Cells Fields act back on Particles

4 Creating Vectorized Data Structures for Particles and Fields Field Domain Particle Domain Cell 1 Cell 2 Cell 4 4 Cell 1 chunked in supercells fixed size frames line wise aligned struct of aligned arrays

5 Algorithm Driven Cache Strategy Global SharedMemory Memory Shared Memory Global Memory Cell Cell 2 3 Cell 1 Cell 4 4

6 High Utilization of Threads Shared Memory Global Memory Cell 1 1 THREAD BLOCK Cell 4 4 THREAD Cell 2 3 Cell 1 THREAD 2 THREAD 3 THREAD 4

7 Task-Parallel Execution of Kernels Asynchronous Communication 7

8 PIConGPU Scales up to 18,432 GPUs strong scaling weak scaling efficiency eﬃciency [%] speedup ideal 1 to 32 8 to to to to number of GPUs 95 Efficiency >95% 6.9 PFlop/s (SP) ideal PIConGPU number of GPUs 10000

9 More Physics, More Computations, More Power! Atom-physical effects s1 t1,1 t1,2 t1,3 t1,n t2,1. t3,1.. tn,1 tn,n s2 s1,1 s1,2 s1,3... sn,m = s1,1 s1,2 s1,3... sn,m Really Big Data Task 9 Old atom New atom state state Random access on big amounts of data > 100 GB Good job for powerful CPUs Efficient CPU/GPU cooperation

minimal code execute everywhere everywhere changes Optimizability Openness Tune

10 Small Open Source Communities need Maintainable Codes Heterogeneity Testability Sustainability Validate once, Write once, Porting implies get correct results minimal code execute everywhere everywhere changes Optimizability Openness Tune for good Open source performance and at minimum open standards coding effort Single Source 10

11 cupla Alpaka 11

12 Good News: there are Alpakas on the Compute Meadow C Threads Fibers TBB Single zero overhead interface to existing parallelism models Single source C11 kernels Data-agnostic memory model 12

13 Abstract Hierarchical Redundant Parallelism Model Synchronize Grid Block Thread Parallel Element Element level is an explicit sequential layer Sequential 13

14 Data Structure Agnostic Memory Model Explicit deep copy Grid Block Host Memory 14 Device Global Memory Register Memory Shared Memory Thread Register Memory

15 Map the Abstraction Model to your Desired Acceleration Back-End Specific unsupported levels of the model can be ignored Abstract interface allows to extend the set of mappings 16

16 Alpaka in Source Code // Configure Alpaka using Dim = alpaka::dim::dimint<3u> using Size = std::size_t using Acc = alpaka::acc::acccpuserial<dim, Size>; using Host = alpaka::acc::acccpuserial<dim, Size>; using Stream = alpaka::stream::streamcpusync; using WorkDiv = alpaka::workdiv::workdivmembers<dim, Size>; using Elem = float; // Retrieve devices and stream DevHost devhost ( alpaka::dev::devman<host>::getdevbyidx(0) ); DevAcc devacc ( alpaka::dev::devman<acc>::getdevbyidx(0) ); Stream stream ( devacc); // Specify work division auto elementsperthread ( alpaka::vec<dim, Size>::ones() ); auto threadsperblock ( alpaka::vec<dim, Size>::all(2u) ); auto blockspergrid ( alpaka::vec<dim, Size>(4u, 8u, 16u) ); initialization work division memory allocation & copy kernel execution memory copy WorkDiv workdiv(alpaka::workdiv::workdivmembers<dim, Size>(blocksPerGrid, threadsperblock, elementsperthread)); // Memory allocation and host to device auto X_h = alpaka::mem::buf::alloc<int, auto Y_h = alpaka::mem::buf::alloc<int, auto X_d = alpaka::mem::buf::alloc<val, auto Y_d = alpaka::mem::buf::alloc<val, memory copy int>(devhost, int>(devhost, Size>(devAcc, Size>(devAcc, alpaka::mem::view::copy(stream, X_d, X_h, extent); alpaka::mem::view::copy(stream, Y_d, Y_h, extent); // Kernel creation and execution VectorAdd kernel; auto const exec(alpaka::exec::create<acc>( workdiv, kernel, numelements alpaka::mem::view::getptrnative(x_d), alpaka::mem::view::getptrnative(y_d))); alpaka::stream::enqueue(stream, exec); // Copy memory back to host alpaka::mem::view::copy(stream, Y_h, Y_d, extent); Alpaka 17 extent); extent); extent); extent); Very explicit, but chatty interface!

17 cupla: Provides you an Interface you are Familiar With // Configure Alpaka using Dim = alpaka::dim::dimint<3u> using Size = std::size_t using Acc = alpaka::acc::acccpuserial<dim, Size>; using Host = alpaka::acc::acccpuserial<dim, Size>; using Stream = alpaka::stream::streamcpusync; using WorkDiv = alpaka::workdiv::workdivmembers<dim, Size>; using Elem = float; #include <cuda_to_cupla.hpp> // Retrieve devices and stream DevHost devhost ( alpaka::dev::devman<host>::getdevbyidx(0) ); DevAcc devacc ( alpaka::dev::devman<acc>::getdevbyidx(0) ); Stream stream ( devacc); // Specify work division auto elementsperthread ( alpaka::vec<dim, Size>::ones() ); auto threadsperblock ( alpaka::vec<dim, Size>::all(2u) ); auto blockspergrid ( alpaka::vec<dim, Size>(4u, 8u, 16u) ); // Specify work division dim3 grid(4,8,16); dim3 threads(2,2,2); dim3 elems(1,1,1) WorkDiv workdiv(alpaka::workdiv::workdivmembers<dim, Size>(blocksPerGrid, threadsperblock, elementsperthread)); // Memory allocation and host to device auto X_h = alpaka::mem::buf::alloc<int, auto Y_h = alpaka::mem::buf::alloc<int, auto X_d = alpaka::mem::buf::alloc<val, auto Y_d = alpaka::mem::buf::alloc<val, memory copy int>(devhost, int>(devhost, Size>(devAcc, Size>(devAcc, alpaka::mem::view::copy(stream, X_d, X_h, extent); alpaka::mem::view::copy(stream, Y_d, Y_h, extent); // Kernel creation and execution VectorAdd kernel; auto const exec(alpaka::exec::create<acc>( workdiv, kernel, numelements alpaka::mem::view::getptrnative(x_d), alpaka::mem::view::getptrnative(y_d))); extent); extent); extent); extent); // Memory allocation and host to device memory copy int * x_h, y_h, x_d, y_d; x_h = (int*) malloc(nelem * sizeof(int)); y_h = (int*) malloc(nelem * sizeof(int)); cudamalloc((void **)&x_d, n * sizeof(int)); cudamalloc((void **)&y_d, nelem * sizeof(int)); cudamemcpy(x_d, x_h, n * sizeof(int), cudamemcpyhosttodevice); cudamemcpy(y_d, y_h, n * sizeof(int), cudamemcpyhosttodevice); // Kernel execution CUPLA_KERNEL_ELEM ( kernel ) (grid,threads,elems)(n, x_d, y_d); alpaka::stream::enqueue(stream, exec); // Copy memory back to host alpaka::mem::view::copy(stream, Y_h, Y_d, extent); Alpaka 18 // Copy memory back to host cudamemcpy(y_h, y_d, n * sizeof(int), cudamemcpydevicetohost); cupla

18 Cupla for the rescue : very fast porting! Alpaka 19

19 Porting CUDA to Alpaka With cupla Needs Minimal Interventions CUDA #include <cuda_runtime.h> // Specify work division dim3 blocks(4,8,16); dim3 threads(2,2,2); dim3 elems(1,1,1); // Memory allocation and host to device memory copy int * x_h, y_h, x_d, y_d; x_h = (int*) malloc(nelem * sizeof(int)); y_h = (int*) malloc(nelem * sizeof(int)); cudamalloc((void **)&x_d, n * sizeof(int)); cudamalloc((void **)&y_d, nelem * sizeof(int)); cudamemcpy(x_d, x_h, n * sizeof(int), cudamemcpyhosttodevice); cudamemcpy(y_d, y_h, n * sizeof(int), cudamemcpyhosttodevice); CUDA // Kernel execution kernel<<<blocks, threads>>>(elems, n, x_d, y_d); CUDA // Copy memory back to host cudamemcpy(y_h, y_d, n * sizeof(int), cudamemcpydevicetohost); 20 CUDA

20 Most of the CUDA-API Calls Stay Untouched cupla #include <cuda_to_cupla.hpp> // Specify work division dim3 blocks(4,8,16); dim3 threads(2,2,2); dim3 elems(1,1,1); // Memory allocation and host to device memory copy int * x_h, y_h, x_d, y_d; x_h = (int*) malloc(nelem * sizeof(int)); y_h = (int*) malloc(nelem * sizeof(int)); cudamalloc((void **)&x_d, n * sizeof(int)); cudamalloc((void **)&y_d, nelem * sizeof(int)); cudamemcpy(x_d, x_h, n * sizeof(int), cudamemcpyhosttodevice); cudamemcpy(y_d, y_h, n * sizeof(int), cudamemcpyhosttodevice); // Kernel execution CUPLA_KERNEL_ELEM ( kernel ) (blocks, threads, elems)(n, x_d, y_d); // Copy memory back to host cudamemcpy(y_h, y_d, n * sizeof(int), cudamemcpydevicetohost); 21 CUDA cupla CUDA

21 Single Source cupla DGEMM Kernel on Various Architectures 480 GFLOPS 560 GFLOPS 540 GFLOPS 150 GFLOPS 1450 GFLOPS Theoretical Peak Performance DGEMM: C αab βc

22 What happend so far... 23

23 PIConGPU Runtime on Various Architectures Simulation Parameters: 1000 time-steps 3D3V 128 cells in each dimension Quadratic-spline interpolation 24

24 PIConGPU Efficiency on Various Architectures 25

25 Conclusion 26 Alpaka Uniform zero overhead C interface to various many core programming models Supports acceleration of all major multi and many core architectures cupla CUDA-like interface to Alpaka Very fast porting of existing C CUDA code PICon(GPU, PPC, x86, ARM, MIC) First prototype was ported from CUDA to cupla within 2 weeks Execution of KHI simulation on CPU and GPU architectures Switch between back-ends with a simple CMAKE variable switch

26 Clone us from GitHub git clone git clone git clone Alpaka paper pre-print: 27

27 28

arxiv: v1 [cs.dc] 26 Feb 2016

arxiv: v1 [cs.dc] 26 Feb 2016 Alpaka An Abstraction Library for Parallel Kernel Acceleration arxiv:1602.08477v1 [cs.dc] 26 Feb 2016 Erik Zenker 1,2, Benjamin Worpitz 1,2, ené Widera 1, Axel Huebl 1,2, Guido Juckeland 1,2, Andreas Knüpfer