Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Size: px

Start display at page:

Download "Finite Element Integration and Assembly on Modern Multi and Many-core Processors"

Myles Mills
5 years ago
Views:

1 Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, Kraków, Poland Filip Krużel Cracow University of Technology, Warszawska 24, Kraków, Poland ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 1/34

2 Outline Motivation - processor architectures 1 Motivation - processor architectures 2 3 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 2/34

Motivation Motivation - processor architectures Current computer architectures: clusters with homogeneous or heterogeneous nodes multicore

3 Motivation Motivation - processor architectures Current computer architectures: clusters with homogeneous or heterogeneous nodes multicore processors with vector capabilities manycore accelerators (including GPUs) ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 3/34

codes? are the new architectures (GPUs, MIC) worth investments?

4 Motivation Motivation - processor architectures The principal question: what is the best processor architecture for running FEM codes? are the new architectures (GPUs, MIC) worth investments? CPU core GPU core ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 4/34

traditional programming model different optimization

5 Motivation Motivation - processor architectures Disadvantages of new architectures: more complex than traditional programming model different optimization strategies price ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 5/34

6 Motivation Motivation - processor architectures GPU (and MIC) advantages: higher floating point performance higher memory bandwidth ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 6/34

7 Architecture comparison Architecture Kepler / Pascal Xeon Phi Xeon (E5) Processor GK110 / GP P / / 2699v3 Year of introduction 2013 / / / 2014 Number of multiprocessors/cores 13 / (59) / 64 6 / 18 Number of SP SIMD lanes 2496(x2) / 3584(x2) 960(x2) / 2048(x2) 48 / 288(x2) Number of DP SIMD lanes 832(x2) / 1792(x2) 480(x2) / 1024(x2) 24 / 144(x2) Fast global memory size [GB] 4.8 / 12 (16) 8 /16 (384) 384 / 768 LLC memory size [MB] 1.5 / 4.0 [L2] 30 / 32 [L2] 15 / 45 [L3] Frequency [GHz] 0.7 / / / 2.3 Performance characteristics Peak SP performance [TFlops] 3.52 / / / 1.3 Peak DP performance [TFlops] 1.17 / / / 0.66 Benchmark (DGEMM) performance 1.10 /??? 0.84 / / 0.48 Peak memory bandwidth [GB/s] 208 / 549 (732) 320 / > / 68 Benchmark (STREAM) bandwidth 144 /??? 171 / 480 (85) 33 / 58 Machine balance [DP flops/access] 45 / 58 (44) 25 / <43 18 / 77 Benchmark machine balance 61 /??? 39 / / 66 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 7/34

8 Architecture comparison Architecture Kepler / Pascal Xeon Phi Xeon (E5) Processor GK110 / GP P / / 2699v3 Multiprocessor/core characteristics Number of SP SIMD lanes 192 (x2) / 64 (x2) 16 (x2) / 32 (x2) 8 / 16 (x2) Number of DP SIMD lanes 64 (x2) / 32 (x2) 8 (x2) / 16 (x2) 4 / 8 (x2) Shared memory / L2 size [KB] 16 or 48 / / / 256 L1 cache memory size [KB] 48 or 16 / 24(?) 32 / / 32 Number of 32 bit registers / >2048 / > / 1680 Resources per single SP SIMD lane (+latency hiding?) Number of SP registers 341 / (x4) / 32 (x4) 16 (x2) / 16 (x2) Number of SP entries in SM/L1 64 / / / 512 Number of SP entries in L2 cache 131 / / / 4096 Resources per single DP SIMD lane (+latency hiding?) Number of DP registers 512 / (x4) / 32 (x4) 16 (x2) / 16 (x2) Number of DP entries in SM/L1 96 / / / 512 Number of DP entries in L2 196 / / / 4096 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 8/34

9 Two phases: creation of the system of linear equations (integration and assembly) linear system solution (direct or iterative) ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 9/34

10 The creation of FEM systems of linear equations Finite element integration and assembly ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 10/34

11 Finite element integration and assembly 1 for e = 1 to N E do 2 - read input data and initialize output arrays A e and b e 3 for i Q = 1 to N Q do 4 - compute auxiliary terms (vol) and arrays (φ, c and d ) 5 for i S = 1 to N S do 6 for j S = 1 to N S do 7 update A e [i S ][j S ] using vol[i Q ], c[i Q ], φ[i S ][i Q ], φ[j S ][i Q ] 8 if (i S == j S ) then 9 update b e [i S ] using vol[i Q ] d[i D ][i Q ], φ[i S ][i Q ] 10 end if 11 end for 12 end for 13 end for 14 - assemble A e and b e into the global arrays 15 end for ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 11/34

12 Computational complexity: the dependence on the order of approximation an example for discontinuous Galerkin approximation Execution time [s] Degree of approximation elements integration faces integration preconditioner set-up iterations total time elements integration faces integration preconditioner set-up iterations Number of operations [Mflop] ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 12/34

13 Finite element integration and assembly Size of input, output and auxiliary arrays for numerical integration The size of arrays Degree of approximation p ξ Q, w Q φ max c, d ξ Total x, det( x ) ξ Total φ Total max c, d A e, b e ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 13/34

14 Finite element integration and assembly Computational and memory complexity: the dependence on the order of approximation an example for discontinuous Galerkin approximation ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 14/34

15 Offloading model Motivation - processor architectures OpenCL devices with PCIe APU with unified memory ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 15/34

16 Available resources for application programmers: Processing environment: multi-threading for multi-core standard - Xeon massive (also for latency hiding) - GPUs, Xeon Phi vectorization wide SIMD units - Xeon (8, 16) wider SIMD units - GPUs (32), Xeon Phi (16, 32) different communication and synchronization mechanisms ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 16/34

17 Available resources for application programmers: Memory hierarchy: large, latency optimized DRAM memory below 150 GB/s (for 2-socket configurations) fast global memory (MCDRAM, HBM) - GPU and MIC above 400 GB/s for single processor shared memory, caches registers smaller, explicitly managed - GPUs with CUDA, OpenCL larger, implicitly managed - Xeon, Xeon Phi large number, explicitly managed - GPUs with CUDA, OpenCL implicitly managed - Xeon, Xeon Phi ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 17/34

18 Programming models: Standard - OpenMP: parallel loops implicit management of data placement GPU oriented - CUDA, OpenCL (can be used as well for x86) explicit and implicit thread organization (warps, threadblocks, grid) memory hierarchy (registers, shared, global) ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 18/34

19 Parallelization strategies One element one thread loop over elements parallelized small resource requirements for low orders of approximation possible explicit placement of some arrays in the shared memory to speed up calculations for CUDA and OpenCL large resource requirements for higher orders of approximation can be handled by flexible memory hierarchy of Xeons prevent OpenCL kernels from executing on GPUs ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 19/34

20 Parallelization strategies One element several threads loop over elements parallelized additionally loops over the entries of output arrays parallelized in domain decomposition manner, with no dependencies the number of threads usually up to the size of threadblocks for CUDA and OpenCL sufficient shared memory resources for GPUs for high orders even several threadblocks can operate on a single element (or additional loop over parts of the output arrays is introduced) serial fraction associated with some auxiliary calculations at interation points ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 20/34

21 Parallelization strategies One element two kernels strategy kernel 1: auxiliary terms calculations loop over elements parallelized loop over integration points parallelized auxiliary terms calculated and stored in global memory kernel 2: actual calculations of element arrays loop over elements parallelized additionally loops over the entries of output arrays parallelized in the same way as for one element several threads strategy no serial fraction ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 21/34

22 Computational example: elasticity problem Higher order approximation numerical integration for several degrees p (from 2 to 7) parallelization for single element based on data decomposition for output arrays parallelization of double loop over shape functions different options for placement of auxiliary arrays Linux, C, OpenCL several processor architectures (with the same, portable OpenCL kernels): Intel Xeon E AMD Radeon HD5870 (Cypress) and HD7950 (Tahiti PRO) Nvidia GeForce GTX580 (Fermi) and Tesla M2075 (Fermi) ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 22/34

23 Performance results for elasticity problem and p=3 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 23/34

24 Performance results for elasticity problem and p=3 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 24/34

25 Performance results for elasticity problem and p=5 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 25/34

26 Performance results for elasticity problem and p=5 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 26/34

27 Computational example: convection-diffusion problem First order approximation simple Poisson (Laplace) problem and more computationally intensive conv-diff problem prismatic and tetrahedral elements one element one thread parallelization strategy different options for placement of auxiliary arrays Linux, C, OpenCL several processor architectures (with the same, portable OpenCL kernels): Intel Xeon E (in dual socket configuration) Intel Xeon Phi 5110P (Knights Corner) Nvidia Tesla K20M (Kepler) ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 27/34

28 Performance results for convection diffusion problem Execution time [ns] Processor: Tesla K20m Xeon Phi 5110P Xeon E Poisson - Tetra Poisson - Prism Conv-Diff - Tetra Conv-Diff - Prism ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 28/34

29 Performance results for convection diffusion problem Performance [GB/s] Poisson - Tetra Processor: Tesla K20m Xeon Phi 5110P Xeon E Conv-Diff - Tetra Performance [GFLOPS] Processor: Tesla K20m Xeon Phi 5110P Xeon E Poisson - Prism Conv-Diff - Prism ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 29/34

30 Performance results for convection diffusion problem Processor: Tesla K20m Xeon Phi 5110P Xeon E Performance as percentage of the benchmark maximum Poisson - Tetra Poisson - Prism Conv-Diff - Tetra Conv-Diff - Prism ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 30/34

31 Computational example: convection-diffusion problem Code optimizations for GPUs classical optimizations (automatic) loop invariant code motion, common subexpression elimination, loop unrolling, induction variable simplification, etc. compiler directives based parameter tuning placing different arrays in different levels of memory hierarchy large number of options automatic testing of the parameter space ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 31/34

32 Parameter based performance tuning ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 32/34

33 Parameter based performance tuning ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 33/34

34 Thank you. ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 34/34

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC