Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków, Poland Filip Krużel Cracow University of Technology, Warszawska 24, 31-155 Kraków, Poland pobanas@cyf-kr.edu.pl, janbielanski@agh.edu.pl, chlon@agh.edu.pl, fkruzel@pk.edu.pl ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 1/34

Outline Motivation - processor architectures 1 Motivation - processor architectures 2 3 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 2/34

Motivation Motivation - processor architectures Current computer architectures: clusters with homogeneous or heterogeneous nodes multicore processors with vector capabilities manycore accelerators (including GPUs) ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 3/34

Motivation Motivation - processor architectures The principal question: what is the best processor architecture for running FEM codes? are the new architectures (GPUs, MIC) worth investments? CPU core GPU core ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 4/34

Motivation Motivation - processor architectures Disadvantages of new architectures: more complex than traditional programming model different optimization strategies price ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 5/34

Motivation Motivation - processor architectures GPU (and MIC) advantages: higher floating point performance higher memory bandwidth ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 6/34

Architecture comparison Architecture Kepler / Pascal Xeon Phi Xeon (E5) Processor GK110 / GP100 5110P / 7230 2620 / 2699v3 Year of introduction 2013 / 2016 2012 / 2016 2012 / 2014 Number of multiprocessors/cores 13 / 56 60 (59) / 64 6 / 18 Number of SP SIMD lanes 2496(x2) / 3584(x2) 960(x2) / 2048(x2) 48 / 288(x2) Number of DP SIMD lanes 832(x2) / 1792(x2) 480(x2) / 1024(x2) 24 / 144(x2) Fast global memory size [GB] 4.8 / 12 (16) 8 /16 (384) 384 / 768 LLC memory size [MB] 1.5 / 4.0 [L2] 30 / 32 [L2] 15 / 45 [L3] Frequency [GHz] 0.7 / 1.1 1.0 / 1.3 2.5 / 2.3 Performance characteristics Peak SP performance [TFlops] 3.52 / 8 2.02 / 5.3 0.24 / 1.3 Peak DP performance [TFlops] 1.17 / 4 1.01 / 2.6 0.10 / 0.66 Benchmark (DGEMM) performance 1.10 /??? 0.84 / 1.9 0.09 / 0.48 Peak memory bandwidth [GB/s] 208 / 549 (732) 320 / > 400 42.6 / 68 Benchmark (STREAM) bandwidth 144 /??? 171 / 480 (85) 33 / 58 Machine balance [DP flops/access] 45 / 58 (44) 25 / <43 18 / 77 Benchmark machine balance 61 /??? 39 / 32 21 / 66 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 7/34

Architecture comparison Architecture Kepler / Pascal Xeon Phi Xeon (E5) Processor GK110 / GP100 5110P / 7230 2620 / 2699v3 Multiprocessor/core characteristics Number of SP SIMD lanes 192 (x2) / 64 (x2) 16 (x2) / 32 (x2) 8 / 16 (x2) Number of DP SIMD lanes 64 (x2) / 32 (x2) 8 (x2) / 16 (x2) 4 / 8 (x2) Shared memory / L2 size [KB] 16 or 48 / 64 512 / 512 256 / 256 L1 cache memory size [KB] 48 or 16 / 24(?) 32 / 32 32 / 32 Number of 32 bit registers 65536 / 65536 >2048 / >2560 1472 / 1680 Resources per single SP SIMD lane (+latency hiding?) Number of SP registers 341 / 1024 32 (x4) / 32 (x4) 16 (x2) / 16 (x2) Number of SP entries in SM/L1 64 / 256 512 / 256 1024 / 512 Number of SP entries in L2 cache 131 / 292 8192 / 4096 8192 / 4096 Resources per single DP SIMD lane (+latency hiding?) Number of DP registers 512 / 1024 32 (x4) / 32 (x4) 16 (x2) / 16 (x2) Number of DP entries in SM/L1 96 / 256 512 / 256 1024 / 512 Number of DP entries in L2 196 / 292 8192 / 4096 8192 / 4096 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 8/34

Two phases: creation of the system of linear equations (integration and assembly) linear system solution (direct or iterative) ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 9/34

The creation of FEM systems of linear equations Finite element integration and assembly ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 10/34

Finite element integration and assembly 1 for e = 1 to N E do 2 - read input data and initialize output arrays A e and b e 3 for i Q = 1 to N Q do 4 - compute auxiliary terms (vol) and arrays (φ, c and d ) 5 for i S = 1 to N S do 6 for j S = 1 to N S do 7 update A e [i S ][j S ] using vol[i Q ], c[i Q ], φ[i S ][i Q ], φ[j S ][i Q ] 8 if (i S == j S ) then 9 update b e [i S ] using vol[i Q ] d[i D ][i Q ], φ[i S ][i Q ] 10 end if 11 end for 12 end for 13 end for 14 - assemble A e and b e into the global arrays 15 end for ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 11/34

Computational complexity: the dependence on the order of approximation an example for discontinuous Galerkin approximation 100 10 10 Execution time [s] 1 0.1 0.01 1 2 3 4 5 6 Degree of approximation elements integration faces integration preconditioner set-up iterations total time elements integration faces integration preconditioner set-up iterations 1 0.1 0.01 Number of operations [Mflop] ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 12/34

Finite element integration and assembly Size of input, output and auxiliary arrays for numerical integration The size of arrays Degree of approximation p 1 2 3 4 5 ξ Q, w Q 24 72 192 320 600 φ 24 72 160 300 504 max c, d 20 20 20 20 20 ξ Total x, det( x ) ξ 60 180 480 800 1500 Total φ 144 1296 7680 24000 75600 Total max c, d 120 360 960 1600 3000 A e, b e 42 342 1640 5700 16002 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 13/34

Finite element integration and assembly Computational and memory complexity: the dependence on the order of approximation an example for discontinuous Galerkin approximation ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 14/34

Offloading model Motivation - processor architectures OpenCL devices with PCIe APU with unified memory ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 15/34

Available resources for application programmers: Processing environment: multi-threading for multi-core standard - Xeon massive (also for latency hiding) - GPUs, Xeon Phi vectorization wide SIMD units - Xeon (8, 16) wider SIMD units - GPUs (32), Xeon Phi (16, 32) different communication and synchronization mechanisms ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 16/34

Available resources for application programmers: Memory hierarchy: large, latency optimized DRAM memory below 150 GB/s (for 2-socket configurations) fast global memory (MCDRAM, HBM) - GPU and MIC above 400 GB/s for single processor shared memory, caches registers smaller, explicitly managed - GPUs with CUDA, OpenCL larger, implicitly managed - Xeon, Xeon Phi large number, explicitly managed - GPUs with CUDA, OpenCL implicitly managed - Xeon, Xeon Phi ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 17/34

Programming models: Standard - OpenMP: parallel loops implicit management of data placement GPU oriented - CUDA, OpenCL (can be used as well for x86) explicit and implicit thread organization (warps, threadblocks, grid) memory hierarchy (registers, shared, global) ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 18/34

Parallelization strategies One element one thread loop over elements parallelized small resource requirements for low orders of approximation possible explicit placement of some arrays in the shared memory to speed up calculations for CUDA and OpenCL large resource requirements for higher orders of approximation can be handled by flexible memory hierarchy of Xeons prevent OpenCL kernels from executing on GPUs ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 19/34

Parallelization strategies One element several threads loop over elements parallelized additionally loops over the entries of output arrays parallelized in domain decomposition manner, with no dependencies the number of threads usually up to the size of threadblocks for CUDA and OpenCL sufficient shared memory resources for GPUs for high orders even several threadblocks can operate on a single element (or additional loop over parts of the output arrays is introduced) serial fraction associated with some auxiliary calculations at interation points ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 20/34

Parallelization strategies One element two kernels strategy kernel 1: auxiliary terms calculations loop over elements parallelized loop over integration points parallelized auxiliary terms calculated and stored in global memory kernel 2: actual calculations of element arrays loop over elements parallelized additionally loops over the entries of output arrays parallelized in the same way as for one element several threads strategy no serial fraction ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 21/34

Computational example: elasticity problem Higher order approximation numerical integration for several degrees p (from 2 to 7) parallelization for single element based on data decomposition for output arrays parallelization of double loop over shape functions different options for placement of auxiliary arrays Linux, C, OpenCL several processor architectures (with the same, portable OpenCL kernels): Intel Xeon E5-2670 AMD Radeon HD5870 (Cypress) and HD7950 (Tahiti PRO) Nvidia GeForce GTX580 (Fermi) and Tesla M2075 (Fermi) ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 22/34

Performance results for elasticity problem and p=3 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 23/34

Performance results for elasticity problem and p=3 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 24/34

Performance results for elasticity problem and p=5 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 25/34

Performance results for elasticity problem and p=5 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 26/34

Computational example: convection-diffusion problem First order approximation simple Poisson (Laplace) problem and more computationally intensive conv-diff problem prismatic and tetrahedral elements one element one thread parallelization strategy different options for placement of auxiliary arrays Linux, C, OpenCL several processor architectures (with the same, portable OpenCL kernels): Intel Xeon E5-2620 (in dual socket configuration) Intel Xeon Phi 5110P (Knights Corner) Nvidia Tesla K20M (Kepler) ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 27/34

Performance results for convection diffusion problem Execution time [ns] 66 64 62 60 58 56 54 52 50 48 46 44 42 40 38 36 34 32 30 28 26 24 22 20 18 16 14 12 10 8 6 4 2 0 Processor: Tesla K20m Xeon Phi 5110P Xeon E5-2620 Poisson - Tetra Poisson - Prism Conv-Diff - Tetra Conv-Diff - Prism ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 28/34

Performance results for convection diffusion problem Performance [GB/s] 150 145 140 135 130 125 120 115 110 105 100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 Poisson - Tetra Processor: Tesla K20m Xeon Phi 5110P Xeon E5-2620 Conv-Diff - Tetra Performance [GFLOPS] 310 300 290 280 270 260 250 240 230 220 210 200 190 180 170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 Processor: Tesla K20m Xeon Phi 5110P Xeon E5-2620 Poisson - Prism Conv-Diff - Prism ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 29/34

Performance results for convection diffusion problem 100 95 90 85 Processor: Tesla K20m Xeon Phi 5110P Xeon E5-2620 80 75 70 Performance as percentage of the benchmark maximum 65 60 55 50 45 40 35 30 25 20 15 10 5 0 Poisson - Tetra Poisson - Prism Conv-Diff - Tetra Conv-Diff - Prism ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 30/34

Computational example: convection-diffusion problem Code optimizations for GPUs classical optimizations (automatic) loop invariant code motion, common subexpression elimination, loop unrolling, induction variable simplification, etc. compiler directives based parameter tuning placing different arrays in different levels of memory hierarchy large number of options automatic testing of the parameter space ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 31/34

Parameter based performance tuning ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 32/34

Parameter based performance tuning ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 33/34

Thank you. ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 34/34