Boundary element quadrature schemes for multi- and many-core architectures

Size: px

Start display at page:

Download "Boundary element quadrature schemes for multi- and many-core architectures"

Andrea Lambert
5 years ago
Views:

1 Boundary element quadrature schemes for multi- and many-core architectures Jan Zapletal, Michal Merta, Lukáš Malý IT4Innovations, Dept. of Applied Mathematics VŠB-TU Ostrava Intel MIC programming workshop, February 7, 207 Jan Zapletal (IT4I) BEM for many-core architectures / 24

2 BEM4I Boundary element method OpenMP threading OpenMP vectorization Adaptive cross approximation 2 Numerical experiments Full assembly ACA assembly 3 Conclusion Jan Zapletal (IT4I) BEM for many-core architectures 2 / 24

3 BEM4I BEM4I Boundary element method OpenMP threading OpenMP vectorization Adaptive cross approximation 2 Numerical experiments Full assembly ACA assembly 3 Conclusion Jan Zapletal (IT4I) BEM for many-core architectures 3 / 24

4 BEM4I Boundary element method BEM4I BEM is an alternative to FEM for the solution of PDEs, BEM4I - 3D boundary element library for Laplace equation (heat transfer), Helmholtz equation (wave scattering), Lamé equation (linear elasticity), time-dependent wave equation. Specifications C++, interface to MKL or other BLAS, LAPACK, ACA for matrix sparsification. Strategies SIMD vectorization for surface integrals (OpenMP, Vc library), OpenMP for local element contributions / individual ACA blocks, MPI for distributed matrices, BETI with the ESPRESO library, Intel MIC offload/native. Jan Zapletal (IT4I) BEM for many-core architectures 4 / 24

5 BEM4I Boundary element method BEM4I BEM is an alternative to FEM for the solution of PDEs, BEM4I - 3D boundary element library for Laplace equation (heat transfer), Helmholtz equation (wave scattering), Lamé equation (linear elasticity), time-dependent wave equation. Specifications C++, interface to MKL or other BLAS, LAPACK, ACA for matrix sparsification. Strategies SIMD vectorization for surface integrals (OpenMP, Vc library), OpenMP for local element contributions / individual ACA blocks, MPI for distributed matrices, BETI with the ESPRESO library, Intel MIC offload/native. Jan Zapletal (IT4I) BEM for many-core architectures 4 / 24

6 BEM4I Boundary element method BEM4I BEM is an alternative to FEM for the solution of PDEs, BEM4I - 3D boundary element library for Laplace equation (heat transfer), Helmholtz equation (wave scattering), Lamé equation (linear elasticity), time-dependent wave equation. Specifications C++, interface to MKL or other BLAS, LAPACK, ACA for matrix sparsification. Strategies SIMD vectorization for surface integrals (OpenMP, Vc library), OpenMP for local element contributions / individual ACA blocks, MPI for distributed matrices, BETI with the ESPRESO library, Intel MIC offload/native. Jan Zapletal (IT4I) BEM for many-core architectures 4 / 24

7 BEM4I Boundary element method BEM4I BEM is an alternative to FEM for the solution of PDEs, BEM4I - 3D boundary element library for Laplace equation (heat transfer), Helmholtz equation (wave scattering), Lamé equation (linear elasticity), time-dependent wave equation. Specifications C++, interface to MKL or other BLAS, LAPACK, ACA for matrix sparsification. Strategies SIMD vectorization for surface integrals (OpenMP, Vc library), OpenMP for local element contributions / individual ACA blocks, MPI for distributed matrices, BETI with the ESPRESO library, Intel MIC offload/native. Jan Zapletal (IT4I) BEM for many-core architectures 4 / 24

8 BEM4I Boundary element method Boundary value problem for the Laplace equation Representation formula for x Ω u(x) = 4π Ω u x y n Boundary integral equations for x Ω n x 4π 4π Ω Ω u = 0 in Ω, u = f on Γ D, u n = g on ΓN. x y, n(y) (y) dsy u(y) ds y. 4π Ω x y 3 u x y n (y) dsy = 2 u(x) + 4π Ω x y, n(y) u(y) ds x y 3 y = u 2 n (x) 4π x y, n(y) x y 3 u(y) ds y, Ω y x, n(x) x y 3 u (y) dsy. n Jan Zapletal (IT4I) BEM for many-core architectures 5 / 24

9 BEM4I Boundary element method Discretization leads to the systems V h g = ( ) ( ) 2 M h + K h f, D h f = 2 M h K h g with the matrices V h [l, k] := dsy dsx 4π τ l τ k x y K h [l, i] := 4π τ l Ω x y, n(y) ϕ i(y) ds x y 3 y ds x D h [j, i] := curl ϕ i(y), curl ϕ j(x) ds y ds x = T h diag(v h, V h, V h )T h, 4π Ω Ω x y M h [l, i] := ϕ i(x) ds x. τ l Jan Zapletal (IT4I) BEM for many-core architectures 6 / 24

10 BEM4I OpenMP threading OpenMP threading for V h # pragma omp parallel for 2 for ( int tau_k = 0; tau_k < E; ++ tau_k ){ // columns 3 for ( int tau_l = 0; tau_l < E; ++ tau_l ){ // rows 4 SLIntegrator. getlocalmatrix ( * tau_l, * tau_k, Vloc ); 5 V. set ( * tau_l, * tau_k, Vloc. get ( 0, 0 ) ); 6 } } OpenMP threading for K h Jan Zapletal (IT4I) BEM for many-core architectures 7 / 24

11 BEM4I OpenMP threading OpenMP threading for V h # pragma omp parallel for 2 for ( int tau_k = 0; tau_k < E; ++ tau_k ){ // columns 3 for ( int tau_l = 0; tau_l < E; ++ tau_l ){ // rows 4 SLIntegrator. getlocalmatrix ( * tau_l, * tau_k, Vloc ); 5 V. set ( * tau_l, * tau_k, Vloc. get ( 0, 0 ) ); 6 } } OpenMP threading for K h # pragma omp parallel for 2 for ( int tau_k = 0; tau_k < E; ++ tau_k ){ // columns 3 for ( int tau_l = 0; tau_l < E; ++ tau_l ){ // rows 4 DLIntegrator. getlocalmatrix ( * tau_l, * tau_k, Kloc ); 5 # pragma omp atomic // ( inside of add_atomic ) 6 K. add_atomic ( * tau_l, tau_k -> node [ 0 ], Kloc. get ( 0, 0 ) ); 7 # pragma omp atomic // ( inside of add_atomic ) 8 K. add_atomic ( * tau_l, tau_k -> node [ ], Kloc. get ( 0, ) ); 9 # pragma omp atomic // ( inside of add_atomic ) 0 K. add_atomic ( * tau_l, tau_k -> node [ 2 ], Kloc. get ( 0, 2 ) ); } } Jan Zapletal (IT4I) BEM for many-core architectures 7 / 24

12 BEM4I OpenMP threading OpenMP threading for V h # pragma omp parallel for 2 for ( int tau_k = 0; tau_k < E; ++ tau_k ){ // columns 3 for ( int tau_l = 0; tau_l < E; ++ tau_l ){ // rows 4 SLIntegrator. getlocalmatrix ( * tau_l, * tau_k, Vloc ); 5 V. set ( * tau_l, * tau_k, Vloc. get ( 0, 0 ) ); 6 } } OpenMP threading for K h # pragma omp parallel for 2 for ( int tau_l = 0; tau_l < E; ++ tau_l ){ // rows 3 for ( int tau_k = 0; tau_k < E; ++ tau_k ){ // columns 4 DLIntegrator. getlocalmatrix ( * tau_l, * tau_k, Kloc ); 5 6 K. add ( * tau_l, tau_k -> node [ 0 ], Kloc. get ( 0, 0 ) ); 7 8 K. add ( * tau_l, tau_k -> node [ ], Kloc. get ( 0, ) ); 9 0 K. add ( * tau_l, tau_k -> node [ 2 ], Kloc. get ( 0, 2 ) ); } } Avoid #pragma omp critical, use #pragma omp atomic if applicable. Jan Zapletal (IT4I) BEM for many-core architectures 7 / 24

13 BEM4I OpenMP vectorization Single Instruction Multiple Data (SIMD) processing vector with a single operation, provides data level parallelism, elements are of the same type. Vector length 256 bits for AVX-2 (Haswell), 4 DP operands, 52 bits for IMCI (KNC), AVX52 (KNL), 8 DP operands. Vectorization achieved by compiler auto-vectorization (code refactoring can help), OpenMP 4.0 pragmas (#pragma omp simd), intrinsic functions, wrapper library (Vc), assembly. Jan Zapletal (IT4I) BEM for many-core architectures 8 / 24

14 BEM4I OpenMP vectorization Single Instruction Multiple Data (SIMD) processing vector with a single operation, provides data level parallelism, elements are of the same type. Vector length 256 bits for AVX-2 (Haswell), 4 DP operands, 52 bits for IMCI (KNC), AVX52 (KNL), 8 DP operands. Vectorization achieved by compiler auto-vectorization (code refactoring can help), OpenMP 4.0 pragmas (#pragma omp simd), intrinsic functions, wrapper library (Vc), assembly. Jan Zapletal (IT4I) BEM for many-core architectures 8 / 24

15 BEM4I OpenMP vectorization Tuned vectorized integration over a square double * w = new double [ S ]; 2 w[ 0 ] =... // init. weights 3 // x^_, x^_2,..., x^s_, x^ S_2 4 double x [ ] = {... }; for ( int l = 0; l < S; ++l ){ 2 result += w[ l ] * f( x[ 2 * l ], x[ 2 * l + ] ); 3 } 4 5 delete w; Loop peeling & strided access to memory Jan Zapletal (IT4I) BEM for many-core architectures 9 / 24

16 BEM4I OpenMP vectorization Tuned vectorized integration over a square double * w = new double [ S ]; 2 w[ 0 ] =... // init. weights 3 // x^_, x^_2,..., x^s_, x^ S_2 4 double x [ ] = {... }; # pragma omp simd reduction ( + : result ) for ( int l = 0; l < S; ++l ){ 2 result += w[ l ] * f( x[ 2 * l ], x[ 2 * l + ] ); 3 } 4 5 delete w; Loop peeling & strided access to memory Jan Zapletal (IT4I) BEM for many-core architectures 9 / 24

17 BEM4I OpenMP vectorization Tuned vectorized integration over a square double * w = new double [ S ]; 2 w[ 0 ] =... // init. weights 3 // x^_, x^_2,..., x^s_, x^ S_2 4 double x [ ] = {... }; # pragma omp simd reduction ( + : result ) for ( int l = 0; l < S; ++l ){ 2 result += w[ l ] * f( x[ 2 * l ], x[ 2 * l + ] ); 3 } 4 5 delete w; Loop peeling & strided access to memory 0x x x AoS peel main body remainder SoA Jan Zapletal (IT4I) BEM for many-core architectures 9 / 24

18 BEM4I OpenMP vectorization Tuned vectorized integration over a square double * w = ( double *) _mm_malloc ( S * sizeof ( double ), 64 ); 2 w[ 0 ] =... // init. weights 3 // x^_,..., x^ S_ 4 double x [ ] attribute ( ( aligned ( 64 ) ) ) = {... }; 5 // x^_2,..., x^ S_2 6 double x2 [ ] attribute ( ( aligned ( 64 ) ) ) = {... }; 7 8 assume_aligned ( w, 64 ); // tell compiler about alignment 9 0 # pragma omp simd reduction ( + : result ) for ( int l = 0; l < S; ++l ){ 2 result += w[ l ] * f( x[ l ], x2[ l ] ); 3 } 4 5 _mm_free ( w ); Loop peeling & strided access to memory 0x x x AoS peel main body remainder SoA Jan Zapletal (IT4I) BEM for many-core architectures 9 / 24

19 BEM4I OpenMP vectorization Duffy substitution for τ l τ k, V h [l, k] = s k(f s (η, η 2, η 3, ξ))s s (η, η 2, η 3, ξ) dη dη 2 dη 3 dξ with F s : [0, ] 4 S τ l τ k, F s (η, η 2, η 3, ξ) = (x, y), S s (η, η 2, η 3, ξ) dη dη 2 dη 3 dξ = ds xds y. Approximated by tensor Gauss quadrature V h [l, k] w m w n w o w p k(f s (x m, x n, x o, x p))s s (x m, x n, x o, x p). s m n o p Jan Zapletal (IT4I) BEM for many-core architectures 0 / 24

20 BEM4I OpenMP vectorization Collapsed integration loop in getlocalmatrix. assume_aligned ( xss, 64 ); // all data aligned switch ( type ){ 5 case ( identicalelements ): 6 for ( int simplex = 0; simplex < 6; ++ simplex ){ 7 8 reftotri ( simplex, x,..., y3, xref,..., y2ref, xss,..., y3ss ); 9 0 # pragma omp simd reduction ( + : entry ) for ( c = 0; c < S* S2* S3* S4; ++c ) { // collapsed 2 kernel = weights_jacv [ c ] 3 * evalsinglelayerkernel ( xss [ c ], x2ss [ c ], 4 x3ss [ c ], yss [ c ], y2ss [ c ], y3ss [ c ] ); 5 entry += kernel ; 6 } } 7 break ; 8... // quadrature over other pairs of elements 9 } Jan Zapletal (IT4I) BEM for many-core architectures / 24

21 BEM4I OpenMP vectorization SIMD evaluation of quadrature points in reftotri. assume_aligned ( xss, 64 ); // all data aligned # pragma omp simd 5 for ( int c = 0; c < S* S2* S3* S4; ++c ){ 6 xss [ c ] = x[ 0 ] 7 + ( x2[ 0 ] - x[ 0 ] ) * xref [ simplex ][ c ] 8 + ( x3[ 0 ] - x[ 0 ] ) * x2ref [ simplex ][ c ]; 9... // compute x2ss, x3ss, yss, y2ss, y3ss 0 } SIMD evaluation of the kernel in evalsinglelayerkernel. # pragma omp declare simd 2 double evalsinglelayerkernel ( 3 double x, double x2, double x3, 4 double y, double y2, double y3 5 ) const { 6 7 double d = x - y, d2 = x2 - y2, d3 = x3 - y3; 8 double norm = sqrt ( d * d + d2 * d2 + d3 * d3 ); 9 0 return ( / ( norm * 4.0 * M_PI ) ); } Jan Zapletal (IT4I) BEM for many-core architectures 2 / 24

22 BEM4I OpenMP vectorization SIMD evaluation of quadrature points in reftotri. assume_aligned ( xss, 64 ); // all data aligned # pragma omp simd 5 for ( int c = 0; c < S* S2* S3* S4; ++c ){ 6 xss [ c ] = x[ 0 ] 7 + ( x2[ 0 ] - x[ 0 ] ) * xref [ simplex ][ c ] 8 + ( x3[ 0 ] - x[ 0 ] ) * x2ref [ simplex ][ c ]; 9... // compute x2ss, x3ss, yss, y2ss, y3ss 0 } SIMD evaluation of the kernel in evalsinglelayerkernel. # pragma omp declare simd 2 double evalsinglelayerkernel ( 3 double x, double x2, double x3, 4 double y, double y2, double y3 5 ) const { 6 7 double d = x - y, d2 = x2 - y2, d3 = x3 - y3; 8 double norm = sqrt ( d * d + d2 * d2 + d3 * d3 ); 9 0 return ( / ( norm * 4.0 * M_PI ) ); } Jan Zapletal (IT4I) BEM for many-core architectures 2 / 24

23 BEM4I OpenMP vectorization Intel Advisor Jan Zapletal (IT4I) BEM for many-core architectures 3 / 24

24 BEM4I OpenMP vectorization Intel compiler reports -qopt-report=5 -qopt-report-phase=vec LOOP BEGIN at BEIntegratorScalar. cpp (304,5) 2 remark 5340: pragma supersedes previous setting [ BEIntegratorScalar. cpp (300,) ] 3 remark 5388: vectorization support : reference xss [ i] has aligned access [ BEIntegratorScalar. cpp (305,55) ] 4... // all data reported as aligned 5 remark 5305: vectorization support : vector length 4 6 remark 5309: vectorization support : normalized vectorization overhead remark 530: OpenMP SIMD LOOP WAS VECTORIZED 8 remark 5448: unmasked aligned unit stride loads : 6 9 remark 5475: --- begin vector cost summary remark 5476: scalar cost : 97 remark 5477: vector cost : remark 5478: estimated potential speedup : remark 5488: --- end vector cost summary LOOP END Jan Zapletal (IT4I) BEM for many-core architectures 4 / 24

25 BEM4I Adaptive cross approximation Adaptive cross approximation Complexity O(n 2 ) O(n log n), mesh divided into clusters, non-admissible clusters assembled in full, admissible clusters approximated as C UV, assembly of clusters distributed by OpenMP. ~ * * T # pragma omp parallel for 2 for ( int i = 0; i < n_nonadmissible_blocks ; ++i ){ 3 getnonadmissibleblock ( i, localblock ); 4 ACAMatrix. addnonadmissibleblock ( i, localblock ); 5 } 6 7 # pragma omp parallel for 8 for ( int i = 0; i < n_admissible_blocks ; ++i ){ 9 getadmissibleblock ( i, localblock ); 0 ACAMatrix. addadmissibleblock ( i, localblock ); } Jan Zapletal (IT4I) BEM for many-core architectures 5 / 24

26 BEM4I Adaptive cross approximation Adaptive cross approximation Complexity O(n 2 ) O(n log n), mesh divided into clusters, non-admissible clusters assembled in full, admissible clusters approximated as C UV, assembly of clusters distributed by OpenMP. ~ * * T # pragma omp parallel for 2 for ( int i = 0; i < n_nonadmissible_blocks ; ++i ){ 3 getnonadmissibleblock ( i, localblock ); 4 ACAMatrix. addnonadmissibleblock ( i, localblock ); 5 } 6 7 # pragma omp parallel for 8 for ( int i = 0; i < n_admissible_blocks ; ++i ){ 9 getadmissibleblock ( i, localblock ); 0 ACAMatrix. addadmissibleblock ( i, localblock ); } Jan Zapletal (IT4I) BEM for many-core architectures 5 / 24

27 Numerical experiments BEM4I Boundary element method OpenMP threading OpenMP vectorization Adaptive cross approximation 2 Numerical experiments Full assembly ACA assembly 3 Conclusion Jan Zapletal (IT4I) BEM for many-core architectures 6 / 24

28 Numerical experiments Experiment setting Full (ACA) assembly tested on a mesh with (8.920) surface elements. 4 quadrature points in each dimension (256 per simplex) to utilize SIMD registers. ACA settings maximal number of elements in clusters: 500, preallocation: 0 %. Assembly performed on single nodes Salomon - 2 x Xeon 2680v3, AVX2, 2x2 cores, 2.5 GHz, 28 GB RAM, Salomon - Xeon Phi 720P, IMCI, 6 cores,.238 GHz, 6 GB RAM, Endeavor - Xeon Phi 720, AVX-52, 64 cores,.3 GHz, GB RAM. Jan Zapletal (IT4I) BEM for many-core architectures 7 / 24

Numerical experiments Experiment setting Full (ACA) assembly tested on a mesh with 20.480 (8.920) surface elements.

ACA settings maximal number of elements in clusters: 500, preallocation: 0 %.

29 Numerical experiments Experiment setting Full (ACA) assembly tested on a mesh with (8.920) surface elements. 4 quadrature points in each dimension (256 per simplex) to utilize SIMD registers. ACA settings maximal number of elements in clusters: 500, preallocation: 0 %. Assembly performed on single nodes Salomon - 2 x Xeon 2680v3, AVX2, 2x2 cores, 2.5 GHz, 28 GB RAM, Salomon - Xeon Phi 720P, IMCI, 6 cores,.238 GHz, 6 GB RAM, Endeavor - Xeon Phi 720, AVX-52, 64 cores,.3 GHz, GB RAM. Jan Zapletal (IT4I) BEM for many-core architectures 7 / 24

30 Numerical experiments Full assembly OMP scalability, full assembly, 256 points per simplex Single-layer matrix, full, double, (4,4,4,4) Double-layer matrix, full, double, (4,4,4,4),000 2xHSW KNC KNL Optimal,000 2xHSW KNC KNL Optimal t [s] 00 t [s] OMP threads OMP threads Xeon 2680v3 matrix V h K h Jan Zapletal (IT4I) BEM for many-core architectures 8 / 24

31 Numerical experiments Full assembly OMP scalability, full assembly, 256 points per simplex Single-layer matrix, full, double, (4,4,4,4) Double-layer matrix, full, double, (4,4,4,4),000 2xHSW KNC KNL Optimal,000 2xHSW KNC KNL Optimal t [s] 00 t [s] OMP threads OMP threads Xeon Phi 720P matrix V h K h Jan Zapletal (IT4I) BEM for many-core architectures 8 / 24

32 Numerical experiments Full assembly OMP scalability, full assembly, 256 points per simplex Single-layer matrix, full, double, (4,4,4,4) Double-layer matrix, full, double, (4,4,4,4),000 2xHSW KNC KNL Optimal,000 2xHSW KNC KNL Optimal t [s] 00 t [s] OMP threads OMP threads Xeon Phi 720 matrix V h K h Jan Zapletal (IT4I) BEM for many-core architectures 8 / 24

33 Numerical experiments Full assembly SIMD scalability, full assembly, 256 points per simplex Single-layer matrix, full, double, (4,4,4,4) Double-layer matrix, full, double, (4,4,4,4) t[s] 0 t[s] 0 2xHSW KNC KNL Optimal Register size 2xHSW KNC KNL Optimal Register size architecture threads matrix SSE4.2 AVX2 IMCI AVX-52 Xeon E2680v3 24 Xeon Phi 720P 244 Xeon Phi V h K h V h 3.77 K h 5.05 V h K h Jan Zapletal (IT4I) BEM for many-core architectures 9 / 24

34 Numerical experiments Full assembly SIMD scalability, full assembly, 256 points per simplex Single-layer matrix, full, double, (4,4,4,4) Double-layer matrix, full, double, (4,4,4,4) t[s] 0 t[s] 0 2xHSW KNC KNL Optimal Register size 2xHSW KNC KNL Optimal Register size architecture threads matrix SSE4.2 AVX2 IMCI AVX-52 Xeon E2680v3 24 Xeon Phi 720P 244 Xeon Phi V h K h V h 3.77 K h 5.05 V h K h Jan Zapletal (IT4I) BEM for many-core architectures 9 / 24

35 Numerical experiments Full assembly SIMD scalability, full assembly, 256 points per simplex Single-layer matrix, full, double, (4,4,4,4) Double-layer matrix, full, double, (4,4,4,4) t[s] 0 t[s] 0 2xHSW KNC KNL Optimal Register size 2xHSW KNC KNL Optimal Register size architecture threads matrix SSE4.2 AVX2 IMCI AVX-52 Xeon E2680v3 24 Xeon Phi 720P 244 Xeon Phi V h K h V h 3.77 K h 5.05 V h K h Jan Zapletal (IT4I) BEM for many-core architectures 9 / 24

36 Numerical experiments ACA assembly OMP scalability, ACA assembly, 256 points per simplex,000 Single-layer matrix, aca, double, (4,4,4,4) 2xHSW KNC KNL Optimal,000 Double-layer matrix, aca, double, (4,4,4,4) 2xHSW KNC KNL Optimal t [s] 00 t [s] OMP threads OMP threads Xeon 2680v3 matrix V h K h Jan Zapletal (IT4I) BEM for many-core architectures 20 / 24

37 Numerical experiments ACA assembly OMP scalability, ACA assembly, 256 points per simplex,000 Single-layer matrix, aca, double, (4,4,4,4) 2xHSW KNC KNL Optimal,000 Double-layer matrix, aca, double, (4,4,4,4) 2xHSW KNC KNL Optimal t [s] 00 t [s] OMP threads OMP threads Xeon Phi 720P matrix V h K h Jan Zapletal (IT4I) BEM for many-core architectures 20 / 24

38 Numerical experiments ACA assembly OMP scalability, ACA assembly, 256 points per simplex,000 Single-layer matrix, aca, double, (4,4,4,4) 2xHSW KNC KNL Optimal,000 Double-layer matrix, aca, double, (4,4,4,4) 2xHSW KNC KNL Optimal t [s] 00 t [s] OMP threads OMP threads Xeon Phi 720 matrix V h K h Jan Zapletal (IT4I) BEM for many-core architectures 20 / 24

39 Numerical experiments ACA assembly SIMD scalability, ACA assembly, 256 points per simplex Single-layer matrix, aca, double, (4,4,4,4) Double-layer matrix, aca, double, (4,4,4,4) t[s] 0 t[s] 0 2xHSW KNC KNL Optimal Register size 2xHSW KNC KNL Optimal Register size architecture threads matrix SSE4.2 AVX2 IMCI AVX-52 Xeon E2680v3 24 Xeon Phi 720P 244 Xeon Phi V h K h V h 2.94 K h 3.47 V h K h Jan Zapletal (IT4I) BEM for many-core architectures 2 / 24

40 Numerical experiments ACA assembly SIMD scalability, ACA assembly, 256 points per simplex Single-layer matrix, aca, double, (4,4,4,4) Double-layer matrix, aca, double, (4,4,4,4) t[s] 0 t[s] 0 2xHSW KNC KNL Optimal Register size 2xHSW KNC KNL Optimal Register size architecture threads matrix SSE4.2 AVX2 IMCI AVX-52 Xeon E2680v3 24 Xeon Phi 720P 244 Xeon Phi V h K h V h 2.94 K h 3.47 V h K h Jan Zapletal (IT4I) BEM for many-core architectures 2 / 24

41 Numerical experiments ACA assembly SIMD scalability, ACA assembly, 256 points per simplex Single-layer matrix, aca, double, (4,4,4,4) Double-layer matrix, aca, double, (4,4,4,4) t[s] 0 t[s] 0 2xHSW KNC KNL Optimal Register size 2xHSW KNC KNL Optimal Register size architecture threads matrix SSE4.2 AVX2 IMCI AVX-52 Xeon E2680v3 24 Xeon Phi 720P 244 Xeon Phi V h K h V h 2.94 K h 3.47 V h K h Jan Zapletal (IT4I) BEM for many-core architectures 2 / 24

42 Conclusion BEM4I Boundary element method OpenMP threading OpenMP vectorization Adaptive cross approximation 2 Numerical experiments Full assembly ACA assembly 3 Conclusion Jan Zapletal (IT4I) BEM for many-core architectures 22 / 24

43 Conclusion Conclusion KNL vs. HSW speedup Full V h 2.29, K h.82, ACA V h 2.64, K h.40. What we learned SIMD processing becoming more efficient with KNL, multi-core code benefits from many-core optimizations. Work in progress object-oriented offload to Xeon Phi (full, ACA), load balancing for offload mode (full, ACA), massively parallel BETI with ESPRESO library (MPI, OpenMP, SIMD, offload). Jan Zapletal (IT4I) BEM for many-core architectures 23 / 24

44 Conclusion Conclusion KNL vs. HSW speedup Full V h 2.29, K h.82, ACA V h 2.64, K h.40. What we learned SIMD processing becoming more efficient with KNL, multi-core code benefits from many-core optimizations. CPU MIC CPU MIC CPU MIC Work in progress object-oriented offload to Xeon Phi (full, ACA), load balancing for offload mode (full, ACA), massively parallel BETI with ESPRESO library (MPI, OpenMP, SIMD, offload). Jan Zapletal (IT4I) BEM for many-core architectures 23 / 24

45 Conclusion Conclusion KNL vs. HSW speedup Full V h 2.29, K h.82, ACA V h 2.64, K h.40. What we learned SIMD processing becoming more efficient with KNL, multi-core code benefits from many-core optimizations. CPU MIC CPU MIC CPU MIC Work in progress object-oriented offload to Xeon Phi (full, ACA), load balancing for offload mode (full, ACA), massively parallel BETI with ESPRESO library (MPI, OpenMP, SIMD, offload). Jan Zapletal (IT4I) BEM for many-core architectures 23 / 24

46 Conclusion References Zapletal, J.; Merta, M.; Malý, L. Boundary element quadrature schemes for multi- and many-core architectures. Computers and Mathematics with Applications (accepted). Merta, M.; Zapletal, J.; Jaroš, J. Many core acceleration of the boundary element method. High Performance Computing in Science and Engineering: Second International Conference, HPCSE 205, Soláň, Czech Republic, May 25-28, 205, Revised Selected Papers, Springer International Publishing 4:6 25, 206. Preprints available at researchgate.com. Thank you Jan Zapletal (IT4I) BEM for many-core architectures 24 / 24

47 Conclusion References Zapletal, J.; Merta, M.; Malý, L. Boundary element quadrature schemes for multi- and many-core architectures. Computers and Mathematics with Applications (accepted). Merta, M.; Zapletal, J.; Jaroš, J. Many core acceleration of the boundary element method. High Performance Computing in Science and Engineering: Second International Conference, HPCSE 205, Soláň, Czech Republic, May 25-28, 205, Revised Selected Papers, Springer International Publishing 4:6 25, 206. Preprints available at researchgate.com. Thank you Jan Zapletal (IT4I) BEM for many-core architectures 24 / 24

Dynamic SIMD Scheduling

Dynamic SIMD Scheduling Florian Wende SC15 MIC Tuning BoF November 18 th, 2015 Zuse Institute Berlin time Dynamic Work Assignment: The Idea Irregular SIMD execution Caused by branching: control flow varies