Boundary element quadrature schemes for multi- and many-core architectures

Size: px
Start display at page:

Download "Boundary element quadrature schemes for multi- and many-core architectures"

Transcription

1 Boundary element quadrature schemes for multi- and many-core architectures Jan Zapletal, Michal Merta, Lukáš Malý IT4Innovations, Dept. of Applied Mathematics VŠB-TU Ostrava Intel MIC programming workshop, February 7, 207 Jan Zapletal (IT4I) BEM for many-core architectures / 24

2 BEM4I Boundary element method OpenMP threading OpenMP vectorization Adaptive cross approximation 2 Numerical experiments Full assembly ACA assembly 3 Conclusion Jan Zapletal (IT4I) BEM for many-core architectures 2 / 24

3 BEM4I BEM4I Boundary element method OpenMP threading OpenMP vectorization Adaptive cross approximation 2 Numerical experiments Full assembly ACA assembly 3 Conclusion Jan Zapletal (IT4I) BEM for many-core architectures 3 / 24

4 BEM4I Boundary element method BEM4I BEM is an alternative to FEM for the solution of PDEs, BEM4I - 3D boundary element library for Laplace equation (heat transfer), Helmholtz equation (wave scattering), Lamé equation (linear elasticity), time-dependent wave equation. Specifications C++, interface to MKL or other BLAS, LAPACK, ACA for matrix sparsification. Strategies SIMD vectorization for surface integrals (OpenMP, Vc library), OpenMP for local element contributions / individual ACA blocks, MPI for distributed matrices, BETI with the ESPRESO library, Intel MIC offload/native. Jan Zapletal (IT4I) BEM for many-core architectures 4 / 24

5 BEM4I Boundary element method BEM4I BEM is an alternative to FEM for the solution of PDEs, BEM4I - 3D boundary element library for Laplace equation (heat transfer), Helmholtz equation (wave scattering), Lamé equation (linear elasticity), time-dependent wave equation. Specifications C++, interface to MKL or other BLAS, LAPACK, ACA for matrix sparsification. Strategies SIMD vectorization for surface integrals (OpenMP, Vc library), OpenMP for local element contributions / individual ACA blocks, MPI for distributed matrices, BETI with the ESPRESO library, Intel MIC offload/native. Jan Zapletal (IT4I) BEM for many-core architectures 4 / 24

6 BEM4I Boundary element method BEM4I BEM is an alternative to FEM for the solution of PDEs, BEM4I - 3D boundary element library for Laplace equation (heat transfer), Helmholtz equation (wave scattering), Lamé equation (linear elasticity), time-dependent wave equation. Specifications C++, interface to MKL or other BLAS, LAPACK, ACA for matrix sparsification. Strategies SIMD vectorization for surface integrals (OpenMP, Vc library), OpenMP for local element contributions / individual ACA blocks, MPI for distributed matrices, BETI with the ESPRESO library, Intel MIC offload/native. Jan Zapletal (IT4I) BEM for many-core architectures 4 / 24

7 BEM4I Boundary element method BEM4I BEM is an alternative to FEM for the solution of PDEs, BEM4I - 3D boundary element library for Laplace equation (heat transfer), Helmholtz equation (wave scattering), Lamé equation (linear elasticity), time-dependent wave equation. Specifications C++, interface to MKL or other BLAS, LAPACK, ACA for matrix sparsification. Strategies SIMD vectorization for surface integrals (OpenMP, Vc library), OpenMP for local element contributions / individual ACA blocks, MPI for distributed matrices, BETI with the ESPRESO library, Intel MIC offload/native. Jan Zapletal (IT4I) BEM for many-core architectures 4 / 24

8 BEM4I Boundary element method Boundary value problem for the Laplace equation Representation formula for x Ω u(x) = 4π Ω u x y n Boundary integral equations for x Ω n x 4π 4π Ω Ω u = 0 in Ω, u = f on Γ D, u n = g on ΓN. x y, n(y) (y) dsy u(y) ds y. 4π Ω x y 3 u x y n (y) dsy = 2 u(x) + 4π Ω x y, n(y) u(y) ds x y 3 y = u 2 n (x) 4π x y, n(y) x y 3 u(y) ds y, Ω y x, n(x) x y 3 u (y) dsy. n Jan Zapletal (IT4I) BEM for many-core architectures 5 / 24

9 BEM4I Boundary element method Discretization leads to the systems V h g = ( ) ( ) 2 M h + K h f, D h f = 2 M h K h g with the matrices V h [l, k] := dsy dsx 4π τ l τ k x y K h [l, i] := 4π τ l Ω x y, n(y) ϕ i(y) ds x y 3 y ds x D h [j, i] := curl ϕ i(y), curl ϕ j(x) ds y ds x = T h diag(v h, V h, V h )T h, 4π Ω Ω x y M h [l, i] := ϕ i(x) ds x. τ l Jan Zapletal (IT4I) BEM for many-core architectures 6 / 24

10 BEM4I OpenMP threading OpenMP threading for V h # pragma omp parallel for 2 for ( int tau_k = 0; tau_k < E; ++ tau_k ){ // columns 3 for ( int tau_l = 0; tau_l < E; ++ tau_l ){ // rows 4 SLIntegrator. getlocalmatrix ( * tau_l, * tau_k, Vloc ); 5 V. set ( * tau_l, * tau_k, Vloc. get ( 0, 0 ) ); 6 } } OpenMP threading for K h Jan Zapletal (IT4I) BEM for many-core architectures 7 / 24

11 BEM4I OpenMP threading OpenMP threading for V h # pragma omp parallel for 2 for ( int tau_k = 0; tau_k < E; ++ tau_k ){ // columns 3 for ( int tau_l = 0; tau_l < E; ++ tau_l ){ // rows 4 SLIntegrator. getlocalmatrix ( * tau_l, * tau_k, Vloc ); 5 V. set ( * tau_l, * tau_k, Vloc. get ( 0, 0 ) ); 6 } } OpenMP threading for K h # pragma omp parallel for 2 for ( int tau_k = 0; tau_k < E; ++ tau_k ){ // columns 3 for ( int tau_l = 0; tau_l < E; ++ tau_l ){ // rows 4 DLIntegrator. getlocalmatrix ( * tau_l, * tau_k, Kloc ); 5 # pragma omp atomic // ( inside of add_atomic ) 6 K. add_atomic ( * tau_l, tau_k -> node [ 0 ], Kloc. get ( 0, 0 ) ); 7 # pragma omp atomic // ( inside of add_atomic ) 8 K. add_atomic ( * tau_l, tau_k -> node [ ], Kloc. get ( 0, ) ); 9 # pragma omp atomic // ( inside of add_atomic ) 0 K. add_atomic ( * tau_l, tau_k -> node [ 2 ], Kloc. get ( 0, 2 ) ); } } Jan Zapletal (IT4I) BEM for many-core architectures 7 / 24

12 BEM4I OpenMP threading OpenMP threading for V h # pragma omp parallel for 2 for ( int tau_k = 0; tau_k < E; ++ tau_k ){ // columns 3 for ( int tau_l = 0; tau_l < E; ++ tau_l ){ // rows 4 SLIntegrator. getlocalmatrix ( * tau_l, * tau_k, Vloc ); 5 V. set ( * tau_l, * tau_k, Vloc. get ( 0, 0 ) ); 6 } } OpenMP threading for K h # pragma omp parallel for 2 for ( int tau_l = 0; tau_l < E; ++ tau_l ){ // rows 3 for ( int tau_k = 0; tau_k < E; ++ tau_k ){ // columns 4 DLIntegrator. getlocalmatrix ( * tau_l, * tau_k, Kloc ); 5 6 K. add ( * tau_l, tau_k -> node [ 0 ], Kloc. get ( 0, 0 ) ); 7 8 K. add ( * tau_l, tau_k -> node [ ], Kloc. get ( 0, ) ); 9 0 K. add ( * tau_l, tau_k -> node [ 2 ], Kloc. get ( 0, 2 ) ); } } Avoid #pragma omp critical, use #pragma omp atomic if applicable. Jan Zapletal (IT4I) BEM for many-core architectures 7 / 24

13 BEM4I OpenMP vectorization Single Instruction Multiple Data (SIMD) processing vector with a single operation, provides data level parallelism, elements are of the same type. Vector length 256 bits for AVX-2 (Haswell), 4 DP operands, 52 bits for IMCI (KNC), AVX52 (KNL), 8 DP operands. Vectorization achieved by compiler auto-vectorization (code refactoring can help), OpenMP 4.0 pragmas (#pragma omp simd), intrinsic functions, wrapper library (Vc), assembly. Jan Zapletal (IT4I) BEM for many-core architectures 8 / 24

14 BEM4I OpenMP vectorization Single Instruction Multiple Data (SIMD) processing vector with a single operation, provides data level parallelism, elements are of the same type. Vector length 256 bits for AVX-2 (Haswell), 4 DP operands, 52 bits for IMCI (KNC), AVX52 (KNL), 8 DP operands. Vectorization achieved by compiler auto-vectorization (code refactoring can help), OpenMP 4.0 pragmas (#pragma omp simd), intrinsic functions, wrapper library (Vc), assembly. Jan Zapletal (IT4I) BEM for many-core architectures 8 / 24

15 BEM4I OpenMP vectorization Tuned vectorized integration over a square double * w = new double [ S ]; 2 w[ 0 ] =... // init. weights 3 // x^_, x^_2,..., x^s_, x^ S_2 4 double x [ ] = {... }; for ( int l = 0; l < S; ++l ){ 2 result += w[ l ] * f( x[ 2 * l ], x[ 2 * l + ] ); 3 } 4 5 delete w; Loop peeling & strided access to memory Jan Zapletal (IT4I) BEM for many-core architectures 9 / 24

16 BEM4I OpenMP vectorization Tuned vectorized integration over a square double * w = new double [ S ]; 2 w[ 0 ] =... // init. weights 3 // x^_, x^_2,..., x^s_, x^ S_2 4 double x [ ] = {... }; # pragma omp simd reduction ( + : result ) for ( int l = 0; l < S; ++l ){ 2 result += w[ l ] * f( x[ 2 * l ], x[ 2 * l + ] ); 3 } 4 5 delete w; Loop peeling & strided access to memory Jan Zapletal (IT4I) BEM for many-core architectures 9 / 24

17 BEM4I OpenMP vectorization Tuned vectorized integration over a square double * w = new double [ S ]; 2 w[ 0 ] =... // init. weights 3 // x^_, x^_2,..., x^s_, x^ S_2 4 double x [ ] = {... }; # pragma omp simd reduction ( + : result ) for ( int l = 0; l < S; ++l ){ 2 result += w[ l ] * f( x[ 2 * l ], x[ 2 * l + ] ); 3 } 4 5 delete w; Loop peeling & strided access to memory 0x x x AoS peel main body remainder SoA Jan Zapletal (IT4I) BEM for many-core architectures 9 / 24

18 BEM4I OpenMP vectorization Tuned vectorized integration over a square double * w = ( double *) _mm_malloc ( S * sizeof ( double ), 64 ); 2 w[ 0 ] =... // init. weights 3 // x^_,..., x^ S_ 4 double x [ ] attribute ( ( aligned ( 64 ) ) ) = {... }; 5 // x^_2,..., x^ S_2 6 double x2 [ ] attribute ( ( aligned ( 64 ) ) ) = {... }; 7 8 assume_aligned ( w, 64 ); // tell compiler about alignment 9 0 # pragma omp simd reduction ( + : result ) for ( int l = 0; l < S; ++l ){ 2 result += w[ l ] * f( x[ l ], x2[ l ] ); 3 } 4 5 _mm_free ( w ); Loop peeling & strided access to memory 0x x x AoS peel main body remainder SoA Jan Zapletal (IT4I) BEM for many-core architectures 9 / 24

19 BEM4I OpenMP vectorization Duffy substitution for τ l τ k, V h [l, k] = s k(f s (η, η 2, η 3, ξ))s s (η, η 2, η 3, ξ) dη dη 2 dη 3 dξ with F s : [0, ] 4 S τ l τ k, F s (η, η 2, η 3, ξ) = (x, y), S s (η, η 2, η 3, ξ) dη dη 2 dη 3 dξ = ds xds y. Approximated by tensor Gauss quadrature V h [l, k] w m w n w o w p k(f s (x m, x n, x o, x p))s s (x m, x n, x o, x p). s m n o p Jan Zapletal (IT4I) BEM for many-core architectures 0 / 24

20 BEM4I OpenMP vectorization Collapsed integration loop in getlocalmatrix. assume_aligned ( xss, 64 ); // all data aligned switch ( type ){ 5 case ( identicalelements ): 6 for ( int simplex = 0; simplex < 6; ++ simplex ){ 7 8 reftotri ( simplex, x,..., y3, xref,..., y2ref, xss,..., y3ss ); 9 0 # pragma omp simd reduction ( + : entry ) for ( c = 0; c < S* S2* S3* S4; ++c ) { // collapsed 2 kernel = weights_jacv [ c ] 3 * evalsinglelayerkernel ( xss [ c ], x2ss [ c ], 4 x3ss [ c ], yss [ c ], y2ss [ c ], y3ss [ c ] ); 5 entry += kernel ; 6 } } 7 break ; 8... // quadrature over other pairs of elements 9 } Jan Zapletal (IT4I) BEM for many-core architectures / 24

21 BEM4I OpenMP vectorization SIMD evaluation of quadrature points in reftotri. assume_aligned ( xss, 64 ); // all data aligned # pragma omp simd 5 for ( int c = 0; c < S* S2* S3* S4; ++c ){ 6 xss [ c ] = x[ 0 ] 7 + ( x2[ 0 ] - x[ 0 ] ) * xref [ simplex ][ c ] 8 + ( x3[ 0 ] - x[ 0 ] ) * x2ref [ simplex ][ c ]; 9... // compute x2ss, x3ss, yss, y2ss, y3ss 0 } SIMD evaluation of the kernel in evalsinglelayerkernel. # pragma omp declare simd 2 double evalsinglelayerkernel ( 3 double x, double x2, double x3, 4 double y, double y2, double y3 5 ) const { 6 7 double d = x - y, d2 = x2 - y2, d3 = x3 - y3; 8 double norm = sqrt ( d * d + d2 * d2 + d3 * d3 ); 9 0 return ( / ( norm * 4.0 * M_PI ) ); } Jan Zapletal (IT4I) BEM for many-core architectures 2 / 24

22 BEM4I OpenMP vectorization SIMD evaluation of quadrature points in reftotri. assume_aligned ( xss, 64 ); // all data aligned # pragma omp simd 5 for ( int c = 0; c < S* S2* S3* S4; ++c ){ 6 xss [ c ] = x[ 0 ] 7 + ( x2[ 0 ] - x[ 0 ] ) * xref [ simplex ][ c ] 8 + ( x3[ 0 ] - x[ 0 ] ) * x2ref [ simplex ][ c ]; 9... // compute x2ss, x3ss, yss, y2ss, y3ss 0 } SIMD evaluation of the kernel in evalsinglelayerkernel. # pragma omp declare simd 2 double evalsinglelayerkernel ( 3 double x, double x2, double x3, 4 double y, double y2, double y3 5 ) const { 6 7 double d = x - y, d2 = x2 - y2, d3 = x3 - y3; 8 double norm = sqrt ( d * d + d2 * d2 + d3 * d3 ); 9 0 return ( / ( norm * 4.0 * M_PI ) ); } Jan Zapletal (IT4I) BEM for many-core architectures 2 / 24

23 BEM4I OpenMP vectorization Intel Advisor Jan Zapletal (IT4I) BEM for many-core architectures 3 / 24

24 BEM4I OpenMP vectorization Intel compiler reports -qopt-report=5 -qopt-report-phase=vec LOOP BEGIN at BEIntegratorScalar. cpp (304,5) 2 remark 5340: pragma supersedes previous setting [ BEIntegratorScalar. cpp (300,) ] 3 remark 5388: vectorization support : reference xss [ i] has aligned access [ BEIntegratorScalar. cpp (305,55) ] 4... // all data reported as aligned 5 remark 5305: vectorization support : vector length 4 6 remark 5309: vectorization support : normalized vectorization overhead remark 530: OpenMP SIMD LOOP WAS VECTORIZED 8 remark 5448: unmasked aligned unit stride loads : 6 9 remark 5475: --- begin vector cost summary remark 5476: scalar cost : 97 remark 5477: vector cost : remark 5478: estimated potential speedup : remark 5488: --- end vector cost summary LOOP END Jan Zapletal (IT4I) BEM for many-core architectures 4 / 24

25 BEM4I Adaptive cross approximation Adaptive cross approximation Complexity O(n 2 ) O(n log n), mesh divided into clusters, non-admissible clusters assembled in full, admissible clusters approximated as C UV, assembly of clusters distributed by OpenMP. ~ * * T # pragma omp parallel for 2 for ( int i = 0; i < n_nonadmissible_blocks ; ++i ){ 3 getnonadmissibleblock ( i, localblock ); 4 ACAMatrix. addnonadmissibleblock ( i, localblock ); 5 } 6 7 # pragma omp parallel for 8 for ( int i = 0; i < n_admissible_blocks ; ++i ){ 9 getadmissibleblock ( i, localblock ); 0 ACAMatrix. addadmissibleblock ( i, localblock ); } Jan Zapletal (IT4I) BEM for many-core architectures 5 / 24

26 BEM4I Adaptive cross approximation Adaptive cross approximation Complexity O(n 2 ) O(n log n), mesh divided into clusters, non-admissible clusters assembled in full, admissible clusters approximated as C UV, assembly of clusters distributed by OpenMP. ~ * * T # pragma omp parallel for 2 for ( int i = 0; i < n_nonadmissible_blocks ; ++i ){ 3 getnonadmissibleblock ( i, localblock ); 4 ACAMatrix. addnonadmissibleblock ( i, localblock ); 5 } 6 7 # pragma omp parallel for 8 for ( int i = 0; i < n_admissible_blocks ; ++i ){ 9 getadmissibleblock ( i, localblock ); 0 ACAMatrix. addadmissibleblock ( i, localblock ); } Jan Zapletal (IT4I) BEM for many-core architectures 5 / 24

27 Numerical experiments BEM4I Boundary element method OpenMP threading OpenMP vectorization Adaptive cross approximation 2 Numerical experiments Full assembly ACA assembly 3 Conclusion Jan Zapletal (IT4I) BEM for many-core architectures 6 / 24

28 Numerical experiments Experiment setting Full (ACA) assembly tested on a mesh with (8.920) surface elements. 4 quadrature points in each dimension (256 per simplex) to utilize SIMD registers. ACA settings maximal number of elements in clusters: 500, preallocation: 0 %. Assembly performed on single nodes Salomon - 2 x Xeon 2680v3, AVX2, 2x2 cores, 2.5 GHz, 28 GB RAM, Salomon - Xeon Phi 720P, IMCI, 6 cores,.238 GHz, 6 GB RAM, Endeavor - Xeon Phi 720, AVX-52, 64 cores,.3 GHz, GB RAM. Jan Zapletal (IT4I) BEM for many-core architectures 7 / 24

29 Numerical experiments Experiment setting Full (ACA) assembly tested on a mesh with (8.920) surface elements. 4 quadrature points in each dimension (256 per simplex) to utilize SIMD registers. ACA settings maximal number of elements in clusters: 500, preallocation: 0 %. Assembly performed on single nodes Salomon - 2 x Xeon 2680v3, AVX2, 2x2 cores, 2.5 GHz, 28 GB RAM, Salomon - Xeon Phi 720P, IMCI, 6 cores,.238 GHz, 6 GB RAM, Endeavor - Xeon Phi 720, AVX-52, 64 cores,.3 GHz, GB RAM. Jan Zapletal (IT4I) BEM for many-core architectures 7 / 24

30 Numerical experiments Full assembly OMP scalability, full assembly, 256 points per simplex Single-layer matrix, full, double, (4,4,4,4) Double-layer matrix, full, double, (4,4,4,4),000 2xHSW KNC KNL Optimal,000 2xHSW KNC KNL Optimal t [s] 00 t [s] OMP threads OMP threads Xeon 2680v3 matrix V h K h Jan Zapletal (IT4I) BEM for many-core architectures 8 / 24

31 Numerical experiments Full assembly OMP scalability, full assembly, 256 points per simplex Single-layer matrix, full, double, (4,4,4,4) Double-layer matrix, full, double, (4,4,4,4),000 2xHSW KNC KNL Optimal,000 2xHSW KNC KNL Optimal t [s] 00 t [s] OMP threads OMP threads Xeon Phi 720P matrix V h K h Jan Zapletal (IT4I) BEM for many-core architectures 8 / 24

32 Numerical experiments Full assembly OMP scalability, full assembly, 256 points per simplex Single-layer matrix, full, double, (4,4,4,4) Double-layer matrix, full, double, (4,4,4,4),000 2xHSW KNC KNL Optimal,000 2xHSW KNC KNL Optimal t [s] 00 t [s] OMP threads OMP threads Xeon Phi 720 matrix V h K h Jan Zapletal (IT4I) BEM for many-core architectures 8 / 24

33 Numerical experiments Full assembly SIMD scalability, full assembly, 256 points per simplex Single-layer matrix, full, double, (4,4,4,4) Double-layer matrix, full, double, (4,4,4,4) t[s] 0 t[s] 0 2xHSW KNC KNL Optimal Register size 2xHSW KNC KNL Optimal Register size architecture threads matrix SSE4.2 AVX2 IMCI AVX-52 Xeon E2680v3 24 Xeon Phi 720P 244 Xeon Phi V h K h V h 3.77 K h 5.05 V h K h Jan Zapletal (IT4I) BEM for many-core architectures 9 / 24

34 Numerical experiments Full assembly SIMD scalability, full assembly, 256 points per simplex Single-layer matrix, full, double, (4,4,4,4) Double-layer matrix, full, double, (4,4,4,4) t[s] 0 t[s] 0 2xHSW KNC KNL Optimal Register size 2xHSW KNC KNL Optimal Register size architecture threads matrix SSE4.2 AVX2 IMCI AVX-52 Xeon E2680v3 24 Xeon Phi 720P 244 Xeon Phi V h K h V h 3.77 K h 5.05 V h K h Jan Zapletal (IT4I) BEM for many-core architectures 9 / 24

35 Numerical experiments Full assembly SIMD scalability, full assembly, 256 points per simplex Single-layer matrix, full, double, (4,4,4,4) Double-layer matrix, full, double, (4,4,4,4) t[s] 0 t[s] 0 2xHSW KNC KNL Optimal Register size 2xHSW KNC KNL Optimal Register size architecture threads matrix SSE4.2 AVX2 IMCI AVX-52 Xeon E2680v3 24 Xeon Phi 720P 244 Xeon Phi V h K h V h 3.77 K h 5.05 V h K h Jan Zapletal (IT4I) BEM for many-core architectures 9 / 24

36 Numerical experiments ACA assembly OMP scalability, ACA assembly, 256 points per simplex,000 Single-layer matrix, aca, double, (4,4,4,4) 2xHSW KNC KNL Optimal,000 Double-layer matrix, aca, double, (4,4,4,4) 2xHSW KNC KNL Optimal t [s] 00 t [s] OMP threads OMP threads Xeon 2680v3 matrix V h K h Jan Zapletal (IT4I) BEM for many-core architectures 20 / 24

37 Numerical experiments ACA assembly OMP scalability, ACA assembly, 256 points per simplex,000 Single-layer matrix, aca, double, (4,4,4,4) 2xHSW KNC KNL Optimal,000 Double-layer matrix, aca, double, (4,4,4,4) 2xHSW KNC KNL Optimal t [s] 00 t [s] OMP threads OMP threads Xeon Phi 720P matrix V h K h Jan Zapletal (IT4I) BEM for many-core architectures 20 / 24

38 Numerical experiments ACA assembly OMP scalability, ACA assembly, 256 points per simplex,000 Single-layer matrix, aca, double, (4,4,4,4) 2xHSW KNC KNL Optimal,000 Double-layer matrix, aca, double, (4,4,4,4) 2xHSW KNC KNL Optimal t [s] 00 t [s] OMP threads OMP threads Xeon Phi 720 matrix V h K h Jan Zapletal (IT4I) BEM for many-core architectures 20 / 24

39 Numerical experiments ACA assembly SIMD scalability, ACA assembly, 256 points per simplex Single-layer matrix, aca, double, (4,4,4,4) Double-layer matrix, aca, double, (4,4,4,4) t[s] 0 t[s] 0 2xHSW KNC KNL Optimal Register size 2xHSW KNC KNL Optimal Register size architecture threads matrix SSE4.2 AVX2 IMCI AVX-52 Xeon E2680v3 24 Xeon Phi 720P 244 Xeon Phi V h K h V h 2.94 K h 3.47 V h K h Jan Zapletal (IT4I) BEM for many-core architectures 2 / 24

40 Numerical experiments ACA assembly SIMD scalability, ACA assembly, 256 points per simplex Single-layer matrix, aca, double, (4,4,4,4) Double-layer matrix, aca, double, (4,4,4,4) t[s] 0 t[s] 0 2xHSW KNC KNL Optimal Register size 2xHSW KNC KNL Optimal Register size architecture threads matrix SSE4.2 AVX2 IMCI AVX-52 Xeon E2680v3 24 Xeon Phi 720P 244 Xeon Phi V h K h V h 2.94 K h 3.47 V h K h Jan Zapletal (IT4I) BEM for many-core architectures 2 / 24

41 Numerical experiments ACA assembly SIMD scalability, ACA assembly, 256 points per simplex Single-layer matrix, aca, double, (4,4,4,4) Double-layer matrix, aca, double, (4,4,4,4) t[s] 0 t[s] 0 2xHSW KNC KNL Optimal Register size 2xHSW KNC KNL Optimal Register size architecture threads matrix SSE4.2 AVX2 IMCI AVX-52 Xeon E2680v3 24 Xeon Phi 720P 244 Xeon Phi V h K h V h 2.94 K h 3.47 V h K h Jan Zapletal (IT4I) BEM for many-core architectures 2 / 24

42 Conclusion BEM4I Boundary element method OpenMP threading OpenMP vectorization Adaptive cross approximation 2 Numerical experiments Full assembly ACA assembly 3 Conclusion Jan Zapletal (IT4I) BEM for many-core architectures 22 / 24

43 Conclusion Conclusion KNL vs. HSW speedup Full V h 2.29, K h.82, ACA V h 2.64, K h.40. What we learned SIMD processing becoming more efficient with KNL, multi-core code benefits from many-core optimizations. Work in progress object-oriented offload to Xeon Phi (full, ACA), load balancing for offload mode (full, ACA), massively parallel BETI with ESPRESO library (MPI, OpenMP, SIMD, offload). Jan Zapletal (IT4I) BEM for many-core architectures 23 / 24

44 Conclusion Conclusion KNL vs. HSW speedup Full V h 2.29, K h.82, ACA V h 2.64, K h.40. What we learned SIMD processing becoming more efficient with KNL, multi-core code benefits from many-core optimizations. CPU MIC CPU MIC CPU MIC Work in progress object-oriented offload to Xeon Phi (full, ACA), load balancing for offload mode (full, ACA), massively parallel BETI with ESPRESO library (MPI, OpenMP, SIMD, offload). Jan Zapletal (IT4I) BEM for many-core architectures 23 / 24

45 Conclusion Conclusion KNL vs. HSW speedup Full V h 2.29, K h.82, ACA V h 2.64, K h.40. What we learned SIMD processing becoming more efficient with KNL, multi-core code benefits from many-core optimizations. CPU MIC CPU MIC CPU MIC Work in progress object-oriented offload to Xeon Phi (full, ACA), load balancing for offload mode (full, ACA), massively parallel BETI with ESPRESO library (MPI, OpenMP, SIMD, offload). Jan Zapletal (IT4I) BEM for many-core architectures 23 / 24

46 Conclusion References Zapletal, J.; Merta, M.; Malý, L. Boundary element quadrature schemes for multi- and many-core architectures. Computers and Mathematics with Applications (accepted). Merta, M.; Zapletal, J.; Jaroš, J. Many core acceleration of the boundary element method. High Performance Computing in Science and Engineering: Second International Conference, HPCSE 205, Soláň, Czech Republic, May 25-28, 205, Revised Selected Papers, Springer International Publishing 4:6 25, 206. Preprints available at researchgate.com. Thank you Jan Zapletal (IT4I) BEM for many-core architectures 24 / 24

47 Conclusion References Zapletal, J.; Merta, M.; Malý, L. Boundary element quadrature schemes for multi- and many-core architectures. Computers and Mathematics with Applications (accepted). Merta, M.; Zapletal, J.; Jaroš, J. Many core acceleration of the boundary element method. High Performance Computing in Science and Engineering: Second International Conference, HPCSE 205, Soláň, Czech Republic, May 25-28, 205, Revised Selected Papers, Springer International Publishing 4:6 25, 206. Preprints available at researchgate.com. Thank you Jan Zapletal (IT4I) BEM for many-core architectures 24 / 24

Dynamic SIMD Scheduling

Dynamic SIMD Scheduling Dynamic SIMD Scheduling Florian Wende SC15 MIC Tuning BoF November 18 th, 2015 Zuse Institute Berlin time Dynamic Work Assignment: The Idea Irregular SIMD execution Caused by branching: control flow varies

More information

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ, Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon

More information

Intel Knights Landing Hardware

Intel Knights Landing Hardware Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute

More information

Code optimization in a 3D diffusion model

Code optimization in a 3D diffusion model Code optimization in a 3D diffusion model Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Agenda Background Diffusion

More information

Numerical Methods for PDEs : Video 11: 1D FiniteFebruary Difference 15, Mappings Theory / 15

Numerical Methods for PDEs : Video 11: 1D FiniteFebruary Difference 15, Mappings Theory / 15 22.520 Numerical Methods for PDEs : Video 11: 1D Finite Difference Mappings Theory and Matlab February 15, 2015 22.520 Numerical Methods for PDEs : Video 11: 1D FiniteFebruary Difference 15, Mappings 2015

More information

Day 6: Optimization on Parallel Intel Architectures

Day 6: Optimization on Parallel Intel Architectures Day 6: Optimization on Parallel Intel Architectures Lecture day 6 Ryo Asai Colfax International colfaxresearch.com April 2017 colfaxresearch.com/ Welcome Colfax International, 2013 2017 Disclaimer 2 While

More information

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 8 Processor-level SIMD SIMD instructions can perform

More information

VECTORISATION. Adrian

VECTORISATION. Adrian VECTORISATION Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Vectorisation Same operation on multiple data items Wide registers SIMD needed to approach FLOP peak performance, but your code must be

More information

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA Introduction to the Xeon Phi programming model Fabio AFFINITO, CINECA What is a Xeon Phi? MIC = Many Integrated Core architecture by Intel Other names: KNF, KNC, Xeon Phi... Not a CPU (but somewhat similar

More information

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava, , Hardware Overview & Native Execution IT4Innovations, Ostrava, 3.2.- 4.2.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi (MIC) Programming models Native mode programming

More information

Scientific Computing with Intel Xeon Phi Coprocessors

Scientific Computing with Intel Xeon Phi Coprocessors Scientific Computing with Intel Xeon Phi Coprocessors Andrey Vladimirov Colfax International HPC Advisory Council Stanford Conference 2015 Compututing with Xeon Phi Welcome Colfax International, 2014 Contents

More information

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting

More information

OpenMP 4.5: Threading, vectorization & offloading

OpenMP 4.5: Threading, vectorization & offloading OpenMP 4.5: Threading, vectorization & offloading Michal Merta michal.merta@vsb.cz 2nd of March 2018 Agenda Introduction The Basics OpenMP Tasks Vectorization with OpenMP 4.x Offloading to Accelerators

More information

LIBXSMM Library for small matrix multiplications. Intel High Performance and Throughput Computing (EMEA) Hans Pabst, March 12 th 2015

LIBXSMM Library for small matrix multiplications. Intel High Performance and Throughput Computing (EMEA) Hans Pabst, March 12 th 2015 LIBXSMM Library for small matrix multiplications. Intel High Performance and Throughput Computing (EMEA) Hans Pabst, March 12 th 2015 Abstract Library for small matrix-matrix multiplications targeting

More information

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC. Vectorisation James Briggs 1 COSMOS DiRAC April 28, 2015 Session Plan 1 Overview 2 Implicit Vectorisation 3 Explicit Vectorisation 4 Data Alignment 5 Summary Section 1 Overview What is SIMD? Scalar Processing:

More information

Bring your application to a new era:

Bring your application to a new era: Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.

More information

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ, Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ, 27.6.- 29.6.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi Products Programming models Native

More information

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Advanced OpenMP Vectoriza?on

Advanced OpenMP Vectoriza?on UT Aus?n Advanced OpenMP Vectoriza?on TACC TACC OpenMP Team milfeld/lars/agomez@tacc.utexas.edu These slides & Labs:?nyurl.com/tacc- openmp Learning objec?ve Vectoriza?on: what is that? Past, present and

More information

Ge#ng Started with Automa3c Compiler Vectoriza3on. David Apostal UND CSci 532 Guest Lecture Sept 14, 2017

Ge#ng Started with Automa3c Compiler Vectoriza3on. David Apostal UND CSci 532 Guest Lecture Sept 14, 2017 Ge#ng Started with Automa3c Compiler Vectoriza3on David Apostal UND CSci 532 Guest Lecture Sept 14, 2017 Parallellism is Key to Performance Types of parallelism Task-based (MPI) Threads (OpenMP, pthreads)

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant Parallel is the Path Forward Intel Xeon and Intel Xeon Phi Product Families are both going parallel Intel Xeon processor

More information

Analysis of Subroutine xppm0 in FV3. Lynd Stringer, NOAA Affiliate Redline Performance Solutions LLC

Analysis of Subroutine xppm0 in FV3. Lynd Stringer, NOAA Affiliate Redline Performance Solutions LLC Analysis of Subroutine xppm0 in FV3 Lynd Stringer, NOAA Affiliate Redline Performance Solutions LLC Lynd.Stringer@noaa.gov Introduction Hardware/Software Why xppm0? KGEN Compiler at O2 Assembly at O2 Compiler

More information

Scientific Programming in C XIV. Parallel programming

Scientific Programming in C XIV. Parallel programming Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence

More information

Intel Xeon Phi Coprocessor

Intel Xeon Phi Coprocessor Intel Xeon Phi Coprocessor A guide to using it on the Cray XC40 Terminology Warning: may also be referred to as MIC or KNC in what follows! What are Intel Xeon Phi Coprocessors? Hardware designed to accelerate

More information

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST Introduction to tuning on many core platforms Gilles Gouaillardet RIST gilles@rist.or.jp Agenda Why do we need many core platforms? Single-thread optimization Parallelization Conclusions Why do we need

More information

HPC TNT - 2. Tips and tricks for Vectorization approaches to efficient code. HPC core facility CalcUA

HPC TNT - 2. Tips and tricks for Vectorization approaches to efficient code. HPC core facility CalcUA HPC TNT - 2 Tips and tricks for Vectorization approaches to efficient code HPC core facility CalcUA ANNIE CUYT STEFAN BECUWE FRANKY BACKELJAUW [ENGEL]BERT TIJSKENS Overview Introduction What is vectorization

More information

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava, PRACE PATC Course: Intel MIC Programming Workshop, MKL Ostrava, 7-8.2.2017 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi Compiler Assisted Offload Automatic Offload Native Execution Hands-on

More information

Porting and Tuning WRF Physics Packages on Intel Xeon and Xeon Phi and NVIDIA GPU

Porting and Tuning WRF Physics Packages on Intel Xeon and Xeon Phi and NVIDIA GPU Porting and Tuning WRF Physics Packages on Intel Xeon and Xeon Phi and NVIDIA GPU Tom Henderson Thomas.B.Henderson@noaa.gov Mark Govett, James Rosinski, Jacques Middlecoff NOAA Global Systems Division

More information

Reusing this material

Reusing this material XEON PHI BASICS Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Parallel Systems. Project topics

Parallel Systems. Project topics Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a

More information

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012 Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012 Outline NICS and AACE Architecture Overview Resources Native Mode Boltzmann BGK Solver Native/Offload

More information

SIMD Exploitation in (JIT) Compilers

SIMD Exploitation in (JIT) Compilers SIMD Exploitation in (JIT) Compilers Hiroshi Inoue, IBM Research - Tokyo 1 What s SIMD? Single Instruction Multiple Data Same operations applied for multiple elements in a vector register input 1 A0 input

More information

AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors

AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors Kaixi Hou, Hao Wang, Wu-chun Feng {kaixihou,hwang121,wfeng}@vt.edu Pairwise Sequence Alignment Algorithms

More information

Software Optimization Case Study. Yu-Ping Zhao

Software Optimization Case Study. Yu-Ping Zhao Software Optimization Case Study Yu-Ping Zhao Yuping.zhao@intel.com Agenda RELION Background RELION ITAC and VTUE Analyze RELION Auto-Refine Workload Optimization RELION 2D Classification Workload Optimization

More information

Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures

Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures Dr. Fabio Baruffa Leibniz Supercomputing Centre fabio.baruffa@lrz.de MC² Series: Colfax Research Webinar, http://mc2series.com

More information

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

Intel Performance Libraries

Intel Performance Libraries Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation

More information

Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism.

Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism. Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism. Parallel + SIMD is the Path Forward Intel Xeon and Intel Xeon Phi Product

More information

PRACE PATC Course: Vectorisation & Basic Performance Overview. Ostrava,

PRACE PATC Course: Vectorisation & Basic Performance Overview. Ostrava, PRACE PATC Course: Vectorisation & Basic Performance Overview Ostrava, 7-8.2.2017 1 Agenda Basic Vectorisation & SIMD Instructions IMCI Vector Extension Intel compiler flags Hands-on Intel Tool VTune Amplifier

More information

Introduction to tuning on KNL platforms

Introduction to tuning on KNL platforms Introduction to tuning on KNL platforms Gilles Gouaillardet RIST gilles@rist.or.jp 1 Agenda Why do we need many core platforms? KNL architecture Single-thread optimization Parallelization Common pitfalls

More information

Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ)

Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ) Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ) Overview Modernising P-Gadget3 for the Intel Xeon Phi : code features, challenges and strategy for

More information

Native Computing and Optimization on Intel Xeon Phi

Native Computing and Optimization on Intel Xeon Phi Native Computing and Optimization on Intel Xeon Phi ISC 2015 Carlos Rosales carlos@tacc.utexas.edu Overview Why run native? What is a native application? Building a native application Running a native

More information

Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures

Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures Dr. Fabio Baruffa Dr. Luigi Iapichino Leibniz Supercomputing Centre fabio.baruffa@lrz.de Outline of the talk

More information

2016 KAUST Workshop on Scalable and Hierarchical Algorithms for Extreme Computing KAUST, Saudi Arabia

2016 KAUST Workshop on Scalable and Hierarchical Algorithms for Extreme Computing KAUST, Saudi Arabia Alexander Heinecke Parallel Computing Lab, Intel Labs, USA 2016 KAUST Workshop on Scalable and Hierarchical Algorithms for Extreme Computing KAUST, Saudi Arabia Legal Disclaimer & Optimization Notice INFORMATION

More information

Intel Architecture for HPC

Intel Architecture for HPC Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter

More information

Benchmark results on Knight Landing architecture

Benchmark results on Knight Landing architecture Benchmark results on Knight Landing architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Milano, 21/04/2017 KNL vs BDW A1 BDW A2 KNL cores per node 2 x 18 @2.3 GHz 1 x 68

More information

OpenCL Vectorising Features. Andreas Beckmann

OpenCL Vectorising Features. Andreas Beckmann Mitglied der Helmholtz-Gemeinschaft OpenCL Vectorising Features Andreas Beckmann Levels of Vectorisation vector units, SIMD devices width, instructions SMX, SP cores Cus, PEs vector operations within kernels

More information

Code Optimization Process for KNL. Dr. Luigi Iapichino

Code Optimization Process for KNL. Dr. Luigi Iapichino Code Optimization Process for KNL Dr. Luigi Iapichino luigi.iapichino@lrz.de About the presenter Dr. Luigi Iapichino Scientific Computing Expert, Leibniz Supercomputing Centre Member of the Intel Parallel

More information

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Troy A. Porter Hansen Experimental Physics Laboratory and Kavli Institute for Particle Astrophysics and Cosmology Stanford

More information

Advanced programming with OpenMP. Libor Bukata a Jan Dvořák

Advanced programming with OpenMP. Libor Bukata a Jan Dvořák Advanced programming with OpenMP Libor Bukata a Jan Dvořák Programme of the lab OpenMP Tasks parallel merge sort, parallel evaluation of expressions OpenMP SIMD parallel integration to calculate π User-defined

More information

Sarah Knepper. Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018

Sarah Knepper. Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018 Sarah Knepper Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018 Outline Motivation Problem statement and solutions Simple example Performance comparison 2 Motivation Partial differential equations

More information

Advanced OpenMP Features

Advanced OpenMP Features Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Vectorization 2 Vectorization SIMD =

More information

What s New August 2015

What s New August 2015 What s New August 2015 Significant New Features New Directory Structure OpenMP* 4.1 Extensions C11 Standard Support More C++14 Standard Support Fortran 2008 Submodules and IMPURE ELEMENTAL Further C Interoperability

More information

Dr. Ilia Bermous, the Australian Bureau of Meteorology. Acknowledgements to Dr. Martyn Corden (Intel), Dr. Zhang Zhang (Intel), Dr. Martin Dix (CSIRO)

Dr. Ilia Bermous, the Australian Bureau of Meteorology. Acknowledgements to Dr. Martyn Corden (Intel), Dr. Zhang Zhang (Intel), Dr. Martin Dix (CSIRO) Performance, accuracy and bit-reproducibility aspects in handling transcendental functions with Cray and Intel compilers for the Met Office Unified Model Dr. Ilia Bermous, the Australian Bureau of Meteorology

More information

Using multifrontal hierarchically solver and HPC systems for 3D Helmholtz problem

Using multifrontal hierarchically solver and HPC systems for 3D Helmholtz problem Using multifrontal hierarchically solver and HPC systems for 3D Helmholtz problem Sergey Solovyev 1, Dmitry Vishnevsky 1, Hongwei Liu 2 Institute of Petroleum Geology and Geophysics SB RAS 1 EXPEC ARC,

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming

More information

Vectorization on KNL

Vectorization on KNL Vectorization on KNL Steve Lantz Senior Research Associate Cornell University Center for Advanced Computing (CAC) steve.lantz@cornell.edu High Performance Computing on Stampede 2, with KNL, Jan. 23, 2017

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture

More information

Scalable, Hybrid-Parallel Multiscale Methods using DUNE

Scalable, Hybrid-Parallel Multiscale Methods using DUNE MÜNSTER Scalable Hybrid-Parallel Multiscale Methods using DUNE R. Milk S. Kaulmann M. Ohlberger December 1st 2014 Outline MÜNSTER Scalable Hybrid-Parallel Multiscale Methods using DUNE 2 /28 Abstraction

More information

Day 5: Introduction to Parallel Intel Architectures

Day 5: Introduction to Parallel Intel Architectures Day 5: Introduction to Parallel Intel Architectures Lecture day 5 Ryo Asai Colfax International colfaxresearch.com April 2017 colfaxresearch.com Welcome Colfax International, 2013 2017 Disclaimer 2 While

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

VLPL-S Optimization on Knights Landing

VLPL-S Optimization on Knights Landing VLPL-S Optimization on Knights Landing 英特尔软件与服务事业部 周姗 2016.5 Agenda VLPL-S 性能分析 VLPL-S 性能优化 总结 2 VLPL-S Workload Descriptions VLPL-S is the in-house code from SJTU, paralleled with MPI and written in C++.

More information

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department Approaches to acceleration: GPUs vs Intel MIC Fabio AFFINITO SCAI department Single core Multi core Many core GPU Intel MIC 61 cores 512bit-SIMD units from http://www.karlrupp.net/ from http://www.karlrupp.net/

More information

Introduction to tuning on KNL platforms

Introduction to tuning on KNL platforms Introduction to tuning on KNL platforms Gilles Gouaillardet RIST gilles@rist.or.jp 1 Agenda Why do we need many core platforms? KNL architecture Post-K overview Single-thread optimization Parallelization

More information

Data Parallel Algorithmic Skeletons with Accelerator Support

Data Parallel Algorithmic Skeletons with Accelerator Support MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support Steffen Ernsting and Herbert Kuchen July 2, 2015 Agenda WESTFÄLISCHE MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support

More information

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming David Pfander, Malte Brunn, Dirk Pflüger University of Stuttgart, Germany May 25, 2018 Vancouver, Canada, iwapt18 May 25, 2018 Vancouver,

More information

Progress Porting and Tuning NWP Physics Packages on Intel Xeon Phi

Progress Porting and Tuning NWP Physics Packages on Intel Xeon Phi Progress Porting and Tuning NWP Physics Packages on Intel Xeon Phi Tom Henderson Thomas.B.Henderson@noaa.gov Mark Govett, Jacques Middlecoff James Rosinski NOAA Global Systems Division Indraneil Gokhale,

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

Parallel Computing. November 20, W.Homberg

Parallel Computing. November 20, W.Homberg Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better

More information

Intel Xeon Phi programming. September 22nd-23rd 2015 University of Copenhagen, Denmark

Intel Xeon Phi programming. September 22nd-23rd 2015 University of Copenhagen, Denmark Intel Xeon Phi programming September 22nd-23rd 2015 University of Copenhagen, Denmark Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED,

More information

Op#miza#on of D3Q19 La2ce Boltzmann Kernels for Recent Mul#- and Many-cores Intel Based Systems

Op#miza#on of D3Q19 La2ce Boltzmann Kernels for Recent Mul#- and Many-cores Intel Based Systems Op#miza#on of D3Q19 La2ce Boltzmann Kernels for Recent Mul#- and Many-cores Intel Based Systems Ivan Giro*o 1235, Sebas4ano Fabio Schifano 24 and Federico Toschi 5 1 International Centre of Theoretical

More information

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT MULTI-CORE PROGRAMMING Dongrui She December 9, 2010 ASSIGNMENT Goal of the Assignment 1 The purpose of this assignment is to Have in-depth understanding of the architectures of real-world multi-core CPUs

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

Benchmark results on Knight Landing (KNL) architecture

Benchmark results on Knight Landing (KNL) architecture Benchmark results on Knight Landing (KNL) architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Roma 23/10/2017 KNL, BDW, SKL A1 BDW A2 KNL A3 SKL cores per node 2 x 18 @2.3

More information

Sparse Matrix-Vector Multiplication with Wide SIMD Units: Performance Models and a Unified Storage Format

Sparse Matrix-Vector Multiplication with Wide SIMD Units: Performance Models and a Unified Storage Format ERLANGEN REGIONAL COMPUTING CENTER Sparse Matrix-Vector Multiplication with Wide SIMD Units: Performance Models and a Unified Storage Format Moritz Kreutzer, Georg Hager, Gerhard Wellein SIAM PP14 MS53

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 9 SIMD vectorization using #pragma omp simd force

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Applying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems

Applying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems V Brazilian Symposium on Computing Systems Engineering Applying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems Alessandro Trindade, Hussama Ismail, and Lucas Cordeiro Foz

More information

A Simple Path to Parallelism with Intel Cilk Plus

A Simple Path to Parallelism with Intel Cilk Plus Introduction This introductory tutorial describes how to use Intel Cilk Plus to simplify making taking advantage of vectorization and threading parallelism in your code. It provides a brief description

More information

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core 1. Many Core vs Multi Core 2. Performance Optimization Concepts for Many Core 3. Performance Optimization Strategy for Many Core 4. Example Case Studies NERSC s Cori will begin to transition the workload

More information

Preparing for Highly Parallel, Heterogeneous Coprocessing

Preparing for Highly Parallel, Heterogeneous Coprocessing Preparing for Highly Parallel, Heterogeneous Coprocessing Steve Lantz Senior Research Associate Cornell CAC Workshop: Parallel Computing on Ranger and Lonestar May 17, 2012 What Are We Talking About Here?

More information

Code modernization of Polyhedron benchmark suite

Code modernization of Polyhedron benchmark suite Code modernization of Polyhedron benchmark suite Manel Fernández Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Approaches for

More information

Native Computing and Optimization. Hang Liu December 4 th, 2013

Native Computing and Optimization. Hang Liu December 4 th, 2013 Native Computing and Optimization Hang Liu December 4 th, 2013 Overview Why run native? What is a native application? Building a native application Running a native application Setting affinity and pinning

More information

Intel Math Kernel Library (Intel MKL) Latest Features

Intel Math Kernel Library (Intel MKL) Latest Features Intel Math Kernel Library (Intel MKL) Latest Features Sridevi Allam Technical Consulting Engineer Sridevi.allam@intel.com 1 Agenda - Introduction to Support on Intel Xeon Phi Coprocessors - Performance

More information

Experiences in Optimizations of Preconditioned Iterative Solvers for FEM/FVM Applications & Matrix Assembly of FEM using Intel Xeon Phi

Experiences in Optimizations of Preconditioned Iterative Solvers for FEM/FVM Applications & Matrix Assembly of FEM using Intel Xeon Phi Experiences in Optimizations of Preconditioned Iterative Solvers for FEM/FVM Applications & Matrix Assembly of FEM using Intel Xeon Phi Kengo Nakajima Supercomputing Research Division Information Technology

More information

Bei Wang, Dmitry Prohorov and Carlos Rosales

Bei Wang, Dmitry Prohorov and Carlos Rosales Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512

More information

Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation

Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation S c i c o m P 2 0 1 3 T u t o r i a l Intel Xeon Phi Product Family Programming Tools Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation Agenda Intel Parallel Studio XE 2013

More information

Technical Report. Document Id.: CESGA Date: July 28 th, Responsible: Andrés Gómez. Status: FINAL

Technical Report. Document Id.: CESGA Date: July 28 th, Responsible: Andrés Gómez. Status: FINAL Technical Report Abstract: This technical report presents CESGA experience of porting three applications to the new Intel Xeon Phi coprocessor. The objective of these experiments was to evaluate the complexity

More information

Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions

Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions Introduction SIMD Vectorization and SIMD-enabled Functions are a part of Intel Cilk Plus feature supported by the Intel

More information

Accelerator Programming Lecture 1

Accelerator Programming Lecture 1 Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming

More information

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013 Achieving High Performance Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013 Does Instruction Set Matter? We find that ARM and x86 processors are simply engineering design points optimized

More information

Intel Xeon Phi Programmability (the good, the bad and the ugly)

Intel Xeon Phi Programmability (the good, the bad and the ugly) Intel Xeon Phi Programmability (the good, the bad and the ugly) Robert Geva Parallel Programming Models Architect My Perspective When the compiler can solve the problem When the programmer has to solve

More information

Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduction to Xeon Phi. Bill Barth January 11, 2013 Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider

More information

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, 27.6-29.6.2016 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi - Compiler Assisted Offload - Automatic Offload - Native Execution

More information