Boundary element quadrature schemes for multi- and many-core architectures
|
|
- Andrea Lambert
- 5 years ago
- Views:
Transcription
1 Boundary element quadrature schemes for multi- and many-core architectures Jan Zapletal, Michal Merta, Lukáš Malý IT4Innovations, Dept. of Applied Mathematics VŠB-TU Ostrava Intel MIC programming workshop, February 7, 207 Jan Zapletal (IT4I) BEM for many-core architectures / 24
2 BEM4I Boundary element method OpenMP threading OpenMP vectorization Adaptive cross approximation 2 Numerical experiments Full assembly ACA assembly 3 Conclusion Jan Zapletal (IT4I) BEM for many-core architectures 2 / 24
3 BEM4I BEM4I Boundary element method OpenMP threading OpenMP vectorization Adaptive cross approximation 2 Numerical experiments Full assembly ACA assembly 3 Conclusion Jan Zapletal (IT4I) BEM for many-core architectures 3 / 24
4 BEM4I Boundary element method BEM4I BEM is an alternative to FEM for the solution of PDEs, BEM4I - 3D boundary element library for Laplace equation (heat transfer), Helmholtz equation (wave scattering), Lamé equation (linear elasticity), time-dependent wave equation. Specifications C++, interface to MKL or other BLAS, LAPACK, ACA for matrix sparsification. Strategies SIMD vectorization for surface integrals (OpenMP, Vc library), OpenMP for local element contributions / individual ACA blocks, MPI for distributed matrices, BETI with the ESPRESO library, Intel MIC offload/native. Jan Zapletal (IT4I) BEM for many-core architectures 4 / 24
5 BEM4I Boundary element method BEM4I BEM is an alternative to FEM for the solution of PDEs, BEM4I - 3D boundary element library for Laplace equation (heat transfer), Helmholtz equation (wave scattering), Lamé equation (linear elasticity), time-dependent wave equation. Specifications C++, interface to MKL or other BLAS, LAPACK, ACA for matrix sparsification. Strategies SIMD vectorization for surface integrals (OpenMP, Vc library), OpenMP for local element contributions / individual ACA blocks, MPI for distributed matrices, BETI with the ESPRESO library, Intel MIC offload/native. Jan Zapletal (IT4I) BEM for many-core architectures 4 / 24
6 BEM4I Boundary element method BEM4I BEM is an alternative to FEM for the solution of PDEs, BEM4I - 3D boundary element library for Laplace equation (heat transfer), Helmholtz equation (wave scattering), Lamé equation (linear elasticity), time-dependent wave equation. Specifications C++, interface to MKL or other BLAS, LAPACK, ACA for matrix sparsification. Strategies SIMD vectorization for surface integrals (OpenMP, Vc library), OpenMP for local element contributions / individual ACA blocks, MPI for distributed matrices, BETI with the ESPRESO library, Intel MIC offload/native. Jan Zapletal (IT4I) BEM for many-core architectures 4 / 24
7 BEM4I Boundary element method BEM4I BEM is an alternative to FEM for the solution of PDEs, BEM4I - 3D boundary element library for Laplace equation (heat transfer), Helmholtz equation (wave scattering), Lamé equation (linear elasticity), time-dependent wave equation. Specifications C++, interface to MKL or other BLAS, LAPACK, ACA for matrix sparsification. Strategies SIMD vectorization for surface integrals (OpenMP, Vc library), OpenMP for local element contributions / individual ACA blocks, MPI for distributed matrices, BETI with the ESPRESO library, Intel MIC offload/native. Jan Zapletal (IT4I) BEM for many-core architectures 4 / 24
8 BEM4I Boundary element method Boundary value problem for the Laplace equation Representation formula for x Ω u(x) = 4π Ω u x y n Boundary integral equations for x Ω n x 4π 4π Ω Ω u = 0 in Ω, u = f on Γ D, u n = g on ΓN. x y, n(y) (y) dsy u(y) ds y. 4π Ω x y 3 u x y n (y) dsy = 2 u(x) + 4π Ω x y, n(y) u(y) ds x y 3 y = u 2 n (x) 4π x y, n(y) x y 3 u(y) ds y, Ω y x, n(x) x y 3 u (y) dsy. n Jan Zapletal (IT4I) BEM for many-core architectures 5 / 24
9 BEM4I Boundary element method Discretization leads to the systems V h g = ( ) ( ) 2 M h + K h f, D h f = 2 M h K h g with the matrices V h [l, k] := dsy dsx 4π τ l τ k x y K h [l, i] := 4π τ l Ω x y, n(y) ϕ i(y) ds x y 3 y ds x D h [j, i] := curl ϕ i(y), curl ϕ j(x) ds y ds x = T h diag(v h, V h, V h )T h, 4π Ω Ω x y M h [l, i] := ϕ i(x) ds x. τ l Jan Zapletal (IT4I) BEM for many-core architectures 6 / 24
10 BEM4I OpenMP threading OpenMP threading for V h # pragma omp parallel for 2 for ( int tau_k = 0; tau_k < E; ++ tau_k ){ // columns 3 for ( int tau_l = 0; tau_l < E; ++ tau_l ){ // rows 4 SLIntegrator. getlocalmatrix ( * tau_l, * tau_k, Vloc ); 5 V. set ( * tau_l, * tau_k, Vloc. get ( 0, 0 ) ); 6 } } OpenMP threading for K h Jan Zapletal (IT4I) BEM for many-core architectures 7 / 24
11 BEM4I OpenMP threading OpenMP threading for V h # pragma omp parallel for 2 for ( int tau_k = 0; tau_k < E; ++ tau_k ){ // columns 3 for ( int tau_l = 0; tau_l < E; ++ tau_l ){ // rows 4 SLIntegrator. getlocalmatrix ( * tau_l, * tau_k, Vloc ); 5 V. set ( * tau_l, * tau_k, Vloc. get ( 0, 0 ) ); 6 } } OpenMP threading for K h # pragma omp parallel for 2 for ( int tau_k = 0; tau_k < E; ++ tau_k ){ // columns 3 for ( int tau_l = 0; tau_l < E; ++ tau_l ){ // rows 4 DLIntegrator. getlocalmatrix ( * tau_l, * tau_k, Kloc ); 5 # pragma omp atomic // ( inside of add_atomic ) 6 K. add_atomic ( * tau_l, tau_k -> node [ 0 ], Kloc. get ( 0, 0 ) ); 7 # pragma omp atomic // ( inside of add_atomic ) 8 K. add_atomic ( * tau_l, tau_k -> node [ ], Kloc. get ( 0, ) ); 9 # pragma omp atomic // ( inside of add_atomic ) 0 K. add_atomic ( * tau_l, tau_k -> node [ 2 ], Kloc. get ( 0, 2 ) ); } } Jan Zapletal (IT4I) BEM for many-core architectures 7 / 24
12 BEM4I OpenMP threading OpenMP threading for V h # pragma omp parallel for 2 for ( int tau_k = 0; tau_k < E; ++ tau_k ){ // columns 3 for ( int tau_l = 0; tau_l < E; ++ tau_l ){ // rows 4 SLIntegrator. getlocalmatrix ( * tau_l, * tau_k, Vloc ); 5 V. set ( * tau_l, * tau_k, Vloc. get ( 0, 0 ) ); 6 } } OpenMP threading for K h # pragma omp parallel for 2 for ( int tau_l = 0; tau_l < E; ++ tau_l ){ // rows 3 for ( int tau_k = 0; tau_k < E; ++ tau_k ){ // columns 4 DLIntegrator. getlocalmatrix ( * tau_l, * tau_k, Kloc ); 5 6 K. add ( * tau_l, tau_k -> node [ 0 ], Kloc. get ( 0, 0 ) ); 7 8 K. add ( * tau_l, tau_k -> node [ ], Kloc. get ( 0, ) ); 9 0 K. add ( * tau_l, tau_k -> node [ 2 ], Kloc. get ( 0, 2 ) ); } } Avoid #pragma omp critical, use #pragma omp atomic if applicable. Jan Zapletal (IT4I) BEM for many-core architectures 7 / 24
13 BEM4I OpenMP vectorization Single Instruction Multiple Data (SIMD) processing vector with a single operation, provides data level parallelism, elements are of the same type. Vector length 256 bits for AVX-2 (Haswell), 4 DP operands, 52 bits for IMCI (KNC), AVX52 (KNL), 8 DP operands. Vectorization achieved by compiler auto-vectorization (code refactoring can help), OpenMP 4.0 pragmas (#pragma omp simd), intrinsic functions, wrapper library (Vc), assembly. Jan Zapletal (IT4I) BEM for many-core architectures 8 / 24
14 BEM4I OpenMP vectorization Single Instruction Multiple Data (SIMD) processing vector with a single operation, provides data level parallelism, elements are of the same type. Vector length 256 bits for AVX-2 (Haswell), 4 DP operands, 52 bits for IMCI (KNC), AVX52 (KNL), 8 DP operands. Vectorization achieved by compiler auto-vectorization (code refactoring can help), OpenMP 4.0 pragmas (#pragma omp simd), intrinsic functions, wrapper library (Vc), assembly. Jan Zapletal (IT4I) BEM for many-core architectures 8 / 24
15 BEM4I OpenMP vectorization Tuned vectorized integration over a square double * w = new double [ S ]; 2 w[ 0 ] =... // init. weights 3 // x^_, x^_2,..., x^s_, x^ S_2 4 double x [ ] = {... }; for ( int l = 0; l < S; ++l ){ 2 result += w[ l ] * f( x[ 2 * l ], x[ 2 * l + ] ); 3 } 4 5 delete w; Loop peeling & strided access to memory Jan Zapletal (IT4I) BEM for many-core architectures 9 / 24
16 BEM4I OpenMP vectorization Tuned vectorized integration over a square double * w = new double [ S ]; 2 w[ 0 ] =... // init. weights 3 // x^_, x^_2,..., x^s_, x^ S_2 4 double x [ ] = {... }; # pragma omp simd reduction ( + : result ) for ( int l = 0; l < S; ++l ){ 2 result += w[ l ] * f( x[ 2 * l ], x[ 2 * l + ] ); 3 } 4 5 delete w; Loop peeling & strided access to memory Jan Zapletal (IT4I) BEM for many-core architectures 9 / 24
17 BEM4I OpenMP vectorization Tuned vectorized integration over a square double * w = new double [ S ]; 2 w[ 0 ] =... // init. weights 3 // x^_, x^_2,..., x^s_, x^ S_2 4 double x [ ] = {... }; # pragma omp simd reduction ( + : result ) for ( int l = 0; l < S; ++l ){ 2 result += w[ l ] * f( x[ 2 * l ], x[ 2 * l + ] ); 3 } 4 5 delete w; Loop peeling & strided access to memory 0x x x AoS peel main body remainder SoA Jan Zapletal (IT4I) BEM for many-core architectures 9 / 24
18 BEM4I OpenMP vectorization Tuned vectorized integration over a square double * w = ( double *) _mm_malloc ( S * sizeof ( double ), 64 ); 2 w[ 0 ] =... // init. weights 3 // x^_,..., x^ S_ 4 double x [ ] attribute ( ( aligned ( 64 ) ) ) = {... }; 5 // x^_2,..., x^ S_2 6 double x2 [ ] attribute ( ( aligned ( 64 ) ) ) = {... }; 7 8 assume_aligned ( w, 64 ); // tell compiler about alignment 9 0 # pragma omp simd reduction ( + : result ) for ( int l = 0; l < S; ++l ){ 2 result += w[ l ] * f( x[ l ], x2[ l ] ); 3 } 4 5 _mm_free ( w ); Loop peeling & strided access to memory 0x x x AoS peel main body remainder SoA Jan Zapletal (IT4I) BEM for many-core architectures 9 / 24
19 BEM4I OpenMP vectorization Duffy substitution for τ l τ k, V h [l, k] = s k(f s (η, η 2, η 3, ξ))s s (η, η 2, η 3, ξ) dη dη 2 dη 3 dξ with F s : [0, ] 4 S τ l τ k, F s (η, η 2, η 3, ξ) = (x, y), S s (η, η 2, η 3, ξ) dη dη 2 dη 3 dξ = ds xds y. Approximated by tensor Gauss quadrature V h [l, k] w m w n w o w p k(f s (x m, x n, x o, x p))s s (x m, x n, x o, x p). s m n o p Jan Zapletal (IT4I) BEM for many-core architectures 0 / 24
20 BEM4I OpenMP vectorization Collapsed integration loop in getlocalmatrix. assume_aligned ( xss, 64 ); // all data aligned switch ( type ){ 5 case ( identicalelements ): 6 for ( int simplex = 0; simplex < 6; ++ simplex ){ 7 8 reftotri ( simplex, x,..., y3, xref,..., y2ref, xss,..., y3ss ); 9 0 # pragma omp simd reduction ( + : entry ) for ( c = 0; c < S* S2* S3* S4; ++c ) { // collapsed 2 kernel = weights_jacv [ c ] 3 * evalsinglelayerkernel ( xss [ c ], x2ss [ c ], 4 x3ss [ c ], yss [ c ], y2ss [ c ], y3ss [ c ] ); 5 entry += kernel ; 6 } } 7 break ; 8... // quadrature over other pairs of elements 9 } Jan Zapletal (IT4I) BEM for many-core architectures / 24
21 BEM4I OpenMP vectorization SIMD evaluation of quadrature points in reftotri. assume_aligned ( xss, 64 ); // all data aligned # pragma omp simd 5 for ( int c = 0; c < S* S2* S3* S4; ++c ){ 6 xss [ c ] = x[ 0 ] 7 + ( x2[ 0 ] - x[ 0 ] ) * xref [ simplex ][ c ] 8 + ( x3[ 0 ] - x[ 0 ] ) * x2ref [ simplex ][ c ]; 9... // compute x2ss, x3ss, yss, y2ss, y3ss 0 } SIMD evaluation of the kernel in evalsinglelayerkernel. # pragma omp declare simd 2 double evalsinglelayerkernel ( 3 double x, double x2, double x3, 4 double y, double y2, double y3 5 ) const { 6 7 double d = x - y, d2 = x2 - y2, d3 = x3 - y3; 8 double norm = sqrt ( d * d + d2 * d2 + d3 * d3 ); 9 0 return ( / ( norm * 4.0 * M_PI ) ); } Jan Zapletal (IT4I) BEM for many-core architectures 2 / 24
22 BEM4I OpenMP vectorization SIMD evaluation of quadrature points in reftotri. assume_aligned ( xss, 64 ); // all data aligned # pragma omp simd 5 for ( int c = 0; c < S* S2* S3* S4; ++c ){ 6 xss [ c ] = x[ 0 ] 7 + ( x2[ 0 ] - x[ 0 ] ) * xref [ simplex ][ c ] 8 + ( x3[ 0 ] - x[ 0 ] ) * x2ref [ simplex ][ c ]; 9... // compute x2ss, x3ss, yss, y2ss, y3ss 0 } SIMD evaluation of the kernel in evalsinglelayerkernel. # pragma omp declare simd 2 double evalsinglelayerkernel ( 3 double x, double x2, double x3, 4 double y, double y2, double y3 5 ) const { 6 7 double d = x - y, d2 = x2 - y2, d3 = x3 - y3; 8 double norm = sqrt ( d * d + d2 * d2 + d3 * d3 ); 9 0 return ( / ( norm * 4.0 * M_PI ) ); } Jan Zapletal (IT4I) BEM for many-core architectures 2 / 24
23 BEM4I OpenMP vectorization Intel Advisor Jan Zapletal (IT4I) BEM for many-core architectures 3 / 24
24 BEM4I OpenMP vectorization Intel compiler reports -qopt-report=5 -qopt-report-phase=vec LOOP BEGIN at BEIntegratorScalar. cpp (304,5) 2 remark 5340: pragma supersedes previous setting [ BEIntegratorScalar. cpp (300,) ] 3 remark 5388: vectorization support : reference xss [ i] has aligned access [ BEIntegratorScalar. cpp (305,55) ] 4... // all data reported as aligned 5 remark 5305: vectorization support : vector length 4 6 remark 5309: vectorization support : normalized vectorization overhead remark 530: OpenMP SIMD LOOP WAS VECTORIZED 8 remark 5448: unmasked aligned unit stride loads : 6 9 remark 5475: --- begin vector cost summary remark 5476: scalar cost : 97 remark 5477: vector cost : remark 5478: estimated potential speedup : remark 5488: --- end vector cost summary LOOP END Jan Zapletal (IT4I) BEM for many-core architectures 4 / 24
25 BEM4I Adaptive cross approximation Adaptive cross approximation Complexity O(n 2 ) O(n log n), mesh divided into clusters, non-admissible clusters assembled in full, admissible clusters approximated as C UV, assembly of clusters distributed by OpenMP. ~ * * T # pragma omp parallel for 2 for ( int i = 0; i < n_nonadmissible_blocks ; ++i ){ 3 getnonadmissibleblock ( i, localblock ); 4 ACAMatrix. addnonadmissibleblock ( i, localblock ); 5 } 6 7 # pragma omp parallel for 8 for ( int i = 0; i < n_admissible_blocks ; ++i ){ 9 getadmissibleblock ( i, localblock ); 0 ACAMatrix. addadmissibleblock ( i, localblock ); } Jan Zapletal (IT4I) BEM for many-core architectures 5 / 24
26 BEM4I Adaptive cross approximation Adaptive cross approximation Complexity O(n 2 ) O(n log n), mesh divided into clusters, non-admissible clusters assembled in full, admissible clusters approximated as C UV, assembly of clusters distributed by OpenMP. ~ * * T # pragma omp parallel for 2 for ( int i = 0; i < n_nonadmissible_blocks ; ++i ){ 3 getnonadmissibleblock ( i, localblock ); 4 ACAMatrix. addnonadmissibleblock ( i, localblock ); 5 } 6 7 # pragma omp parallel for 8 for ( int i = 0; i < n_admissible_blocks ; ++i ){ 9 getadmissibleblock ( i, localblock ); 0 ACAMatrix. addadmissibleblock ( i, localblock ); } Jan Zapletal (IT4I) BEM for many-core architectures 5 / 24
27 Numerical experiments BEM4I Boundary element method OpenMP threading OpenMP vectorization Adaptive cross approximation 2 Numerical experiments Full assembly ACA assembly 3 Conclusion Jan Zapletal (IT4I) BEM for many-core architectures 6 / 24
28 Numerical experiments Experiment setting Full (ACA) assembly tested on a mesh with (8.920) surface elements. 4 quadrature points in each dimension (256 per simplex) to utilize SIMD registers. ACA settings maximal number of elements in clusters: 500, preallocation: 0 %. Assembly performed on single nodes Salomon - 2 x Xeon 2680v3, AVX2, 2x2 cores, 2.5 GHz, 28 GB RAM, Salomon - Xeon Phi 720P, IMCI, 6 cores,.238 GHz, 6 GB RAM, Endeavor - Xeon Phi 720, AVX-52, 64 cores,.3 GHz, GB RAM. Jan Zapletal (IT4I) BEM for many-core architectures 7 / 24
29 Numerical experiments Experiment setting Full (ACA) assembly tested on a mesh with (8.920) surface elements. 4 quadrature points in each dimension (256 per simplex) to utilize SIMD registers. ACA settings maximal number of elements in clusters: 500, preallocation: 0 %. Assembly performed on single nodes Salomon - 2 x Xeon 2680v3, AVX2, 2x2 cores, 2.5 GHz, 28 GB RAM, Salomon - Xeon Phi 720P, IMCI, 6 cores,.238 GHz, 6 GB RAM, Endeavor - Xeon Phi 720, AVX-52, 64 cores,.3 GHz, GB RAM. Jan Zapletal (IT4I) BEM for many-core architectures 7 / 24
30 Numerical experiments Full assembly OMP scalability, full assembly, 256 points per simplex Single-layer matrix, full, double, (4,4,4,4) Double-layer matrix, full, double, (4,4,4,4),000 2xHSW KNC KNL Optimal,000 2xHSW KNC KNL Optimal t [s] 00 t [s] OMP threads OMP threads Xeon 2680v3 matrix V h K h Jan Zapletal (IT4I) BEM for many-core architectures 8 / 24
31 Numerical experiments Full assembly OMP scalability, full assembly, 256 points per simplex Single-layer matrix, full, double, (4,4,4,4) Double-layer matrix, full, double, (4,4,4,4),000 2xHSW KNC KNL Optimal,000 2xHSW KNC KNL Optimal t [s] 00 t [s] OMP threads OMP threads Xeon Phi 720P matrix V h K h Jan Zapletal (IT4I) BEM for many-core architectures 8 / 24
32 Numerical experiments Full assembly OMP scalability, full assembly, 256 points per simplex Single-layer matrix, full, double, (4,4,4,4) Double-layer matrix, full, double, (4,4,4,4),000 2xHSW KNC KNL Optimal,000 2xHSW KNC KNL Optimal t [s] 00 t [s] OMP threads OMP threads Xeon Phi 720 matrix V h K h Jan Zapletal (IT4I) BEM for many-core architectures 8 / 24
33 Numerical experiments Full assembly SIMD scalability, full assembly, 256 points per simplex Single-layer matrix, full, double, (4,4,4,4) Double-layer matrix, full, double, (4,4,4,4) t[s] 0 t[s] 0 2xHSW KNC KNL Optimal Register size 2xHSW KNC KNL Optimal Register size architecture threads matrix SSE4.2 AVX2 IMCI AVX-52 Xeon E2680v3 24 Xeon Phi 720P 244 Xeon Phi V h K h V h 3.77 K h 5.05 V h K h Jan Zapletal (IT4I) BEM for many-core architectures 9 / 24
34 Numerical experiments Full assembly SIMD scalability, full assembly, 256 points per simplex Single-layer matrix, full, double, (4,4,4,4) Double-layer matrix, full, double, (4,4,4,4) t[s] 0 t[s] 0 2xHSW KNC KNL Optimal Register size 2xHSW KNC KNL Optimal Register size architecture threads matrix SSE4.2 AVX2 IMCI AVX-52 Xeon E2680v3 24 Xeon Phi 720P 244 Xeon Phi V h K h V h 3.77 K h 5.05 V h K h Jan Zapletal (IT4I) BEM for many-core architectures 9 / 24
35 Numerical experiments Full assembly SIMD scalability, full assembly, 256 points per simplex Single-layer matrix, full, double, (4,4,4,4) Double-layer matrix, full, double, (4,4,4,4) t[s] 0 t[s] 0 2xHSW KNC KNL Optimal Register size 2xHSW KNC KNL Optimal Register size architecture threads matrix SSE4.2 AVX2 IMCI AVX-52 Xeon E2680v3 24 Xeon Phi 720P 244 Xeon Phi V h K h V h 3.77 K h 5.05 V h K h Jan Zapletal (IT4I) BEM for many-core architectures 9 / 24
36 Numerical experiments ACA assembly OMP scalability, ACA assembly, 256 points per simplex,000 Single-layer matrix, aca, double, (4,4,4,4) 2xHSW KNC KNL Optimal,000 Double-layer matrix, aca, double, (4,4,4,4) 2xHSW KNC KNL Optimal t [s] 00 t [s] OMP threads OMP threads Xeon 2680v3 matrix V h K h Jan Zapletal (IT4I) BEM for many-core architectures 20 / 24
37 Numerical experiments ACA assembly OMP scalability, ACA assembly, 256 points per simplex,000 Single-layer matrix, aca, double, (4,4,4,4) 2xHSW KNC KNL Optimal,000 Double-layer matrix, aca, double, (4,4,4,4) 2xHSW KNC KNL Optimal t [s] 00 t [s] OMP threads OMP threads Xeon Phi 720P matrix V h K h Jan Zapletal (IT4I) BEM for many-core architectures 20 / 24
38 Numerical experiments ACA assembly OMP scalability, ACA assembly, 256 points per simplex,000 Single-layer matrix, aca, double, (4,4,4,4) 2xHSW KNC KNL Optimal,000 Double-layer matrix, aca, double, (4,4,4,4) 2xHSW KNC KNL Optimal t [s] 00 t [s] OMP threads OMP threads Xeon Phi 720 matrix V h K h Jan Zapletal (IT4I) BEM for many-core architectures 20 / 24
39 Numerical experiments ACA assembly SIMD scalability, ACA assembly, 256 points per simplex Single-layer matrix, aca, double, (4,4,4,4) Double-layer matrix, aca, double, (4,4,4,4) t[s] 0 t[s] 0 2xHSW KNC KNL Optimal Register size 2xHSW KNC KNL Optimal Register size architecture threads matrix SSE4.2 AVX2 IMCI AVX-52 Xeon E2680v3 24 Xeon Phi 720P 244 Xeon Phi V h K h V h 2.94 K h 3.47 V h K h Jan Zapletal (IT4I) BEM for many-core architectures 2 / 24
40 Numerical experiments ACA assembly SIMD scalability, ACA assembly, 256 points per simplex Single-layer matrix, aca, double, (4,4,4,4) Double-layer matrix, aca, double, (4,4,4,4) t[s] 0 t[s] 0 2xHSW KNC KNL Optimal Register size 2xHSW KNC KNL Optimal Register size architecture threads matrix SSE4.2 AVX2 IMCI AVX-52 Xeon E2680v3 24 Xeon Phi 720P 244 Xeon Phi V h K h V h 2.94 K h 3.47 V h K h Jan Zapletal (IT4I) BEM for many-core architectures 2 / 24
41 Numerical experiments ACA assembly SIMD scalability, ACA assembly, 256 points per simplex Single-layer matrix, aca, double, (4,4,4,4) Double-layer matrix, aca, double, (4,4,4,4) t[s] 0 t[s] 0 2xHSW KNC KNL Optimal Register size 2xHSW KNC KNL Optimal Register size architecture threads matrix SSE4.2 AVX2 IMCI AVX-52 Xeon E2680v3 24 Xeon Phi 720P 244 Xeon Phi V h K h V h 2.94 K h 3.47 V h K h Jan Zapletal (IT4I) BEM for many-core architectures 2 / 24
42 Conclusion BEM4I Boundary element method OpenMP threading OpenMP vectorization Adaptive cross approximation 2 Numerical experiments Full assembly ACA assembly 3 Conclusion Jan Zapletal (IT4I) BEM for many-core architectures 22 / 24
43 Conclusion Conclusion KNL vs. HSW speedup Full V h 2.29, K h.82, ACA V h 2.64, K h.40. What we learned SIMD processing becoming more efficient with KNL, multi-core code benefits from many-core optimizations. Work in progress object-oriented offload to Xeon Phi (full, ACA), load balancing for offload mode (full, ACA), massively parallel BETI with ESPRESO library (MPI, OpenMP, SIMD, offload). Jan Zapletal (IT4I) BEM for many-core architectures 23 / 24
44 Conclusion Conclusion KNL vs. HSW speedup Full V h 2.29, K h.82, ACA V h 2.64, K h.40. What we learned SIMD processing becoming more efficient with KNL, multi-core code benefits from many-core optimizations. CPU MIC CPU MIC CPU MIC Work in progress object-oriented offload to Xeon Phi (full, ACA), load balancing for offload mode (full, ACA), massively parallel BETI with ESPRESO library (MPI, OpenMP, SIMD, offload). Jan Zapletal (IT4I) BEM for many-core architectures 23 / 24
45 Conclusion Conclusion KNL vs. HSW speedup Full V h 2.29, K h.82, ACA V h 2.64, K h.40. What we learned SIMD processing becoming more efficient with KNL, multi-core code benefits from many-core optimizations. CPU MIC CPU MIC CPU MIC Work in progress object-oriented offload to Xeon Phi (full, ACA), load balancing for offload mode (full, ACA), massively parallel BETI with ESPRESO library (MPI, OpenMP, SIMD, offload). Jan Zapletal (IT4I) BEM for many-core architectures 23 / 24
46 Conclusion References Zapletal, J.; Merta, M.; Malý, L. Boundary element quadrature schemes for multi- and many-core architectures. Computers and Mathematics with Applications (accepted). Merta, M.; Zapletal, J.; Jaroš, J. Many core acceleration of the boundary element method. High Performance Computing in Science and Engineering: Second International Conference, HPCSE 205, Soláň, Czech Republic, May 25-28, 205, Revised Selected Papers, Springer International Publishing 4:6 25, 206. Preprints available at researchgate.com. Thank you Jan Zapletal (IT4I) BEM for many-core architectures 24 / 24
47 Conclusion References Zapletal, J.; Merta, M.; Malý, L. Boundary element quadrature schemes for multi- and many-core architectures. Computers and Mathematics with Applications (accepted). Merta, M.; Zapletal, J.; Jaroš, J. Many core acceleration of the boundary element method. High Performance Computing in Science and Engineering: Second International Conference, HPCSE 205, Soláň, Czech Republic, May 25-28, 205, Revised Selected Papers, Springer International Publishing 4:6 25, 206. Preprints available at researchgate.com. Thank you Jan Zapletal (IT4I) BEM for many-core architectures 24 / 24
Dynamic SIMD Scheduling
Dynamic SIMD Scheduling Florian Wende SC15 MIC Tuning BoF November 18 th, 2015 Zuse Institute Berlin time Dynamic Work Assignment: The Idea Irregular SIMD execution Caused by branching: control flow varies
More informationTools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,
Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon
More informationIntel Knights Landing Hardware
Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute
More informationCode optimization in a 3D diffusion model
Code optimization in a 3D diffusion model Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Agenda Background Diffusion
More informationNumerical Methods for PDEs : Video 11: 1D FiniteFebruary Difference 15, Mappings Theory / 15
22.520 Numerical Methods for PDEs : Video 11: 1D Finite Difference Mappings Theory and Matlab February 15, 2015 22.520 Numerical Methods for PDEs : Video 11: 1D FiniteFebruary Difference 15, Mappings 2015
More informationDay 6: Optimization on Parallel Intel Architectures
Day 6: Optimization on Parallel Intel Architectures Lecture day 6 Ryo Asai Colfax International colfaxresearch.com April 2017 colfaxresearch.com/ Welcome Colfax International, 2013 2017 Disclaimer 2 While
More informationOutline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends
Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory
More informationHigh Performance Computing: Tools and Applications
High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 8 Processor-level SIMD SIMD instructions can perform
More informationVECTORISATION. Adrian
VECTORISATION Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Vectorisation Same operation on multiple data items Wide registers SIMD needed to approach FLOP peak performance, but your code must be
More informationIntroduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA
Introduction to the Xeon Phi programming model Fabio AFFINITO, CINECA What is a Xeon Phi? MIC = Many Integrated Core architecture by Intel Other names: KNF, KNC, Xeon Phi... Not a CPU (but somewhat similar
More informationIntel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,
, Hardware Overview & Native Execution IT4Innovations, Ostrava, 3.2.- 4.2.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi (MIC) Programming models Native mode programming
More informationScientific Computing with Intel Xeon Phi Coprocessors
Scientific Computing with Intel Xeon Phi Coprocessors Andrey Vladimirov Colfax International HPC Advisory Council Stanford Conference 2015 Compututing with Xeon Phi Welcome Colfax International, 2014 Contents
More informationIntel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2
Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting
More informationOpenMP 4.5: Threading, vectorization & offloading
OpenMP 4.5: Threading, vectorization & offloading Michal Merta michal.merta@vsb.cz 2nd of March 2018 Agenda Introduction The Basics OpenMP Tasks Vectorization with OpenMP 4.x Offloading to Accelerators
More informationLIBXSMM Library for small matrix multiplications. Intel High Performance and Throughput Computing (EMEA) Hans Pabst, March 12 th 2015
LIBXSMM Library for small matrix multiplications. Intel High Performance and Throughput Computing (EMEA) Hans Pabst, March 12 th 2015 Abstract Library for small matrix-matrix multiplications targeting
More informationOverview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.
Vectorisation James Briggs 1 COSMOS DiRAC April 28, 2015 Session Plan 1 Overview 2 Implicit Vectorisation 3 Explicit Vectorisation 4 Data Alignment 5 Summary Section 1 Overview What is SIMD? Scalar Processing:
More informationBring your application to a new era:
Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.
More informationIntel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,
Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ, 27.6.- 29.6.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi Products Programming models Native
More informationESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report
ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationAdvanced OpenMP Vectoriza?on
UT Aus?n Advanced OpenMP Vectoriza?on TACC TACC OpenMP Team milfeld/lars/agomez@tacc.utexas.edu These slides & Labs:?nyurl.com/tacc- openmp Learning objec?ve Vectoriza?on: what is that? Past, present and
More informationGe#ng Started with Automa3c Compiler Vectoriza3on. David Apostal UND CSci 532 Guest Lecture Sept 14, 2017
Ge#ng Started with Automa3c Compiler Vectoriza3on David Apostal UND CSci 532 Guest Lecture Sept 14, 2017 Parallellism is Key to Performance Types of parallelism Task-based (MPI) Threads (OpenMP, pthreads)
More informationPROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec
PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization
More informationIntel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant
Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant Parallel is the Path Forward Intel Xeon and Intel Xeon Phi Product Families are both going parallel Intel Xeon processor
More informationAnalysis of Subroutine xppm0 in FV3. Lynd Stringer, NOAA Affiliate Redline Performance Solutions LLC
Analysis of Subroutine xppm0 in FV3 Lynd Stringer, NOAA Affiliate Redline Performance Solutions LLC Lynd.Stringer@noaa.gov Introduction Hardware/Software Why xppm0? KGEN Compiler at O2 Assembly at O2 Compiler
More informationScientific Programming in C XIV. Parallel programming
Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence
More informationIntel Xeon Phi Coprocessor
Intel Xeon Phi Coprocessor A guide to using it on the Cray XC40 Terminology Warning: may also be referred to as MIC or KNC in what follows! What are Intel Xeon Phi Coprocessors? Hardware designed to accelerate
More informationIntroduction to tuning on many core platforms. Gilles Gouaillardet RIST
Introduction to tuning on many core platforms Gilles Gouaillardet RIST gilles@rist.or.jp Agenda Why do we need many core platforms? Single-thread optimization Parallelization Conclusions Why do we need
More informationHPC TNT - 2. Tips and tricks for Vectorization approaches to efficient code. HPC core facility CalcUA
HPC TNT - 2 Tips and tricks for Vectorization approaches to efficient code HPC core facility CalcUA ANNIE CUYT STEFAN BECUWE FRANKY BACKELJAUW [ENGEL]BERT TIJSKENS Overview Introduction What is vectorization
More informationPRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,
PRACE PATC Course: Intel MIC Programming Workshop, MKL Ostrava, 7-8.2.2017 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi Compiler Assisted Offload Automatic Offload Native Execution Hands-on
More informationPorting and Tuning WRF Physics Packages on Intel Xeon and Xeon Phi and NVIDIA GPU
Porting and Tuning WRF Physics Packages on Intel Xeon and Xeon Phi and NVIDIA GPU Tom Henderson Thomas.B.Henderson@noaa.gov Mark Govett, James Rosinski, Jacques Middlecoff NOAA Global Systems Division
More informationReusing this material
XEON PHI BASICS Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationParallel Systems. Project topics
Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a
More informationVincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012
Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012 Outline NICS and AACE Architecture Overview Resources Native Mode Boltzmann BGK Solver Native/Offload
More informationSIMD Exploitation in (JIT) Compilers
SIMD Exploitation in (JIT) Compilers Hiroshi Inoue, IBM Research - Tokyo 1 What s SIMD? Single Instruction Multiple Data Same operations applied for multiple elements in a vector register input 1 A0 input
More informationAAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors
AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors Kaixi Hou, Hao Wang, Wu-chun Feng {kaixihou,hwang121,wfeng}@vt.edu Pairwise Sequence Alignment Algorithms
More informationSoftware Optimization Case Study. Yu-Ping Zhao
Software Optimization Case Study Yu-Ping Zhao Yuping.zhao@intel.com Agenda RELION Background RELION ITAC and VTUE Analyze RELION Auto-Refine Workload Optimization RELION 2D Classification Workload Optimization
More informationPerformance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures
Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures Dr. Fabio Baruffa Leibniz Supercomputing Centre fabio.baruffa@lrz.de MC² Series: Colfax Research Webinar, http://mc2series.com
More informationPORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune
PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationIntel Performance Libraries
Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation
More informationCode modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism.
Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism. Parallel + SIMD is the Path Forward Intel Xeon and Intel Xeon Phi Product
More informationPRACE PATC Course: Vectorisation & Basic Performance Overview. Ostrava,
PRACE PATC Course: Vectorisation & Basic Performance Overview Ostrava, 7-8.2.2017 1 Agenda Basic Vectorisation & SIMD Instructions IMCI Vector Extension Intel compiler flags Hands-on Intel Tool VTune Amplifier
More informationIntroduction to tuning on KNL platforms
Introduction to tuning on KNL platforms Gilles Gouaillardet RIST gilles@rist.or.jp 1 Agenda Why do we need many core platforms? KNL architecture Single-thread optimization Parallelization Common pitfalls
More informationTowards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ)
Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ) Overview Modernising P-Gadget3 for the Intel Xeon Phi : code features, challenges and strategy for
More informationNative Computing and Optimization on Intel Xeon Phi
Native Computing and Optimization on Intel Xeon Phi ISC 2015 Carlos Rosales carlos@tacc.utexas.edu Overview Why run native? What is a native application? Building a native application Running a native
More informationPerformance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures
Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures Dr. Fabio Baruffa Dr. Luigi Iapichino Leibniz Supercomputing Centre fabio.baruffa@lrz.de Outline of the talk
More information2016 KAUST Workshop on Scalable and Hierarchical Algorithms for Extreme Computing KAUST, Saudi Arabia
Alexander Heinecke Parallel Computing Lab, Intel Labs, USA 2016 KAUST Workshop on Scalable and Hierarchical Algorithms for Extreme Computing KAUST, Saudi Arabia Legal Disclaimer & Optimization Notice INFORMATION
More informationIntel Architecture for HPC
Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter
More informationBenchmark results on Knight Landing architecture
Benchmark results on Knight Landing architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Milano, 21/04/2017 KNL vs BDW A1 BDW A2 KNL cores per node 2 x 18 @2.3 GHz 1 x 68
More informationOpenCL Vectorising Features. Andreas Beckmann
Mitglied der Helmholtz-Gemeinschaft OpenCL Vectorising Features Andreas Beckmann Levels of Vectorisation vector units, SIMD devices width, instructions SMX, SP cores Cus, PEs vector operations within kernels
More informationCode Optimization Process for KNL. Dr. Luigi Iapichino
Code Optimization Process for KNL Dr. Luigi Iapichino luigi.iapichino@lrz.de About the presenter Dr. Luigi Iapichino Scientific Computing Expert, Leibniz Supercomputing Centre Member of the Intel Parallel
More informationDouble Rewards of Porting Scientific Applications to the Intel MIC Architecture
Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Troy A. Porter Hansen Experimental Physics Laboratory and Kavli Institute for Particle Astrophysics and Cosmology Stanford
More informationAdvanced programming with OpenMP. Libor Bukata a Jan Dvořák
Advanced programming with OpenMP Libor Bukata a Jan Dvořák Programme of the lab OpenMP Tasks parallel merge sort, parallel evaluation of expressions OpenMP SIMD parallel integration to calculate π User-defined
More informationSarah Knepper. Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018
Sarah Knepper Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018 Outline Motivation Problem statement and solutions Simple example Performance comparison 2 Motivation Partial differential equations
More informationAdvanced OpenMP Features
Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Vectorization 2 Vectorization SIMD =
More informationWhat s New August 2015
What s New August 2015 Significant New Features New Directory Structure OpenMP* 4.1 Extensions C11 Standard Support More C++14 Standard Support Fortran 2008 Submodules and IMPURE ELEMENTAL Further C Interoperability
More informationDr. Ilia Bermous, the Australian Bureau of Meteorology. Acknowledgements to Dr. Martyn Corden (Intel), Dr. Zhang Zhang (Intel), Dr. Martin Dix (CSIRO)
Performance, accuracy and bit-reproducibility aspects in handling transcendental functions with Cray and Intel compilers for the Met Office Unified Model Dr. Ilia Bermous, the Australian Bureau of Meteorology
More informationUsing multifrontal hierarchically solver and HPC systems for 3D Helmholtz problem
Using multifrontal hierarchically solver and HPC systems for 3D Helmholtz problem Sergey Solovyev 1, Dmitry Vishnevsky 1, Hongwei Liu 2 Institute of Petroleum Geology and Geophysics SB RAS 1 EXPEC ARC,
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming
More informationVectorization on KNL
Vectorization on KNL Steve Lantz Senior Research Associate Cornell University Center for Advanced Computing (CAC) steve.lantz@cornell.edu High Performance Computing on Stampede 2, with KNL, Jan. 23, 2017
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture
More informationScalable, Hybrid-Parallel Multiscale Methods using DUNE
MÜNSTER Scalable Hybrid-Parallel Multiscale Methods using DUNE R. Milk S. Kaulmann M. Ohlberger December 1st 2014 Outline MÜNSTER Scalable Hybrid-Parallel Multiscale Methods using DUNE 2 /28 Abstraction
More informationDay 5: Introduction to Parallel Intel Architectures
Day 5: Introduction to Parallel Intel Architectures Lecture day 5 Ryo Asai Colfax International colfaxresearch.com April 2017 colfaxresearch.com Welcome Colfax International, 2013 2017 Disclaimer 2 While
More informationOpenACC programming for GPGPUs: Rotor wake simulation
DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing
More informationVLPL-S Optimization on Knights Landing
VLPL-S Optimization on Knights Landing 英特尔软件与服务事业部 周姗 2016.5 Agenda VLPL-S 性能分析 VLPL-S 性能优化 总结 2 VLPL-S Workload Descriptions VLPL-S is the in-house code from SJTU, paralleled with MPI and written in C++.
More informationApproaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department
Approaches to acceleration: GPUs vs Intel MIC Fabio AFFINITO SCAI department Single core Multi core Many core GPU Intel MIC 61 cores 512bit-SIMD units from http://www.karlrupp.net/ from http://www.karlrupp.net/
More informationIntroduction to tuning on KNL platforms
Introduction to tuning on KNL platforms Gilles Gouaillardet RIST gilles@rist.or.jp 1 Agenda Why do we need many core platforms? KNL architecture Post-K overview Single-thread optimization Parallelization
More informationData Parallel Algorithmic Skeletons with Accelerator Support
MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support Steffen Ernsting and Herbert Kuchen July 2, 2015 Agenda WESTFÄLISCHE MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support
More informationAutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming
AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming David Pfander, Malte Brunn, Dirk Pflüger University of Stuttgart, Germany May 25, 2018 Vancouver, Canada, iwapt18 May 25, 2018 Vancouver,
More informationProgress Porting and Tuning NWP Physics Packages on Intel Xeon Phi
Progress Porting and Tuning NWP Physics Packages on Intel Xeon Phi Tom Henderson Thomas.B.Henderson@noaa.gov Mark Govett, Jacques Middlecoff James Rosinski NOAA Global Systems Division Indraneil Gokhale,
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationParallel Computing. November 20, W.Homberg
Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better
More informationIntel Xeon Phi programming. September 22nd-23rd 2015 University of Copenhagen, Denmark
Intel Xeon Phi programming September 22nd-23rd 2015 University of Copenhagen, Denmark Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED,
More informationOp#miza#on of D3Q19 La2ce Boltzmann Kernels for Recent Mul#- and Many-cores Intel Based Systems
Op#miza#on of D3Q19 La2ce Boltzmann Kernels for Recent Mul#- and Many-cores Intel Based Systems Ivan Giro*o 1235, Sebas4ano Fabio Schifano 24 and Federico Toschi 5 1 International Centre of Theoretical
More informationMULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT
MULTI-CORE PROGRAMMING Dongrui She December 9, 2010 ASSIGNMENT Goal of the Assignment 1 The purpose of this assignment is to Have in-depth understanding of the architectures of real-world multi-core CPUs
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationBenchmark results on Knight Landing (KNL) architecture
Benchmark results on Knight Landing (KNL) architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Roma 23/10/2017 KNL, BDW, SKL A1 BDW A2 KNL A3 SKL cores per node 2 x 18 @2.3
More informationSparse Matrix-Vector Multiplication with Wide SIMD Units: Performance Models and a Unified Storage Format
ERLANGEN REGIONAL COMPUTING CENTER Sparse Matrix-Vector Multiplication with Wide SIMD Units: Performance Models and a Unified Storage Format Moritz Kreutzer, Georg Hager, Gerhard Wellein SIAM PP14 MS53
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationHigh Performance Computing: Tools and Applications
High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 9 SIMD vectorization using #pragma omp simd force
More informationArchitecture, Programming and Performance of MIC Phi Coprocessor
Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationApplying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems
V Brazilian Symposium on Computing Systems Engineering Applying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems Alessandro Trindade, Hussama Ismail, and Lucas Cordeiro Foz
More informationA Simple Path to Parallelism with Intel Cilk Plus
Introduction This introductory tutorial describes how to use Intel Cilk Plus to simplify making taking advantage of vectorization and threading parallelism in your code. It provides a brief description
More information1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core
1. Many Core vs Multi Core 2. Performance Optimization Concepts for Many Core 3. Performance Optimization Strategy for Many Core 4. Example Case Studies NERSC s Cori will begin to transition the workload
More informationPreparing for Highly Parallel, Heterogeneous Coprocessing
Preparing for Highly Parallel, Heterogeneous Coprocessing Steve Lantz Senior Research Associate Cornell CAC Workshop: Parallel Computing on Ranger and Lonestar May 17, 2012 What Are We Talking About Here?
More informationCode modernization of Polyhedron benchmark suite
Code modernization of Polyhedron benchmark suite Manel Fernández Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Approaches for
More informationNative Computing and Optimization. Hang Liu December 4 th, 2013
Native Computing and Optimization Hang Liu December 4 th, 2013 Overview Why run native? What is a native application? Building a native application Running a native application Setting affinity and pinning
More informationIntel Math Kernel Library (Intel MKL) Latest Features
Intel Math Kernel Library (Intel MKL) Latest Features Sridevi Allam Technical Consulting Engineer Sridevi.allam@intel.com 1 Agenda - Introduction to Support on Intel Xeon Phi Coprocessors - Performance
More informationExperiences in Optimizations of Preconditioned Iterative Solvers for FEM/FVM Applications & Matrix Assembly of FEM using Intel Xeon Phi
Experiences in Optimizations of Preconditioned Iterative Solvers for FEM/FVM Applications & Matrix Assembly of FEM using Intel Xeon Phi Kengo Nakajima Supercomputing Research Division Information Technology
More informationBei Wang, Dmitry Prohorov and Carlos Rosales
Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512
More informationKlaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation
S c i c o m P 2 0 1 3 T u t o r i a l Intel Xeon Phi Product Family Programming Tools Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation Agenda Intel Parallel Studio XE 2013
More informationTechnical Report. Document Id.: CESGA Date: July 28 th, Responsible: Andrés Gómez. Status: FINAL
Technical Report Abstract: This technical report presents CESGA experience of porting three applications to the new Intel Xeon Phi coprocessor. The objective of these experiments was to evaluate the complexity
More informationGetting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions
Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions Introduction SIMD Vectorization and SIMD-enabled Functions are a part of Intel Cilk Plus feature supported by the Intel
More informationAccelerator Programming Lecture 1
Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming
More informationAchieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013
Achieving High Performance Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013 Does Instruction Set Matter? We find that ARM and x86 processors are simply engineering design points optimized
More informationIntel Xeon Phi Programmability (the good, the bad and the ugly)
Intel Xeon Phi Programmability (the good, the bad and the ugly) Robert Geva Parallel Programming Models Architect My Perspective When the compiler can solve the problem When the programmer has to solve
More informationIntroduction to Xeon Phi. Bill Barth January 11, 2013
Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider
More informationPRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,
PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, 27.6-29.6.2016 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi - Compiler Assisted Offload - Automatic Offload - Native Execution
More information