Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Size: px

Start display at page:

Download "Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X"

Mercy Leonard
6 years ago
Views:

1 Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X IXPUG 7 PresentaNon J. Hashmi, M. Li, H. Subramoni and DK Panda The Ohio State University {hashmi.29,li.292,subramoni.,panda.2}@osu.edu

Parallel Programming Models Overview P P2 P3 P P2 P3 P P2 P3 Shared Memory Memory Memory Memory Logical shared memory Memory Memory Memory Shared

Chapel, X, CAF, Programming models provide abstract machine models Models can be mapped on different types of systems e.g. Distributed Shared Memory (DSM), MPI within a node, etc.

2 Parallel Programming Models Overview P P2 P3 P P2 P3 P P2 P3 Shared Memory Memory Memory Memory Logical shared memory Memory Memory Memory Shared Memory Model Distributed Memory Model ParLLoned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) Global Arrays, UPC, Chapel, X, CAF, Programming models provide abstract machine models Models can be mapped on different types of systems e.g. Distributed Shared Memory (DSM), MPI within a node, etc. PGAS models and Hybrid MPI+PGAS models are gradually receiving importance Network Based CompuNng Laboratory IXPUG 7 2

3 IXPUG 7 3 ParNNoned Global Address Space (PGAS) Models Key features - Simple shared memory abstraclons - Light weight one-sided communicalon - Easier to express irregular communicalon Different approaches to PGAS - Languages Unified Parallel C (UPC) Co-Array Fortran (CAF) X Chapel - Libraries OpenSHMEM UPC++ Global Arrays

4 IXPUG 7 4 Hybrid (MPI+PGAS) Programming ApplicaLon sub-kernels can be re-wri^en in MPI/PGAS based on communicalon characterislcs Benefits: Cons Best of Distributed CompuLng Model Best of Shared Memory CompuLng Model Two different runlmes Need great care while programming Prone to deadlock if not careful HPC ApplicaNon Kernel MPI Kernel 2 PGAS MPI Kernel 3 MPI Kernel N PGAS MPI

5 IXPUG 7 5 MVAPICH2 SoZware Family High-Performance Parallel Programming Libraries MVAPICH2 MVAPICH2-X MVAPICH2-GDR MVAPICH2-Virt MVAPICH2-EA MVAPICH2-MIC Microbenchmarks OMB Tools OSU INAM OEMT Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI +PGAS programming models with unified communicalon runlme OpLmized MPI for clusters with NVIDIA GPUs High-performance and scalable MPI for hypervisor and container based HPC cloud Energy aware and High-performance MPI OpLmized MPI for clusters with Intel KNC Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs Network monitoring, profiling, and analysis for clusters with MPI and scheduler integralon ULlity to measure the energy consumplon of MPI applicalons

MVAPICH2-X for Hybrid MPI + PGAS ApplicaNons Current Model Separate RunLmes for OpenSHMEM/UPC/UPC++/CAF and MPI Possible deadlock if both runlmes are not progressed Consumes more network

6 MVAPICH2-X for Hybrid MPI + PGAS ApplicaNons Current Model Separate RunLmes for OpenSHMEM/UPC/UPC++/CAF and MPI Possible deadlock if both runlmes are not progressed Consumes more network resource Unified communicalon runlme for MPI, UPC, UPC++, OpenSHMEM, CAF Available with since 202 (starlng with MVAPICH2-X.9) h^p://mvapich.cse.ohio-state.edu Network Based CompuNng Laboratory IXPUG 7 6

7 ApplicaNon Level Performance with Graph500 and Sort Time (s) Network Based CompuNng Laboratory Graph500 ExecuNon Time MPI-Simple MPI-CSC MPI-CSR Hybrid (MPI+OpenSHMEM) 4K 8K 6K No. of Processes 7.6X J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, InternaNonal SupercompuNng Conference (ISC 3), June 203 J. Jose, K. Kandalla, M. Luo and D. K. Panda, SupporNng Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance EvaluaNon, Int'l Conference on Parallel Processing (ICPP '2), September 202 3X Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design 8,92 processes - 2.4X improvement over MPI-CSR - 7.6X improvement over MPI-Simple 6,384 processes -.5X improvement over MPI-CSR - 3X improvement over MPI-Simple J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, OpNmizing CollecNve CommunicaNon in OpenSHMEM, Int'l Conference on ParNNoned Global Address Space Programming Models (PGAS '3), October 203. Time (seconds) MPI Sort ExecuNon Time Hybrid 500GB-52 TB-K 2TB-2K 4TB-4K Input Data - No. of Processes Performance of Hybrid (MPI+OpenSHMEM) Sort ApplicaLon 4,096 processes, 4 TB Input Size - MPI 2408 sec; 0.6 TB/min - Hybrid 72 sec; 0.36 TB/min - 5% improvement over MPI-design 5% IXPUG 7 7

8 IXPUG 7 8 Performance of PGAS Models on KNL Performance of Put and Get with OpenSHMEM, UPC, and UPC++ Comparison on KNL and Broadwell for OpenSHMEM point-topoint, colleclves, and atomics OperaLons Impact of AVX-52 VectorizaLon and MCDRAM on OpenSHMEM ApplicaLon Kernels Performance of UPC++ ApplicaLon kernels on MVAPICH2-X communicalon runlme

9 Latency (us) Performance of PGAS Models on KNL using MVAPICH2-X K 4K 6K 64K 256K M Network Based CompuNng Laboratory Intra-node PUT shmem_put upc_putmem upcxx_async_put Message Size Latency (us) Intra-node GET shmem_get upc_getmem upcxx_async_get K 4K 6K 64K 256K M Message Size Intra-node performance of one-sided Put/Get operalons of PGAS libraries/languages using MVAPICH2-X communicalon conduit Near-naLve communicalon performance is observed on KNL IXPUG 7 9

10 Latency (us) Performance of PGAS Models on KNL using MVAPICH2-X K 4K 6K 64K 256K M Network Based CompuNng Laboratory Inter-node PUT shmem_put upc_putmem upcxx_async_put Message Size Latency (us) Inter-node GET shmem_get upc_getmem upcxx_async_get K 4K 6K 64K 256K M Message Size Inter-node performance of one-sided Put/Get operalons using MVAPICH2-X communicalon conduit with InfiniBand HCA (MT45) NaLve IB performance for all three PGAS models is observed IXPUG 7

11 IXPUG 7 Performance of PGAS Models on KNL Performance of Put and Get with OpenSHMEM, UPC, and UPC++ Comparison on KNL and Broadwell for OpenSHMEM point-topoint, colleclves, and atomics OperaLons Impact of AVX-52 VectorizaLon and MCDRAM on OpenSHMEM ApplicaLon Kernels Performance of UPC++ ApplicaLon kernels on MVAPICH2-X communicalon runlme

Microbenchmark EvaluaNons (Intra-node Put/Get) 00 33.7 00 32.6 Broadwell KNL Broadwell KNL 0 0

12 Microbenchmark EvaluaNons (Intra-node Put/Get) Broadwell KNL Broadwell KNL 0 0 Time (us) 96.5 Time (us) Message size (bytes) Message size (bytes) Shmem_putmem Shmem_getmem Broadwell shows about 3X be^er performance than KNL on large message MuL-threaded memcpy roulnes on KNL could offset the degradalon caused by the slower core on basic Put/Get operalons J. Hashmi, M. Li, H. Subramoni, D. Panda, Exploi:ng and Evalua:ng OpenSHMEM on KNL Architecture, Fourth Workshop on OpenSHMEM, Aug 207 Network Based CompuNng Laboratory IXPUG 7 2

IXPUG 7 3 Microbenchmark EvaluaNons (Inter-node Put/Get) 0 00 Broadwell KNL Broadwell KNL Time (us) 2.8 Time (us) 0 2.78.76 2.

13 IXPUG 7 3 Microbenchmark EvaluaNons (Inter-node Put/Get) 0 00 Broadwell KNL Broadwell KNL Time (us) 2.8 Time (us) Message size (bytes) Message size (bytes) Shmem_putmem Shmem_getmem Inter-node small message latency is only 2X worse on KNL. While large message performance is almost similar on both KNL and Broadwell.

Microbenchmark EvaluaNons (CollecNves) 000 Broadwell KNL 000 Broadwell KNL Time (us) 00 0 ~4X Time (us) 00 0 ~2X Message size (bytes) Shmem_reduce on 28 processes Message size (bytes) Shmem_broadcast

14 Microbenchmark EvaluaNons (CollecNves) 000 Broadwell KNL 000 Broadwell KNL Time (us) 00 0 ~4X Time (us) 00 0 ~2X Message size (bytes) Shmem_reduce on 28 processes Message size (bytes) Shmem_broadcast on 28 processes 2 KNL nodes (64 ppn) and 8 Broadwell nodes (6 ppn) 4X degradalon is observed on KNL using colleclve benchmarks Basic point-to-point performance difference is reflected in colleclves as well Network Based CompuNng Laboratory IXPUG 7 4

15 Microbenchmark EvaluaNons (Atomics) 2 KNL Broadwell Time (us) Using mullple nodes of KNL, atomic operalons showed about 2.5X degradalon on compare-swap, and Inc atomics Fetch-and-add (32-bit) showed up to 4X degradalon on KNL Network Based CompuNng Laboratory 2 0 fadd32 fadd64 cswap32 cswap64 inc32 OpenSHMEM atomics on 28 processes inc64 IXPUG 7 5

16 IXPUG 7 6 Performance of PGAS Models on KNL Performance of Put and Get with OpenSHMEM, UPC, and UPC++ Comparison on KNL and Broadwell for OpenSHMEM point-topoint, colleclves, and atomics OperaLons Impact of AVX-52 VectorizaLon and MCDRAM on OpenSHMEM ApplicaLon Kernels Performance of UPC++ ApplicaLon kernels on MVAPICH2-X communicalon runlme

17 Time (s) NAS Parallel Benchmark EvaluaNon NAS-BT (PDE solver), CLASS=B Network Based CompuNng Laboratory KNL (Default) KNL (AVX-52) KNL (AVX-52+MCDRAM) Broadwell No. of processes NAS-EP (RNG), CLASS=B AVX-52 vectorized execulon of BT kernel on KNL showed 30% improvement over default execulon while EP kernel didn t show any improvement Broadwell showed 20% improvement over oplmized KNL on BT and 4X improvement over all KNL execulons on EP kernel (random number generalon) Time (s) KNL (Default) KNL (AVX-52) KNL (AVX-52+MCDRAM) Broadwell No. of processes IXPUG 7 7

18 IXPUG 7 8 Time (s) NAS Parallel Benchmark EvaluaNon (Cont d) NAS-SP (non-linear PDE), CLASS=B KNL (Default) KNL (AVX-52) KNL (AVX-52+MCDRAM) Broadwell No. of processes NAS-MG (MulLGrid solver), CLASS=B Similar performance trends are observed on BT and MG kernels as well On SP kernel, MCDRAM based execulon showed up to 20% improvement over default at 6 processes Time (s) KNL (Default) KNL (AVX-52) KNL (AVX-52+MCDRAM) Broadwell No. of processes

IXPUG 7 9 ApplicaNon Kernels EvaluaNon 00 Time (s) 0 Heat-2D Kernel using Jacobi method KNL (Default) KNL (AVX-52) KNL (AVX-52+MCDRAM) Broadwell Time (s) 0 Heat Image Kernel KNL (Default) KNL

19 IXPUG 7 9 ApplicaNon Kernels EvaluaNon 00 Time (s) 0 Heat-2D Kernel using Jacobi method KNL (Default) KNL (AVX-52) KNL (AVX-52+MCDRAM) Broadwell Time (s) 0 Heat Image Kernel KNL (Default) KNL (AVX-52) KNL (AVX-52+MCDRAM) Broadwell No. of processes No. of processes On heat diffusion based kernels AVX-52 vectorizalon showed be^er performance MCDRAM showed significant benefits on Heat-Image kernel for all process counts. Combined with AVX-52 vectorizalon, it showed up to 4X improved performance

20 IXPUG 7 20 ApplicaNon Kernels EvaluaNon (Cont d) 00 Matrix MulLplicaLon kernel KNL (Default) KNL (AVX-52) KNL (AVX-52+MCDRAM) Broadwell 0 KNL (Default) KNL (AVX-52) DAXPY kernel Time (s) Time (s) No. of processes No. of processes VectorizaLon helps in matrix mullplicalon and vector operalons Due to heavily compute bound nature of these kernels, MCDRAM didn t show any significant performance improvement

21 IXPUG 7 2 ApplicaNon Kernels EvaluaNon (Cont d) Scalable Integer Sort Kernel (ISx) Time (s) 0 0. KNL (Default) KNL (AVX-52+MCDRAM) KNL (AVX-52) Broadwell No. of processes Up to 3X improvement on un-oplmized execulon is observed on KNL Broadwell showed up to 2X be^er performance for core-by-core comparison

KNL (68) Broadwell (28) +30% Time (s) 0.

22 Node-by-node EvaluaNon using ApplicaNon Kernels ApplicaLon Kernels on a single KNL vs. single Broadwell node 0 KNL (68) Broadwell (28) +30% Time (s) DHeat HeatImg MatMul DAXPY ISx(strong) A single node of KNL is evaluated against a single node of Broadwell using all the available physical cores HeatImage showed be^er performance than Intel Xeon Network Based CompuNng Laboratory IXPUG 7 22

23 IXPUG 7 23 Performance of PGAS Models on KNL Performance of Put and Get with OpenSHMEM, UPC, and UPC++ Comparison on KNL and Broadwell for OpenSHMEM point-topoint, colleclves, and atomics OperaLons Impact of AVX-52 VectorizaLon and MCDRAM on OpenSHMEM ApplicaLon Kernels Performance of UPC++ ApplicaLon kernels on MVAPICH2-X communicalon runlme

24 IXPUG 7 24 UPC++ ApplicaNon Kernels Performance on KNL Developed and Used two applicalon kernels to evaluate UPC++ model using MVAPICH2-X as communicalon runlme Sparse Matrix Vector MulLplicaLon (SpMV) AdapLve Mesh Refinement (AMR) kernel 2D-Heat conduclon using Jacobi iteralve

25 IXPUG 7 25 ApplicaNon Kernels Performance of UPC++ on MVAPICH2-X Time (s) UPC++ SpMV No. of Processes Strong-scaling Performance of SpMV kernel (2Kx2K) Time (s) 20 UPC++ 2D Heat No. of Processes Strong-scaling Performance of 2D-Heat kernel (52x52) SpMV and 2D Heat kernels using MVAPICH2-X shows good scalability on increasing number of processes of KNL

26 IXPUG 7 26 Performance Results Summary Put/Get and Atomics Performance KNL (Default) KNL (AVX52) KNL (AVX52 + MCDRAM) Broadwell Collectives Performance Node-by-node Application Performance (Closer to center is better) Core-by-core Application Performance

27 Conclusion Comprehensive performance evalualon of MVAPICH2-X based OpenSHMEM, UPC, and UPC++ models over the KNL architecture Observed significant performance gains on applicalon kernels when using AVX-52 vectorizalon 2.5x performance benefits in terms of execulon Lme MCDRAM benefits are not prominent on most of the applicalon kernels Lack of memory bound operalons KNL showed up to 3X worse performance than Broadwell for core-by-core evalualon KNL showed be^er or on-par performance than Broadwell on Heat-Image and ISx kernels for Node-by-Node evalualon The runlme implementalons need to take advantage of the concurrency of KNL cores Network Based CompuNng Laboratory IXPUG 7 27

28 IXPUG 7 28 Funding Acknowledgments Funding Support by Equipment Support by

29 Personnel Acknowledgments Current Students A. Awan (Ph.D.) M. Bayatpour (Ph.D.) S. Chakraborthy (Ph.D.) C.-H. Chu (Ph.D.) S. Guganani (Ph.D.) J. Hashmi (Ph.D.) N. Islam (Ph.D.) M. Li (Ph.D.) M. Rahman (Ph.D.) D. Shankar (Ph.D.) A. Venkatesh (Ph.D.) J. Zhang (Ph.D.) Current Research Scien:sts X. Lu H. Subramoni Current Post-doc A. Ruhela Current Research Specialist J. Smith M. Arnold Past Students A. AugusLne (M.S.) P. Balaji (Ph.D.) S. Bhagvat (M.S.) A. Bhat (M.S.) D. BunLnas (Ph.D.) L. Chai (Ph.D.) B. Chandrasekharan (M.S.) N. Dandapanthula (M.S.) V. Dhanraj (M.S.) T. Gangadharappa (M.S.) W. Huang (Ph.D.) W. Jiang (M.S.) J. Jose (Ph.D.) S. Kini (M.S.) M. Koop (Ph.D.) K. Kulkarni (M.S.) R. Kumar (M.S.) S. Krishnamoorthy (M.S.) K. Kandalla (Ph.D.) P. Lai (M.S.) M. Luo (Ph.D.) A. Mamidala (Ph.D.) G. Marsh (M.S.) V. Meshram (M.S.) A. Moody (M.S.) S. Naravula (Ph.D.) R. Noronha (Ph.D.) X. Ouyang (Ph.D.) S. Pai (M.S.) S. Potluri (Ph.D.) R. Rajachandrasekar (Ph.D.) G. Santhanaraman (Ph.D.) A. Singh (Ph.D.) J. Sridhar (M.S.) S. Sur (Ph.D.) H. Subramoni (Ph.D.) K. Vaidyanathan (Ph.D.) A. Vishnu (Ph.D.) J. Wu (Ph.D.) W. Yu (Ph.D.) Past Research Scien:st K. Hamidouche S. Sur Past Programmers D. Bureddy J. Perkins K. Gopalakrishnan (M.S.) J. Liu (Ph.D.) Past Post-Docs D. Banerjee J. Lin S. Marcarelli X. Besseron M. Luo J. Vienne H.-W. Jin E. Mancini H. Wang Network Based CompuNng Laboratory IXPUG 7 29

IXPUG 7 30 Thank You! panda@cse.ohio-state.

30 IXPUG 7 30 Thank You! panda@cse.ohio-state.edu Network-Based CompuLng Laboratory h^p://nowlab.cse.ohio-state.edu/ The High-Performance MPI/PGAS Project h^p://mvapich.cse.ohio-state.edu/ The High-Performance Big Data Project h^p://hibd.cse.ohio-state.edu/ The High-Performance Deep Learning Project h^p://hidl.cse.ohio-state.edu/

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X Intel Nerve Center (SC 7) Presentation Dhabaleswar K (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu Parallel