Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Size: px

Start display at page:

Download "Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X"

Austin Jenkins
6 years ago
Views:

1 Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X Intel Nerve Center (SC 7) Presentation Dhabaleswar K (DK) Panda The Ohio State University panda@cse.ohio-state.edu

Parallel Programming Models Overview P P2 P3 P P2 P3 P P2 P3 Shared Memory Memory Memory Memory Logical shared memory Memory Memory Memory Shared Memory Model Distributed Memory Model Partitioned

2 Parallel Programming Models Overview P P2 P3 P P2 P3 P P2 P3 Shared Memory Memory Memory Memory Logical shared memory Memory Memory Memory Shared Memory Model Distributed Memory Model Partitioned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) Global Arrays, UPC, Chapel, X, CAF, Programming models provide abstract machine models Models can be mapped on different types of systems e.g. Distributed Shared Memory (DSM), MPI within a node, etc. PGAS models and Hybrid MPI+PGAS models are gradually receiving importance Network Based Computing Laboratory Intel-Booth (SC 7) 2

3 Intel-Booth (SC 7) 3 Partitioned Global Address Space (PGAS) Models Key features - Simple shared memory abstractions - Light weight one-sided communication - Easier to express irregular communication Different approaches to PGAS - Languages - Libraries Unified Parallel C (UPC) OpenSHMEM Co-Array Fortran (CAF) UPC++ X Global Arrays Chapel

4 Intel-Booth (SC 7) 4 Hybrid (MPI+PGAS) Programming Application sub-kernels can be re-written in MPI/PGAS based on communication characteristics Benefits: Cons Best of Distributed Computing Model Best of Shared Memory Computing Model Two different runtimes Need great care while programming Prone to deadlock if not careful HPC Application Kernel MPI Kernel 2 PGAS MPI Kernel 3 MPI Kernel N PGAS MPI

5 Intel-Booth (SC 7) 5 MVAPICH2 Software Family High-Performance Parallel Programming Libraries MVAPICH2 MVAPICH2-X MVAPICH2-GDR MVAPICH2-Virt MVAPICH2-EA MVAPICH2-MIC Microbenchmarks OMB Tools OSU INAM OEMT Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime Optimized MPI for clusters with NVIDIA GPUs High-performance and scalable MPI for hypervisor and container based HPC cloud Energy aware and High-performance MPI Optimized MPI for clusters with Intel KNC Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration Utility to measure the energy consumption of MPI applications

MVAPICH2-X for Hybrid MPI + PGAS Applications Current Model Separate Runtimes for OpenSHMEM/UPC/UPC++/CAF and MPI Possible deadlock if both runtimes are not progressed Consumes more network resource

6 MVAPICH2-X for Hybrid MPI + PGAS Applications Current Model Separate Runtimes for OpenSHMEM/UPC/UPC++/CAF and MPI Possible deadlock if both runtimes are not progressed Consumes more network resource Unified communication runtime for MPI, UPC, UPC++, OpenSHMEM, CAF Available with since 202 (starting with MVAPICH2-X.9) Network Based Computing Laboratory Intel-Booth (SC 7) 6

7 Application Level Performance with Graph500 and Sort Time (s) Network Based Computing Laboratory Graph500 Execution Time MPI-Simple MPI-CSC MPI-CSR Hybrid (MPI+OpenSHMEM) 4K 8K 6K No. of Processes 7.6X J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC 3), June 203 J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '2), September 202 3X Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design 8,92 processes - 2.4X improvement over MPI-CSR - 7.6X improvement over MPI-Simple 6,384 processes -.5X improvement over MPI-CSR - 3X improvement over MPI-Simple J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, Optimizing Collective Communication in OpenSHMEM, Int'l Conference on Partitioned Global Address Space Programming Models (PGAS '3), October 203. Time (seconds) MPI Sort Execution Time Hybrid 500GB-52 TB-K 2TB-2K 4TB-4K Input Data - No. of Processes Performance of Hybrid (MPI+OpenSHMEM) Sort Application 4,096 processes, 4 TB Input Size - MPI 2408 sec; 0.6 TB/min - Hybrid 72 sec; 0.36 TB/min - 5% improvement over MPI-design Intel-Booth (SC 7) 7 5%

8 Intel-Booth (SC 7) 8 Performance of PGAS Models on KNL Performance of Put and Get with OpenSHMEM, UPC, and UPC++ Evaluation of KNL many-core processor for OpenSHMEM pointto-point, collectives, and atomics Operations Impact of AVX-52 Vectorization and MCDRAM on OpenSHMEM Application Kernels Performance of UPC++ Application kernels on MVAPICH2-X communication runtime

9 Latency (us) Performance of PGAS Models on KNL using MVAPICH2-X K 4K 6K 64K 256K M Network Based Computing Laboratory Intra-node PUT shmem_put upc_putmem upcxx_async_put Message Size Latency (us) Intra-node GET shmem_get upc_getmem upcxx_async_get K 4K 6K 64K 256K M Message Size Intra-node performance of one-sided Put/Get operations of PGAS libraries/languages using MVAPICH2-X communication conduit Near-native communication performance is observed on KNL Intel-Booth (SC 7) 9

10 Latency (us) Performance of PGAS Models on KNL using MVAPICH2-X Inter-node PUT shmem_put upc_putmem upcxx_async_put K 4K 6K 64K 256K M Message Size K 4K 6K 64K 256K M Network Based Computing Laboratory Intel-Booth (SC 7) Latency (us) Inter-node GET shmem_get upc_getmem upcxx_async_get Message Size Inter-node performance of one-sided Put/Get operations using MVAPICH2-X communication conduit with InfiniBand HCA (MT45) Native IB performance for all three PGAS models is observed

11 Intel-Booth (SC 7) Performance of PGAS Models on KNL Performance of Put and Get with OpenSHMEM, UPC, and UPC++ Evaluation of KNL many-core processor for OpenSHMEM pointto-point, collectives, and atomics Operations Impact of AVX-52 Vectorization and MCDRAM on OpenSHMEM Application Kernels Performance of UPC++ Application kernels on MVAPICH2-X communication runtime

12 Microbenchmark Evaluations (Intra-node Put/Get) 00 0 KNL 00 0 KNL Time (us) Time (us) Message size (bytes) Shmem_putmem Message size (bytes) Shmem_getmem Muti-threaded memcpy routines on KNL can further improve the performance of basic Put/Get operations J. Hashmi, M. Li, H. Subramoni, D. Panda, Exploiting and Evaluating OpenSHMEM on KNL Architecture, Fourth Workshop on OpenSHMEM, Aug 207 Network Based Computing Laboratory Intel-Booth (SC 7) 2

13 Microbenchmark Evaluations (Inter-node Put/Get) Time (us) Time (us) Message size (bytes) Shmem_putmem Message size (bytes) Shmem_getmem Inter-node one-sided Put and Get using 2 KNL nodes with process per node KNL showed good scalability on inter-node one-sided Put and Get operations Network Based Computing Laboratory Intel-Booth (SC 7) 3

14 Microbenchmark Evaluations (Collectives) 000 KNL 000 KNL Time (us) 0 Time (us) 0 Message size (bytes) Shmem_reduce on 28 processes Message size (bytes) Shmem_broadcast on 28 processes Inter-node collectives runs using 2 KNL nodes with 64 processes per node Good scalability of collectives is observed on KNL using collective benchmarks Basic point-to-point performance difference is reflected in collectives as well Network Based Computing Laboratory Intel-Booth (SC 7) 4

Microbenchmark Evaluations (Atomics) 2 KNL Time (us) 8 6 4 2 Available in MVAPICH2-X 2.

15 Microbenchmark Evaluations (Atomics) 2 KNL Time (us) Available in MVAPICH2-X 2.3b 0 fadd32 fadd64 cswap32 cswap64 inc32 inc64 OpenSHMEM atomics on 28 processes Using multiple nodes of KNL, atomic operations showed about 2.5X degradation on compare-swap, and Inc atomics Fetch-and-add (32-bit) showed up to 4X degradation on KNL Network Based Computing Laboratory Intel-Booth (SC 7) 5

16 Intel-Booth (SC 7) 6 Performance of PGAS Models on KNL Performance of Put and Get with OpenSHMEM, UPC, and UPC++ Evaluation of KNL many-core processor for OpenSHMEM pointto-point, collectives, and atomics Operations Impact of AVX-52 Vectorization and MCDRAM on OpenSHMEM Application Kernels Performance of UPC++ Application kernels on MVAPICH2-X communication runtime

17 Intel-Booth (SC 7) 7 Time (s) NAS Parallel Benchmark Evaluation NAS-BT (PDE solver), CLASS=B KNL (Default) KNL (AVX-52) KNL (AVX-52+MCDRAM) No. of processes NAS-EP (RNG), CLASS=B AVX-52 vectorized and MCDRAM based execution of NAS kernels on KNL NAS-bT showed 30% improvement over default execution EP kernel didn t show much improvement Time (s) KNL (Default) KNL (AVX-52) KNL (AVX-52+MCDRAM) No. of processes

18 Intel-Booth (SC 7) 8 Time (s) NAS Parallel Benchmark Evaluation (Cont d) NAS-SP (non-linear PDE), CLASS=B KNL (Default) KNL (AVX-52) KNL (AVX-52+MCDRAM) No. of processes NAS-MG (MultiGrid solver), CLASS=B Similar performance trends are observed on BT and MG kernels as well On SP kernel, MCDRAM based execution showed up to 20% improvement over default at 6 processes Time (s) KNL (Default) KNL (AVX-52) KNL (AVX-52+MCDRAM) No. of processes

Intel-Booth (SC 7) 9 Application Kernels Evaluation Heat-2D Kernel using Jacobi method Heat Image Kernel 00 KNL (Default) 0 KNL (Default) KNL (AVX-52) KNL (AVX-52) 0 KNL (AVX-52+MCDRAM) KNL

19 Intel-Booth (SC 7) 9 Application Kernels Evaluation Heat-2D Kernel using Jacobi method Heat Image Kernel 00 KNL (Default) 0 KNL (Default) KNL (AVX-52) KNL (AVX-52) 0 KNL (AVX-52+MCDRAM) KNL (AVX-52+MCDRAM) Time (s) Time (s) No. of processes No. of processes On heat diffusion based kernels AVX-52 vectorization showed better performance MCDRAM showed significant benefits on Heat-Image kernel for all process counts. Combined with AVX-52 vectorization, it showed up to 4X improved performance

20 Intel-Booth (SC 7) 20 Time (s) Application Kernels Evaluation (Cont d) 00 0 Matrix Multiplication kernel KNL (Default) KNL (AVX-52) KNL (AVX-52+MCDRAM) Time (s) 0 DAXPY kernel KNL (Default) KNL (AVX-52) KNL (AVX-52+MCDRAM) No. of processes No. of processes Vectorization helps in matrix multiplication and vector operations Due to heavily compute bound nature of these kernels, MCDRAM didn t show any significant performance improvement

21 Intel-Booth (SC 7) 2 Application Kernels Evaluation (Cont d) Scalable Integer Sort Kernel (ISx) Time (s) 0 0. KNL (Default) KNL (AVX-52+MCDRAM) KNL (AVX-52) No. of processes Scalable Integer Sort kernel evaluation on KNL for different configuration Up to 3X improvement on un-optimized execution is observed on KNL

22 Application Kernels Performance on KNL Network Based Computing Laboratory Intel-Booth (SC 7) 22 Application Kernels on a single KNL 0 KNL (68) Time (s) DHeat HeatImg MatMul DAXPY ISx(strong) A single node of KNL is evaluated under different application kernels using all the available physical cores

23 Intel-Booth (SC 7) 23 Performance of PGAS Models on KNL Performance of Put and Get with OpenSHMEM, UPC, and UPC++ Evaluation of KNL many-core processor for OpenSHMEM pointto-point, collectives, and atomics Operations Impact of AVX-52 Vectorization and MCDRAM on OpenSHMEM Application Kernels Performance of UPC++ Application kernels on MVAPICH2-X communication runtime

24 Intel-Booth (SC 7) 24 UPC++ Application Kernels Performance on KNL Developed and Used two application kernels to evaluate UPC++ model using MVAPICH2-X as communication runtime Sparse Matrix Vector Multiplication (SpMV) Adaptive Mesh Refinement (AMR) kernel 2D-Heat conduction using Jacobi iterative solver

25 Intel-Booth (SC 7) 25 Application Kernels Performance of UPC++ on MVAPICH2-X Time (s) UPC++ SpMV No. of Processes Strong-scaling Performance of SpMV kernel (2Kx2K) Time (s) 20 UPC++ 2D Heat No. of Processes Strong-scaling Performance of 2D-Heat kernel (52x52) SpMV and 2D Heat kernels using MVAPICH2-X shows good scalability on increasing number of processes of KNL

26 Performance Results Summary Network Based Computing Laboratory Intel-Booth (SC 7) 26 Put/Get and Atomics Performance KNL (Default) KNL (AVX52) KNL (AVX52 + MCDRAM) Collectives Performance Node-by-node Application Performance (Closer to center is better) Core-by-core Application Performance

27 Conclusion Comprehensive performance evaluation of MVAPICH2-X based OpenSHMEM, UPC, and UPC++ models over the KNL architecture Observed significant performance gains on application kernels when using AVX-52 vectorization 2.5x performance benefits in terms of execution time MCDRAM benefits are not prominent on most of the application kernels Lack of memory bound operations KNL showed good scalability on application kernels such as Heat-Image and Isx The runtime implementations need to take advantage of the concurrency of KNL cores All proposed enhancements are available in the latest MVAPICH2-X 2.3b release ( Network Based Computing Laboratory Intel-Booth (SC 7) 27

28 Intel-Booth (SC 7) 28 Funding Acknowledgments Funding Support by Equipment Support by

29 Personnel Acknowledgments Current Students A. Awan (Ph.D.) M. Bayatpour (Ph.D.) S. Chakraborthy (Ph.D.) C.-H. Chu (Ph.D.) S. Guganani (Ph.D.) J. Hashmi (Ph.D.) N. Islam (Ph.D.) M. Li (Ph.D.) M. Rahman (Ph.D.) D. Shankar (Ph.D.) A. Venkatesh (Ph.D.) J. Zhang (Ph.D.) Current Research Scientists X. Lu H. Subramoni Current Post-doc A. Ruhela Current Research Specialist J. Smith M. Arnold Past Students A. Augustine (M.S.) P. Balaji (Ph.D.) S. Bhagvat (M.S.) A. Bhat (M.S.) D. Buntinas (Ph.D.) L. Chai (Ph.D.) B. Chandrasekharan (M.S.) N. Dandapanthula (M.S.) V. Dhanraj (M.S.) T. Gangadharappa (M.S.) K. Gopalakrishnan (M.S.) W. Huang (Ph.D.) W. Jiang (M.S.) J. Jose (Ph.D.) S. Kini (M.S.) M. Koop (Ph.D.) K. Kulkarni (M.S.) R. Kumar (M.S.) S. Krishnamoorthy (M.S.) K. Kandalla (Ph.D.) P. Lai (M.S.) J. Liu (Ph.D.) M. Luo (Ph.D.) A. Mamidala (Ph.D.) G. Marsh (M.S.) V. Meshram (M.S.) A. Moody (M.S.) S. Naravula (Ph.D.) R. Noronha (Ph.D.) X. Ouyang (Ph.D.) S. Pai (M.S.) S. Potluri (Ph.D.) R. Rajachandrasekar (Ph.D.) G. Santhanaraman (Ph.D.) A. Singh (Ph.D.) J. Sridhar (M.S.) S. Sur (Ph.D.) H. Subramoni (Ph.D.) K. Vaidyanathan (Ph.D.) A. Vishnu (Ph.D.) J. Wu (Ph.D.) W. Yu (Ph.D.) Past Research Scientist K. Hamidouche S. Sur Past Programmers D. Bureddy J. Perkins Past Post-Docs D. Banerjee X. Besseron H.-W. Jin J. Lin M. Luo E. Mancini S. Marcarelli J. Vienne H. Wang Network Based Computing Laboratory Intel-Booth (SC 7) 29

30 Intel-Booth (SC 7) 30 Thank You! Network-Based Computing Laboratory The High-Performance MPI/PGAS Project The High-Performance Big Data Project The High-Performance Deep Learning Project

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X IXPUG 7 PresentaNon J. Hashmi, M. Li, H. Subramoni and DK Panda The Ohio State University E-mail: {hashmi.29,li.292,subramoni.,panda.2}@osu.edu