MVAPICH2 and MVAPICH2-MIC: Latest Status

Size: px

Start display at page:

Download "MVAPICH2 and MVAPICH2-MIC: Latest Status"

Prosper Elliott
5 years ago
Views:

1 MVAPICH2 and MVAPICH2-MIC: Latest Status Presentation at IXPUG Meeting, July 214 by Dhabaleswar K. (DK) Panda and Khaled Hamidouche The Ohio State University {panda,

2 Number of Clusters Percentage of Clusters Trends for Commodity Computing Clusters in the Top 5 List ( Percentage of Clusters Number of Clusters Timeline 2

3 Large-scale InfiniBand Installations 223 IB Clusters (44.3%) in the June 214 Top5 list ( Installations in the Top 5 (25 systems): 519,64 cores (Stampede) at TACC (7 th ) 12, 64 cores (Nebulae) at China/NSCS (28 th ) 62,64 cores (HPC2) in Italy (11 th ) 72,288 cores (Yellowstone) at NCAR (29 th ) 147, 456 cores (Super MUC) in Germany (12 th ) 7,56 cores (Helios) at Japan/IFERC (3 th ) 76,32 cores (Tsubame 2.5) at Japan/GSIC (13 th ) 138,368 cores (Tera-1) at France/CEA (35 th ) 194,616 cores (Cascade) at PNNL (15 th ) 222,72 cores (QUARTETTO) in Japan (37 th ) 11,4 cores (Pangea) at France/Total (16 th ) 53,54 cores (PRIMERGY) in Australia (38 th ) 96,192 cores (Pleiades) at NASA/Ames (21 st ) 77,52 cores (Conte) at Purdue University (39th) 73,584 cores (Spirit) at USA/Air Force (24 th ) 44,52 cores (Spruce A) at AWE in UK (4 th ) 77,184 cores (Curie thin nodes) at France/CEA (26 h ) 48,896 cores (MareNostrum) at Spain/BSC (41 st ) 65,32-cores, idataplex DX36M4 at Germany/Max- Planck (27 th ) and many more! 3

4 MVAPICH2/MVAPICH2-X Software High Performance open-source MPI Library for InfiniBand, 1Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE) MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.), Available since 22 MVAPICH2-X (MPI + PGAS), Available since 212 Support for GPGPUs and MIC Used by more than 2,15 organizations in 72 countries More than 218, downloads from OSU site directly Empowering many TOP5 clusters 7 th ranked 519,64-core cluster (Stampede) at TACC 13 th ranked 74,358-core cluster (Tsubame 2.5) at Tokyo Institute of Technology 23 rd ranked 96,192-core cluster (Pleiades) at NASA Available with software stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE) Partner in the U.S. NSF-TACC Stampede System 4

5 MVAPICH2 2. GA and MVAPICH2-X 2. GA Released on 6/2/14 Major Features and Enhancements Based on MPICH-3.1 MPI-3 RMA Support Support for Non-blocking collectives MPI-T support CMA support is default for intra-node communication Optimization of collectives with CMA support Large message transfer support Reduced memory footprint Improved Job start-up time Optimization of collectives and tuning for multiple platforms Updated hwloc to version 1.9 MVAPICH2-X 2. GA supports hybrid MPI + PGAS (UPC and OpenSHMEM) programming models Based on MVAPICH2 2. GA including MPI-3 features Compliant with UPC and OpenSHMEM v1.f 5

6 MVAPICH2-MIC: Optimized MPI Library for Xeon Phi Clusters MPI libraries run out of box (or with minor changes) on the Xeon Phi Critical to optimize the runtimes for better performance Tune existing designs Designs using lower level features offered by MPSS Designs to address system-level limitations Initial version of MVAPICH2-MIC (based on MVAPICH2 2.a release) is available on Stampede since Oct 13 Supports all modes of usage host-only, offload, coprocessor-only and symmetric Improved shared memory communication channel SCIF-based designs for improved communication within MIC and between MICs and Hosts Proxy-based design to work around bandwidth limitations on Sandy Bridge platform Enhanced version based on MVAPICH2 2.GA release is being worked out. Will be coming out soon!! 6

7 MPI Applications on MIC Clusters MPI (+X) continues to be the predominant programming model in HPC Flexibility in launching MPI jobs on clusters with Xeon Phi Xeon Xeon Phi Multi-core Centric Host-only MPI Program Offload (/reverse Offload) MPI Program Offloaded Computation Symmetric MPI Program MPI Program Coprocessor-only Many-core Centric MPI Program 7

8 MPI Applications on MIC Clusters Offload mode: lesser overhead way to extract performance from Xeon Phi Xeon Xeon Phi Multi-core Centric Host-only MPI Program Offload MPI Program Offloaded Computation Symmetric MPI Program MPI Program Coprocessor-only Many-core Centric MPI Program 8

9 MIC MIC PCIe MPI Data Movement in Offload Mode MPI communication only from the host Node Node 1 CPU QPI CPU1 CPU1 CPU IB MPI Process 1. Intra-Socket 2. Inter-Socket 3. Inter-Node Supported by the Standard MVAPICH2 Library (Latest Release - MVAPICH2 2. GA) 9

10 Latency (us) Latency (us) One-way Latency: MPI over IB with MVAPICH Small Message Latency Large Message Latency Qlogic-DDR Qlogic-QDR ConnectX-DDR ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR Message Size (bytes) Message Size (bytes) DDR, QDR GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch 1

11 Bandwidth (MBytes/sec) Bandwidth (MBytes/sec) Bandwidth: MPI over IB with MVAPICH Unidirectional Bandwidth Bidirectional Bandwidth Qlogic-DDR Qlogic-QDR ConnectX-DDR ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR Message Size (bytes) Message Size (bytes) DDR, QDR GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch 11

12 Bandwidth (MB/s) Bandwidth (MB/s) Latency (us) MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA)) Latency Intra-Socket Inter-Socket.45 us.18 us K Latest MVAPICH2 2. Intel Ivy-bridge Bandwidth (Intra-socket) intra-socket-cma intra-socket-shmem intra-socket-limic Bandwidth (Inter-socket) 14,25 MB/s inter-socket-cma 13,749 MB/s 12 inter-socket-shmem 1 inter-socket-limic

13 Latency (us) Bandwidth (Mbytes/sec) Latency (us) Bandwidth (Mbytes/sec) MPI-3 RMA Get/Put with Flush Performance Inter-node Get/Put Latency Get Put 2.4 us 1.56 us K 2K 4K Inter-node Get/Put Bandwidth Get Put K 4K 16K 64K 256K Intra-socket Get/Put Latency Get Put 2 15 Intra-socket Get/Put Bandwidth Get Put us K 2K 4K 8K K 4K 16K 64K 256K Latest MVAPICH2 2., Intel Sandy-bridge with Connect-IB (single-port) 13

14 GFlops GFlops HPCG Benchmark with Offload Mode MVAPICH2-Offload MVAPICH2-Offload # Nodes # Nodes Full subscription of a node: 1 MPI process + 1 OpenMP threads on the CPU 3 MPI processes + 6 OpenMP threads on CPU + 24 OpenMP threads offload on MIC Input data size = 128^3 Binding using MV2_CPU_MAPPING 1 node MVAPICH2 achieves 24 GFlops 124 nodes MVAPICH2 achieves 18.2 TFlops 14

15 MPI Applications on Xeon Phi Clusters - IntraNode Xeon Phi as a many-core node Xeon Xeon Phi Multi-core Centric Host-only MPI Program Offload MPI Program Offloaded Computation Symmetric MPI Program MPI Program Coprocessor-only Many-core Centric MPI Program 15

16 MIC PCIe MPI Data Movement in Coprocessor-Only Mode CPU Node QPI CPU1 Xeon Phi C L2 C L2 C L2 C L2 GDDR GDDR GDDR GDDR C C C C L2 L2 L2 L2 MPI Process Intra-MIC Communication 16

17 MVAPICH2-MIC Channels for Intra-MIC Communication MVAPICH2 provides a hybrid of two channels CH3-SHM Derives the design for shared memory communication on host Tuned for the Xeon Phi architecture CH3-SCIF SCIF is a lower level API provided by MPSS Provides user control of the DMA Takes advantage of SCIF to improve performance CH3-SHM POSIX Calls Xeon Phi MVAPICH2 CH3-SCIF SCIF S. Potluri, A. Venkatesh, D. Bureddy, K. Kandalla, and D. K. Panda - Efficient Intra-node Communication on Intel-MIC Clusters - International Symposium on Cluster, Cloud and Grid Computing, May 213 (CCGrid 13). S. Potluri, K. Tomko, D. Bureddy and D. K. Panda - Intra-MIC MPI Communication using MVAPICH2: Early Experience - TACC-Intel Highly-Parallel Computing Symposium (TI-HPCS), April Best Student Paper 17

18 Bandwidth (MB/s) Better Bandwidth (MB/s) Better Latency (usec) Latency (usec) Intra-MIC - Point-to-Point Communication MV2-MIC-2.GA MV2-MIC-2.a Better K 4K 66% Better osu_latency (small) osu_bw osu_latency (medium) 3% K 4K osu_bibw 18

19 Latency (usec) Latency (usec) Intra-MIC - Collective Communication AlltoAll MV2-MIC-2.GA MV2-MIC-2.a Better % Better K 4K K 4K 8 processes osu_alltoall (small) 16 processes osu_alltoall (small) 8 Processes: 4% and 19% improvement for 4 and 4K bytes messages 16 Processes: 38% and 71% improvement for 64 and 1K bytes messages 19

a 16 14 12 1 8 6 4 2 8% 32 64 Input size 4 MPI

20 GFlops Better Intra-MIC HPCG Benchmark MV2-MIC-2.GA MV2-MIC-2.a % Input size 4 MPI processes with 6 OpenMP threads per process Binding using MV2_MIC_MAPPING environment variable 2

21 MPI Applications on MIC Clusters - InterNode Xeon Xeon Phi Multi-core Centric Host-only MPI Program Offload MPI Program Offloaded Computation Symmetric MPI Program MPI Program Coprocessor-only Many-core Centric MPI Program 21

22 MIC MIC PCIe MPI Data Movement in Coprocessor-Only Mode Node Node 1 CPU QPI CPU1 CPU1 CPU IB MPI Process 1. MIC-RemoteMIC 22

23 MVAPICH2 Channels for MIC-RemoteMIC Communication Xeon Phi MVAPICH2 OFA-IB-CH3 IB Verbs mlx4_ Xeon Phi MVAPICH2 CH3 IB Verbs mlx4_ IB-HCA CH3-IB channel High-performance InfiniBand channel in MVAPICH2 Implemented using IB Verbs - DirectIB Currently does not support advanced features like hardware multicast, etc. Limited by P2P Bandwidth offered on Sandy Bridge platform 23

24 Host Proxy-based Designs in MVAPICH2-MIC SNB E MB/s 6296 MB/s 528 MB/s MB/s P2P Write/IB Write to Xeon Phi P2P Read / IB Read from Xeon Phi Xeon Phi-to-Host IB Read from Host Direct IB channels is limited by P2P read bandwidth MVAPICH2-MIC uses a hybrid DirectIB + host proxy-based approach to work around this S. Potluri, D. Bureddy, K. Hamidouche, A. Venkatesh, K. Kandalla, H. Subramoni and D. K. Panda, MVAPICH- PRISM: A Proxy-based Communication Framework using InfiniBand and SCIF for Intel MIC Clusters Int'l Conference on Supercomputing (SC '13), November

25 Bandwidth (MB/sec) Better Bandwidth (MB/sec) Better Latency (usec) Latency (usec) MIC-RemoteMIC Point-to-Point Communication (Active Proxy) MV2-MIC-2.GA w/ proxy MV2-MIC-2.GA w/o proxy Better Better K 8K 32K 128K 512K 2M 6 osu_latency (small) osu_latency (large) K 64K 1M K 64K 1M osu_bw osu_bibw 25

26 GFlops InterNode - Coprocessor-only Mode - HPCG MV2-MIC-2.GA # MICs Full subscription of a MIC: (Host is not involved) 3 MPI processes + 24 OpenMP threads on MIC Input size = 96^3 Binding using MV2_MIC_MAPPING 1 node MVAPICH2-MIC achieves 15 Gflops 8 nodes MVAPICH2-MIC achieves 15 GFlops 26

27 MPI Applications on MIC Clusters - InterNode Xeon Xeon Phi Multi-core Centric Host-only MPI Program Offload MPI Program Offloaded Computation Symmetric MPI Program MPI Program Coprocessor-only Many-core Centric MPI Program 27

28 MIC PCIe MPI Data Movement in Symmetric Mode - InterNode Node Node 1 CPU QPI CPU1 CPU1 CPU IB MPI Process 1. MIC- RemoteMIC 2. MIC- RemoteHost (Closer to HCA) 3. MIC- RemoteHost (Farther from HCA) Uses OFA-IB-CH3 channel backed by host-based proxy 28

29 K 4K 16K 64K 256K 1M 4M K 4K 16K 64K 256K 1M 4M Bandwidth (MB/s) Better Bandwidth (MB/s) Better Latency (usec) Latency (usec) MIC-RemoteHost Point-to-Point Communication (Active Proxy) MV2-MIC-2.GA w/ proxy MV2-MIC-2.GA w/o proxy K Better K 32K 128K 512K 2M Better osu_latency (small) osu_latency (large) 8742 osu_bw osu_bibw 29

30 Latency (usec) Latency (usec) Inter-Node Symmetric Mode Passive Proxy 64 MPI processes on 2 nodes (16H+16M per node) MV2-MIC-2.GA MV2-MIC-2.a Better X 6 1.8X K 4K 16K64K Better osu_alltoall osu_allgather Fully-subscribed mode (16 processes on the Host) Passive proxy design outperforms the active proxy (default in MV2-MIC-2.a) 3

31 Inter-Node Symmetric Mode Redesigning Collectives Latency (us) x 1 Latency (us) x MV2-MIC-2.a MV2-MIC-2.GA Message Size (bytes) Allgather: 32 Nodes (16H+16M) Message Size (bytes) Alltoall: 16 Nodes (8H+8M) 3X Latency (us) x 1 Latency (us) x Message Size (bytes) Allgather: 32 Nodes (8H+8M) Message Size (bytes) Alltoall: 16 Nodes (8H+8M) A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS 14, May 214 2X 2.5X 31 Better Better

32 GFlops InterNode Symmetric Mode - HPCG MV2-MIC-2.GA # Nodes Full subscription of a node: 2 MPI processes + 16 OpenMP threads on CPU 3 MPI processes + 24 OpenMP threads on MIC Input size = 96^3 Binding using both MV2_CPU_MAPPING and MV2_MIC_MAPPING To explore different threading levels for host and Xeon Phi 1 node MVAPICH2-MIC achieves 23.5 GFlops 16 nodes MVAPICH2-MIC achieves 34 GFlops 32

33 On-Going and Future Work Redesigning of other collective operations Optimizing MPI3-RMA operations for MIC Automatic proxy selection using architecture topology detection Evaluating application performance and scaling Extending designs to KNL-based self-hosted nodes 33

34 Conclusion MVAPICH2-MIC provides a high-performance and scalable MPI library for InfiniBand clusters using Xeon Phi Initial version takes advantage of SCIF to improve Intra-MIC and Intranode MIC-Host communication Host-Proxy based designs help work around P2P bandwidth bottlenecks Provides initial support for heterogeneity-aware collectives Enhanced version of MVAPICH2-MIC (based on 2. GA) will be available soon 34

35 Upcoming 2 nd Annual MVAPICH User Group (MUG) Meeting August 25-27, 214; Columbus, Ohio, USA Keynote Talks, Invited Talks, Contributed Presentations Tutorial on MVAPICH2 and MVAPICH2-X optimization and tuning Speakers (Confirmed so far): Keynote: Dan Stanzione (TACC) Keynote: Darren Kerbyson (PNNL) Sadaf Alam (CSCS, Switzerland) Onur Celebioglu (Dell) Jens Glaser (Univ. of Michigan) Jeff Hammond (Intel) Adam Moody (LLNL) David Race (Cray) Davide Rossetti (NVIDIA) Gilad Shainer (Mellanox) Sayantan Sur (Intel) Mahidhar Tatineni (SDSC) Jerome Vienne (TACC) More details at: 35

36 Web Pointers MVAPICH Web Page 36

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda