Networking Technologies and Middleware for Next-Generation Clusters and Data Centers: Challenges and Opportunities

Size: px
Start display at page:

Download "Networking Technologies and Middleware for Next-Generation Clusters and Data Centers: Challenges and Opportunities"

Transcription

1 Networking Technologies and Middleware for Next-Generation Clusters and Data Centers: Challenges and Opportunities Keynote Talk at HPSR 18 by Dhabaleswar K. (DK) Panda The Ohio State University

2 HPSR 18 2 High-End Computing (HEC): Towards Exascale 1 PFlops in PFlops in EFlops in ? Expected to have an ExaFlop system in !

3 Big Data How Much Data Is Generated Every Minute on the Internet? The global Internet population grew 7.5% from 216 and now represents 3.7 Billion People. Courtesy: HPSR 18 3

4 Resurgence of AI/Machine Learning/Deep Learning Courtesy: HPSR 18 4

5 HPSR 18 5 A Typical Multi-Tier Data Center Architecture Routers/ Servers Application Server Database Server Routers/ Servers Routers/ Servers.. Routers/ Servers Switch Application Server Application Server.. Application Server Switch Database Server Database Server Database Server Switch Tier1 Tier2 Tier3

6 Data Management and Processing on Modern Datacenters HPSR 18 6 Substantial impact on designing and utilizing data management and processing systems in multiple tiers Front-end data accessing and serving (Online) Memcached + DB (e.g. MySQL), HBase Back-end data analytics (Offline) HDFS, MapReduce, Spark

7 Communication and Computation Requirements HPSR 18 7 Clients WAN Proxy Server Web-server (Apache) More Computation and Communication Requirements Storage Application Server (PHP) Database Server (MySQL) Requests are received from clients over the WAN Proxy nodes perform caching, load balancing, resource monitoring, etc. If not cached, the request is forwarded to the next tiers Application Server Application server performs the business logic (CGI, Java servlets, etc.) Retrieves appropriate data from the database to process the requests

8 HPSR 18 8 Increasing Usage of HPC, Big Data and Deep Learning on Modern Datacenters HPC (MPI, RDMA, Lustre, etc.) Big Data (Hadoop, Spark, HBase, Memcached, etc.) Convergence of HPC, Big Data, and Deep Learning! Deep Learning (Caffe, TensorFlow, BigDL, etc.) Increasing Need to Run these applications on the Cloud!!

9 HPSR 18 9 Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure? Physical Compute

10 Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure? HPSR 18 1

11 Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure? HPSR 18 11

12 Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure? HPSR Hadoop Job Deep Learning Job Spark Job

13 Drivers of Modern HPC Cluster and Data Center Architecture Multi-/Many-core Processors High Performance Interconnects InfiniBand (with SR-IOV) <1usec latency, 2Gbps Bandwidth> Accelerators high compute density, high performance/watt >1 TFlop DP on a chip SSD, NVMe-SSD, NVRAM Multi-core/many-core technologies Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand, iwarp, RoCE, and Omni-Path) Single Root I/O Virtualization (SR-IOV) Solid State Drives (SSDs), NVMe/NVMf, Parallel Filesystems, Object Storage Clusters Accelerators (NVIDIA GPGPUs and FPGAs) SDSC Comet TACC Stampede Cloud Cloud HPSR 18 13

14 Trends in Network Speed Acceleration Ethernet ( ) 1 Mbit/sec Fast Ethernet (1993 -) 1 Mbit/sec Gigabit Ethernet (1995 -) 1 Mbit /sec ATM (1995 -) 155/622/124 Mbit/sec Myrinet (1993 -) 1 Gbit/sec Fibre Channel (1994 -) 1 Gbit/sec InfiniBand (21 -) 2 Gbit/sec (1X SDR) 1-Gigabit Ethernet (21 -) 1 Gbit/sec InfiniBand (23 -) 8 Gbit/sec (4X SDR) InfiniBand (25 -) 16 Gbit/sec (4X DDR) 24 Gbit/sec (12X SDR) InfiniBand (27 -) 32 Gbit/sec (4X QDR) 4-Gigabit Ethernet (21 -) 4 Gbit/sec InfiniBand (211 -) 54.6 Gbit/sec (4X FDR) InfiniBand (212 -) 2 x 54.6 Gbit/sec (4X Dual-FDR) 25-/5-Gigabit Ethernet (214 -) 25/5 Gbit/sec 1-Gigabit Ethernet (215 -) 1 Gbit/sec Omni-Path (215 - ) 1 Gbit/sec InfiniBand (215 - ) 1 Gbit/sec (4X EDR) InfiniBand (216 - ) 2 Gbit/sec (4X HDR) 1 times in the last 17 years HPSR 18 14

15 Available Interconnects and Protocols for Data Centers Application / Middleware Interface Application / Middleware Sockets Verbs OFI Protocol Kernel Space TCP/IP TCP/IP RSockets SDP TCP/IP RDMA RDMA RDMA Ethernet Driver IPoIB Hardware Offload User Space RDMA User Space User Space User Space User Space Adapter Ethernet Adapter InfiniBand Adapter Ethernet Adapter InfiniBand Adapter InfiniBand Adapter iwarp Adapter RoCE Adapter InfiniBand Adapter Omni-Path Adapter Switch Ethernet Switch 1/1/25/4/ 5/1 GigE InfiniBand Switch IPoIB Ethernet Switch 1/1/25/4/ 5/1 GigE- TOE InfiniBand Switch RSockets InfiniBand Switch SDP Ethernet Switch iwarp Ethernet Switch RoCE InfiniBand Switch IB Native Omni-Path Switch 1 Gb/s HPSR 18 15

16 Open Standard InfiniBand Networking Technology Introduced in Oct 2 High Performance Data Transfer Interprocessor communication and I/O Low latency (<1. microsec), High bandwidth (up to 25 GigaBytes/sec -> 2Gbps), and low CPU utilization (5-1%) Flexibility for LAN and WAN communication Multiple Transport Services Reliable Connection (RC), Unreliable Connection (UC), Reliable Datagram (RD), Unreliable Datagram (UD), and Raw Datagram Provides flexibility to develop upper layers Multiple Operations Send/Recv RDMA Read/Write Atomic Operations (very unique) high performance and scalable implementations of distributed locks, semaphores, collective communication operations Leading to big changes in designing HPC clusters, file systems, cloud computing systems, grid computing systems,. HPSR 18 16

17 Large-scale InfiniBand Installations 163 IB Clusters (32.6%) in the Nov 17 Top5 list ( Installations in the Top 5 (17 systems): 19,86, core (Gyoukou) in Japan (4 th ) 6,512 core (DGX SATURN V) at NVIDIA/USA (36 th ) 241,18 core (Pleiades) at NASA/Ames (17 th ) 72, core (HPC2) in Italy (37 th ) 22,8 core (Pangea) in France (21 st ) 152,692 core (Thunder) at AFRL/USA (4 th ) 144,9 core (Cheyenne) at NCAR/USA (24 th ) 99,72 core (Mistral) at DKRZ/Germany (42 nd ) 155,15 core (Jureca) in Germany (29 th ) 147,456 core (SuperMUC) in Germany (44 th ) 72,8 core Cray CS-Storm in US (3 th ) 86,16 core (SuperMUC Phase 2) in Germany (45 th ) 72,8 core Cray CS-Storm in US (31 st ) 74,52 core (Tsubame 2.5) at Japan/GSIC (48 th ) 78,336 core (Electra) at NASA/USA (33 rd ) 66, core (HPC3) in Italy (51 st ) 124,2 core (Topaz) SGI ICE at ERDC DSRC in US (34 th ) 194,616 core (Cascade) at PNNL (53 rd ) 6,512 core (NVIDIA DGX-1/Relion) at Facebook in USA (35 th ) and many more! #1st system also uses InfiniBand Upcoming US DOE Summit (2 PFlops, to be the new #1 st ) uses InfiniBand HPSR 18 17

18 High-speed Ethernet Consortium (1GE/25GE/4GE/5GE/1GE) 1GE Alliance formed by several industry leaders to take the Ethernet family to the next speed step Goal: To achieve a scalable and high performance communication architecture while maintaining backward compatibility with Ethernet 4-Gbps (Servers) and 1-Gbps Ethernet (Backbones, Switches, Routers): IEEE 82.3 WG 25-Gbps Ethernet Consortium targeting 25/5Gbps (July 214) Energy-efficient and power-conscious protocols On-the-fly link speed reduction for under-utilized links Ethernet Alliance Technology Forum looking forward to HPSR 18 18

19 HPSR TOE and iwarp Accelerators TCP Offload Engines (TOE) Hardware Acceleration for the entire TCP/IP stack Initially patented by Tehuti Networks Actually refers to the IC on the network adapter that implements TCP/IP In practice, usually referred to as the entire network adapter Internet Wide-Area RDMA Protocol (iwarp) Standardized by IETF and the RDMA Consortium Support acceleration features (like IB) for Ethernet &

20 Hardware RoCE RoCE v2 RDMA over Converged Enhanced Ethernet (RoCE) IB Transport ETH L2 Hdr IB Network ETH L2 Hdr Application IB Verbs InfiniBand Link Layer Application IB Verbs IB Transport IB Network Ethernet Link Layer InfiniBand RoCE Packet Header Comparison Ethertype Ethertype Network Stack Comparison IP Hdr L3 Hdr IB GRH L3 Hdr Proto # UDP Hdr Port # Courtesy: OFED, Mellanox IB BTH+ L4 Hdr Application IB Verbs IB Transport UDP / IP Ethernet Link Layer IB BTH+ L4 Hdr RoCE v2 Takes advantage of IB and Ethernet Software written with IB-Verbs Link layer is Converged (Enhanced) Ethernet (CE) 1Gb/s support from latest EDR and ConnectX- 3 Pro adapters Pros: IB Vs RoCE Works natively in Ethernet environments Entire Ethernet management ecosystem is available Has all the benefits of IB verbs Link layer is very similar to the link layer of native IB, so there are no missing features RoCE v2: Additional Benefits over RoCE Traditional Network Management Tools Apply ACLs (Metering, Accounting, Firewalling) GMP Snooping for Optimized Multicast Network Monitoring Tools HPSR 18 2

21 HSE Scientific Computing Installations 24 HSE compute systems with ranking in the Nov 17 Top5 list 39,68-core installation in China (#73) 66,56-core installation in China (#11) new 66,28-core installation in China (#13) new 64,-core installation in China (#14) new 64,-core installation in China (#15) new 72,-core installation in China (#18) new 78,-core installation in China (#125) 59,52-core installation in China (#128) new 59,52-core installation in China (#129) new 64,8-core installation in China (#13) new 67,2-core installation in China (#134) new 57,6-core installation in China (#135) new 57,6-core installation in China (#136) new 64,-core installation in China (#138) new 84,-core installation in China (#139) 84,-core installation in China (#14) 51,84-core installation in China (#151) new 51,2-core installation in China (#156) new and many more! HPSR 18 21

22 Omni-Path Fabric Overview Derived from QLogic InfiniBand Layer 1.5: Link Transfer Protocol Features Traffic Flow Optimization Packet Integrity Protection Dynamic Lane Switching Error detection/replay occurs in Link Transfer Packet units Retransmit request via NULL LTP; carries replay command flit Layer 2: Link Layer Supports 24 bit fabric addresses Allows 1KB of L4 payload; 1,368 byte max packet size Congestion Management Adaptive / Dispersive Routing Explicit Congestion Notification QoS support Traffic Class, Service Level, Service Channel and Virtual Lane Layer 3: Data Link Layer Fabric addressing, switching, resource allocation and partitioning support Courtesy: Intel Corporation HPSR 18 22

23 HPSR Large-scale Omni-Path Installations 35 Omni-Path Clusters (7%) in the Nov 17 Top5 list ( 556,14 core (Oakforest-PACS) at JCAHPC in Japan (9 th ) 54,432 core (Marconi Xeon) at CINECA in Italy (72 nd ) 368,928 core (Stampede2) at TACC in USA (12 th ) 46,464 core (Peta4) at University of Cambridge in UK (75 th ) 135,828 core (Tsubame 3.) at TiTech in Japan (13 th ) 53,352 core (Girzzly) at LANL in USA (85 th ) 314,384 core (Marconi XeonPhi) at CINECA in Italy (14 th ) 45,68 core (Endeavor) at Intel in USA (86 th ) 153,216 core (MareNostrum) at BSC in Spain (16 th ) 59,776 core (Cedar) at SFU in Canada (94 th ) 95,472 core (Quartz) at LLNL in USA (49 th ) 27,2 core (Peta HPC) in Taiwan (95 th ) 95,472 core (Jade) at LLNL in USA (5 th ) 4,392 core (Serrano) at SNL in USA (112 th ) 49,432 core (Mogon II) at Universitaet Mainz in Germany (65 th ) 4,392 core (Cayenne) at SNL in USA (113 th ) 38,552 core (Molecular Simulator) in Japan (7 th ) 39,774 core (Nel) at LLNL in USA (11 st ) 35,28 core (Quriosity) at BASF in Germany (71 st ) and many more!

24 HPSR IB, Omni-Path, and HSE: Feature Comparison Features IB iwarp/hse RoCE RoCE v2 Omni-Path Hardware Acceleration Yes Yes Yes Yes Yes RDMA Yes Yes Yes Yes Yes Congestion Control Yes Optional Yes Yes Yes Multipathing Yes Yes Yes Yes Yes Atomic Operations Yes No Yes Yes Yes Multicast Optional No Optional Optional Optional Data Placement Ordered Out-of-order Ordered Ordered Ordered Prioritization Optional Optional Yes Yes Yes Fixed BW QoS (ETS) No Optional Yes Yes Yes Ethernet Compatibility No Yes Yes Yes Yes TCP/IP Compatibility Yes (using IPoIB) Yes Yes (using IPoIB) Yes Yes

25 HPSR Designing Communication and I/O Libraries for Clusters and Data Center Middleware: Challenges Applications Cluster and Data Center Middleware Upper level (MPI, PGAS, Memcached, HDFS, MapReduce, HBase, and grpc/tensorflow) Changes? RDMA-based Communication Substrate Programming Models (Sockets) Communication and I/O Library Threaded Models and Synchronization RDMA? Virtualization (SR-IOV) I/O and File Systems QoS & Fault Tolerance Performance Tuning Networking Technologies (InfiniBand, 1/1/4/1 GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD, SSD, NVM, and NVMe- SSD)

26 HPSR Designing Next-Generation Middleware for Clusters and Datacenters High-Performance Programming Models Support for HPC Clusters RDMA-Enabled Communication Substrate for Common Services in Datacenters High-Performance and Scalable Memcached RDMA-Enabled Spark and Hadoop (HDFS, HBase, MapReduce) Deep Learning with Scale-Up and Scale-Out Caffe, CNTK, and TensorFlow Virtualization Support with SR-IOV and Containers

27 Parallel Programming Models Overview P1 P2 P3 P1 P2 P3 P1 P2 P3 Shared Memory Memory Memory Memory Logical shared memory Memory Memory Memory Shared Memory Model Distributed Memory Model Partitioned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) Global Arrays, UPC, Chapel, X1, CAF, Programming models provide abstract machine models Models can be mapped on different types of systems e.g. Distributed Shared Memory (DSM), MPI within a node, etc. PGAS models and Hybrid MPI+PGAS models are gradually receiving importance HPSR 18 27

28 Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges Application Kernels/Applications Point-to-point Communication Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc. Networking Technologies (InfiniBand, 4/1GigE, Aries, and Omni-Path) Middleware Communication Library or Runtime for Programming Models Collective Communication Energy- Awareness Synchronization and Locks Multi-/Many-core Architectures I/O and File Systems Fault Tolerance Accelerators (GPU and FPGA) Co-Design Opportunities and Challenges across Various Layers Performance Scalability Resilience HPSR 18 28

29 Broad Challenges in Designing Communication Middleware for (MPI+X) at Exascale Scalability for million to billion processors Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) Scalable job start-up Scalable Collective communication Offload Non-blocking Topology-aware Balancing intra-node and inter-node communication for next generation nodes ( cores) Multiple end-points per node Support for efficient multi-threading Integrated Support for GPGPUs and Accelerators Fault-tolerance/resiliency QoS support for communication and I/O Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, MPI+UPC++, CAF, ) Virtualization Energy-Awareness HPSR 18 29

30 HPSR 18 3 Overview of the MVAPICH2 Project High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 21, First version available in 22 MVAPICH2-X (MPI + PGAS), Available since 211 Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 214 Support for Virtualization (MVAPICH2-Virt), Available since 215 Support for Energy-Awareness (MVAPICH2-EA), Available since 215 Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 215 Used by more than 2,9 organizations in 86 countries More than 474, (>.47 million) downloads from the OSU site directly Empowering many TOP5 clusters (Nov 17 ranking) 1st, 1,649,6-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China 9th, 556,14 cores (Oakforest-PACS) in Japan 12th, 368,928-core (Stampede2) at TACC 17th, 241,18-core (Pleiades) at NASA 48th, 76,32-core (Tsubame 2.5) at Tokyo Institute of Technology Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) Empowering Top5 systems for over a decade

31 Number of Downloads MVAPICH2 Release Timeline and Downloads MV.9.4 MV2.9. MV2.9.8 MV2 1. MV 1. MV MV 1.1 MV2 1.4 MV2 1.5 MV2 1.6 MV2 1.7 MV2 1.8 MV2 1.9 MV2-GDR 2.b MV2-MIC 2. MV2 Virt 2.2 MV2-X 2.3b MV2-GDR 2.3a MV2 2.3rc1 OSU INAM Sep-4 Feb-5 Jul-5 Dec-5 May-6 Oct-6 Mar-7 Aug-7 Jan-8 Jun-8 Nov-8 Apr-9 Sep-9 Feb-1 Jul-1 Dec-1 May-11 Oct-11 Mar-12 Aug-12 Jan-13 Jun-13 Nov-13 Apr-14 Sep-14 Feb-15 Jul-15 Dec-15 Timeline HPSR May-16 Oct-16 Mar-17 Aug-17 Jan-18

32 Architecture of MVAPICH2 Software Family High Performance Parallel Programming Models Message Passing Interface (MPI) PGAS (UPC, OpenSHMEM, CAF, UPC++) Hybrid --- MPI + X (MPI + PGAS + OpenMP/Cilk) High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms Point-topoint Primitives Collectives Algorithms Job Startup Energy- Awareness Remote Memory Access I/O and File Systems Fault Tolerance Virtualization Active Messages Introspection & Analysis Support for Modern Networking Technology (InfiniBand, iwarp, RoCE, Omni-Path) Support for Modern Multi-/Many-core Architectures (Intel-Xeon, OpenPower, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols RC XRC UD DC UMR ODP Modern Features SR- IOV Multi Rail Shared Memory Transport Mechanisms CMA IVSHMEM XPMEM* Modern Features MCDRAM * NVLink * CAPI * * Upcoming HPSR 18 32

33 One-way Latency: MPI over IB with MVAPICH2 Latency (us) 2 Small Message Latency Latency (us) Large Message Latency TrueScale-QDR ConnectX-3-FDR ConnectIB-DualFDR ConnectX-5-EDR Omni-Path Message Size (bytes) Message Size (bytes) TrueScale-QDR GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-5-EDR GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Omni-Path GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch HPSR 18 33

34 Bandwidth (MBytes/sec) Bandwidth: MPI over IB with MVAPICH2 14 Unidirectional Bandwidth 12, , , ,356 3,373 Bandwidth (MBytes/sec) Bidirectional Bandwidth TrueScale-QDR ConnectX-3-FDR ConnectIB-DualFDR ConnectX-5-EDR Omni-Path 24,136 22,564 21,983 12,161 6,228 Message Size (bytes) Message Size (bytes) TrueScale-QDR GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-5-EDR GHz Deca-core (Haswell) Intel PCI Gen3 IB switch Omni-Path GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch HPSR 18 34

35 Hardware Multicast-aware MPI_Bcast on Stampede Latency (us) Latency (us) Small Messages (12,4 Cores) Default Multicast Message Size (Bytes) Default Multicast 16 Byte Message Latency (us) Latency (us) Large Messages (12,4 Cores) Default Multicast 2K 8K 32K 128K Message Size (Bytes) Default Multicast 32 KByte Message Number of Nodes Number of Nodes ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch HPSR 18 35

36 Performance of MPI_Allreduce On Stampede2 (1,24 Processes) Latency (us) Message Size MVAPICH2 MVAPICH2-OPT IMPI OSU Micro Benchmark 64 PPN 8K 16K 32K 64K 128K 256K Message Size MVAPICH2 MVAPICH2-OPT IMPI For MPI_Allreduce latency with 32K bytes, MVAPICH2-OPT can reduce the latency by 2.4X 2.4X M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design, SuperComputing '17. Available in MVAPICH2-X 2.3b HPSR 18 36

37 Application Level Performance with Graph5 and Sort Time (s) Graph5 Execution Time MPI-Simple MPI-CSC MPI-CSR Hybrid (MPI+OpenSHMEM) 4K 8K 16K No. of Processes 7.6X J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph5 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC 13), June 213 J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September X Performance of Hybrid (MPI+ OpenSHMEM) Graph5 Design 8,192 processes - 2.4X improvement over MPI-CSR - 7.6X improvement over MPI-Simple 16,384 processes - 1.5X improvement over MPI-CSR - 13X improvement over MPI-Simple J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, Optimizing Collective Communication in OpenSHMEM, Int'l Conference on Partitioned Global Address Space Programming Models (PGAS '13), October 213. Time (seconds) MPI Sort Execution Time Hybrid 5GB-512 1TB-1K 2TB-2K 4TB-4K Input Data - No. of Processes Performance of Hybrid (MPI+OpenSHMEM) Sort Application 4,96 processes, 4 TB Input Size - MPI 248 sec;.16 TB/min - Hybrid 1172 sec;.36 TB/min - 51% improvement over MPI-design 51% HPSR 18 37

38 GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU Standard MPI interfaces used for unified data movement Takes advantage of Unified Virtual Addressing (>= CUDA 4.) Overlaps data movement from GPU with RDMA transfers At Sender: MPI_Send(s_devbuf, size, ); inside MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, ); High Performance and High Productivity HPSR 18 38

39 HPSR Latency (us) Bandwidth (MB/s) Optimized MVAPICH2-GDR Design GPU-GPU Inter-node Latency 1.88us 1x K 2K 4K 8K Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3a GPU-GPU Inter-node Bandwidth K 2K 4K Message Size (Bytes) 9x Bandwidth (MB/s) GPU-GPU Inter-node Bi-Bandwidth K 2K 4K Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3a 11X MVAPICH2-GDR-2.3a Intel Haswell 3.1 GHz) node - 2 cores NVIDIA Volta V1 GPU Mellanox Connect-X4 EDR HCA CUDA 9. Mellanox OFED 4. with GPU-Direct-RDMA MV2-(NO-GDR) MV2-GDR-2.3a

40 Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland Normalized Execution Time Wilkes GPU Cluster Default Callback-based Event-based Number of GPUs CSCS GPU cluster Default Callback-based Event-based Normalized Execution Time Number of GPUs 2X improvement on 32 GPUs nodes 3% improvement on 96 GPU nodes (8 GPUs/node) Cosmo model: /tasks/operational/meteoswiss/ On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee, H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS 16 HPSR 18 4

41 HPSR Designing Next-Generation Middleware for Clusters and Datacenters High-Performance Programming Models Support for HPC Clusters RDMA-Enabled Communication Substrate for Common Services in Datacenters High-Performance and Scalable Memcached RDMA-Enabled Spark and Hadoop (HDFS, HBase, MapReduce) Deep Learning with Scale-Up and Scale-Out Caffe, CNTK, and TensorFlow Virtualization Support with SR-IOV and Containers

42 HPSR Data-Center Service Primitives Common Services needed by Data-Centers Better resource management Higher performance provided to higher layers Service Primitives Soft Shared State Distributed Lock Management Global Memory Aggregator Network Based Designs RDMA, Remote Atomic Operations

43 HPSR Soft Shared State Data-Center Application Get Put Data-Center Application Data-Center Application Get Shared State Put Data-Center Application Get Put Data-Center Application Data-Center Application

44 HPSR Active Caching Dynamic data caching challenging! Cache Consistency and Coherence Become more important than in static case Proxy Nodes Back-End Nodes User Requests Update

45 HPSR RDMA based Client Polling Design Front-End Back-End Request Response Cache Hit Version Read Response Cache Miss

46 Active Caching Performance Benefits 16 Data-Center Throughput 16 Effect of Load Throughput No Cache Invalidate All Dependency Lists Throughput No Cache Dependency Lists Trace 2 Trace 3 Trace 4 Trace 5 Traces with Increasing Update Rate Load (Compute Threads) Higher overall performance Up to an order of magnitude Performance is sustained under loaded conditions S. Narravula, P. Balaji, K. Vaidyanathan, H. -W. Jin and D. K. Panda, Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data-Centers over InfiniBand. CCGrid-25 HPSR 18 46

47 HPSR Resource Monitoring Services Traditional approaches Coarse-grained in nature Assume resource usage is consistent throughout the monitoring granularity (in the order of seconds) This assumption is no longer valid Resource usage is becoming increasingly divergent Fine-grained monitoring is desired but has additional overheads High overheads, less accurate, slow in response Can we design fine-grained resource monitoring scheme with low overhead and accurate resource usage?

48 HPSR Synchronous Resource Monitoring using RDMA (RDMA-Sync) CPU CPU App Threads App Threads Front-end Monitoring Process Front-end Node Memory Memory Back-end Node User Space RDMA /proc User Space Kernel Space Kernel Data Structures Kernel Space

49 Impact of Fine-grained Monitoring with Applications % Improvement 4% 35% 3% 25% 2% 15% 1% 5% % Impact on RUBiS and Zipf Traces α =.9 α =.75 α =.5 α =.25 Zipf alpha values Socket-Sync RDMA-Async RDMA-Sync e-rdma-sync Our schemes (RDMA-Sync and e-rdma-sync) achieve significant performance gain over existing schemes K. Vaidyanathan, H. W. Jin and D. K. Panda. Exploiting RDMA operations for Providing Efficient Fine-Grained Resource Monitoring in Cluster-Based Servers, Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations and Technologies, 26 HPSR 18 49

50 HPSR 18 5 Designing Next-Generation Middleware for Clusters and Datacenters High-Performance Programming Models Support for HPC Clusters RDMA-Enabled Communication Substrate for Common Services in Datacenters High-Performance and Scalable Memcached RDMA-Enabled Spark and Hadoop (HDFS, HBase, MapReduce) Deep Learning with Scale-Up and Scale-Out Caffe, CNTK, and TensorFlow Virtualization Support with SR-IOV and Containers

51 Architecture Overview of Memcached Three-layer architecture of Web 2. Web Servers, Memcached Servers, Database Servers Internet Memcached is a core component of Web 2. architecture Distributed Caching Layer Allows to aggregate spare memory from multiple nodes General purpose Typically used to cache database queries, results of API calls Scalable model, but typical usage very network intensive HPSR 18 51

52 HPSR Memcached-RDMA Design Sockets Client RDMA Client Master Thread Sockets Worker Thread Verbs Worker Thread Sockets Worker Thread Verbs Worker Thread Shared Data Memory Slabs Items Server and client perform a negotiation protocol Master thread assigns clients to appropriate worker thread Once a client is assigned a verbs worker thread, it can communicate directly and is bound to that thread All other Memcached data structures are shared among RDMA and Sockets worker threads Memcached Server can serve both socket and verbs clients simultaneously Memcached applications need not be modified; uses verbs interface if available

53 HPSR Time (us) 1 1 Memcached Performance (FDR Interconnect) 1 1 Memcached GET Latency OSU-IB (FDR) IPoIB (FDR) Latency Reduced by nearly 2X K 2K 4K Message Size Memcached Get latency 4 bytes OSU-IB: 2.84 us; IPoIB: us 2K bytes OSU-IB: 4.49 us; IPoIB: us Memcached Throughput (4bytes) Thousands of Transactions per Second (TPS) Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR) 48 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s Nearly 2X improvement in throughput Memcached Throughput No. of Clients 2X

54 Latency (sec) Micro-benchmark Evaluation for OLDP workloads Illustration with Read-Cache-Read access pattern using modified mysqlslap load testing tool Memcached-IPoIB (32Gbps) Reduced by 66% 4 Memcached-RDMA (32Gbps) No. of Clients Memcached-RDMA can Throughput (Kq/s) - improve query latency by up to 66% over IPoIB (32Gbps) - throughput by up to 69% over IPoIB (32Gbps) D. Shankar, X. Lu, J. Jose, M. W. Rahman, N. Islam, and D. K. Panda, Can RDMA Benefit On-Line Data Processing Workloads with Memcached and MySQL, ISPASS 15 HPSR Memcached-IPoIB (32Gbps) Memcached-RDMA (32Gbps) No. of Clients

55 Performance Evaluation on IB FDR + SATA/NVMe SSDs slab allocation (SSD write) cache check+load (SSD read) cache update server response client wait miss-penalty 25 2 Latency (us) Set Get Set Get Set Get Set Get Set Get Set Get Set Get Set Get IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid- NVMe IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid- NVMe Data Fits In Memory Data Does Not Fit In Memory Memcached latency test with Zipf distribution, server with 1 GB memory, 32 KB key-value pair size, total size of data accessed is 1 GB (when data fits in memory) and 1.5 GB (when data does not fit in memory) When data fits in memory: RDMA-Mem/Hybrid gives 5x improvement over IPoIB-Mem When data does not fit in memory: RDMA-Hybrid gives 2x-2.5x over IPoIB/RDMA-Mem HPSR 18 55

56 Accelerating Hybrid Memcached with RDMA, Non-blocking Extensions and SSDs CLIENT SERVER BLOCKING API NON-BLOCKING API REQ. NON-BLOCKING API REPLY RDMA-ENHANCED COMMUNICATION LIBRARY RDMA-ENHANCED COMMUNICATION LIBRARY Blocking API Flow LIBMEMCACHED LIBRARY HYBRID SLAB MANAGER (RAM+SSD) Non-Blocking API Flow RDMA-Accelerated Communication for Memcached Get/Set Hybrid RAM+SSD slab management for higher data retention Non-blocking API extensions memcached_(iset/iget/bset/bget/test/wait) Achieve near in-memory speeds while hiding bottlenecks of network and SSD I/O Ability to exploit communication/computation overlap Optional buffer re-use guarantees Adaptive slab manager with different I/O schemes for higher throughput. D. Shankar, X. Lu, N. S. Islam, M. W. Rahman, and D. K. Panda, High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and SSDs: Non-blocking Extensions, Designs, and Benefits, IPDPS, May 216 HPSR 18 56

57 Average Latency (us) Performance Evaluation with Non-Blocking Memcached API Set Get Set Get Set Get Set Get Set Get Set Get Data does not fit in memory: Non-blocking Memcached Set/Get API Extensions can achieve >16x latency improvement vs. blocking API over RDMA-Hybrid/RDMA-Mem w/ penalty >2.5x throughput improvement vs. blocking API over default/optimized RDMA-Hybrid Data fits in memory: Non-blocking Extensions perform similar to RDMA-Mem/RDMA-Hybrid and >3.6x improvement over IPoIB-Mem Miss Penalty (Backend DB Access Overhead) Client Wait Server Response Cache Update Cache check+load (Memory and/or SSD read) Slab Allocation (w/ SSD write on Out-of-Mem) IPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-Block H-RDMA-Opt-NonBb H-RDMA-Opt-NonB- H = Hybrid Memcached over SATA SSD Opt = Adaptive slab manager Block = Default Blocking API NonB-i = Non-blocking iset/iget API NonB-b = Non-blocking bset/bget API w/ buffer re-use guarantee HPSR 18 57

58 HPSR Designing Next-Generation Middleware for Clusters and Datacenters High-Performance Programming Models Support for HPC Clusters RDMA-Enabled Communication Substrate for Common Services in Datacenters High-Performance and Scalable Memcached RDMA-Enabled Spark and Hadoop (HDFS, HBase, MapReduce) Deep Learning with Scale-Up and Scale-Out Caffe, CNTK, and TensorFlow Virtualization Support with SR-IOV and Containers

59 HPSR The High-Performance Big Data (HiBD) Project RDMA for Apache Spark RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions RDMA for Apache HBase RDMA for Memcached (RDMA-Memcached) RDMA for Apache Hadoop 1.x (RDMA-Hadoop) OSU HiBD-Benchmarks (OHB) HDFS, Memcached, HBase, and Spark Micro-benchmarks Users Base: 285 organizations from 34 countries More than 26,7 downloads from the project site Available for InfiniBand and RoCE Also run on Ethernet Available for x86 and OpenPOWER Support for Singularity and Docker

60 HPSR 18 6 Performance Numbers of RDMA for Apache Hadoop 2.x RandomWriter & TeraGen in OSU-RI2 (EDR) Execution Time (s) IPoIB (EDR) Reduced by 3x IPoIB (EDR) Reduced by 4x 7 OSU-IB (EDR) 6 OSU-IB (EDR) Data Size (GB) Data Size (GB) RandomWriter TeraGen Cluster with 8 Nodes with a total of 64 maps Execution Time (s) RandomWriter 3x improvement over IPoIB for 8-16 GB file size TeraGen 4x improvement over IPoIB for 8-24 GB file size

61 HPSR Performance Evaluation of RDMA-Spark on SDSC Comet HiBench PageRank Time (sec) IPoIB 37% RDMA Huge BigData Gigantic Data Size (GB) 32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R) RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node. 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps) 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps) Time (sec) IPoIB 43% RDMA Huge BigData Gigantic Data Size (GB)

62 Performance Evaluation on SDSC Comet: Astronomy Application Kira Toolkit 1 : Distributed astronomy image processing toolkit implemented using Apache Spark. Source extractor application, using a 65GB dataset from the SDSS DR2 survey that comprises 11,15 image files. Compare RDMA Spark performance with the standard apache implementation using IPoIB. 1. Z. Zhang, K. Barbary, F. A. Nothaft, E.R. Sparks, M.J. Franklin, D.A. Patterson, S. Perlmutter. Scientific Computing meets Big Data Technology: An Astronomy Use Case. CoRR, vol: abs/ , Aug % RDMA Spark Apache Spark (IPoIB) Execution times (sec) for Kira SE benchmark using 65 GB dataset, 48 cores. M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet, XSEDE 16, July 216 HPSR 18 62

63 HPSR Designing Next-Generation Middleware for Clusters and Datacenters High-Performance Programming Models Support for HPC Clusters RDMA-Enabled Communication Substrate for Common Services in Datacenters High-Performance and Scalable Memcached RDMA-Enabled Spark and Hadoop (HDFS, HBase, MapReduce) Deep Learning with Scale-Up and Scale-Out Caffe, CNTK, and TensorFlow Virtualization Support with SR-IOV and Containers

64 Deep Learning: New Challenges for Communication Runtimes Deep Learning frameworks are a different game altogether Unusually large message sizes (order of megabytes) Most communication based on GPU buffers Existing State-of-the-art cudnn, cublas, NCCL --> scale-up performance CUDA-Aware MPI --> scale-out performance For small and medium message sizes only! Proposed: Can we co-design the MPI runtime (MVAPICH2- GDR) and the DL framework (Caffe) to achieve both? Efficient Overlap of Computation and Communication Efficient Large-Message Communication (Reductions) What application co-designs are needed to exploit communication-runtime co-designs? Scale-up Performance NCCL cudnn cublas Hadoop grpc Scale-out Performance Proposed Co-Designs MPI A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17) HPSR 18 64

65 Large Message Optimized Collectives for Deep Learning MV2-GDR provides optimized collectives for large message sizes Optimized Reduce, Allreduce, and Bcast Good scaling with large number of GPUs Available since MVAPICH2- GDR 2.2GA Latency (ms) Latency (ms) Latency (ms) Reduce 192 GPUs Message Size (MB) Allreduce 64 GPUs Message Size (MB) Bcast 64 GPUs Message Size (MB) No. of GPUs HPSR Latency (ms) Latency (ms) Latency (ms) Reduce 64 MB No. of GPUs Allreduce MB No. of GPUs Bcast 128 MB

66 MVAPICH2-GDR vs. NCCL2 Allreduce Operation Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases MPI_Allreduce (MVAPICH2-GDR) vs. ncclallreduce (NCCL2) on 16 GPUs 1 1 Latency (us) 1 1 ~3X better Latency (us) ~1.2X better K 2K 4K 8K 16K 32K 64K Message Size (Bytes) 1 Message Size (Bytes) MVAPICH2-GDR NCCL2 MVAPICH2-GDR NCCL2 *Will be available with upcoming MVAPICH2-GDR 2.3b Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-8 GPUs, and EDR InfiniBand Inter-connect HPSR 18 66

67 MVAPICH2: Allreduce Comparison with Baidu and OpenMPI 16 GPUs (4 nodes) MVAPICH2-GDR(*) vs. Baidu-Allreduce and OpenMPI 3. Latency (us) ~3X better Message Size (Bytes) MVAPICH2 BAIDU OPENMPI Latency (us) ~1X better ~4X better 512K 1M 2M 4M Message Size (Bytes) MVAPICH2 BAIDU OPENMPI Latency (us) 6 OpenMPI is ~5X slower 5 than Baidu 4 MV2 is ~2X better 3 than Baidu Message Size (Bytes) MVAPICH2 BAIDU OPENMPI *Available with MVAPICH2-GDR 2.3a HPSR 18 67

68 OSU-Caffe: Scalable Deep Learning Caffe : A flexible and layered Deep Learning framework. Benefits and Weaknesses Multi-GPU Training within a single node Performance degradation for GPUs across different sockets Limited Scale-out OSU-Caffe: MPI-based Parallel Training Enable Scale-up (within a node) and Scale-out (across multi-gpu nodes) Scale-out on 64 GPUs for training CIFAR-1 network on CIFAR-1 dataset Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset OSU-Caffe publicly available from Training Time (seconds) GoogLeNet (ImageNet) on 128 GPUs No. of GPUs Invalid use case Caffe OSU-Caffe (124) OSU-Caffe (248) HPSR 18 68

69 HPSR Architecture Overview of grpc Key Features: Simple service definition Works across languages and platforms C++, Java, Python, Android Java etc Linux, Mac, Windows. Start quickly and scale Bi-directional streaming and integrated authentication Used by Google (several of Google s cloud products and Google externally facing APIs, TensorFlow), Netflix, Docker, Cisco, Juniper Networks etc. Uses sockets for communication! Large-scale distributed systems composed of micro services Source:

70 Performance Benefits for AR-gRPC with Micro-Benchmark Latency (us) Default grpc (IPoIB-56Gbps) AR-gRPC (RDMA-56Gbps) payload (Bytes) K 2K 4K 8K Latency (us) Default grpc (IPoIB-56Gbps) AR-gRPC (RDMA-56 Gbps) 16K 32K 64K 128K 256K 512K Payload (Bytes) Latency (us) Default grpc (IPoIB- 56Gbps) AR-gRPC (RDMA-56Gbps) 1M 2M 4M 8M Payload (Bytes) Point-to-Point Latency AR-gRPC (OSU design) Latency on SDSC-Comet-FDR Up to 2.7x performance speedup over Default grpc (IPoIB) for Latency for small messages. Up to 2.8x performance speedup over Default grpc (IPoIB) for Latency for medium messages. Up to 2.5x performance speedup over Default grpc (IPoIB) for Latency for large messages. R. Biswas, X. Lu, and D. K. Panda, Accelerating TensorFlow with Adaptive RDMA-based grpc. (Under Review) HPSR 18 7

71 Performance Benefit for TensorFlow (Resnet5) Images / Second % grppc (IPoIB-1Gbps) Verbs (RDMA-1Gbps) MPI (RDMA-1Gbps) AR-gRPC (RDMA-1Gbps) 1% Images / Second grppc (IPoIB-1Gbps) Verbs (RDMA-1Gbps) MPI (RDMA-1Gbps) AR-gRPC (RDMA-1Gbps) 127% 15% Images / Second grppc (IPoIB-1Gbps) Verbs (RDMA-1Gbps) MPI (RDMA-1Gbps) AR-gRPC (RDMA-1Gbps) 133% 16% Batch Size/ GPU Batch Size / GPU 4 Nodes 8 Nodes Batch Size / GPU 12 Nodes TensorFlow Resnet5 performance evaluation on an IB EDR cluster Up to 26% performance speedup over Default grpc (IPoIB) for 4 nodes Up to 127% performance speedup over Default grpc (IPoIB) for 8 nodes Up to 133% performance speedup over Default grpc (IPoIB) for 12 nodes HPSR 18 71

72 Performance Benefit for TensorFlow (Inception3) Images / Second % grppc (IPoIB-1Gbps) Verbs (RDMA-1Gbps) MPI (RDMA-1Gbps) 7% AR-gRPC (RDMA-1Gbps) Images / Second grppc (IPoIB-1Gbps) Verbs (RDMA-1Gbps) MPI (RDMA-1Gbps) 1% AR-gRPC (RDMA-1Gbps) 116% Batch Size / GPU Batch Size / GPU 4 Nodes 8 Nodes 12 Nodes Images / Second grppc (IPoIB-1Gbps) Verbs (RDMA-1Gbps) MPI (RDMA-1Gbps) AR-gRPC (RDMA-1Gbps) 153% 12% Batch Size / GPU TensorFlow Inception3 performance evaluation on an IB EDR cluster Up to 47% performance speedup over Default grpc (IPoIB) for 4 nodes Up to 116% performance speedup over Default grpc (IPoIB) for 8 nodes Up to 153% performance speedup over Default grpc (IPoIB) for 12 nodes HPSR 18 72

73 High-Performance Deep Learning over Big Data (DLoBD) Stacks Benefits of Deep Learning over Big Data (DLoBD) Easily integrate deep learning components into Big Data processing workflow Easily access the stored data in Big Data systems No need to set up new dedicated deep learning clusters; Reuse existing big data analytics clusters Challenges Can RDMA-based designs in DLoBD stacks improve performance, scalability, and resource utilization on highperformance interconnects, GPUs, and multi-core CPUs? What are the performance characteristics of representative DLoBD stacks on RDMA networks? Characterization on DLoBD Stacks CaffeOnSpark, TensorFlowOnSpark, and BigDL IPoIB vs. RDMA; In-band communication vs. Out-of-band communication; CPU vs. GPU; etc. Performance, accuracy, scalability, and resource utilization RDMA-based DLoBD stacks (e.g., BigDL over RDMA-Spark) can achieve 2.6x speedup compared to the IPoIB based scheme, while maintain similar accuracy Epochs Time (secs) IPoIB-Time RDMA-Time IPoIB-Accuracy RDMA-Accuracy Epoch Number 2.6X Accuracy (%) X. Lu, H. Shi, M. H. Javed, R. Biswas, and D. K. Panda, Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-capable Networks, HotI 217. HPSR 18 73

74 HPSR Designing Next-Generation Middleware for Clusters and Datacenters High-Performance Programming Models Support for HPC Clusters RDMA-Enabled Communication Substrate for Common Services in Datacenters High-Performance and Scalable Memcached RDMA-Enabled Spark and Hadoop (HDFS, HBase, MapReduce) Deep Learning with Scale-Up and Scale-Out Caffe, CNTK, and TensorFlow Virtualization Support with SR-IOV and Containers

75 Single Root I/O Virtualization (SR-IOV) Single Root I/O Virtualization (SR-IOV) is providing new opportunities to design cloud-based datacenters with very little low overhead Allows a single physical device, or a Physical Function (PF), to present itself as multiple virtual devices, or Virtual Functions (VFs) VFs are designed based on the existing non-virtualized PFs, no need for driver change Each VF can be dedicated to a single VM through PCI pass-through Work with 1/4/1 GigE and InfiniBand HPSR 18 75

76 Can High-Performance and Virtualization be Combined? Virtualization has many benefits Fault-tolerance Job migration Compaction Have not been very popular in HPC due to overhead associated with Virtualization New SR-IOV (Single Root IO Virtualization) support available with Mellanox InfiniBand adapters changes the field Enhanced MVAPICH2 support for SR-IOV MVAPICH2-Virt 2.2 supports: OpenStack, Docker, and singularity J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based Virtualized InfiniBand Clusters? EuroPar'14 J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand Clusters, HiPC 14 J. Zhang, X.Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build HPC Clouds, CCGrid 15 HPSR 18 76

77 Application-Level Performance on Chameleon Execution Time (ms) MV2-SR-IOV-Def MV2-SR-IOV-Opt MV2-Native 2% 5% Execution Time (s) % 9.5% MV2-SR-IOV-Def MV2-SR-IOV-Opt MV2-Native 22,2 24,1 24,16 24,2 26,1 26,16 Problem Size (Scale, Edgefactor) Graph5 milc leslie3d pop2 GAPgeofem zeusmp2 lu SPEC MPI27 32 VMs, 6 Core/VM Compared to Native, 2-5% overhead for Graph5 with 128 Procs Compared to Native, 1-9.5% overhead for SPEC MPI27 with 128 Procs HPSR 18 77

78 HPSR Execution Time (s) Application-Level Performance on Docker with MVAPICH2 Container-Def Container-Opt Native MG.D FT.D EP.D LU.D CG.D 1Cont*16P 2Conts*8P 4Conts*4P NAS 11% BFS Execution Time (ms) 3 Container-Def 25 Container-Opt 2 Native Scale, Edgefactor (2,16) Graph 5 73% 64 Containers across 16 nodes, pining 4 Cores per Container Compared to Container-Def, up to 11% and 73% of execution time reduction for NAS and Graph 5 Compared to Native, less than 9 % and 5% overhead for NAS and Graph 5

79 Application-Level Performance on Singularity with MVAPICH2 NPB Class D Graph5 Execution Time (s) Singularity Native 7% BFS Execution Time (ms) Singularity Native 6% 5 5 CG EP FT IS LU MG 512 Processes across 32 nodes 22,16 22,2 24,16 24,2 26,16 26,2 Problem Size (Scale, Edgefactor) Less than 7% and 6% overhead for NPB and Graph5, respectively J. Zhang, X.Lu and D. K. Panda, Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds?, UCC 17 HPSR 18 79

80 RDMA-Hadoop with Virtualization CloudBurst Self-Join RDMA-Hadoop RDMA-Hadoop-Virt RDMA-Hadoop RDMA-Hadoop-Virt EXECUTION TIME Default Mode Distributed Mode 3% reduction 14% and 24% improvement with Default Mode for CloudBurst and Self-Join 3% and 55% improvement with Distributed Mode for CloudBurst and Self-Join EXECUTION TIME Default Mode Distributed Mode 55% reduction S. Gugnani, X. Lu, D. K. Panda. Designing Virtualization-aware and Automatic Topology Detection Schemes for Accelerating Hadoop on SR-IOV-enabled Clouds. CloudCom, 216. HPSR 18 8

81 HPSR Concluding Remarks Next generation Clusters and Data Centers need to be designed with a holistic view of HPC, Big Data, Deep Learning, and Cloud Presented an overview of the networking technology trends Presented some of the approaches and results along these directions Enable HPC, Big Data, Deep Learning and Cloud community to take advantage of modern networking technologies Many other open issues need to be solved

82 HPSR Funding Acknowledgments Funding Support by Equipment Support by

83 Personnel Acknowledgments Current Students (Graduate) A. Awan (Ph.D.) R. Biswas (M.S.) M. Bayatpour (Ph.D.) S. Chakraborthy (Ph.D.) C.-H. Chu (Ph.D.) S. Guganani (Ph.D.) Past Students A. Augustine (M.S.) P. Balaji (Ph.D.) S. Bhagvat (M.S.) A. Bhat (M.S.) D. Buntinas (Ph.D.) L. Chai (Ph.D.) B. Chandrasekharan (M.S.) N. Dandapanthula (M.S.) V. Dhanraj (M.S.) T. Gangadharappa (M.S.) K. Gopalakrishnan (M.S.) Current Students (Undergraduate) J. Hashmi (Ph.D.) N. Sarkauskas (B.S.) H. Javed (Ph.D.) P. Kousha (Ph.D.) D. Shankar (Ph.D.) H. Shi (Ph.D.) J. Zhang (Ph.D.) W. Huang (Ph.D.) J. Liu (Ph.D.) W. Jiang (M.S.) M. Luo (Ph.D.) J. Jose (Ph.D.) A. Mamidala (Ph.D.) S. Kini (M.S.) G. Marsh (M.S.) M. Koop (Ph.D.) V. Meshram (M.S.) K. Kulkarni (M.S.) A. Moody (M.S.) R. Kumar (M.S.) S. Naravula (Ph.D.) S. Krishnamoorthy (M.S.) R. Noronha (Ph.D.) K. Kandalla (Ph.D.) X. Ouyang (Ph.D.) M. Li (Ph.D.) S. Pai (M.S.) P. Lai (M.S.) S. Potluri (Ph.D.) Current Research Scientists X. Lu H. Subramoni Current Post-doc A. Ruhela K. Manian R. Rajachandrasekar (Ph.D.) G. Santhanaraman (Ph.D.) A. Singh (Ph.D.) J. Sridhar (M.S.) S. Sur (Ph.D.) H. Subramoni (Ph.D.) K. Vaidyanathan (Ph.D.) A. Vishnu (Ph.D.) J. Wu (Ph.D.) W. Yu (Ph.D.) Current Research Specialist J. Smith M. Arnold Past Research Scientist K. Hamidouche S. Sur Past Programmers D. Bureddy J. Perkins Past Post-Docs D. Banerjee X. Besseron H.-W. Jin J. Lin M. Luo E. Mancini S. Marcarelli J. Vienne H. Wang HPSR 18 83

84 HPSR Multiple Positions Available in My Group Looking for Bright and Enthusiastic Personnel to join as Post-Doctoral Researchers PhD Students MPI Programmer/Software Engineer Hadoop/Big Data Programmer/Software Engineer If interested, please contact me at this conference and/or send an to

85 HPSR Thank You! Network-Based Computing Laboratory The High-Performance MPI/PGAS Project The High-Performance Big Data Project The High-Performance Deep Learning Project

High-Performance Training for Deep Learning and Computer Vision HPC

High-Performance Training for Deep Learning and Computer Vision HPC High-Performance Training for Deep Learning and Computer Vision HPC Panel at CVPR-ECV 18 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management SigHPC BigData BoF (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X Intel Nerve Center (SC 7) Presentation Dhabaleswar K (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu Parallel

More information

Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies

Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies Talk at Intel HPC Developer Conference 2017 (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University

More information

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning 5th ANNUAL WORKSHOP 209 Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning Hari Subramoni Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University E-mail:

More information

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters Presentation at Mellanox Theatre (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters

High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters Talk at OpenFabrics Workshop (April 2016) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

High-Performance MPI and Deep Learning on OpenPOWER Platform

High-Performance MPI and Deep Learning on OpenPOWER Platform High-Performance MPI and Deep Learning on OpenPOWER Platform Talk at OpenPOWER SUMMIT (Mar 18) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning Ching-Hsiang Chu 1, Xiaoyi Lu 1, Ammar A. Awan 1, Hari Subramoni 1, Jahanzeb Hashmi 1, Bracy Elton 2 and Dhabaleswar

More information

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters Presentation at Mellanox Theatre (SC 16) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing Talk at HPCAC Stanford Conference (Feb 18) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters Exploiting InfiniBand and Direct Technology for High Performance Collectives on Clusters Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth

More information

Designing Efficient HPC, Big Data, Deep Learning, and Cloud Computing Middleware for Exascale Systems

Designing Efficient HPC, Big Data, Deep Learning, and Cloud Computing Middleware for Exascale Systems Designing Efficient HPC, Big Data, Deep Learning, and Cloud Computing Middleware for Exascale Systems Keynote Talk at HPCAC China Conference by Dhabaleswar K. (DK) Panda The Ohio State University E mail:

More information

Scalability and Performance of MVAPICH2 on OakForest-PACS

Scalability and Performance of MVAPICH2 on OakForest-PACS Scalability and Performance of MVAPICH2 on OakForest-PACS Talk at JCAHPC Booth (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience Jithin Jose, Mingzhe Li, Xiaoyi Lu, Krishna Kandalla, Mark Arnold and Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory

More information

Exploiting HPC Technologies for Accelerating Big Data Processing and Associated Deep Learning

Exploiting HPC Technologies for Accelerating Big Data Processing and Associated Deep Learning Exploiting HPC Technologies for Accelerating Big Data Processing and Associated Deep Learning Keynote Talk at Swiss Conference (April 18) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:

More information

Accelerating Big Data Processing with RDMA- Enhanced Apache Hadoop

Accelerating Big Data Processing with RDMA- Enhanced Apache Hadoop Accelerating Big Data Processing with RDMA- Enhanced Apache Hadoop Keynote Talk at BPOE-4, in conjunction with ASPLOS 14 (March 2014) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

HPC Meets Big Data: Accelerating Hadoop, Spark, and Memcached with HPC Technologies

HPC Meets Big Data: Accelerating Hadoop, Spark, and Memcached with HPC Technologies HPC Meets Big Data: Accelerating Hadoop, Spark, and Memcached with HPC Technologies Talk at OpenFabrics Alliance Workshop (OFAW 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

High-Performance Broadcast for Streaming and Deep Learning

High-Performance Broadcast for Streaming and Deep Learning High-Performance Broadcast for Streaming and Deep Learning Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth - SC17 2 Outline Introduction

More information

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K. In the multi-core age, How do larger, faster and cheaper and more responsive sub-systems affect data management? Panel at ADMS 211 Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory Department

More information

Designing Shared Address Space MPI libraries in the Many-core Era

Designing Shared Address Space MPI libraries in the Many-core Era Designing Shared Address Space MPI libraries in the Many-core Era Jahanzeb Hashmi hashmi.29@osu.edu (NBCL) The Ohio State University Outline Introduction and Motivation Background Shared-memory Communication

More information

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries How to Boost the Performance of Your MPI and PGAS s with MVAPICH2 Libraries A Tutorial at the MVAPICH User Group (MUG) Meeting 18 by The MVAPICH Team The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Designing HPC, Big Data, Deep Learning, and Cloud Middleware for Exascale Systems: Challenges and Opportunities

Designing HPC, Big Data, Deep Learning, and Cloud Middleware for Exascale Systems: Challenges and Opportunities Designing HPC, Big Data, Deep Learning, and Cloud Middleware for Exascale Systems: Challenges and Opportunities Keynote Talk at SEA Symposium (April 18) by Dhabaleswar K. (DK) Panda The Ohio State University

More information

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Presentation at Mellanox Theater () Dhabaleswar K. (DK) Panda - The Ohio State University panda@cse.ohio-state.edu Outline Communication

More information

Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters

Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters Presentation at IXPUG Meeting, July 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Designing High-Performance MPI Libraries for Multi-/Many-core Era

Designing High-Performance MPI Libraries for Multi-/Many-core Era Designing High-Performance MPI Libraries for Multi-/Many-core Era Talk at IXPUG-Fall Conference (September 18) J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni and D. K Panda The Ohio State University

More information

High Performance Migration Framework for MPI Applications on HPC Cloud

High Performance Migration Framework for MPI Applications on HPC Cloud High Performance Migration Framework for MPI Applications on HPC Cloud Jie Zhang, Xiaoyi Lu and Dhabaleswar K. Panda {zhanjie, luxi, panda}@cse.ohio-state.edu Computer Science & Engineering Department,

More information

Designing HPC, Big Data, and Deep Learning Middleware for Exascale Systems: Challenges and Opportunities

Designing HPC, Big Data, and Deep Learning Middleware for Exascale Systems: Challenges and Opportunities Designing HPC, Big Data, and Deep Learning Middleware for Exascale Systems: Challenges and Opportunities DHABALESWAR K PANDA DK Panda is a Professor and University Distinguished Scholar of Computer Science

More information

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing Talk at HPCAC-Switzerland (April 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications GPU Technology Conference GTC 2016 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS Adithya Bhat, Nusrat Islam, Xiaoyi Lu, Md. Wasi- ur- Rahman, Dip: Shankar, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng

More information

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning Talk at Mellanox booth (SC 218) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime: An MVAPICH2-X Approach

Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime: An MVAPICH2-X Approach Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime: An MVAPICH2-X Approach Talk at OSC theater (SC 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:

More information

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures Talk at Bench 2018 by Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu http://www.cse.ohio-state.edu/~luxi

More information

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters High Performance MPI Support in MVAPICH2 for InfiniBand Clusters A Talk at NVIDIA Booth (SC 11) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach

Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach Talk at OpenFabrics Workshop (April 216) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

High-Performance and Scalable Designs of Programming Models for Exascale Systems

High-Performance and Scalable Designs of Programming Models for Exascale Systems High-Performance and Scalable Designs of Programming Models for Exascale Systems Keynote Talk at HPCAC, Switzerland (April 217) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services P. Balaji, K. Vaidyanathan, S. Narravula, H. W. Jin and D. K. Panda Network Based Computing Laboratory

More information

Designing Convergent HPC, Deep Learning and Big Data Analytics Software Stacks for Exascale Systems

Designing Convergent HPC, Deep Learning and Big Data Analytics Software Stacks for Exascale Systems Designing Convergent HPC, Deep Learning and Big Data Analytics Software Stacks for Exascale Systems Talk at HPCAC-AI Switzerland Conference (April 19) by Dhabaleswar K. (DK) Panda The Ohio State University

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Panda Department of Computer Science and Engineering The Ohio

More information

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning Talk at Mellanox Theater (SC 16) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters 1 Ching-Hsiang Chu, 1 Khaled Hamidouche, 1 Hari Subramoni, 1 Akshay Venkatesh, 2 Bracy Elton and 1 Dhabaleswar

More information

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters High Performance File System and I/O Middleware Design for Big Data on HPC Clusters by Nusrat Sharmin Islam Advisor: Dhabaleswar K. (DK) Panda Department of Computer Science and Engineering The Ohio State

More information

High-performance and Scalable MPI+X Library for Emerging HPC Clusters & Cloud Platforms

High-performance and Scalable MPI+X Library for Emerging HPC Clusters & Cloud Platforms High-performance and Scalable MPI+X Library for Emerging HPC Clusters & Cloud Platforms Talk at Intel HPC Developer Conference (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters Jie Zhang Dr. Dhabaleswar K. Panda (Advisor) Department of Computer Science & Engineering The

More information

The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing

The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing Presentation at Mellanox Theatre (SC 16) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems

Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems Keynote Talk at HPCAC Stanford Conference (Feb 18) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi

More information

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X IXPUG 7 PresentaNon J. Hashmi, M. Li, H. Subramoni and DK Panda The Ohio State University E-mail: {hashmi.29,li.292,subramoni.,panda.2}@osu.edu

More information

MVAPICH2 Project Update and Big Data Acceleration

MVAPICH2 Project Update and Big Data Acceleration MVAPICH2 Project Update and Big Data Acceleration Presentation at HPC Advisory Council European Conference 212 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning Ammar Ahmad Awan, Khaled Hamidouche, Akshay Venkatesh, and Dhabaleswar K. Panda Network-Based Computing Laboratory Department

More information

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks Presented By : Esthela Gallardo Ammar Ahmad Awan, Khaled Hamidouche, Akshay Venkatesh, Jonathan Perkins, Hari Subramoni,

More information

Scaling with PGAS Languages

Scaling with PGAS Languages Scaling with PGAS Languages Panel Presentation at OFA Developers Workshop (2013) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan and Dhabaleswar K. (DK) Panda Speaker: Sourav Chakraborty

More information

Overview of the MVAPICH Project: Latest Status and Future Roadmap

Overview of the MVAPICH Project: Latest Status and Future Roadmap Overview of the MVAPICH Project: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) Meeting by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

MVAPICH2 and MVAPICH2-MIC: Latest Status

MVAPICH2 and MVAPICH2-MIC: Latest Status MVAPICH2 and MVAPICH2-MIC: Latest Status Presentation at IXPUG Meeting, July 214 by Dhabaleswar K. (DK) Panda and Khaled Hamidouche The Ohio State University E-mail: {panda, hamidouc}@cse.ohio-state.edu

More information

Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached

Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached Talk at HPC Advisory Council Lugano Conference (213) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach

Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach Talk at OpenFabrics Workshop (March 217) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09 RDMA over Ethernet - A Preliminary Study Hari Subramoni, Miao Luo, Ping Lai and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University Introduction Problem Statement

More information

Acceleration for Big Data, Hadoop and Memcached

Acceleration for Big Data, Hadoop and Memcached Acceleration for Big Data, Hadoop and Memcached A Presentation at HPC Advisory Council Workshop, Lugano 212 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects

Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects Keynote Talk at ADMS 214 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Accelerating and Benchmarking Big Data Processing on Modern Clusters

Accelerating and Benchmarking Big Data Processing on Modern Clusters Accelerating and Benchmarking Big Data Processing on Modern Clusters Open RG Big Data Webinar (Sept 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing

The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing Presentation at OSU Booth (SC 18) by Hari Subramoni The Ohio State University E-mail: subramon@cse.ohio-state.edu http://www.cse.ohio-state.edu/~subramon

More information

Accelerating and Benchmarking Big Data Processing on Modern Clusters

Accelerating and Benchmarking Big Data Processing on Modern Clusters Accelerating and Benchmarking Big Data Processing on Modern Clusters Keynote Talk at BPOE-6 (Sept 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing

Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing Talk at Storage Developer Conference SNIA 2018 by Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu

More information

Memcached Design on High Performance RDMA Capable Interconnects

Memcached Design on High Performance RDMA Capable Interconnects Memcached Design on High Performance RDMA Capable Interconnects Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md. Wasi- ur- Rahman, Nusrat S. Islam, Xiangyong Ouyang, Hao Wang, Sayantan

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda Department of Computer Science and Engineering

More information

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies GPU Technology Conference (GTC 218) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda Network- Based Compu2ng Laboratory Department of Computer

More information

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning GPU Technology Conference GTC 217 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Advanced RDMA-based Admission Control for Modern Data-Centers

Advanced RDMA-based Admission Control for Modern Data-Centers Advanced RDMA-based Admission Control for Modern Data-Centers Ping Lai Sundeep Narravula Karthikeyan Vaidyanathan Dhabaleswar. K. Panda Computer Science & Engineering Department Ohio State University Outline

More information

Building the Most Efficient Machine Learning System

Building the Most Efficient Machine Learning System Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide

More information

Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap

Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) MeeKng by Dhabaleswar K. (DK) Panda The Ohio State University E- mail: panda@cse.ohio- state.edu h

More information

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits Ashish Kumar Singh, Sreeram Potluri, Hao Wang, Krishna Kandalla, Sayantan Sur, and Dhabaleswar K. Panda Network-Based

More information

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications K. Vaidyanathan, P. Lai, S. Narravula and D. K. Panda Network Based Computing Laboratory

More information

An Assessment of MPI 3.x (Part I)

An Assessment of MPI 3.x (Part I) An Assessment of MPI 3.x (Part I) High Performance MPI Support for Clouds with IB and SR-IOV (Part II) Talk at HP-CAST 25 (Nov 2015) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Job Startup at Exascale:

Job Startup at Exascale: Job Startup at Exascale: Challenges and Solutions Hari Subramoni The Ohio State University http://nowlab.cse.ohio-state.edu/ Current Trends in HPC Supercomputing systems scaling rapidly Multi-/Many-core

More information

The Future of Supercomputer Software Libraries

The Future of Supercomputer Software Libraries The Future of Supercomputer Software Libraries Talk at HPC Advisory Council Israel Supercomputing Conference by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Overview of the MVAPICH Project: Latest Status and Future Roadmap

Overview of the MVAPICH Project: Latest Status and Future Roadmap Overview of the MVAPICH Project: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) Meeting by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Designing High Performance Communication Middleware with Emerging Multi-core Architectures Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Unifying UPC and MPI Runtimes: Experience with MVAPICH Unifying UPC and MPI Runtimes: Experience with MVAPICH Jithin Jose Miao Luo Sayantan Sur D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,

More information

Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase

Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase 2 IEEE 8th International Conference on Cloud Computing Technology and Science Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase Xiaoyi Lu, Dipti Shankar, Shashank Gugnani,

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Smart Interconnect for Next Generation HPC Platforms Gilad Shainer, August 2016, 4th Annual MVAPICH User Group (MUG) Meeting Mellanox Connects the World s Fastest Supercomputer

More information

Building the Most Efficient Machine Learning System

Building the Most Efficient Machine Learning System Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Gilad Shainer 2nd Annual MVAPICH User Group (MUG) Meeting, August 2014 Complete High-Performance Scalable Interconnect Infrastructure Comprehensive End-to-End Software Accelerators

More information

MVAPICH2 MPI Libraries to Exploit Latest Networking and Accelerator Technologies

MVAPICH2 MPI Libraries to Exploit Latest Networking and Accelerator Technologies MVAPICH2 MPI Libraries to Exploit Latest Networking and Accelerator Technologies Talk at NRL booth (SC 216) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University

More information

Communication Frameworks for HPC and Big Data

Communication Frameworks for HPC and Big Data Communication Frameworks for HPC and Big Data Talk at HPC Advisory Council Spain Conference (215) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Software Libraries and Middleware for Exascale Systems

Software Libraries and Middleware for Exascale Systems Software Libraries and Middleware for Exascale Systems Talk at HPC Advisory Council China Workshop (212) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Exploiting HPC Technologies for Accelerating Big Data Processing and Storage

Exploiting HPC Technologies for Accelerating Big Data Processing and Storage Exploiting HPC Technologies for Accelerating Big Data Processing and Storage Talk in the 5194 class by Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu http://www.cse.ohio-state.edu/~luxi

More information

High Performance Computing

High Performance Computing High Performance Computing Dror Goldenberg, HPCAC Switzerland Conference March 2015 End-to-End Interconnect Solutions for All Platforms Highest Performance and Scalability for X86, Power, GPU, ARM and

More information

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011 The Road to ExaScale Advances in High-Performance Interconnect Infrastructure September 2011 diego@mellanox.com ExaScale Computing Ambitious Challenges Foster Progress Demand Research Institutes, Universities

More information

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering,

More information

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of

More information

DLoBD: An Emerging Paradigm of Deep Learning over Big Data Stacks on RDMA- enabled Clusters

DLoBD: An Emerging Paradigm of Deep Learning over Big Data Stacks on RDMA- enabled Clusters DLoBD: An Emerging Paradigm of Deep Learning over Big Data Stacks on RDMA- enabled Clusters Talk at OFA Workshop 2018 by Xiaoyi Lu The Ohio State University E- mail: luxi@cse.ohio- state.edu h=p://www.cse.ohio-

More information

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart Xiangyong Ouyang, Raghunath Rajachandrasekar, Xavier Besseron, Hao Wang, Jian Huang, Dhabaleswar K. Panda Department of Computer

More information

Designing Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges

Designing Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges Designing Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges Talk at Brookhaven National Laboratory (Oct 214) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:

More information

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth by D.K. Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline Overview of MVAPICH2-GPU

More information