Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects

Size: px
Start display at page:

Download "Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects"

Transcription

1 Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects Keynote Talk at ADMS 214 by Dhabaleswar K. (DK) Panda The Ohio State University

2 Introduction to Big Data Applications and Analytics Big Data has become the one of the most important elements of business analytics Provides groundbreaking opportunities for enterprise information management and decision making The amount of data is exploding; companies are capturing and digitizing more information than ever The rate of information growth appears to be exceeding Moore s Law ADMS '14 2

3 4V Characteristics of Big Data Commonly accepted 3V s of Big Data Volume, Velocity, Variety Michael Stonebraker: Big Data Means at Least Three Different Things, 4/5V s of Big Data 3V + *Veracity, *Value ADMS '14 JTEzcfJjGurJucBMTkIUNdL3jcZT8IPfNWfN9/dv1.jpg 3

4 Velocity of Big Data How Much Data Is Generated Every Minute on the Internet? The global Internet population grew 6.59% from 21 to 211 and now represents 2.1 Billion People. ADMS '14 4

5 Data Management and Processing in Modern Datacenters Substantial impact on designing and utilizing modern data management and processing systems in multiple tiers Front-end data accessing and serving (Online) Memcached + DB (e.g. MySQL), HBase Back-end data analytics (Offline) HDFS, MapReduce, Spark ADMS '14 5

6 Overview of Web 2. Architecture and Memcached Three-layer architecture of Web 2. Web Servers, Memcached Servers, Database Servers Memcached is a core component of Web 2. architecture ADMS '14 6

7 Memcached Architecture Distributed Caching Layer Allows to aggregate spare memory from multiple nodes General purpose Typically used to cache database queries, results of API calls Scalable model, but typical usage very network intensive ADMS '14 7

8 HBase Overview Apache Hadoop Database ( Semi-structured database, which is highly scalable Integral part of many datacenter applications eg: Facebook Social Inbox Developed in Java for platformindependence and portability (HBase Architecture) Uses sockets for communication! ADMS '14 8

9 Hadoop Distributed File System (HDFS) Primary storage of Hadoop; highly reliable and fault-tolerant Adopted by many reputed organizations Client eg: Facebook, Yahoo! NameNode: stores the file system namespace DataNode: stores data blocks Developed in Java for platformindependence and portability Uses sockets for communication! RPC RPC RPC RPC (HDFS Architecture) ADMS '14 9

10 Data Movement in Hadoop MapReduce Bulk Data Transfer Disk Operations Map and Reduce Tasks carry out the total job execution Map tasks read from HDFS, operate on it, and write the intermediate data to local disk Reduce tasks get these data by shuffle from TaskTrackers, operate on it and write to HDFS Communication in shuffle phase uses HTTP over Java Sockets ADMS '14 1

11 Spark Architecture Overview An in-memory data-processing framework Iterative machine learning jobs Interactive data analytics Scala based Implementation Standalone, YARN, Mesos Scalable and communication intensive Wide dependencies between Resilient Distributed Datasets (RDDs) MapReduce-like shuffle operations to repartition RDDs Sockets based communication ADMS '14 11

12 Presentation Outline Overview of Modern Clusters, Interconnects and Protocols Challenges for Accelerating Data Management and Processing The High-Performance Big Data (HiBD) Project RDMA-based design for Memcached and HBase RDMA-based Memcached Case study with OLTP SSD-assisted hybrid Memcached RDMA-based HBase RDMA-based designs for Apache Hadoop and Spark Case studies with HDFS, MapReduce, and Spark RDMA-based MapReduce on HPC Clusters with Lustre Ongoing and Future Activities Conclusion and Q&A ADMS '14 12

13 High-End Computing (HEC): PetaFlop to ExaFlop 1 PFlops in EFlops in 218? Expected to have an ExaFlop system in ! ADMS '14 13

14 Trends for Commodity Computing Clusters in the Top 5 List ( Number of Clusters Percentage of Clusters Number of Clusters Percentage of Clusters Timeline ADMS '14 14

15 High End Computing (HEC) High End Computing (HEC) grows dramatically High Performance Computing Big Data Computing Technology Advancement Multi-core/many-core technologies and accelerators Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) Solid State Drives (SSDs) and Non-Volatile Random- Access Memory (NVRAM) Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) Tianhe 2 (1) Titan (2) Stampede (6) Tianhe 1A (1) ADMS '14 15

16 Overview of High Performance Interconnects High-Performance Computing (HPC) has adopted advanced interconnects and protocols InfiniBand 1 Gigabit Ethernet/iWARP RDMA over Converged Enhanced Ethernet (RoCE) Very Good Performance Low latency (few micro seconds) High Bandwidth (1 Gb/s with dual FDR InfiniBand) Low CPU overhead (5-1%) OpenFabrics software stack ( with IB, iwarp and RoCE interfaces are driving HPC systems Many such systems in Top5 list ADMS '14 16

17 All interconnects and protocols in OpenFabrics Stack Application / Middleware Interface Application / Middleware Sockets Verbs Protocol Kernel Space TCP/IP TCP/IP RSockets SDP TCP/IP RDMA RDMA Ethernet Driver IPoIB Hardware Offload User Space RDMA User Space User Space User Space Adapter Ethernet Adapter InfiniBand Adapter Ethernet Adapter InfiniBand Adapter InfiniBand Adapter iwarp Adapter RoCE Adapter InfiniBand Adapter Switch Ethernet Switch InfiniBand Switch Ethernet Switch InfiniBand Switch InfiniBand Switch Ethernet Switch Ethernet Switch InfiniBand Switch 1/1/4 GigE IPoIB 1/4 GigE- TOE RSockets SDP iwarp RoCE IB Native ADMS '14 17

18 Trends of Networking Technologies in TOP5 Systems Percentage share of InfiniBand is steadily increasing Interconnect Family Systems Share ADMS '14 18

19 Large-scale InfiniBand Installations 223 IB Clusters (44.3%) in the June 214 Top5 list ( Installations in the Top 5 (25 systems): 519,64 cores (Stampede) at TACC (7 th ) 12, 64 cores (Nebulae) at China/NSCS (28 th ) 62,64 cores (HPC2) in Italy (11 th ) 72,288 cores (Yellowstone) at NCAR (29 th ) 147, 456 cores (Super MUC) in Germany (12 th ) 7,56 cores (Helios) at Japan/IFERC (3 th ) 76,32 cores (Tsubame 2.5) at Japan/GSIC (13 th ) 138,368 cores (Tera-1) at France/CEA (35 th ) 194,616 cores (Cascade) at PNNL (15 th ) 222,72 cores (QUARTETTO) in Japan (37 th ) 11,4 cores (Pangea) at France/Total (16 th ) 53,54 cores (PRIMERGY) in Australia (38 th ) 96,192 cores (Pleiades) at NASA/Ames (21 st ) 77,52 cores (Conte) at Purdue University (39th) 73,584 cores (Spirit) at USA/Air Force (24 th ) 44,52 cores (Spruce A) at AWE in UK (4 th ) 77,184 cores (Curie thin nodes) at France/CEA (26 h ) 48,896 cores (MareNostrum) at Spain/BSC (41 st ) 65,32-cores, idataplex DX36M4 at Germany/Max- Planck (27 th ) and many more! ADMS '14 19

20 Open Standard InfiniBand Networking Technology Introduced in Oct 2 High Performance Data Transfer Interprocessor communication and I/O Low latency (<1. microsec), High bandwidth (up to 12.5 GigaBytes/sec -> 1Gbps), and low CPU utilization (5-1%) Flexibility for LAN and WAN communication Multiple Transport Services Reliable Connection (RC), Unreliable Connection (UC), Reliable Datagram (RD), Unreliable Datagram (UD), and Raw Datagram Provides flexibility to develop upper layers Multiple Operations Send/Recv RDMA Read/Write Atomic Operations (very unique) high performance and scalable implementations of distributed locks, semaphores, collective communication operations Leading to big changes in designing HPC clusters, file systems, cloud computing systems, grid computing systems,. ADMS '14 2

21 Communication in the Channel Semantics (Send/Receive Model) Memory Segment Memory Segment Memory Segment CQ Memory QP Send Recv Processor Processor Processor is involved only to: QP Send Recv Memory Memory Segment Memory Segment CQ 1. Post receive WQE 2. Post send WQE 3. Pull out completed CQEs from the CQ ADMS '14 InfiniBand Device Send WQE contains information about the send buffer (multiple non-contiguous segments) Hardware ACK InfiniBand Device Receive WQE contains information on the receive buffer (multiple non-contiguous segments); Incoming messages have to be matched to a receive WQE to know where to place 21

22 Communication in the Memory Semantics (RDMA Model) Memory Segment Memory Processor Processor Memory Memory Segment Memory Segment CQ QP Send Recv Initiator processor is involved only to: 1. Post send WQE 2. Pull out completed CQE from the send CQ QP Send Recv Memory Segment CQ No involvement from the target processor InfiniBand Device Send WQE contains information about the send buffer (multiple segments) and the receive buffer (single segment) Hardware ACK InfiniBand Device ADMS '14 22

23 Presentation Outline Overview of Modern Clusters, Interconnects and Protocols Challenges for Accelerating Data Management and Processing The High-Performance Big Data (HiBD) Project RDMA-based design for Memcached and HBase RDMA-based designs for Apache Hadoop and Spark Case studies with HDFS, MapReduce, and Spark RDMA-based MapReduce on HPC Clusters with Lustre Ongoing and Future Activities Conclusion and Q&A ADMS '14 23

24 Wide Adaptation of RDMA Technology Message Passing Interface (MPI) for HPC Parallel File Systems Lustre GPFS Delivering excellent performance (latency, bandwidth and CPU Utilization) Delivering excellent scalability ADMS '14 24

25 MVAPICH2/MVAPICH2-X Software High Performance open-source MPI Library for InfiniBand, 1Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE) MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.), Available since 22 MVAPICH2-X (MPI + PGAS), Available since 212 Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC) Used by more than 2,2 organizations in 73 countries More than 221, downloads from OSU site directly Empowering many TOP5 clusters 7 th ranked 519,64-core cluster (Stampede) at TACC 13 th ranked 74,358-core cluster (Tsubame 2.5) at Tokyo Institute of Technology 23 th ranked 96,192-core cluster (Pleiades) at NASA many others... Available with software stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE) Partner in the U.S. NSF-TACC Stampede System ADMS '14 25

26 One-way Latency: MPI over IB with MVAPICH2 Latency (us) Small Message Latency Latency (us) 1. Large Message Latency Qlogic-DDR Qlogic-QDR ConnectX-DDR ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR Message Size (bytes) Message Size (bytes) DDR, QDR GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ADMS '14 26

27 Bandwidth: MPI over IB with MVAPICH2 Bandwidth (MBytes/sec) Unidirectional Bandwidth Bandwidth (MBytes/sec) Bidirectional Bandwidth Qlogic-DDR Qlogic-QDR ConnectX-DDR ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR Message Size (bytes) Message Size (bytes) ADMS '14 DDR, QDR GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch 27

28 Can High-Performance Interconnects Benefit Data Management and Processing? Most of the current Big Data systems use Ethernet Infrastructure with Sockets Concerns for performance and scalability Usage of High-Performance Networks is beginning to draw interest Oracle, IBM, Google, Intel are working along these directions What are the challenges? Where do the bottlenecks lie? Can these bottlenecks be alleviated with new designs (similar to the designs adopted for MPI)? Can HPC Clusters with High-Performance networks be used for Big Data applications using Hadoop and Memcached? ADMS '14 28

29 Designing Communication and I/O Libraries for Big Data Systems: Challenges Applications Benchmarks Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached) Programming Models (Sockets) Other Protocols? Point-to-Point Communication Communication and I/O Library Threaded Models and Synchronization Virtualization I/O and File Systems QoS Fault-Tolerance Networking Technologies (InfiniBand, 1/1/4GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD and SSD) ADMS '14 29

30 Can Big Data Processing Systems be Designed with High- Performance Networks and Protocols? Current Design Enhanced Designs Our Approach Application Application Application Sockets Accelerated Sockets Verbs / Hardware Offload OSU Design Verbs Interface 1/1 GigE Network 1 GigE or InfiniBand 1 GigE or InfiniBand Sockets not designed for high-performance Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop) Zero-copy not available for non-blocking sockets ADMS '14 3

31 Presentation Outline Overview of Modern Clusters, Interconnects and Protocols Challenges for Accelerating Data Management and Processing The High-Performance Big Data (HiBD) Project RDMA-based design for Memcached and HBase RDMA-based designs for Apache Hadoop and Spark Case studies with HDFS, MapReduce, and Spark RDMA-based MapReduce on HPC Clusters with Lustre Ongoing and Future Activities Conclusion and Q&A ADMS '14 31

32 The High-Performance Big Data (HiBD) Project RDMA for Memcached (RDMA-Memcached) RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) RDMA for Apache Hadoop 1.x (RDMA-Hadoop) OSU HiBD-Benchmarks (OHB) RDMA for Apache HBase and Spark ADMS '14 32

33 RDMA for Memcached Distribution High-Performance Design of Memcached over RDMA-enabled Interconnects High performance design with native InfiniBand and RoCE support at the verbslevel for Memcached and libmemcached components Easily configurable for native InfiniBand, RoCE and the traditional socketsbased support (Ethernet and InfiniBand with IPoIB) Current release:.9.1 Based on Memcached and libmemcached Compliant with Memcached APIs and applications Tested with Mellanox InfiniBand adapters (DDR, QDR and FDR) RoCE support with Mellanox adapters Various multi-core platforms ADMS '14 33

34 RDMA for Apache Hadoop 1.x/2.x Distributions High-Performance Design of Hadoop over RDMA-enabled Interconnects High performance design with native InfiniBand and RoCE support at the verbslevel for HDFS, MapReduce, and RPC components Easily configurable for native InfiniBand, RoCE and the traditional socketsbased support (Ethernet and InfiniBand with IPoIB) Current release:.9.9/.9.1 Based on Apache Hadoop 1.2.1/2.4.1 Compliant with Apache Hadoop 1.2.1/2.4.1 APIs and applications Tested with Mellanox InfiniBand adapters (DDR, QDR and FDR) RoCE support with Mellanox adapters Various multi-core platforms Different file systems with disks and SSDs ADMS '14 34

35 OSU HiBD Micro-Benchmark (OHB) Suite - Memcached Released in OHB.7.1 (ohb_memlat) Evaluates the performance of stand-alone Memcached Three different micro-benchmarks SET Micro-benchmark: Micro-benchmark for memcached set operations GET Micro-benchmark: Micro-benchmark for memcached get operations MIX Micro-benchmark: Micro-benchmark for a mix of memcached set/get operations (Read:Write ratio is 9:1) Calculates average latency of Memcached operations Can measure throughput in Transactions Per Second ADMS '14 35

36 Presentation Outline Overview of Modern Clusters, Interconnects and Protocols Challenges for Accelerating Data Management and Processing The High-Performance Big Data (HiBD) Project RDMA-based design for Memcached and HBase RDMA-based Memcached Case study with OLTP SSD-assisted hybrid Memcached RDMA-based HBase RDMA-based designs for Apache Hadoop and Spark Case studies with HDFS, MapReduce, and Spark RDMA-based MapReduce on HPC Clusters with Lustre Ongoing and Future Activities Conclusion and Q&A ADMS '14 36

37 Design Challenges of RDMA-based Memcached Can Memcached be re-designed from the ground up to utilize RDMA capable networks? How efficiently can we utilize RDMA for Memcached operations? Can we leverage the best features of RC and UD to deliver both high performance and scalability to Memcached? Memcached applications need not be modified; the middleware uses verbs interface if available ADMS '14 37

38 Memcached-RDMA Design Sockets Client RDMA Client Master Thread Sockets Worker Thread Verbs Worker Thread Sockets Worker Thread Verbs Worker Thread Shared Data Memory Slabs Items Server and client perform a negotiation protocol Master thread assigns clients to appropriate worker thread Once a client is assigned a verbs worker thread, it can communicate directly and is bound to that thread All other Memcached data structures are shared among RDMA and Sockets worker threads Native IB-verbs-level Design and evaluation with Server : Memcached ( Client : libmemcached ( Different networks and protocols: 1GigE, IPoIB, native IB (RC, UD) ADMS '14 38

39 Memcached Get Latency (Small Message) Time (us) K 2K Message Size IPoIB (DDR) 1GigE (DDR) OSU-RC-IB (DDR) OSU-UD-IB (DDR) Time (us) IPoIB (QDR) 1GigE (QDR) OSU-RC-IB (QDR) OSU-UD-IB (QDR) K 2K Message Size Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR) Memcached Get latency 4 bytes RC/UD DDR: 6.82/7.55 us; QDR: 4.28/4.86 us 2K bytes RC/UD DDR: 12.31/12.78 us; QDR: 8.19/8.46 us Almost factor of four improvement over 1GE (TOE) for 2K bytes on the DDR cluster ADMS '14 39

40 Memcached Get TPS (4 bytes) Thousands of Transactions per second (TPS) No. of Clients IPoIB (QDR) 12 OSU-RC-IB (QDR) OSU-UD-IB (QDR) K No. of Clients Memcached Get transactions per second for 4 bytes On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients Significant improvement with native IB QDR compared to IPoIB Thousands of Transactions per second (TPS) ADMS '14 4

41 Time (us) 1 1 Memcached Performance (FDR Interconnect) 1 1 Memcached GET Latency OSU-IB (FDR) IPoIB (FDR) Latency Reduced by nearly 2X K 2K 4K Message Size Memcached Get latency Thousands of Transactions per Second (TPS) bytes OSU-IB: 2.84 us; IPoIB: us 2K bytes OSU-IB: 4.49 us; IPoIB: us Memcached Throughput (4bytes) Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR) 48 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s Nearly 2X improvement in throughput Memcached Throughput No. of Clients ADMS ' X

42 Application Level Evaluation Olio Benchmark 6 25 IPoIB (QDR) 5 2 OSU-RC-IB (QDR) OSU-UD-IB (QDR) Time (ms) 4 3 Time (ms) 15 1 OSU-Hybrid-IB (QDR) No. of Clients No. of Clients Olio Benchmark RC 1.6 sec, UD 1.9 sec, Hybrid 1.7 sec for 124 clients 4X times better than IPoIB for 8 clients Hybrid design achieves comparable performance to that of pure RC design ADMS '14 42

43 Time (ms) Application Level Evaluation Real Application Workloads No. of Clients Time (ms) Real Application Workload RC 32 ms, UD 318 ms, Hybrid 314 ms for 124 clients 12X times better than IPoIB for 8 clients Hybrid design achieves comparable performance to that of pure RC design IPoIB (QDR) OSU-RC-IB (QDR) OSU-UD-IB (QDR) OSU-Hybrid-IB (QDR) No. of Clients J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP 11 J. Jose, H. Subramoni, K. Kandalla, M. W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for InfiniBand Clusters using Hybrid Transport, CCGrid 12 ADMS '14 43

44 Memcached-RDMA Case Studies with OLTP Workloads Design of scalable architectures using Memcached and MySQL Case study to evaluate benefits of using RDMA- Memcached for traditional OLTP workloads ADMS '14 44

45 Evaluation with Traditional OLTP Workloads - mysqlslap Latency (sec) Memcached-IPoIB (32Gbps) Memcached-RDMA (32Gbps) No. of Clients Throughput (q/ms) Memcached-IPoIB (32Gbps) 35 Memcached-RDMA (32Gbps) No. of Clients Illustration with Read-Cache-Read access pattern using modified mysqlslap load testing tool Up to 4 nodes, 1 concurrent threads per node, 2 queries per client Memcached-RDMA can - improve query latency by up to 66% over IPoIB (32Gbps) - throughput by up to 69% over IPoIB (32Gbps) ADMS '14 45

46 Traditional Memcached Deployments Existing memcached deployment Low hit ratio at Memcached server due to Limited main memory size Mmap() SSD into virtual memory Higher hit ratio due to SSD Mapped virtual memory Frequent access to backend DB Significant overhead at virtual memory management ADMS '14 46

47 SSD-Assisted Hybrid Memory for Memcached IB Verbs Get Latency (us) IPoIB 1GigE 1GigE MySQL N/A Memcached (In RAM) Memcached (SSD virtual memory) SSD Basic Performance (us) (Fusionio iodrive) Random Read SSD Latency 68 7 Random Write SSD-Assisted Hybrid Memory to expand Memory size A user-level library that bypasses the kernel-based virtual memory management overhead ADMS '14 47

48 Evaluation on SSD-Assisted Hybrid Memory for Memcached - Latency Memcached Get Latency Memcached Put Latency 347 us 48 us 3.6 X 3 X 93 us 16 us Memcached with InfiniBand DDR transport (16Gbps) 3GB data in SSD, 256 MB read/write buffer Get / Put a random object X. Ouyang, N. S. Islam, R. Rajachandrasekar, J. Jose, M. Luo, H. Wang & D. K. Panda, SSD-Assisted Hybrid Memory to Accelerate Memcached over High Performance Networks, ICPP 12 ADMS '14 48

49 Evaluation on SSD-Assisted Hybrid Memory for Memcached - Throughput Native Hmem (upper-bound) Memcached+IB+Hmem Memcached+IB+VMS Transactions/sec (thousands) # processes 5.3 X Memcached with InfiniBand DDR transport (16Gbps) 3GB data in SSD, object size = 1KB, 256 MB read/write buffer 1, 2, 4 and 8 client process to perform random get() ADMS '14 49

50 Motivation Detailed Analysis on HBase Put/Get Time (us) Time (us) Communication Communication Preparation Server Processing Server Serialization Client Processing Client Serialization IPoIB (DDR) 1GigE OSU-IB (DDR) HBase Put 1KB HBase 1KB Put Communication Time 8.9 us A factor of 6X improvement over 1GE for communication time HBase 1KB Get Communication Time 8.9 us A factor of 6X improvement over 1GE for communication time IPoIB (DDR) 1GigE OSU-IB (DDR) HBase Get 1KB M. W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda, Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, Poster Paper, ISPASS 12 ADMS '14 5

51 HBase-RDMA Design Overview Applications HBase Java Socket Interface Java Native Interface (JNI) OSU-IB Design IB Verbs 1/1 GigE, IPoIB Networks RDMA Capable Networks (IB, 1GE/ iwarp, RoCE..) JNI Layer bridges Java based HBase with communication library written in native code Enables high performance RDMA communication, while supporting traditional socket interface ADMS '14 51

52 HBase-RDMA Design - Communication Flow Application (HBase Client) (HBase Server) Response Queue Handler Threads IB Reader HBase Client Responder Reader Threads Call Queue Helper Threads Network Selector JNI Interface Sockets Sockets JNI Interface OSU-IB Design OSU-IB Design 1/1G Network 1/1G Network IBVerbs IBVerbs RDMA design components Network Selector, IB Reader, Helper threads JNI Adaptive Interface ADMS '14 52

53 Time (us) HBase Micro-benchmark (Single-Server-Multi- Client) Results No. of Clients Latency HBase Get latency 4 clients: 14.5 us; 16 clients: us HBase Get throughput 4 clients: 37.1 Kops/sec; 16 clients: 53.4 Kops/sec 27% improvement in throughput for 16 clients over 1GE Ops/sec IPoIB (DDR) OSU-IB (DDR) 1GigE No. of Clients Throughput J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy, and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS 12 ADMS '14 53

54 HBase YCSB Read-Write Workload Time (us) Time (us) IPoIB (QDR) 1GigE OSU-IB (QDR) No. of Clients No. of Clients Read Latency Write Latency HBase Get latency (Yahoo! Cloud Service Benchmark) 64 clients: 2. ms; 128 Clients: 3.5 ms 42% improvement over IPoIB for 128 clients HBase Put latency 64 clients: 1.9 ms; 128 Clients: 3.5 ms 4% improvement over IPoIB for 128 clients ADMS '14 54

55 Presentation Outline Overview of Modern Clusters, Interconnects and Protocols Challenges for Accelerating Data Management and Processing The High-Performance Big Data (HiBD) Project RDMA-based design for Memcached and HBase RDMA-based designs for Apache Hadoop and Spark Case studies with HDFS, MapReduce, and Spark RDMA-based MapReduce on HPC Clusters with Lustre Ongoing and Future Activities Conclusion and Q&A ADMS '14 55

56 Acceleration Case Studies and In-Depth Performance Evaluation RDMA-based Designs and Performance Evaluation HDFS MapReduce Spark ADMS '14 56

57 Design Overview of HDFS with RDMA Applications Design Features Others HDFS Write RDMA-based HDFS write RDMA-based HDFS replication Java Socket Interface Java Native Interface (JNI) Parallel replication support OSU Design On-demand connection setup 1/1 GigE, IPoIB Network Verbs RDMA Capable Networks (IB, 1GE/ iwarp, RoCE..) InfiniBand/RoCE support Enables high performance RDMA communication, while supporting traditional socket interface JNI Layer bridges Java based HDFS with communication library written in native code ADMS '14 57

58 Communication Times in HDFS Communication Time (s) GigE IPoIB (QDR) OSU-IB (QDR) Reduced by 3% 2GB 4GB 6GB 8GB 1GB File Size (GB) Cluster with HDD DataNodes 3% improvement in communication time over IPoIB (QDR) 56% improvement in communication time over 1GigE Similar improvements are obtained for SSD DataNodes N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda, High Performance RDMA-Based Design of HDFS over InfiniBand, Supercomputing (SC), Nov 212 ADMS '14 58

59 Evaluations using OHB HDFS Micro-benchmark Write Latency (s) GigE IPoIB (QDR) OSU-IB (QDR) reduced by 25% Total Throughput (MBps) GigE IPoIB (QDR) OSU-IB (QDR) Increased by 5% File Size (GB) Write Latency File Size (GB) Write Throughput with 2 clients Cluster with 4 HDD DataNodes, single disk per node 25% improvement in latency over IPoIB (QDR) for 1GB file size 5% improvement in throughput over IPoIB (QDR) for 1GB file size ADMS '14 N. S. Islam, X. Lu, M. W. Rahman, J. Jose, and D. K. Panda, A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters, Int'l Workshop on Big Data Benchmarking (WBDB '12), December

60 Evaluations using TestDFSIO Total Throughput (MBps) GigE IPoIB (QDR) OSU-IB (QDR) Increased by 24% Total Throughput (MBps) IPoIB (QDR) OSU-IB (QDR) Increased by 61% File Size (GB) File Size (GB) Cluster with 8 HDD Nodes, TestDFSIO with 8 maps Cluster with 4 SSD Nodes, TestDFSIO with 4 maps Cluster with 8 HDD DataNodes, single disk per node 24% improvement over IPoIB (QDR) for 2GB file size Cluster with 4 SSD DataNodes, single SSD per node 61% improvement over IPoIB (QDR) for 2GB file size ADMS '14 6

61 Evaluations using Enhanced DFSIO of Intel HiBench on TACC-Stampede Aggregated Throughput (MBps) IPoIB (FDR) OSU-IB (FDR) Increased by 64% File Size (GB) Execution Time (s) IPoIB (FDR) Reduced by 37% OSU-IB (FDR) File Size (GB) Cluster with 64 DataNodes, single HDD per node 64% improvement in throughput over IPoIB (FDR) for 256GB file size 37% improvement in latency over IPoIB (FDR) for 256GB file size ADMS '14 61

62 Evaluations using YCSB (32 Region Servers: 1% Update) Average Latency (us) IPoIB (QDR) OSU-IB (QDR) 12K 24K 36K 48K Number of Records (Record Size = 1KB) Put Latency HBase using TCP/IP, running over HDFS-IB HBase Put latency for 48K records Throughput (Kops/sec) IPoIB (QDR) 21 us for OSU Design; 272 us for IPoIB (32Gbps) HBase Put throughput for 48K records OSU-IB (QDR) 12K 24K 36K 48K Number of Records (Record Size = 1KB) Put Throughput 4.42 Kops/sec for OSU Design; 3.63 Kops/sec for IPoIB (32Gbps) 26% improvement in average latency; 24% improvement in throughput ADMS '14 62

63 HDFS and HBase Integration over IB (OSU-IB) IPoIB (QDR) OSU-IB (HDFS) - IPoIB (HBase) (QDR) IPoIB (QDR) OSU-IB (HDFS) - IPoIB (HBase) (QDR) OSU-IB (HDFS) - OSU-IB (HBase) (QDR) OSU-IB (HDFS) - OSU-IB (HBase) (QDR) Latency (us) Throughput (Kops/sec) K 24K 36K 48K Number of Records (Record Size = 1KB) 12K 24K 36K 48K Number of Records (Record Size = 1KB) YCSB Evaluation with 4 RegionServers (1% update) HBase Put Latency and Throughput for 36K records - 37% improvement over IPoIB (32Gbps) - 18% improvement over OSU-IB HDFS only ADMS '14 63

64 Acceleration Case Studies and In-Depth Performance Evaluation RDMA-based Designs and Performance Evaluation HDFS MapReduce Spark ADMS '14 64

65 Design Overview of MapReduce with RDMA Java Socket Interface MapReduce Job Tracker 1/1 GigE, IPoIB Network Applications Task Tracker Map Reduce Java Native Interface (JNI) OSU Design Verbs RDMA Capable Networks (IB, 1GE/ iwarp, RoCE..) Design Features RDMA-based shuffle Prefetching and caching map output Efficient Shuffle Algorithms In-memory merge On-demand Shuffle Adjustment Advanced overlapping map, shuffle, and merge shuffle, merge, and reduce On-demand connection setup InfiniBand/RoCE support Enables high performance RDMA communication, while supporting traditional socket interface JNI Layer bridges Java based MapReduce with communication library written in native code ADMS '14 65

66 Advanced Overlapping among different phases Default Architecture A hybrid approach to achieve maximum possible overlapping in MapReduce across all phases compared to other approaches Efficient Shuffle Algorithms Dynamic and Efficient Switching On-demand Shuffle Adjustment Enhanced Overlapping Advanced Overlapping M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, ICS, June 214. ADMS '14 66

67 Evaluations using OHB MapReduce Micro-benchmark slave nodes 16 slave nodes Stand-alone MapReduce micro-benchmark (MR-AVG) 1 KB key/value pair size For 8 slave nodes, RDMA has up to 3% over IPoIB (56Gbps) For 16 slave nodes, RDMA has up to 28% over IPoIB (56Gbps) ADMS '14 IPoIB (FDR) OSU-IB (FDR) IPoIB (FDR) OSU-IB (FDR) D. Shankar, X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, A Micro-Benchmark Suite for Evaluating Hadoop MapReduce on High-Performance Networks, BPOE-5 (214). 3% 28% 67

68 Evaluations using Sort (HDD, SSD) Job Execution Time (sec) 6 1GigE IPoIB (QDR) 5 UDA-IB (QDR) OSU-IB (QDR) Data Size (GB) 8 HDD DataNodes With 8 HDD DataNodes for 4GB sort 43% improvement over IPoIB (QDR) 44% improvement over UDA-IB (QDR) Job Execution Time (sec) 5 1GigE IPoIB (QDR) 45 UDA-IB (QDR) OSU-IB (QDR) Data Size (GB) 8 SSD DataNodes With 8 SSD DataNodes for 4GB sort 52% improvement over IPoIB (QDR) 45% improvement over UDA-IB (QDR) ADMS '14 68

69 Evaluations with TeraSort Job Execution Time (sec) GigE IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR) Data Size (GB) 54% 49% 1 GB TeraSort with 8 DataNodes with 2 HDD per node 49% benefit compared to UDA-IB (QDR) 54% benefit compared to IPoIB (QDR) 56% benefit compared to 1GigE ADMS '14 69

70 Performance Evaluation on Larger Clusters 12 1 IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR) 7 IPoIB (FDR) UDA-IB (FDR) OSU-IB (FDR) Job Execution Time (sec) Job Execution Time (sec) Data Size: 6 GB Data Size: 12 GB Data Size: 24 GB Data Size: 8 GB Data Size: 16 GBData Size: 32 GB Cluster Size: 16 Cluster Size: 32 Cluster Size: 64 Cluster Size: 16 Cluster Size: 32 Cluster Size: 64 Sort in OSU Cluster TeraSort in TACC Stampede For 24GB Sort in 64 nodes 4% improvement over IPoIB (QDR) with HDD used for HDFS For 32GB TeraSort in 64 nodes 38% improvement over IPoIB (FDR) with HDD used for HDFS ADMS '14 7

71 Evaluations using PUMA Workload 1.9 1GigE IPoIB (QDR) OSU-IB (QDR) Normalized Execution Time AdjList (3GB) SelfJoin (8GB) SeqCount (3GB) WordCount (3GB) InvertIndex (3GB) Benchmarks 5% improvement in Self Join over IPoIB (QDR) for 8 GB data size 49% improvement in Sequence Count over IPoIB (QDR) for 3 GB data size ADMS '14 71

72 Evaluations using SWIM Execution Time (sec) IPoIB (QDR) OSU-IB (QDR) Job Number 5 small MapReduce jobs executed in a cluster size of 4 Maximum performance benefit 24% over IPoIB (QDR) Average performance benefit 13% over IPoIB (QDR) ADMS '14 72

73 Acceleration Case Studies and In-Depth Performance Evaluation RDMA-based Designs and Performance Evaluation HDFS MapReduce Spark ADMS '14 73

74 Design Overview of Spark with RDMA Design Features RDMA based shuffle SEDA-based plugins Dynamic connection management and sharing Non-blocking and out-oforder data transfer Off-JVM-heap buffer management InfiniBand/RoCE support Enables high performance RDMA communication, while supporting traditional socket interface JNI Layer bridges Scala based Spark with communication library written in native code X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 214 ADMS '14 74

75 Preliminary Results of Spark-RDMA Design - GroupBy GroupBy Time (sec) GroupBy Time (sec) GigE IPoIB RDMA Data Size (GB) Data Size (GB) Cluster with 4 HDD Nodes, GroupBy with 32 cores Cluster with 8 HDD Nodes, GroupBy with 64 cores Cluster with 4 HDD Nodes, single disk per node, 32 concurrent tasks 18% improvement over IPoIB (QDR) for 1GB data size Cluster with 8 HDD Nodes, single disk per node, 64 concurrent tasks 2% improvement over IPoIB (QDR) for 2GB data size ADMS '14 75

76 Presentation Outline Overview of Modern Clusters, Interconnects and Protocols Challenges for Accelerating Data Management and Processing The High-Performance Big Data (HiBD) Project RDMA-based design for Memcached and HBase RDMA-based designs for Apache Hadoop and Spark Case studies with HDFS, MapReduce, and Spark RDMA-based MapReduce on HPC Clusters with Lustre Ongoing and Future Activities Conclusion and Q&A ADMS '14 76

77 Optimize Apache Hadoop over Parallel File Systems Compute Nodes TaskTracker Map Reduce Lustre Client MetaData Servers Object Storage Servers Lustre Setup HPC Cluster Deployment Hybrid topological solution of Beowulf architecture with separate I/O nodes Lean compute nodes with light OS; more memory space; small local storage Sub-cluster of dedicated I/O nodes with parallel file systems, such as Lustre MapReduce over Lustre Local disk is used as the intermediate data directory Lustre is used as the intermediate data directory ADMS '14 77

78 Case Study - Performance Improvement of MapReduce over Lustre on TACC-Stampede Job Execution Time (sec) Local disk is used as the intermediate data directory IPoIB (FDR) OSU-IB (FDR) Job Execution Time (sec) IPoIB (FDR) OSU-IB (FDR) GB 4 GB 8 GB 16 GB 32 GB 64 GB Data Size (GB) Cluster: 4 Cluster: 8 Cluster: 16 Cluster: 32 Cluster: 64 Cluster: 128 For 5GB Sort in 64 nodes For 64GB Sort in 128 nodes 44% improvement over IPoIB (FDR) 48% improvement over IPoIB (FDR) M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, MapReduce over Lustre: Can RDMAbased Approach Benefit?, Euro-Par, August 214. ADMS '14 78

79 Case Study - Performance Improvement of MapReduce over Lustre on TACC-Stampede Lustre is used as the intermediate data directory 2 18 IPoIB (FDR) OSU-IB (FDR) 3 25 IPoIB (FDR) OSU-IB (FDR) Job Execution Time (sec) Job Execution Time (sec) GB 16 GB 32 GB Data Size (GB) Cluster: 8 Cluster: 16 Cluster: 32 For 16GB Sort in 16 nodes For 32GB Sort in 32 nodes 35% improvement over IPoIB (FDR) 33% improvement over IPoIB (FDR) Can more optimizations be achieved by leveraging more features of Lustre? ADMS '14 79

80 Presentation Outline Overview of Modern Clusters, Interconnects and Protocols Challenges for Accelerating Data Management and Processing The High-Performance Big Data (HiBD) Project RDMA-based design for Memcached and HBase RDMA-based designs for Apache Hadoop and Spark Case studies with HDFS, MapReduce, and Spark RDMA-based MapReduce on HPC Clusters with Lustre Ongoing and Future Activities Conclusion and Q&A ADMS '14 8

81 Ongoing and Future Activities for Hadoop Accelerations Some other existing Hadoop solutions Hadoop-A (UDA) RDMA-based implementation of Hadoop MapReduce shuffle engine Uses plug-in based solution : UDA (Unstructured Data Accelerator) Cloudera Distributions of Hadoop CDH: Open Source distribution Cloudera Standard: CDH with automated cluster management Cloudera Enterprise: Cloudera Standard + enhanced management capabilities and support Ethernet/IPoIB Hortonworks Data Platform HDP: Open Source distribution Developed as projects through the Apache Software Foundation (ASF), NO proprietary extensions or add-ons Functional areas: Data Management, Data Access, Data Governance and Integration, Security, and Operations. Ethernet/IPoIB ADMS '14 81

82 Designing Communication and I/O Libraries for Big Data Systems: Solved a Few Initial Challenges Applications Benchmarks Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached) Upper level Changes? Programming Models (Sockets) Other RDMA Protocols? Point-to-Point Communication Communication and I/O Library Threaded Models and Synchronization Virtualization I/O and File Systems QoS Fault-Tolerance Networking Technologies (InfiniBand, 1/1/4GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD and SSD) ADMS '14 82

83 More Challenges Multi-threading and Synchronization Multi-threaded model exploration Fine-grained synchronization and lock-free design Unified helper threads for different components Multi-endpoint design to support multi-threading communications QoS and Virtualization Network virtualization and locality-aware communication for Big Data middleware Hardware-level virtualization support for End-to-End QoS I/O scheduling and storage virtualization Live migration ADMS '14 83

84 More Challenges (Cont d) Support of Accelerators Efficient designs for Big Data middleware to take advantage of NVIDA GPGPUs and Intel MICs Offload computation-intensive workload to accelerators Explore maximum overlapping between communication and offloaded computation Fault Tolerance Enhancements Exploration of light-weight fault tolerance mechanisms for Big Data Support of Parallel File Systems Optimize Big Data middleware over parallel file systems (e.g. Lustre) on modern HPC clusters Big Data Benchmarking ADMS '14 84

85 Are the Current Benchmarks Sufficient for Big Data Management and Processing? The current benchmarks provide some performance behavior However, do not provide any information to the designer/developer on: What is happening at the lower-layer? Where the benefits are coming from? Which design is leading to benefits or bottlenecks? Which component in the design needs to be changed and what will be its impact? Can performance gain/loss at the lower-layer be correlated to the performance gain/loss observed at the upper layer? ADMS '14 85

86 Challenges in Benchmarking of RDMA-based Designs Applications Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached) Programming Models (Sockets) Benchmarks Other RDMA Protocols? Current Benchmarks Correlation? Communication and I/O Library Point-to-Point Communication Threaded Models and Synchronization Virtualization No Benchmarks I/O and File Systems QoS Fault-Tolerance Networking Technologies (InfiniBand, 1/1/4GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD and SSD) ADMS '14 86

87 OSU MPI Micro-Benchmarks (OMB) Suite A comprehensive suite of benchmarks to Compare performance of different MPI libraries on various networks and systems Validate low-level functionalities Provide insights to the underlying MPI-level designs Started with basic send-recv (MPI-1) micro-benchmarks for latency, bandwidth and bi-directional bandwidth Extended later to MPI-2 one-sided Collectives GPU-aware data movement OpenSHMEM (point-to-point and collectives) UPC Has become an industry standard Extensively used for design/development of MPI libraries, performance comparison of MPI libraries and even in procurement of large-scale systems Available from Available in an integrated manner with MVAPICH2 stack ADMS '14 87

88 Iterative Process Requires Deeper Investigation and Design for Benchmarking Next Generation Big Data Systems and Applications Applications Benchmarks Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached) Applications- Level Benchmarks Programming Models (Sockets) Other RDMA Protocols? Communication and I/O Library Point-to-Point Communication I/O and File Systems Threaded Models and Synchronization QoS Virtualization Fault-Tolerance Micro- Benchmarks Networking Technologies (InfiniBand, 1/1/4GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD and SSD) ADMS '14 88

89 Future Plans of OSU High Performance Big Data Project Upcoming Releases of RDMA-enhanced Packages will support Hadoop 2.x MapReduce & RPC Spark HBase Upcoming Releases of OSU HiBD Micro-Benchmarks (OHB) will support HDFS MapReduce RPC Exploration of other components (Threading models, QoS, Virtualization, Accelerators, etc.) Advanced designs with upper-level changes and optimizations ADMS '14 89

90 Presentation Outline Overview of Modern Clusters, Interconnects and Protocols Challenges for Accelerating Data Management and Processing The High-Performance Big Data (HiBD) Project RDMA-based design for Memcached and HBase RDMA-based designs for Apache Hadoop and Spark Case studies with HDFS, MapReduce, and Spark RDMA-based MapReduce on HPC Clusters with Lustre Ongoing and Future Activities Conclusion and Q&A ADMS '14 9

91 Concluding Remarks Presented an overview of data management and processing middleware in different tiers Provided an overview of modern networking technologies Discussed challenges in accelerating Big Data middleware Presented initial designs to take advantage of InfiniBand/RDMA for Memcached, HBase, Hadoop and Spark Results are promising Many other open issues need to be solved Will enable Big Data management and processing community to take advantage of modern HPC technologies to carry out their analytics in a fast and scalable manner ADMS '14 91

92 Thank You! Network-Based Computing Laboratory The High-Performance Big Data Project ADMS '14 92

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS Adithya Bhat, Nusrat Islam, Xiaoyi Lu, Md. Wasi- ur- Rahman, Dip: Shankar, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng

More information

Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached

Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached Talk at HPC Advisory Council Lugano Conference (213) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Accelerating Big Data Processing with RDMA- Enhanced Apache Hadoop

Accelerating Big Data Processing with RDMA- Enhanced Apache Hadoop Accelerating Big Data Processing with RDMA- Enhanced Apache Hadoop Keynote Talk at BPOE-4, in conjunction with ASPLOS 14 (March 2014) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K. In the multi-core age, How do larger, faster and cheaper and more responsive sub-systems affect data management? Panel at ADMS 211 Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory Department

More information

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management SigHPC BigData BoF (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda Network- Based Compu2ng Laboratory Department of Computer

More information

Acceleration for Big Data, Hadoop and Memcached

Acceleration for Big Data, Hadoop and Memcached Acceleration for Big Data, Hadoop and Memcached A Presentation at HPC Advisory Council Workshop, Lugano 212 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters High Performance File System and I/O Middleware Design for Big Data on HPC Clusters by Nusrat Sharmin Islam Advisor: Dhabaleswar K. (DK) Panda Department of Computer Science and Engineering The Ohio State

More information

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience Jithin Jose, Mingzhe Li, Xiaoyi Lu, Krishna Kandalla, Mark Arnold and Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory

More information

Accelerating and Benchmarking Big Data Processing on Modern Clusters

Accelerating and Benchmarking Big Data Processing on Modern Clusters Accelerating and Benchmarking Big Data Processing on Modern Clusters Keynote Talk at BPOE-6 (Sept 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Memcached Design on High Performance RDMA Capable Interconnects

Memcached Design on High Performance RDMA Capable Interconnects Memcached Design on High Performance RDMA Capable Interconnects Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md. Wasi- ur- Rahman, Nusrat S. Islam, Xiangyong Ouyang, Hao Wang, Sayantan

More information

Accelerating and Benchmarking Big Data Processing on Modern Clusters

Accelerating and Benchmarking Big Data Processing on Modern Clusters Accelerating and Benchmarking Big Data Processing on Modern Clusters Open RG Big Data Webinar (Sept 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

MVAPICH2 and MVAPICH2-MIC: Latest Status

MVAPICH2 and MVAPICH2-MIC: Latest Status MVAPICH2 and MVAPICH2-MIC: Latest Status Presentation at IXPUG Meeting, July 214 by Dhabaleswar K. (DK) Panda and Khaled Hamidouche The Ohio State University E-mail: {panda, hamidouc}@cse.ohio-state.edu

More information

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters Presentation at Mellanox Theatre (SC 16) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters Presentation at Mellanox Theatre (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing

Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing Talk at Storage Developer Conference SNIA 2018 by Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu

More information

MVAPICH2 Project Update and Big Data Acceleration

MVAPICH2 Project Update and Big Data Acceleration MVAPICH2 Project Update and Big Data Acceleration Presentation at HPC Advisory Council European Conference 212 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Scaling with PGAS Languages

Scaling with PGAS Languages Scaling with PGAS Languages Panel Presentation at OFA Developers Workshop (2013) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Presentation at Mellanox Theater () Dhabaleswar K. (DK) Panda - The Ohio State University panda@cse.ohio-state.edu Outline Communication

More information

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

High-Performance Training for Deep Learning and Computer Vision HPC

High-Performance Training for Deep Learning and Computer Vision HPC High-Performance Training for Deep Learning and Computer Vision HPC Panel at CVPR-ECV 18 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09 RDMA over Ethernet - A Preliminary Study Hari Subramoni, Miao Luo, Ping Lai and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University Introduction Problem Statement

More information

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay 1 Apache Spark - Intro Spark within the Big Data ecosystem Data Sources Data Acquisition / ETL Data Storage Data Analysis / ML Serving 3 Apache

More information

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits Ashish Kumar Singh, Sreeram Potluri, Hao Wang, Krishna Kandalla, Sayantan Sur, and Dhabaleswar K. Panda Network-Based

More information

Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase

Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase 2 IEEE 8th International Conference on Cloud Computing Technology and Science Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase Xiaoyi Lu, Dipti Shankar, Shashank Gugnani,

More information

Designing Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges

Designing Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges Designing Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges Talk at Brookhaven National Laboratory (Oct 214) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:

More information

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications GPU Technology Conference GTC 2016 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning Ching-Hsiang Chu 1, Xiaoyi Lu 1, Ammar A. Awan 1, Hari Subramoni 1, Jahanzeb Hashmi 1, Bracy Elton 2 and Dhabaleswar

More information

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of

More information

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Gilad Shainer 2nd Annual MVAPICH User Group (MUG) Meeting, August 2014 Complete High-Performance Scalable Interconnect Infrastructure Comprehensive End-to-End Software Accelerators

More information

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Sayantan Sur, Matt Koop, Lei Chai Dhabaleswar K. Panda Network Based Computing Lab, The Ohio State

More information

MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand

MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 213 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Unifying UPC and MPI Runtimes: Experience with MVAPICH Unifying UPC and MPI Runtimes: Experience with MVAPICH Jithin Jose Miao Luo Sayantan Sur D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,

More information

Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies

Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies Talk at Intel HPC Developer Conference 2017 (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University

More information

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart Xiangyong Ouyang, Raghunath Rajachandrasekar, Xavier Besseron, Hao Wang, Jian Huang, Dhabaleswar K. Panda Department of Computer

More information

Unified Runtime for PGAS and MPI over OFED

Unified Runtime for PGAS and MPI over OFED Unified Runtime for PGAS and MPI over OFED D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, USA Outline Introduction

More information

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks Presented By : Esthela Gallardo Ammar Ahmad Awan, Khaled Hamidouche, Akshay Venkatesh, Jonathan Perkins, Hari Subramoni,

More information

High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters

High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters Talk at OpenFabrics Workshop (April 2016) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Application Acceleration Beyond Flash Storage

Application Acceleration Beyond Flash Storage Application Acceleration Beyond Flash Storage Session 303C Mellanox Technologies Flash Memory Summit July 2014 Accelerating Applications, Step-by-Step First Steps Make compute fast Moore s Law Make storage

More information

Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap

Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) MeeKng by Dhabaleswar K. (DK) Panda The Ohio State University E- mail: panda@cse.ohio- state.edu h

More information

HPC Meets Big Data: Accelerating Hadoop, Spark, and Memcached with HPC Technologies

HPC Meets Big Data: Accelerating Hadoop, Spark, and Memcached with HPC Technologies HPC Meets Big Data: Accelerating Hadoop, Spark, and Memcached with HPC Technologies Talk at OpenFabrics Alliance Workshop (OFAW 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters 1 Ching-Hsiang Chu, 1 Khaled Hamidouche, 1 Hari Subramoni, 1 Akshay Venkatesh, 2 Bracy Elton and 1 Dhabaleswar

More information

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing Talk at HPCAC-Switzerland (April 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Overview of the MVAPICH Project: Latest Status and Future Roadmap

Overview of the MVAPICH Project: Latest Status and Future Roadmap Overview of the MVAPICH Project: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) Meeting by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Software Libraries and Middleware for Exascale Systems

Software Libraries and Middleware for Exascale Systems Software Libraries and Middleware for Exascale Systems Talk at HPC Advisory Council China Workshop (212) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University

More information

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning 5th ANNUAL WORKSHOP 209 Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning Hari Subramoni Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University E-mail:

More information

Enhancing Checkpoint Performance with Staging IO & SSD

Enhancing Checkpoint Performance with Staging IO & SSD Enhancing Checkpoint Performance with Staging IO & SSD Xiangyong Ouyang Sonya Marcarelli Dhabaleswar K. Panda Department of Computer Science & Engineering The Ohio State University Outline Motivation and

More information

The Future of Supercomputer Software Libraries

The Future of Supercomputer Software Libraries The Future of Supercomputer Software Libraries Talk at HPC Advisory Council Israel Supercomputing Conference by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 H. Wang, S. Potluri, M. Luo, A. K. Singh, X. Ouyang, S. Sur, D. K. Panda Network-Based

More information

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services P. Balaji, K. Vaidyanathan, S. Narravula, H. W. Jin and D. K. Panda Network Based Computing Laboratory

More information

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

High-Performance Broadcast for Streaming and Deep Learning

High-Performance Broadcast for Streaming and Deep Learning High-Performance Broadcast for Streaming and Deep Learning Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth - SC17 2 Outline Introduction

More information

Designing Shared Address Space MPI libraries in the Many-core Era

Designing Shared Address Space MPI libraries in the Many-core Era Designing Shared Address Space MPI libraries in the Many-core Era Jahanzeb Hashmi hashmi.29@osu.edu (NBCL) The Ohio State University Outline Introduction and Motivation Background Shared-memory Communication

More information

Exploiting HPC Technologies for Accelerating Big Data Processing and Storage

Exploiting HPC Technologies for Accelerating Big Data Processing and Storage Exploiting HPC Technologies for Accelerating Big Data Processing and Storage Talk in the 5194 class by Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu http://www.cse.ohio-state.edu/~luxi

More information

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications K. Vaidyanathan, P. Lai, S. Narravula and D. K. Panda Network Based Computing Laboratory

More information

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth by D.K. Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline Overview of MVAPICH2-GPU

More information

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan and Dhabaleswar K. (DK) Panda Speaker: Sourav Chakraborty

More information

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters Exploiting InfiniBand and Direct Technology for High Performance Collectives on Clusters Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth

More information

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? Sayantan Sur, Abhinav Vishnu, Hyun-Wook Jin, Wei Huang and D. K. Panda {surs, vishnu, jinhy, huanwei, panda}@cse.ohio-state.edu

More information

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand Matthew Koop 1,2 Terry Jones 2 D. K. Panda 1 {koop, panda}@cse.ohio-state.edu trj@llnl.gov 1 Network-Based Computing Lab, The

More information

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing Talk at HPCAC Stanford Conference (Feb 18) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

RDMA for Memcached User Guide

RDMA for Memcached User Guide 0.9.5 User Guide HIGH-PERFORMANCE BIG DATA TEAM http://hibd.cse.ohio-state.edu NETWORK-BASED COMPUTING LABORATORY DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING THE OHIO STATE UNIVERSITY Copyright (c)

More information

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014 InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment TOP500 Supercomputers, June 2014 TOP500 Performance Trends 38% CAGR 78% CAGR Explosive high-performance

More information

Advanced RDMA-based Admission Control for Modern Data-Centers

Advanced RDMA-based Admission Control for Modern Data-Centers Advanced RDMA-based Admission Control for Modern Data-Centers Ping Lai Sundeep Narravula Karthikeyan Vaidyanathan Dhabaleswar. K. Panda Computer Science & Engineering Department Ohio State University Outline

More information

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING Meeting Today s Datacenter Challenges Produced by Tabor Custom Publishing in conjunction with: 1 Introduction In this era of Big Data, today s HPC systems are faced with unprecedented growth in the complexity

More information

Efficient Data Access Strategies for Hadoop and Spark on HPC Cluster with Heterogeneous Storage*

Efficient Data Access Strategies for Hadoop and Spark on HPC Cluster with Heterogeneous Storage* 216 IEEE International Conference on Big Data (Big Data) Efficient Data Access Strategies for Hadoop and Spark on HPC Cluster with Heterogeneous Storage* Nusrat Sharmin Islam, Md. Wasi-ur-Rahman, Xiaoyi

More information

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters High Performance MPI Support in MVAPICH2 for InfiniBand Clusters A Talk at NVIDIA Booth (SC 11) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational

More information

Advanced Topics in InfiniBand and High-speed Ethernet for Designing HEC Systems

Advanced Topics in InfiniBand and High-speed Ethernet for Designing HEC Systems Advanced Topics in InfiniBand and High-speed Ethernet for Designing HEC Systems A Tutorial at SC 13 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet WHITE PAPER Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet Contents Background... 2 The MapR Distribution... 2 Mellanox Ethernet Solution... 3 Test

More information

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication Sreeram Potluri* Hao Wang* Devendar Bureddy* Ashish Kumar Singh* Carlos Rosales + Dhabaleswar K. Panda* *Network-Based

More information

High Performance Migration Framework for MPI Applications on HPC Cloud

High Performance Migration Framework for MPI Applications on HPC Cloud High Performance Migration Framework for MPI Applications on HPC Cloud Jie Zhang, Xiaoyi Lu and Dhabaleswar K. Panda {zhanjie, luxi, panda}@cse.ohio-state.edu Computer Science & Engineering Department,

More information

Solutions for Scalable HPC

Solutions for Scalable HPC Solutions for Scalable HPC Scot Schultz, Director HPC/Technical Computing HPC Advisory Council Stanford Conference Feb 2014 Leading Supplier of End-to-End Interconnect Solutions Comprehensive End-to-End

More information

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Qi Gao, Weikuan Yu, Wei Huang, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering

More information

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering,

More information

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Designing High Performance Communication Middleware with Emerging Multi-core Architectures Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Exploiting HPC Technologies for Accelerating Big Data Processing and Associated Deep Learning

Exploiting HPC Technologies for Accelerating Big Data Processing and Associated Deep Learning Exploiting HPC Technologies for Accelerating Big Data Processing and Associated Deep Learning Keynote Talk at Swiss Conference (April 18) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:

More information

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT Krishna Kandalla (1), Hari Subramoni (1), Karen Tomko (2), Dmitry Pekurovsky

More information

High Performance Design for HDFS with Byte-Addressability of NVM and RDMA

High Performance Design for HDFS with Byte-Addressability of NVM and RDMA High Performance Design for HDFS with Byte-Addressability of and RDMA Nusrat Sharmin Islam, Md Wasi-ur-Rahman, Xiaoyi Lu, and Dhabaleswar K (DK) Panda Department of Computer Science and Engineering The

More information

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics 1 Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics Mingzhe Li Sreeram Potluri Khaled Hamidouche Jithin Jose Dhabaleswar K. Panda Network-Based Computing Laboratory Department

More information

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Pak Lui, Gilad Shainer, Brian Klaff Mellanox Technologies Abstract From concept to

More information

Designing and Modeling High-Performance MapReduce and DAG Execution Framework on Modern HPC Systems

Designing and Modeling High-Performance MapReduce and DAG Execution Framework on Modern HPC Systems Designing and Modeling High-Performance MapReduce and DAG Execution Framework on Modern HPC Systems Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy

More information

2008 International ANSYS Conference

2008 International ANSYS Conference 2008 International ANSYS Conference Maximizing Productivity With InfiniBand-Based Clusters Gilad Shainer Director of Technical Marketing Mellanox Technologies 2008 ANSYS, Inc. All rights reserved. 1 ANSYS,

More information

Communication Frameworks for HPC and Big Data

Communication Frameworks for HPC and Big Data Communication Frameworks for HPC and Big Data Talk at HPC Advisory Council Spain Conference (215) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures Talk at Bench 2018 by Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu http://www.cse.ohio-state.edu/~luxi

More information

Memory Management Strategies for Data Serving with RDMA

Memory Management Strategies for Data Serving with RDMA Memory Management Strategies for Data Serving with RDMA Dennis Dalessandro and Pete Wyckoff (presenting) Ohio Supercomputer Center {dennis,pw}@osc.edu HotI'07 23 August 2007 Motivation Increasing demands

More information

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications Outline RDMA Motivating trends iwarp NFS over RDMA Overview Chelsio T5 support Performance results 2 Adoption Rate of 40GbE Source: Crehan

More information

High-Performance Key-Value Store on OpenSHMEM

High-Performance Key-Value Store on OpenSHMEM High-Performance Key-Value Store on OpenSHMEM Huansong Fu*, Manjunath Gorentla Venkata, Ahana Roy Choudhury*, Neena Imam, Weikuan Yu* *Florida State University Oak Ridge National Laboratory Outline Background

More information

Accelerating HPL on Heterogeneous GPU Clusters

Accelerating HPL on Heterogeneous GPU Clusters Accelerating HPL on Heterogeneous GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline

More information

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007 Mellanox Technologies Maximize Cluster Performance and Productivity Gilad Shainer, shainer@mellanox.com October, 27 Mellanox Technologies Hardware OEMs Servers And Blades Applications End-Users Enterprise

More information

Infiniband and RDMA Technology. Doug Ledford

Infiniband and RDMA Technology. Doug Ledford Infiniband and RDMA Technology Doug Ledford Top 500 Supercomputers Nov 2005 #5 Sandia National Labs, 4500 machines, 9000 CPUs, 38TFlops, 1 big headache Performance great...but... Adding new machines problematic

More information

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters Jie Zhang Dr. Dhabaleswar K. Panda (Advisor) Department of Computer Science & Engineering The

More information

Memcached Design on High Performance RDMA Capable Interconnects

Memcached Design on High Performance RDMA Capable Interconnects 211 International Conference on Parallel Processing Memcached Design on High Performance RDMA Capable Interconnects Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md. Wasi-ur-Rahman,

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Panda Department of Computer Science and Engineering The Ohio

More information

DB2 purescale: High Performance with High-Speed Fabrics. Author: Steve Rees Date: April 5, 2011

DB2 purescale: High Performance with High-Speed Fabrics. Author: Steve Rees Date: April 5, 2011 DB2 purescale: High Performance with High-Speed Fabrics Author: Steve Rees Date: April 5, 2011 www.openfabrics.org IBM 2011 Copyright 1 Agenda Quick DB2 purescale recap DB2 purescale comes to Linux DB2

More information

High Performance MPI on IBM 12x InfiniBand Architecture

High Performance MPI on IBM 12x InfiniBand Architecture High Performance MPI on IBM 12x InfiniBand Architecture Abhinav Vishnu, Brad Benton 1 and Dhabaleswar K. Panda {vishnu, panda} @ cse.ohio-state.edu {brad.benton}@us.ibm.com 1 1 Presentation Road-Map Introduction

More information

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning Ammar Ahmad Awan, Khaled Hamidouche, Akshay Venkatesh, and Dhabaleswar K. Panda Network-Based Computing Laboratory Department

More information

The Exascale Architecture

The Exascale Architecture The Exascale Architecture Richard Graham HPC Advisory Council China 2013 Overview Programming-model challenges for Exascale Challenges for scaling MPI to Exascale InfiniBand enhancements Dynamically Connected

More information