Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing

Size: px
Start display at page:

Download "Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing"

Transcription

1 Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing Talk at HPCAC Stanford Conference (Feb 18) by Dhabaleswar K. (DK) Panda The Ohio State University

2 HPCAC-Stanford (Feb 18) 2 Increasing Usage of HPC, Big Data and Deep Learning HPC (MPI, RDMA, Lustre, etc.) Big Data (Hadoop, Spark, HBase, Memcached, etc.) Convergence of HPC, Big Data, and Deep Learning! Deep Learning (Caffe, TensorFlow, BigDL, etc.) Increasing Need to Run these applications on the Cloud!!

3 HPCAC-Stanford (Feb 18) 3 Focus of Yesterday s Talk: HPC, Deep Learning, and Cloud Traditional HPC Message Passing Interface (MPI), including MPI + OpenMP Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) Exploiting Accelerators Deep Learning Caffe and CNTK Cloud for HPC Virtualization with SR-IOV and Containers

4 HPCAC-Stanford (Feb 18) 4 Focus of Today s Talk: HPC, BigData, Deep Learning and Cloud Big Data/Enterprise/Commercial Computing Spark and Hadoop (HDFS, HBase, MapReduce) Memcached for Web 2.0 Kafka for Stream Processing Deep Learning TensorFlow with grpc Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark,..) Cloud Support for Big Data Stacks Virtualization with SR-IOV and Containers Swift Object Store

5 Big Velocity How Much Data Is Generated Every Minute on the Internet? The global Internet population grew 7.5% from 2016 and now represents 3.7 Billion People. Courtesy: HPCAC-Stanford (Feb 18) 5

6 Data Management and Processing on Modern Clusters HPCAC-Stanford (Feb 18) 6 Substantial impact on designing and utilizing data management and processing systems in multiple tiers Front-end data accessing and serving (Online) Memcached + DB (e.g. MySQL), HBase Back-end data analytics (Offline) HDFS, MapReduce, Spark

7 Deep Learning over Big Data (DLoBD) Deep Learning over Big Data (DLoBD) is one of the most efficient analyzing paradigms More and more deep learning tools or libraries (e.g., Caffe, TensorFlow) start running over big data stacks, such as Apache Hadoop and Spark Benefits of the DLoBD approach Easily build a powerful data analytics pipeline E.g., Flickr DL/ML Pipeline, How Deep Learning Powers Flickr, (1) Prepare (2) Deep (3) Non-deep learning (4) Apply ML Better data locality Efficient resource sharing and cost effective HPCAC-Stanford (Feb 18) 7

8 HPCAC-Stanford (Feb 18) 8 Not Only in Internet Services - Big Data in Scientific Domains Scientific Data Management, Analysis, and Visualization Applications examples Climate modeling Combustion Fusion Astrophysics Bioinformatics Data Intensive Tasks Runs large-scale simulations on supercomputers Dump data on parallel storage systems Collect experimental / observational data Move experimental / observational data to analysis sites Visual analytics help understand data visually

9 Drivers of Modern HPC Cluster and Data Center Architecture Multi-/Many-core Processors High Performance Interconnects InfiniBand (with SR-IOV) <1usec latency, 200Gbps Bandwidth> Multi-core/many-core technologies Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) Single Root I/O Virtualization (SR-IOV) Solid State Drives (SSDs), NVM, Parallel Filesystems, Object Storage Clusters Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip SSD, NVMe-SSD, NVRAM SDSC Comet TACC Stampede Cloud Cloud HPCAC-Stanford (Feb 18) 9

10 Interconnects and Protocols in OpenFabrics Stack for HPC ( Application / Middleware Interface Protocol Kernel Space TCP/IP Sockets TCP/IP RSockets Application / Middleware SDP TCP/IP Verbs RDMA RDMA Adapter Switch Ethernet Driver Ethernet Adapter Ethernet Switch 1/10/40/100 GigE IPoIB InfiniBand Adapter InfiniBand Switch IPoIB Hardware Offload Ethernet Adapter Ethernet Switch 10/40 GigE- TOE User Space InfiniBand Adapter InfiniBand Switch RSockets RDMA InfiniBand Adapter InfiniBand Switch SDP User Space iwarp Adapter Ethernet Switch iwarp User Space RoCE Adapter Ethernet Switch RoCE User Space InfiniBand Adapter InfiniBand Switch IB Native HPCAC-Stanford (Feb 18) 10

11 How Can HPC Clusters with High-Performance Interconnect and Storage Architectures Benefit Big Data Applications? HPCAC-Stanford (Feb 18) 11 Can the bottlenecks be alleviated with new designs by taking advantage of HPC technologies? What are the major bottlenecks in current Big Data processing middleware (e.g. Hadoop, Spark, and Memcached)? Can RDMA-enabled high-performance interconnects benefit Big Data processing? Can HPC Clusters with high-performance storage systems (e.g. SSD, parallel file systems) benefit Big Data applications? How much performance benefits can be achieved through enhanced designs? How to design benchmarks for evaluating the performance of Big Data middleware on HPC clusters? Bring HPC and Big Data processing into a convergent trajectory!

12 Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure? HPCAC-Stanford (Feb 18) 12

13 Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure? HPCAC-Stanford (Feb 18) 13

14 Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure? HPCAC-Stanford (Feb 18) 14 Hadoop Job Deep Learning Job Spark Job

15 HPCAC-Stanford (Feb 18) 15 Designing Communication and I/O Libraries for Big Data Systems: Challenges Applications Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached) Programming Models (Sockets) Benchmarks Other Protocols? Upper level Changes? Point-to-Point Communication Communication and I/O Library Threaded Models and Synchronization Virtualization (SR-IOV) I/O and File Systems QoS & Fault Tolerance Performance Tuning Networking Technologies (InfiniBand, 1/10/40/100 GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD, SSD, NVM, and NVMe- SSD)

16 The High-Performance Big Data (HiBD) Project RDMA for Apache Spark RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions RDMA for Apache HBase RDMA for Memcached (RDMA-Memcached) RDMA for Apache Hadoop 1.x (RDMA-Hadoop) OSU HiBD-Benchmarks (OHB) Available for InfiniBand and RoCE Also run on Ethernet HDFS, Memcached, HBase, and Spark Micro-benchmarks Users Base: 275 organizations from 34 countries More than 25,000 downloads from the project site HPCAC-Stanford (Feb 18) 16

17 Number of Downloads RDMA-Hadoop 1.x RDMA-Hadoop 1.x RDMA-Hadoop 1.x RDMA-Memcached & OHB RDMA-Hadoop 2.x RDMA-Hadoop 2.x RDMA-Hadoop 2.x RDMA-Memcached RDMA-Spark RDMA-Hadoop 2.x RDMA-Memcached & OHB RDMA-Spark RDMA-HBase RDMA-Hadoop 2.x RDMA-Spark HiBD Release Timeline and Downloads Timeline HPCAC-Stanford (Feb 18) 17

18 HPCAC-Stanford (Feb 18) 18 RDMA for Apache Hadoop 2.x Distribution High-Performance Design of Hadoop over RDMA-enabled Interconnects High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and RPC components Enhanced HDFS with in-memory and heterogeneous storage High performance design of MapReduce over Lustre Memcached-based burst buffer for MapReduce over Lustre-integrated HDFS (HHH-L-BB mode) Plugin-based architecture supporting RDMA-based designs for Apache Hadoop, CDH and HDP Support for OpenPOWER Current release: Based on Apache Hadoop Compliant with Apache Hadoop 2.8.0, HDP and CDH APIs and applications Tested with Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR) RoCE support with Mellanox adapters Various multi-core platforms (x86, POWER) Different file systems with disks and SSDs and Lustre

19 Different Modes of RDMA for Apache Hadoop 2.x HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well as performance. This mode is enabled by default in the package. HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations inmemory and obtain as much performance benefit as possible. HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster. HHH-L-BB: This mode deploys a Memcached-based burst buffer system to reduce the bandwidth bottleneck of shared file system access. The burst buffer design is hosted by Memcached servers, each of which has a local SSD. MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top of Lustre alone. Here, two different modes are introduced: with local disks and without local disks. Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH- L, and MapReduce over Lustre). HPCAC-Stanford (Feb 18) 19

20 HPCAC-Stanford (Feb 18) 20 RDMA for Apache Spark Distribution High-Performance Design of Spark over RDMA-enabled Interconnects High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for Spark RDMA-based data shuffle and SEDA-based shuffle architecture Non-blocking and chunk-based data transfer Off-JVM-heap buffer management Support for OpenPOWER Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB) Current release: Based on Apache Spark Tested with Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR) RoCE support with Mellanox adapters Various multi-core platforms (x86, POWER) RAM disks, SSDs, and HDD

21 HPCAC-Stanford (Feb 18) 21 RDMA for Apache HBase Distribution High-Performance Design of HBase over RDMA-enabled Interconnects High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for HBase Compliant with Apache HBase APIs and applications On-demand connection setup Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB) Current release: Based on Apache HBase Tested with Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR) RoCE support with Mellanox adapters Various multi-core platforms

22 RDMA for Memcached Distribution HPCAC-Stanford (Feb 18) 22 High-Performance Design of Memcached over RDMA-enabled Interconnects High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for Memcached and libmemcached components High performance design of SSD-Assisted Hybrid Memory Non-Blocking Libmemcached Set/Get API extensions Support for burst-buffer mode in Lustre-integrated design of HDFS in RDMA for Apache Hadoop-2.x Easily configurable for native InfiniBand, RoCE and the traditional sockets-based support (Ethernet and InfiniBand with IPoIB) Current release: Based on Memcached and libmemcached Compliant with libmemcached APIs and applications Tested with Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR) RoCE support with Mellanox adapters Various multi-core platforms SSD

23 OSU HiBD Micro-Benchmark (OHB) Suite HDFS, Memcached, HBase, and Spark Micro-benchmarks for Hadoop Distributed File System (HDFS) Sequential Write Latency (SWL) Benchmark, Sequential Read Latency (SRL) Benchmark, Random Read Latency (RRL) Benchmark, Sequential Write Throughput (SWT) Benchmark, Sequential Read Throughput (SRT) Benchmark Support benchmarking of Apache Hadoop 1.x and 2.x HDFS, Hortonworks Data Platform (HDP) HDFS, Cloudera Distribution of Hadoop (CDH) HDFS Micro-benchmarks for Memcached Get Benchmark, Set Benchmark, and Mixed Get/Set Benchmark, Non-Blocking API Latency Benchmark, Hybrid Memory Latency Benchmark Yahoo! Cloud Serving Benchmark (YCSB) Extension for RDMA-Memcached Micro-benchmarks for HBase Get Latency Benchmark, Put Latency Benchmark Micro-benchmarks for Spark GroupBy, SortBy Current release: HPCAC-Stanford (Feb 18) 23

24 HPCAC-Stanford (Feb 18) 24 Using HiBD Packages on Existing HPC Infrastructure

25 HPCAC-Stanford (Feb 18) 25 Using HiBD Packages on Existing HPC Infrastructure

26 HiBD Packages on SDSC Comet and Chameleon Cloud RDMA for Apache Hadoop 2.x and RDMA for Apache Spark are installed and available on SDSC Comet. Examples for various modes of usage are available in: RDMA for Apache Hadoop 2.x: /share/apps/examples/hadoop RDMA for Apache Spark: /share/apps/examples/spark/ Please (reference Comet as the machine, and SDSC as the site) if you have any further questions about usage and configuration. RDMA for Apache Hadoop is also available on Chameleon Cloud as an appliance M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet, XSEDE 16, July 2016 HPCAC-Stanford (Feb 18) 26

27 Acceleration Case Studies and Performance Evaluation Basic Designs for HiBD Packages HDFS and MapReduce Spark Memcached Kafka Deep Learning TensorFlow with grpc Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark,..) Cloud Support for Big Data Stacks Virtualization with SR-IOV and Containers Swift Object Store HPCAC-Stanford (Feb 18) 27

28 Design Overview of HDFS with RDMA Applications Design Features Others Java Socket Interface HDFS Write Java Native Interface (JNI) OSU Design Verbs RDMA-based HDFS write RDMA-based HDFS replication Parallel replication support On-demand connection setup 1/10/40/100 GigE, IPoIB Network RDMA Capable Networks (IB, iwarp, RoCE..) InfiniBand/RoCE support Enables high performance RDMA communication, while supporting traditional socket interface JNI Layer bridges Java based HDFS with communication library written in native code N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda, High Performance RDMA-Based Design of HDFS over InfiniBand, Supercomputing (SC), Nov 2012 N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC '14, June 2014 HPCAC-Stanford (Feb 18) 28

29 Enhanced HDFS with In-Memory and Heterogeneous Storage Applications Triple-H Data Placement Policies Hybrid Replication Eviction/Promotion Heterogeneous Storage RAM Disk SSD HDD Lustre Design Features Three modes Default (HHH) In-Memory (HHH-M) Lustre-Integrated (HHH-L) Policies to efficiently utilize the heterogeneous storage devices RAM, SSD, HDD, Lustre Eviction/Promotion based on data usage pattern Hybrid Replication Lustre-Integrated mode: Lustre-based fault-tolerance N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture, CCGrid 15, May 2015 HPCAC-Stanford (Feb 18) 29

30 Design Overview of MapReduce with RDMA MapReduce Job Tracker Java Socket Interface 1/10/40/100 GigE, IPoIB Network Applications Task Tracker Map Reduce Java Native Interface (JNI) OSU Design Verbs RDMA Capable Networks (IB, iwarp, RoCE..) Design Features RDMA-based shuffle Prefetching and caching map output Efficient Shuffle Algorithms In-memory merge On-demand Shuffle Adjustment Advanced overlapping map, shuffle, and merge shuffle, merge, and reduce On-demand connection setup InfiniBand/RoCE support Enables high performance RDMA communication, while supporting traditional socket interface JNI Layer bridges Java based MapReduce with communication library written in native code M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, ICS, June 2014 HPCAC-Stanford (Feb 18) 30

31 Execution Time (s) Execution Time (s) HPCAC-Stanford (Feb 18) 31 Performance Numbers of RDMA for Apache Hadoop 2.x RandomWriter & TeraGen in OSU-RI2 (EDR) IPoIB (EDR) OSU-IB (EDR) Data Size (GB) Reduced by 3x IPoIB (EDR) OSU-IB (EDR) RandomWriter Cluster with 8 Nodes with a total of 64 maps Data Size (GB) TeraGen Reduced by 4x RandomWriter 3x improvement over IPoIB for GB file size TeraGen 4x improvement over IPoIB for GB file size

32 Execution Time (s) Execution Time (s) Performance Numbers of RDMA for Apache Hadoop 2.x Sort & TeraSort in OSU-RI2 (EDR) Sort IPoIB (EDR) OSU-IB (EDR) Data Size (GB) 61% improvement over IPoIB for GB data Reduced by 61% Sort Cluster with 8 Nodes with a total of 64 maps and 14 reduces TeraSort IPoIB (EDR) OSU-IB (EDR) TeraSort Cluster with 8 Nodes with a total of 64 maps and 32 reduces 18% improvement over IPoIB for GB data Reduced by 18% Data Size (GB) HPCAC-Stanford (Feb 18) 32

33 Execution Time (s) Execution Time (s) Evaluation with Spark on SDSC Gordon (HHH vs. Tachyon/Alluxio) IPoIB (QDR) Tachyon OSU-IB (QDR) Reduced by 2.4x For 200GB TeraGen on 32 nodes Spark-TeraGen: HHH has 2.4x improvement over Tachyon; 2.3x over HDFS-IPoIB (QDR) Spark-TeraSort: HHH has 25.2% improvement over Tachyon; 17% over HDFS-IPoIB (QDR) 8:50 16:100 32:200 Cluster Size : Data Size (GB) TeraGen Reduced by 25.2% 8:50 16:100 32:200 Cluster Size : Data Size (GB) TeraSort N. Islam, M. W. Rahman, X. Lu, D. Shankar, and D. K. Panda, Performance Characterization and Acceleration of In-Memory File Systems for Hadoop and Spark Applications on HPC Clusters, IEEE BigData 15, October 2015 HPCAC-Stanford (Feb 18) 33

34 Acceleration Case Studies and Performance Evaluation Basic Designs for HiBD Packages HDFS and MapReduce Spark Memcached Kafka Deep Learning TensorFlow with grpc Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark,..) Cloud Support for Big Data Stacks Virtualization with SR-IOV and Containers Swift Object Store HPCAC-Stanford (Feb 18) 34

35 Design Overview of Spark with RDMA Apache Spark Benchmarks/Applications/Libraries/Frameworks Netty Server Spark Core Shuffle Manager (Sort, Hash, Tungsten-Sort) Block Transfer Service (Netty, NIO, RDMA-Plugin) NIO Server Java Socket Interface RDMA Server Netty Client NIO Client RDMA Client Java Native Interface (JNI) Native RDMA-based Comm. Engine Design Features RDMA based shuffle plugin SEDA-based architecture Dynamic connection management and sharing Non-blocking data transfer Off-JVM-heap buffer management InfiniBand/RoCE support 1/10/40/100 GigE, IPoIB Network Enables high performance RDMA communication, while supporting traditional socket interface JNI Layer bridges Scala based Spark with communication library written in native code X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014 X. Lu, D. Shankar, S. Gugnani, and D. K. Panda, High-Performance Design of Apache Spark with RDMA and Its Benefits on Various Workloads, IEEE BigData 16, Dec RDMA Capable Networks (IB, iwarp, RoCE..) HPCAC-Stanford (Feb 18) 35

36 Time (sec) Time (sec) HPCAC-Stanford (Feb 18) 36 Performance Evaluation on SDSC Comet SortBy/GroupBy IPoIB RDMA Data Size (GB) 80% 64 Worker Nodes, 1536 cores, SortByTest Total Time 64 Worker Nodes, 1536 cores, GroupByTest Total Time InfiniBand FDR, SSD, 64 Worker Nodes, 1536 Cores, (1536M 1536R) RDMA vs. IPoIB with 1536 concurrent tasks, single SSD per node. SortBy: Total time reduced by up to 80% over IPoIB (56Gbps) GroupBy: Total time reduced by up to 74% over IPoIB (56Gbps) IPoIB RDMA Data Size (GB) 74%

37 Time (sec) Time (sec) HPCAC-Stanford (Feb 18) 37 Performance Evaluation on SDSC Comet HiBench PageRank IPoIB RDMA Huge BigData Gigantic Data Size (GB) 37% 32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R) RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node. 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps) 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps) IPoIB RDMA Huge BigData Gigantic Data Size (GB) 43%

38 One Epoch Time (sec) HPCAC-Stanford (Feb 18) 38 Application Evaluation on SDSC Comet % RDMA Spark Apache Spark (IPoIB) Execution times (sec) for Kira SE benchmark using 65 GB dataset, 48 cores. Kira Toolkit: Distributed astronomy image processing toolkit implemented using Apache Spark Source extractor application, using a 65GB dataset from the SDSS DR2 survey that comprises 11,150 image files IPoIB RDMA Number of cores BigDL: Distributed Deep Learning Tool using Apache Spark VGG training model on the CIFAR-10 dataset 4.58x M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet, XSEDE 16, July 2016

39 Memory Controller(s) PCIe Interface Chip Interconnect Chip Interconnect PCIe Interface Memory Controller(s) Latency (sec) Accelerating Hadoop and Spark with RDMA over OpenPOWER Socket 0 (SMT) Core (SMT) Core Connect to DDR4 Memory Challenges Results IBM POWER8 Architecture (SMT) Core (SMT) Core Connect to IB HCA (SMT) Core (SMT) Core Performance characteristics of RDMA-based Hadoop and Spark over OpenPOWER systems with IB networks Can RDMA-based Big Data middleware demonstrate good performance benefits on OpenPOWER systems? Any new accelerations for Big Data middleware on OpenPOWER systems to attain better benefits? For the Spark SortBy benchmark, RDMA-Opt outperforms IPoIB and RDMA-Def by 23.5% and 10.5% Socket 1 (SMT) Core (SMT) Core (SMT) Core (SMT) Core Connect to IB HCA (SMT) Core (SMT) Core Connect to DDR4 Memory Spark SortBy IPoIB RDMA-Def RDMA-Opt 32GB 64GB 96GB Reduced by 23.5% X. Lu, H. Shi, D. Shankar and D. K. Panda, Performance Characterization and Acceleration of Big Data Workloads on OpenPOWER System, IEEE BigData HPCAC-Stanford (Feb 18) 39

40 Acceleration Case Studies and Performance Evaluation Basic Designs for HiBD Packages HDFS and MapReduce Spark Memcached Kafka Deep Learning TensorFlow with grpc Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark,..) Cloud Support for Big Data Stacks Virtualization with SR-IOV and Containers Swift Object Store HPCAC-Stanford (Feb 18) 40

41 K 2K 4K Time (us) Thousands of Transactions per Second (TPS) Memcached Performance (FDR Interconnect) 10 1 Memcached GET Latency OSU-IB (FDR) Memcached Get latency Latency Reduced by nearly 20X Message Size Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR) 4 bytes OSU-IB: 2.84 us; IPoIB: us, 2K bytes OSU-IB: 4.49 us; IPoIB: us Memcached Throughput (4bytes) 0 Memcached Throughput No. of Clients 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s, Nearly 2X improvement in throughput J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP 11 J. Jose, H. Subramoni, K. Kandalla, M. W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for InfiniBand Clusters using Hybrid Transport, CCGrid 12 2X HPCAC-Stanford (Feb 18) 41

42 Latency (sec) Throughput (Kq/s) Micro-benchmark Evaluation for OLDP workloads No. of Clients No. of Clients Illustration with Read-Cache-Read access pattern using modified mysqlslap load testing tool Memcached-IPoIB (32Gbps) Reduced by 66% 4000 Memcached-RDMA (32Gbps) Memcached-RDMA can - improve query latency by up to 66% over IPoIB (32Gbps) - throughput by up to 69% over IPoIB (32Gbps) D. Shankar, X. Lu, J. Jose, M. W. Rahman, N. Islam, and D. K. Panda, Can RDMA Benefit On-Line Data Processing Workloads with Memcached and MySQL, ISPASS 15 HPCAC-Stanford (Feb 18) Memcached-IPoIB (32Gbps) Memcached-RDMA (32Gbps)

43 Average Latency (us) Performance Evaluation with Non-Blocking Memcached API Miss Penalty (Backend DB Access Overhead) Client Wait Server Response Cache Update Cache check+load (Memory and/or SSD read) Slab Allocation (w/ SSD write on Out-of-Mem) 0 Set Get Set Get Set Get Set Get Set Get Set Get Data does not fit in memory: Non-blocking Memcached Set/Get API Extensions can achieve >16x latency improvement vs. blocking API over RDMA-Hybrid/RDMA-Mem w/ penalty >2.5x throughput improvement vs. blocking API over default/optimized RDMA-Hybrid Data fits in memory: Non-blocking Extensions perform similar to RDMA-Mem/RDMA-Hybrid and >3.6x improvement over IPoIB-Mem IPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-Block H-RDMA-Opt-NonBb H-RDMA-Opt-NonB- H = Hybrid Memcached over SATA SSD Opt = Adaptive slab manager Block = Default Blocking API NonB-i = Non-blocking iset/iget API NonB-b = Non-blocking bset/bget API w/ buffer re-use guarantee HPCAC-Stanford (Feb 18) 43

44 Latency (ms) Throughput (MB/s) RDMA-Kafka: High-Performance Message Broker for Streaming Workloads % Record Size RDMA IPoIB Kafka Producer Benchmark (1 Broker 4 Producers) Experiments run on OSU-RI2 cluster 2.4GHz 28 cores, InfiniBand EDR, 512 GB RAM, 400GB SSD Up to 98% improvement in latency compared to IPoIB Up to 7x increase in throughput over IPoIB Record Size H. Javed, X. Lu, and D. K. Panda, Characterization of Big Data Stream Processing Pipeline: A Case Study using Flink and Kafka, BDCAT 17 HPCAC-Stanford (Feb 18) 44 7x

45 Acceleration Case Studies and Performance Evaluation Basic Designs for HiBD Packages HDFS and MapReduce Spark Memcached Kafka Deep Learning TensorFlow with grpc Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark,..) Cloud Support for Big Data Stacks Virtualization with SR-IOV and Containers Swift Object Store HPCAC-Stanford (Feb 18) 45

46 HPCAC-Stanford (Feb 18) 46 Overview of RDMA-gRPC with TensorFlow Worker /job:ps/task:0 CPU GPU RDMA grpc server/ client Client Master RDMA grpc server/ client Worker /job:worker/task:0 CPU GPU Worker services communicate among each other using RDMA-gRPC

47 Latency (us) Performance Benefits for RDMA-gRPC with Micro-Benchmark Latency (us) Latency (us) Default grpc AR-gRPC Default grpc AR-gRPC Default grpc AR-gRPC K 8K payload (Bytes) 0 16K 32K 64K 128K 256K 512K Payload (Bytes) 100 1M 2M 4M 8M Payload (Bytes) RDMA-gRPC RPC Latency AR-gRPC (OSU design) Latency on SDSC-Comet-FDR Up to 2.7x performance speedup over IPoIB for Latency for small messages Up to 2.8x performance speedup over IPoIB for Latency for medium messages Up to 2.5x performance speedup over IPoIB for Latency for large messages R. Biswas, X. Lu, and D. K. Panda, Characterizing and Accelerating TensorFlow: Can Adaptive and RDMA-based RPC Benefit It? Under Review HPCAC-Stanford (Feb 18) 47

48 Time (Seconds), Lower better Images / Second, Higher Better HPCAC-Stanford (Feb 18) Performance Benefit for TensorFlow Default grpc grpc + Verbs grpc + MPI AR-gRPC 4 Nodes (1 PS, 3 Workers) sigmoid net 19% TensorFlow performance evaluation on an IB EDR cluster Up to 19% performance speedup over IPoIB for Sigmoid net (20 epochs) Up to 35% and 31% performance speedup over IPoIB for resnet50 and Inception3 (batch size 8) % 31% resnet50 inception3 Default grpc 4 Nodes (1 PS, 3 Workers) grpc + Verbs grpc + MPI CNN AR-gRPC

49 Acceleration Case Studies and Performance Evaluation Basic Designs for HiBD Packages HDFS and MapReduce Spark Memcached Kafka Deep Learning TensorFlow with grpc Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark,..) Cloud Support for Big Data Stacks Virtualization with SR-IOV and Containers Swift Object Store HPCAC-Stanford (Feb 18) 49

50 Overview of DLoBD Stacks Layers of DLoBD Stacks Deep learning application layer Deep learning library layer Big data analytics framework layer Resource scheduler layer Distributed file system layer Hardware resource layer How much performance benefit we can achieve for end deep learning applications? Sub-optimal Performance? HPCAC-Stanford (Feb 18) 50

51 End-to-End Time (secs) End-to-End Time (secs) Performance Evaluation for IPoIB and RDMA with CaffeOnSpark and TensorFlowOnSpark (IB EDR) % IPoIB RDMA IPoIB RDMA IPoIB RDMA IPoIB RDMA GPU GPU-cuDNN GPU GPU-cuDNN CIFAR-10 Quick CaffeOnSpark 13.3% % LeNet on MNIST 45.6% CaffeOnSpark benefits from the high performance of RDMA compared to IPoIB once communication overhead becomes significant. Our experiments show that the default RDMA design in TensorFlowOnSpark is not fully optimized yet. For MNIST tests, RDMA is not showing obvious benefits TensorFlowOnSpark 2 GPUs 4 GPUs 8 GPUs 33.01% IPoIB RDMA IPoIB RDMA CIFAR10 (2 GPUs/node) 4.9% MNIST (1 GPU/node) HPCAC-Stanford (Feb 18) 51

52 Accumulative Epochs Time (secs) Accuracy (%) Epoch-Level Evaluation with BigDL on SDSC Comet Epoch-level evaluation of training VGG model using BigDL on default Spark with IPoIB and our RDMA-based Spark. RDMA version takes IPoIB-Time RDMA-Time IPoIB-Accuracy RDMA-Accuracy 2.6X constantly less time than the IPoIB version to finish every epoch. RDMA finishes epoch 18 in 2.6x time faster than IPoIB Epoch Number 0 VGG on CIFAR-10 with BigDL HPCAC-Stanford (Feb 18) 52

53 Acceleration Case Studies and Performance Evaluation Basic Designs for HiBD Packages HDFS and MapReduce Spark Memcached Kafka Deep Learning TensorFlow with grpc Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark,..) Cloud Support for Big Data Stacks Virtualization with SR-IOV and Containers Swift Object Store HPCAC-Stanford (Feb 18) 53

54 Hadoop Common Topology Detection Module Overview of RDMA-Hadoop-Virt Architecture Virtual Machines Big Data Applications CloudBurst MR-MS Polygraph Others MapReduce Map Task Scheduling Policy Extension YARN HDFS HBase Container Allocation Policy Extension Virtualization Aware Block Management Containers Others Bare-Metal nodes Virtualization-aware modules in all the four main Hadoop components: HDFS: Virtualization-aware Block Management to improve fault-tolerance YARN: Extensions to Container Allocation Policy to reduce network traffic MapReduce: Extensions to Map Task Scheduling Policy to reduce network traffic Hadoop Common: Topology Detection Module for automatic topology detection Communications in HDFS, MapReduce, and RPC go through RDMA-based designs over SR-IOV enabled InfiniBand S. Gugnani, X. Lu, D. K. Panda. Designing Virtualization-aware and Automatic Topology Detection Schemes for Accelerating Hadoop on SR-IOV-enabled Clouds. CloudCom, HPCAC-Stanford (Feb 18) 54

55 EXECUTION TIME EXECUTION TIME Evaluation with Applications CloudBurst Self-Join RDMA-Hadoop RDMA-Hadoop-Virt RDMA-Hadoop RDMA-Hadoop-Virt Default Mode Distributed Mode 30% reduction Default Mode Distributed Mode 55% reduction 14% and 24% improvement with Default Mode for CloudBurst and Self-Join 30% and 55% improvement with Distributed Mode for CloudBurst and Self-Join Will be released in future HPCAC-Stanford (Feb 18) 55

56 Acceleration Case Studies and Performance Evaluation Basic Designs for HiBD Packages HDFS and MapReduce Spark Memcached Kafka Deep Learning TensorFlow with grpc Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark,..) Cloud Support for Big Data Stacks Virtualization with SR-IOV and Containers Swift Object Store HPCAC-Stanford (Feb 18) 56

57 OpenStack Swift Overview Distributed Cloud-based Object Storage Service Deployed as part of OpenStack installation Can be deployed as standalone storage solution as well Worldwide data access via Internet HTTP-based Architecture Usage Multiple Object Servers: To store data Few Proxy Servers: Act as a proxy for all requests Ring: Handles metadata Input/output source for Big Data applications (most common use case) Software/Data backup Storage of VM/Docker images Based on traditional TCP sockets communication Send PUT or GET request PUT/GET /v1/<account>/<container>/<object> Object Server Disk 1 Disk 2 Proxy Server Object Server Disk 1 Disk 2 Swift Architecture Object Server Disk 1 Disk 2 HPCAC-Stanford (Feb 18) 57 Ring

58 Swift-X: Accelerating OpenStack Swift with RDMA for Building Efficient HPC Clouds Challenges Design Proxy server is a bottleneck for large scale deployments Object upload/download operations network intensive Can an RDMA-based approach benefit? Re-designed Swift architecture for improved scalability and performance; Two proposed designs: Client-Oblivious Design: No changes required on the client side Metadata Server-based Design: Direct communication between client and object servers; bypass proxy server RDMA-based communication framework for accelerating networking performance Client-Oblivious Design (D1) High-performance I/O framework to provide maximum Metadata Server-based overlap between communication and I/O Design (D2) S. Gugnani, X. Lu, and D. K. Panda, Swift-X: Accelerating OpenStack Swift with RDMA for Building an Efficient HPC Cloud, accepted at CCGrid 17, May 2017 HPCAC-Stanford (Feb 18) 58

59 LATENCY (S) LATENCY (s) HPCAC-Stanford (Feb 18) 59 Swift-X: Accelerating OpenStack Swift with RDMA for Building Efficient HPC Clouds TIME BREAKUP OF GET AND PUT GET LATENCY EVALUATION Reduced by 66% 0 Swift PUT Swift-X (D1) PUT Swift-X (D2) PUT Swift GET Swift-X (D1) GET Swift-X (D2) GET Communication I/O Hashsum Other 5 0 1MB 4MB 16MB 64MB 256MB 1GB 4GB OBJECT SIZE Swift Swift-X (D2) Swift-X (D1) Communication time reduced by up to 3.8x for PUT and up to 2.8x for GET Up to 66% reduction in GET latency

60 HPCAC-Stanford (Feb 18) 60 Concluding Remarks Discussed challenges in accelerating BigData and Deep Learning Stacks with HPC technologies Presented basic and advanced designs to take advantage of InfiniBand/RDMA for HDFS, MapReduce, HBase, Spark, Memcached, grpc, and TensorFlow on HPC clusters and clouds Results are promising for Big Data processing and the associated Deep Learning tools Many other open issues need to be solved Will enable Big Data and Deep Learning communities to take advantage of modern HPC technologies to carry out their analytics in a fast and scalable manner

61 HPCAC-Stanford (Feb 18) 61 Funding Acknowledgments Funding Support by Equipment Support by

62 Personnel Acknowledgments HPCAC-Stanford (Feb 18) 62 Current Students (Graduate) A. Awan (Ph.D.) R. Biswas (M.S.) M. Bayatpour (Ph.D.) S. Chakraborthy (Ph.D.) C.-H. Chu (Ph.D.) S. Guganani (Ph.D.) Current Students (Undergraduate) Current Research Scientists Current Research Specialist J. Hashmi (Ph.D.) N. Sarkauskas (B.S.) X. Lu J. Smith H. Javed (Ph.D.) H. Subramoni M. Arnold P. Kousha (Ph.D.) D. Shankar (Ph.D.) Current Post-doc H. Shi (Ph.D.) A. Ruhela J. Zhang (Ph.D.) Past Students A. Augustine (M.S.) P. Balaji (Ph.D.) S. Bhagvat (M.S.) A. Bhat (M.S.) D. Buntinas (Ph.D.) L. Chai (Ph.D.) B. Chandrasekharan (M.S.) N. Dandapanthula (M.S.) V. Dhanraj (M.S.) T. Gangadharappa (M.S.) K. Gopalakrishnan (M.S.) W. Huang (Ph.D.) W. Jiang (M.S.) J. Jose (Ph.D.) S. Kini (M.S.) M. Koop (Ph.D.) K. Kulkarni (M.S.) R. Kumar (M.S.) S. Krishnamoorthy (M.S.) K. Kandalla (Ph.D.) M. Li (Ph.D.) P. Lai (M.S.) J. Liu (Ph.D.) M. Luo (Ph.D.) A. Mamidala (Ph.D.) G. Marsh (M.S.) V. Meshram (M.S.) A. Moody (M.S.) S. Naravula (Ph.D.) R. Noronha (Ph.D.) X. Ouyang (Ph.D.) S. Pai (M.S.) S. Potluri (Ph.D.) R. Rajachandrasekar (Ph.D.) G. Santhanaraman (Ph.D.) A. Singh (Ph.D.) J. Sridhar (M.S.) S. Sur (Ph.D.) H. Subramoni (Ph.D.) K. Vaidyanathan (Ph.D.) A. Vishnu (Ph.D.) J. Wu (Ph.D.) W. Yu (Ph.D.) Past Research Scientist K. Hamidouche S. Sur Past Programmers D. Bureddy J. Perkins Past Post-Docs D. Banerjee J. Lin S. Marcarelli X. Besseron M. Luo J. Vienne H.-W. Jin E. Mancini H. Wang

63 The 4 th International Workshop on High-Performance Big Data Computing (HPBDC) HPCAC-Stanford (Feb 18) 63 HPBDC 2018 will be held with IEEE International Parallel and Distributed Processing Symposium (IPDPS 2018), Vancouver, British Columbia CANADA, May, HPBDC 2017 was held in conjunction with IPDPS 17 Keynote Talk: Prof. Satoshi Matsuoka, Converging HPC and Big Data / AI Infrastructures at Scale with BYTES-Oriented Architectures Panel Topic: Sunrise or Sunset: Exploring the Design Space of Big Data Software Stack Six Regular Research Papers and One Short Research Papers HPBDC 2016 was held in conjunction with IPDPS 16 Keynote Talk: Dr. Chaitanya Baru, High Performance Computing and Which Big Data? Panel Topic: Merge or Split: Mutual Influence between Big Data and HPC Techniques Six Regular Research Papers and Two Short Research Papers

64 HPCAC-Stanford (Feb 18) 64 Thank You! Network-Based Computing Laboratory The High-Performance Big Data Project

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management SigHPC BigData BoF (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Exploiting HPC Technologies for Accelerating Big Data Processing and Associated Deep Learning

Exploiting HPC Technologies for Accelerating Big Data Processing and Associated Deep Learning Exploiting HPC Technologies for Accelerating Big Data Processing and Associated Deep Learning Keynote Talk at Swiss Conference (April 18) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:

More information

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters Presentation at Mellanox Theatre (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

HPC Meets Big Data: Accelerating Hadoop, Spark, and Memcached with HPC Technologies

HPC Meets Big Data: Accelerating Hadoop, Spark, and Memcached with HPC Technologies HPC Meets Big Data: Accelerating Hadoop, Spark, and Memcached with HPC Technologies Talk at OpenFabrics Alliance Workshop (OFAW 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters Presentation at Mellanox Theatre (SC 16) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing Talk at HPCAC-Switzerland (April 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies

Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies Talk at Intel HPC Developer Conference 2017 (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University

More information

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS Adithya Bhat, Nusrat Islam, Xiaoyi Lu, Md. Wasi- ur- Rahman, Dip: Shankar, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng

More information

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters High Performance File System and I/O Middleware Design for Big Data on HPC Clusters by Nusrat Sharmin Islam Advisor: Dhabaleswar K. (DK) Panda Department of Computer Science and Engineering The Ohio State

More information

High-Performance Training for Deep Learning and Computer Vision HPC

High-Performance Training for Deep Learning and Computer Vision HPC High-Performance Training for Deep Learning and Computer Vision HPC Panel at CVPR-ECV 18 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X Intel Nerve Center (SC 7) Presentation Dhabaleswar K (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu Parallel

More information

Exploiting HPC Technologies for Accelerating Big Data Processing and Storage

Exploiting HPC Technologies for Accelerating Big Data Processing and Storage Exploiting HPC Technologies for Accelerating Big Data Processing and Storage Talk in the 5194 class by Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu http://www.cse.ohio-state.edu/~luxi

More information

Accelerating Big Data Processing with RDMA- Enhanced Apache Hadoop

Accelerating Big Data Processing with RDMA- Enhanced Apache Hadoop Accelerating Big Data Processing with RDMA- Enhanced Apache Hadoop Keynote Talk at BPOE-4, in conjunction with ASPLOS 14 (March 2014) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Accelerating and Benchmarking Big Data Processing on Modern Clusters

Accelerating and Benchmarking Big Data Processing on Modern Clusters Accelerating and Benchmarking Big Data Processing on Modern Clusters Open RG Big Data Webinar (Sept 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures Talk at Bench 2018 by Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu http://www.cse.ohio-state.edu/~luxi

More information

Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached

Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached Talk at HPC Advisory Council Lugano Conference (213) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Accelerating and Benchmarking Big Data Processing on Modern Clusters

Accelerating and Benchmarking Big Data Processing on Modern Clusters Accelerating and Benchmarking Big Data Processing on Modern Clusters Keynote Talk at BPOE-6 (Sept 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda Network- Based Compu2ng Laboratory Department of Computer

More information

Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing

Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing Talk at Storage Developer Conference SNIA 2018 by Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu

More information

DLoBD: An Emerging Paradigm of Deep Learning over Big Data Stacks on RDMA- enabled Clusters

DLoBD: An Emerging Paradigm of Deep Learning over Big Data Stacks on RDMA- enabled Clusters DLoBD: An Emerging Paradigm of Deep Learning over Big Data Stacks on RDMA- enabled Clusters Talk at OFA Workshop 2018 by Xiaoyi Lu The Ohio State University E- mail: luxi@cse.ohio- state.edu h=p://www.cse.ohio-

More information

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters Exploiting InfiniBand and Direct Technology for High Performance Collectives on Clusters Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth

More information

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K. In the multi-core age, How do larger, faster and cheaper and more responsive sub-systems affect data management? Panel at ADMS 211 Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory Department

More information

Sunrise or Sunset: Exploring the Design Space of Big Data So7ware Stacks

Sunrise or Sunset: Exploring the Design Space of Big Data So7ware Stacks Sunrise or Sunset: Exploring the Design Space of Big Data So7ware Stacks Panel PresentaAon at HPBDC 17 by Dhabaleswar K. (DK) Panda The Ohio State University E- mail: panda@cse.ohio- state.edu h

More information

Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase

Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase 2 IEEE 8th International Conference on Cloud Computing Technology and Science Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase Xiaoyi Lu, Dipti Shankar, Shashank Gugnani,

More information

High-Performance Broadcast for Streaming and Deep Learning

High-Performance Broadcast for Streaming and Deep Learning High-Performance Broadcast for Streaming and Deep Learning Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth - SC17 2 Outline Introduction

More information

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience Jithin Jose, Mingzhe Li, Xiaoyi Lu, Krishna Kandalla, Mark Arnold and Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory

More information

Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects

Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects Keynote Talk at ADMS 214 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Memcached Design on High Performance RDMA Capable Interconnects

Memcached Design on High Performance RDMA Capable Interconnects Memcached Design on High Performance RDMA Capable Interconnects Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md. Wasi- ur- Rahman, Nusrat S. Islam, Xiangyong Ouyang, Hao Wang, Sayantan

More information

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning Ching-Hsiang Chu 1, Xiaoyi Lu 1, Ammar A. Awan 1, Hari Subramoni 1, Jahanzeb Hashmi 1, Bracy Elton 2 and Dhabaleswar

More information

Acceleration for Big Data, Hadoop and Memcached

Acceleration for Big Data, Hadoop and Memcached Acceleration for Big Data, Hadoop and Memcached A Presentation at HPC Advisory Council Workshop, Lugano 212 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Designing HPC, Big Data, and Deep Learning Middleware for Exascale Systems: Challenges and Opportunities

Designing HPC, Big Data, and Deep Learning Middleware for Exascale Systems: Challenges and Opportunities Designing HPC, Big Data, and Deep Learning Middleware for Exascale Systems: Challenges and Opportunities DHABALESWAR K PANDA DK Panda is a Professor and University Distinguished Scholar of Computer Science

More information

Big Data Analytics with the OSU HiBD Stack at SDSC. Mahidhar Tatineni OSU Booth Talk, SC18, Dallas

Big Data Analytics with the OSU HiBD Stack at SDSC. Mahidhar Tatineni OSU Booth Talk, SC18, Dallas Big Data Analytics with the OSU HiBD Stack at SDSC Mahidhar Tatineni OSU Booth Talk, SC18, Dallas Comet HPC for the long tail of science iphone panorama photograph of 1 of 2 server rows Comet: System Characteristics

More information

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay 1 Apache Spark - Intro Spark within the Big Data ecosystem Data Sources Data Acquisition / ETL Data Storage Data Analysis / ML Serving 3 Apache

More information

Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters

Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters Presentation at IXPUG Meeting, July 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters High Performance MPI Support in MVAPICH2 for InfiniBand Clusters A Talk at NVIDIA Booth (SC 11) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Overview of High- Performance Big Data Stacks

Overview of High- Performance Big Data Stacks Overview of High- Performance Big Data Stacks OSU Booth Talk at SC 17 by Xiaoyi Lu The Ohio State University E- mail: luxi@cse.ohio- state.edu h=p://www.cse.ohio- state.edu/~luxi OSC Booth @ SC 16 2 Overview

More information

MVAPICH2 Project Update and Big Data Acceleration

MVAPICH2 Project Update and Big Data Acceleration MVAPICH2 Project Update and Big Data Acceleration Presentation at HPC Advisory Council European Conference 212 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Exploi'ng HPC Technologies to Accelerate Big Data Processing

Exploi'ng HPC Technologies to Accelerate Big Data Processing Exploi'ng HPC Technologies to Accelerate Big Data Processing Talk at Open Fabrics Workshop (April 216) by Dhabaleswar K. (DK) Panda The Ohio State University E- mail: panda@cse.ohio- state.edu h

More information

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X IXPUG 7 PresentaNon J. Hashmi, M. Li, H. Subramoni and DK Panda The Ohio State University E-mail: {hashmi.29,li.292,subramoni.,panda.2}@osu.edu

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Panda Department of Computer Science and Engineering The Ohio

More information

Networking Technologies and Middleware for Next-Generation Clusters and Data Centers: Challenges and Opportunities

Networking Technologies and Middleware for Next-Generation Clusters and Data Centers: Challenges and Opportunities Networking Technologies and Middleware for Next-Generation Clusters and Data Centers: Challenges and Opportunities Keynote Talk at HPSR 18 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:

More information

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning 5th ANNUAL WORKSHOP 209 Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning Hari Subramoni Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University E-mail:

More information

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Presentation at Mellanox Theater () Dhabaleswar K. (DK) Panda - The Ohio State University panda@cse.ohio-state.edu Outline Communication

More information

Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime: An MVAPICH2-X Approach

Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime: An MVAPICH2-X Approach Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime: An MVAPICH2-X Approach Talk at OSC theater (SC 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:

More information

Designing HPC, Big Data, Deep Learning, and Cloud Middleware for Exascale Systems: Challenges and Opportunities

Designing HPC, Big Data, Deep Learning, and Cloud Middleware for Exascale Systems: Challenges and Opportunities Designing HPC, Big Data, Deep Learning, and Cloud Middleware for Exascale Systems: Challenges and Opportunities Keynote Talk at SEA Symposium (April 18) by Dhabaleswar K. (DK) Panda The Ohio State University

More information

High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters

High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters Talk at OpenFabrics Workshop (April 2016) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09 RDMA over Ethernet - A Preliminary Study Hari Subramoni, Miao Luo, Ping Lai and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University Introduction Problem Statement

More information

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

High-Performance MPI and Deep Learning on OpenPOWER Platform

High-Performance MPI and Deep Learning on OpenPOWER Platform High-Performance MPI and Deep Learning on OpenPOWER Platform Talk at OpenPOWER SUMMIT (Mar 18) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi

More information

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Scalability and Performance of MVAPICH2 on OakForest-PACS

Scalability and Performance of MVAPICH2 on OakForest-PACS Scalability and Performance of MVAPICH2 on OakForest-PACS Talk at JCAHPC Booth (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Efficient Data Access Strategies for Hadoop and Spark on HPC Cluster with Heterogeneous Storage*

Efficient Data Access Strategies for Hadoop and Spark on HPC Cluster with Heterogeneous Storage* 216 IEEE International Conference on Big Data (Big Data) Efficient Data Access Strategies for Hadoop and Spark on HPC Cluster with Heterogeneous Storage* Nusrat Sharmin Islam, Md. Wasi-ur-Rahman, Xiaoyi

More information

Designing Efficient HPC, Big Data, Deep Learning, and Cloud Computing Middleware for Exascale Systems

Designing Efficient HPC, Big Data, Deep Learning, and Cloud Computing Middleware for Exascale Systems Designing Efficient HPC, Big Data, Deep Learning, and Cloud Computing Middleware for Exascale Systems Keynote Talk at HPCAC China Conference by Dhabaleswar K. (DK) Panda The Ohio State University E mail:

More information

Designing and Modeling High-Performance MapReduce and DAG Execution Framework on Modern HPC Systems

Designing and Modeling High-Performance MapReduce and DAG Execution Framework on Modern HPC Systems Designing and Modeling High-Performance MapReduce and DAG Execution Framework on Modern HPC Systems Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy

More information

High Performance Design for HDFS with Byte-Addressability of NVM and RDMA

High Performance Design for HDFS with Byte-Addressability of NVM and RDMA High Performance Design for HDFS with Byte-Addressability of and RDMA Nusrat Sharmin Islam, Md Wasi-ur-Rahman, Xiaoyi Lu, and Dhabaleswar K (DK) Panda Department of Computer Science and Engineering The

More information

Building the Most Efficient Machine Learning System

Building the Most Efficient Machine Learning System Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide

More information

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters Jie Zhang Dr. Dhabaleswar K. Panda (Advisor) Department of Computer Science & Engineering The

More information

RDMA for Memcached User Guide

RDMA for Memcached User Guide 0.9.5 User Guide HIGH-PERFORMANCE BIG DATA TEAM http://hibd.cse.ohio-state.edu NETWORK-BASED COMPUTING LABORATORY DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING THE OHIO STATE UNIVERSITY Copyright (c)

More information

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services P. Balaji, K. Vaidyanathan, S. Narravula, H. W. Jin and D. K. Panda Network Based Computing Laboratory

More information

The Future of Supercomputer Software Libraries

The Future of Supercomputer Software Libraries The Future of Supercomputer Software Libraries Talk at HPC Advisory Council Israel Supercomputing Conference by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters 1 Ching-Hsiang Chu, 1 Khaled Hamidouche, 1 Hari Subramoni, 1 Akshay Venkatesh, 2 Bracy Elton and 1 Dhabaleswar

More information

Deep Learning Frameworks with Spark and GPUs

Deep Learning Frameworks with Spark and GPUs Deep Learning Frameworks with Spark and GPUs Abstract Spark is a powerful, scalable, real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel,

More information

Designing Convergent HPC, Deep Learning and Big Data Analytics Software Stacks for Exascale Systems

Designing Convergent HPC, Deep Learning and Big Data Analytics Software Stacks for Exascale Systems Designing Convergent HPC, Deep Learning and Big Data Analytics Software Stacks for Exascale Systems Talk at HPCAC-AI Switzerland Conference (April 19) by Dhabaleswar K. (DK) Panda The Ohio State University

More information

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth by D.K. Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline Overview of MVAPICH2-GPU

More information

Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach

Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach Talk at OpenFabrics Workshop (April 216) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Bridging Neuroscience and HPC with MPI-LiFE Shashank Gugnani

Bridging Neuroscience and HPC with MPI-LiFE Shashank Gugnani Bridging Neuroscience and HPC with MPI-LiFE Shashank Gugnani The Ohio State University E-mail: gugnani.2@osu.edu http://web.cse.ohio-state.edu/~gugnani/ Network Based Computing Laboratory SC 17 2 Neuroscience:

More information

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING Meeting Today s Datacenter Challenges Produced by Tabor Custom Publishing in conjunction with: 1 Introduction In this era of Big Data, today s HPC systems are faced with unprecedented growth in the complexity

More information

Building the Most Efficient Machine Learning System

Building the Most Efficient Machine Learning System Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide

More information

Application Acceleration Beyond Flash Storage

Application Acceleration Beyond Flash Storage Application Acceleration Beyond Flash Storage Session 303C Mellanox Technologies Flash Memory Summit July 2014 Accelerating Applications, Step-by-Step First Steps Make compute fast Moore s Law Make storage

More information

Accelerating HPL on Heterogeneous GPU Clusters

Accelerating HPL on Heterogeneous GPU Clusters Accelerating HPL on Heterogeneous GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda Department of Computer Science and Engineering

More information

Communication Frameworks for HPC and Big Data

Communication Frameworks for HPC and Big Data Communication Frameworks for HPC and Big Data Talk at HPC Advisory Council Spain Conference (215) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan and Dhabaleswar K. (DK) Panda Speaker: Sourav Chakraborty

More information

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Designing High Performance Communication Middleware with Emerging Multi-core Architectures Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of

More information

Infiniband and RDMA Technology. Doug Ledford

Infiniband and RDMA Technology. Doug Ledford Infiniband and RDMA Technology Doug Ledford Top 500 Supercomputers Nov 2005 #5 Sandia National Labs, 4500 machines, 9000 CPUs, 38TFlops, 1 big headache Performance great...but... Adding new machines problematic

More information

High Performance Migration Framework for MPI Applications on HPC Cloud

High Performance Migration Framework for MPI Applications on HPC Cloud High Performance Migration Framework for MPI Applications on HPC Cloud Jie Zhang, Xiaoyi Lu and Dhabaleswar K. Panda {zhanjie, luxi, panda}@cse.ohio-state.edu Computer Science & Engineering Department,

More information

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning Ammar Ahmad Awan, Khaled Hamidouche, Akshay Venkatesh, and Dhabaleswar K. Panda Network-Based Computing Laboratory Department

More information

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits Ashish Kumar Singh, Sreeram Potluri, Hao Wang, Krishna Kandalla, Sayantan Sur, and Dhabaleswar K. Panda Network-Based

More information

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University

More information

Why AI Frameworks Need (not only) RDMA?

Why AI Frameworks Need (not only) RDMA? Why AI Frameworks Need (not only) RDMA? With Design and Implementation Experience of Networking Support on TensorFlow GDR, Apache MXNet, WeChat Amber, and Tencent Angel Bairen Yi (byi@connect.ust.hk) Jingrong

More information

Scaling with PGAS Languages

Scaling with PGAS Languages Scaling with PGAS Languages Panel Presentation at OFA Developers Workshop (2013) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication Sreeram Potluri* Hao Wang* Devendar Bureddy* Ashish Kumar Singh* Carlos Rosales + Dhabaleswar K. Panda* *Network-Based

More information

MOHA: Many-Task Computing Framework on Hadoop

MOHA: Many-Task Computing Framework on Hadoop Apache: Big Data North America 2017 @ Miami MOHA: Many-Task Computing Framework on Hadoop Soonwook Hwang Korea Institute of Science and Technology Information May 18, 2017 Table of Contents Introduction

More information

Designing Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges

Designing Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges Designing Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges Talk at Brookhaven National Laboratory (Oct 214) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:

More information

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures Haiyang Shi, Xiaoyi Lu, and Dhabaleswar K. (DK) Panda {shi.876, lu.932, panda.2}@osu.edu The Ohio State University

More information

MVAPICH2 and MVAPICH2-MIC: Latest Status

MVAPICH2 and MVAPICH2-MIC: Latest Status MVAPICH2 and MVAPICH2-MIC: Latest Status Presentation at IXPUG Meeting, July 214 by Dhabaleswar K. (DK) Panda and Khaled Hamidouche The Ohio State University E-mail: {panda, hamidouc}@cse.ohio-state.edu

More information

Advanced RDMA-based Admission Control for Modern Data-Centers

Advanced RDMA-based Admission Control for Modern Data-Centers Advanced RDMA-based Admission Control for Modern Data-Centers Ping Lai Sundeep Narravula Karthikeyan Vaidyanathan Dhabaleswar. K. Panda Computer Science & Engineering Department Ohio State University Outline

More information

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research The CRAIL Project: Overview Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) Spark-IO FS Albis Streaming

More information

Software Libraries and Middleware for Exascale Systems

Software Libraries and Middleware for Exascale Systems Software Libraries and Middleware for Exascale Systems Talk at HPC Advisory Council China Workshop (212) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? Sayantan Sur, Abhinav Vishnu, Hyun-Wook Jin, Wei Huang and D. K. Panda {surs, vishnu, jinhy, huanwei, panda}@cse.ohio-state.edu

More information

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational

More information

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart Xiangyong Ouyang, Raghunath Rajachandrasekar, Xavier Besseron, Hao Wang, Jian Huang, Dhabaleswar K. Panda Department of Computer

More information

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research The CRAIL Project: Overview Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) Spark-IO Albis Pocket

More information

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications GPU Technology Conference GTC 2016 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications K. Vaidyanathan, P. Lai, S. Narravula and D. K. Panda Network Based Computing Laboratory

More information

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Jeffrey Young, Alex Merritt, Se Hoon Shon Advisor: Sudhakar Yalamanchili 4/16/13 Sponsors: Intel, NVIDIA, NSF 2 The Problem Big

More information

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE BRETT WENINGER, MANAGING DIRECTOR 10/21/2014 ADURANT APPROACH TO BIG DATA Align to Un/Semi-structured Data Instead of Big Scale out will become Big Greatest

More information

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering,

More information