Exploiting HPC Technologies for Accelerating Big Data Processing and Storage

Size: px

Start display at page:

Download "Exploiting HPC Technologies for Accelerating Big Data Processing and Storage"

Claude Kennedy
5 years ago
Views:

1 Exploiting HPC Technologies for Accelerating Big Data Processing and Storage Talk in the 5194 class by Xiaoyi Lu The Ohio State University

Introduction to Big Data Analytics and Trends Big

harness the power of data, both in the business

the most important elements in business analytics

converging to meet large scale data processing

(HPDA) workloads in the cloud is gaining

survey, 27% of cloud deployments are running HPDA

com/blog/tag/data?currentpage=3 http://www.

2 Introduction to Big Data Analytics and Trends Big Data has changed the way people understand and harness the power of data, both in the business and research domains Big Data has become one of the most important elements in business analytics Big Data and High Performance Computing (HPC) are converging to meet large scale data processing challenges Running High Performance Data Analysis (HPDA) workloads in the cloud is gaining popularity According to the latest OpenStack survey, 27% of cloud deployments are running HPDA workloads Class at OSU 2

4V Characteristics of Big Data 5194 Class at OSU 3 Commonly accepted 3V s of Big Data Volume, Velocity, Variety Courtesy: http://api.ning.

3 4V Characteristics of Big Data 5194 Class at OSU 3 Commonly accepted 3V s of Big Data Volume, Velocity, Variety Courtesy: Xz5cxylXG004GLGJdjoPd6bVfVBwvgu*F5MwDDUCiHHdmBW- JTEz0cfJjGurJucBMTkIUNdL3jcZT8IPfNWfN9/dv1.jpg Michael Stonebraker: Big Data Means at Least Three Different Things, 4/5V s of Big Data 3V + *Veracity, *Value

By 2020, a third of the data in the digital universe (more than 13,000 exabytes) will have Big Data Value, but only if it is

4 Big Volume of Data by the End of this Decade From 2005 to 2020, the digital universe will grow by a factor of 300, from 130 exabytes to 40,000 exabytes. By 2020, a third of the data in the digital universe (more than 13,000 exabytes) will have Big Data Value, but only if it is tagged and analyzed. Courtesy: John Gantz and David Reinsel, "The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East, IDC's Digital Universe Study, sponsored by EMC, December Class at OSU 4

5 Big Velocity How Much Data Is Generated Every Minute on the Internet? The global Internet population grew 7.5% from 2016 and now represents 3.7 Billion People. Courtesy: Class at OSU 5

6 Data Management and Processing on Modern Clusters 5194 Class at OSU 6 Substantial impact on designing and utilizing data management and processing systems in multiple tiers Front-end data accessing and serving (Online) Memcached + DB (e.g. MySQL), HBase Back-end data analytics (Offline) HDFS, MapReduce, Spark

5194 Class at OSU 7 Not Only in Internet Services - Big

Analysis, and Visualization Applications examples Climate

Intensive Tasks Runs large-scale simulations on

Collect experimental / observational data Move experimental

7 5194 Class at OSU 7 Not Only in Internet Services - Big Data in Scientific Domains Scientific Data Management, Analysis, and Visualization Applications examples Climate modeling Combustion Fusion Astrophysics Bioinformatics Data Intensive Tasks Runs large-scale simulations on supercomputers Dump data on parallel storage systems Collect experimental / observational data Move experimental / observational data to analysis sites Visual analytics help understand data visually

8 Presentation Outline Overview MapReduce and RDD Programming Models Apache Hadoop, Spark, Memcached, grpc, and TensorFlow Modern Interconnects and Protocols Challenges in Accelerating Hadoop, Spark, Memcached, grpc, and TensorFlow Acceleration Case Studies and In-Depth Performance Evaluation The High-Performance Big Data (HiBD) Project and Associated Releases Conclusion and Q&A 5194 Class at OSU 8

9 5194 Class at OSU 9 WordCount Execution in MapReduce The overall execution process of WordCount in MapReduce

10 A Hadoop MapReduce Example - WordCount public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); LOC of the Full while (tokenizer.hasmoretokens()) Productive { Example: 63 word.set(tokenizer.nexttoken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> values, Context context) throws IOException, InterruptedException { Fault- Tolerant int sum = 0; while (values.hasnext()) { sum += Scalable values.next().get(); } context.write(key, new IntWritable(sum)); } } } 5194 Class at OSU 10

11 5194 Class at OSU 11 Data Sharing Problems in MapReduce HDFS read HDFS write HDFS read HDFS write iter. 1 iter Input Slow due to replication, serialization, and disk IO In-Memory? iter. 1 iter Input faster than network and disk

12 RDD Programming Model in Spark Key idea: Resilient Distributed Datasets (RDDs) Immutable distributed collections of objects that can be cached in memory across cluster nodes Created by transforming data in stable storage using data flow operators (map, filter, groupby, ) Manipulated through various parallel operators Automatically rebuilt on failure rebuilt if a partition is lost Interface Clean language-integrated API in Scala (Python & Java) Can be used interactively from Scala console 5194 Class at OSU 12

13 RDD Operations map filter sample union groupbykey reducebykey sortbykey join Transformations (define a new RDD) Actions (return a result to driver) reduce collect count first Take countbykey saveastextfile saveassequencefile More Information: Class at OSU 13

5194 Class at OSU 14 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base Cache 1 Transformed RDD RDD lines =

$split( \t )(2)) Driver Action cachedmsgs = messages.cache() Lazy Execution cachedmsgs.filter(_.contains( foo )).count Action cachedmsgs.filter(_.contains( bar )).$

14 5194 Class at OSU 14 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base Cache 1 Transformed RDD RDD lines = spark.textfile( hdfs://... ) Worker results errors = lines.filter(_.startswith( ERROR )) tasks Block 1 messages = errors.map(_.split( \t )(2)) Driver Action cachedmsgs = messages.cache() Lazy Execution cachedmsgs.filter(_.contains( foo )).count Action cachedmsgs.filter(_.contains( bar )).count Cache 2 Worker... Cache 3 Worker Block 2 Full-text Result: search scaled of Wikipedia, to 1 TB data 60GB in 5-7 on sec 20 EC2 machines, (vs sec sec for vs. on-disk 20s for data) on-disk Block 3 Courtesy:

15 5194 Class at OSU 15 Lineage-based Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions Example cachedmsgs = textfile(...).filter(_.contains( error )).map(_.split( \t )(2)).cache() HdfsRDD path: hdfs:// FilteredRDD func: contains(...) MappedRDD func: split( ) CachedRDD

flatmap(line => line.split(" ")).map(word => (word, 1)).

16 5194 Class at OSU 16 RDD Example: Word Count in Spark! val file = spark.textfile("hdfs://...") Productive 3 Lines! val counts = file.flatmap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) counts.saveastextfile("hdfs://...") Scalable High- Performance Fault- Tolerant

17 Presentation Outline Overview MapReduce and RDD Programming Models Apache Hadoop, Spark, Memcached, grpc, and TensorFlow Challenges in Accelerating Hadoop, Spark, Memcached, grpc, and TensorFlow Acceleration Case Studies and In-Depth Performance Evaluation The High-Performance Big Data (HiBD) Project and Associated Releases Conclusion and Q&A 5194 Class at OSU 17

18 Overview of Hadoop, Spark, Memcached, grpc, and TensorFlow Overview of Apache Hadoop and Spark Architecture and its Components MapReduce HDFS Spark HBase Overview of Web 2.0 Architecture and Memcached Overview of grpc and TensorFlow Architecture 5194 Class at OSU 18

19 5194 Class at OSU 19 Overview of Apache Hadoop Architecture Open-source implementation of Google MapReduce, GFS, and BigTable for Big Data Analytics Hadoop Common Utilities (RPC, etc.), HDFS, MapReduce, YARN Hadoop 1.x Hadoop 2.x MapReduce (Data Processing) Other Models (Data Processing) MapReduce (Cluster Resource Management & Data Processing) YARN (Cluster Resource Management & Job Scheduling) Hadoop Distributed File System (HDFS) Hadoop Common/Core (RPC,..) Hadoop Distributed File System (HDFS) Hadoop Common/Core (RPC,..)

20 MapReduce on Hadoop 2.x -- YARN Architecture Resource Manager: coordinates the allocation of compute resources Node Manager: in charge of resource containers, monitoring resource usage, and reporting to Resource Manager Application Master: in charge of the life cycle an application, like a MapReduce Job. It negotiates with Resource Manager of cluster resources and keeps track of task progress and status Courtesy: Class at OSU 20

Data Movement in Hadoop MapReduce Bulk Data Transfer Disk Operations Map and Reduce Tasks carry out the total job execution Map tasks read from HDFS, operate on it, and write the

21 Data Movement in Hadoop MapReduce Bulk Data Transfer Disk Operations Map and Reduce Tasks carry out the total job execution Map tasks read from HDFS, operate on it, and write the intermediate data to local disk Reduce tasks get these data by shuffle from TaskTrackers, operate on it and write to HDFS Communication in shuffle phase uses HTTP over Java Sockets 5194 Class at OSU 21

Hadoop Distributed File System (HDFS) Primary storage of Hadoop; highly reliable and fault-tolerant Adopted by many reputed organizations eg: Facebook, Yahoo!

22 Hadoop Distributed File System (HDFS) Primary storage of Hadoop; highly reliable and fault-tolerant Adopted by many reputed organizations eg: Facebook, Yahoo! NameNode: stores the file system namespace DataNode: stores data blocks Developed in Java for platformindependence and portability Client RPC RPC RPC RPC Uses sockets for communication! (HDFS Architecture) 5194 Class at OSU 22

23 New Features of Apache Hadoop 3.x Architecture Hadoop Apps MR HIVE Spark Hadoop Apps TensorFlow YARN MR HIVE Spark docker docker docker docker docker YARN HDFS HDFS Erasure Coding Support for more than 2 NameNodes Intra-datanode balancer YARN Built-in support for Long Running Services MapReduce Task-level native optimization (up to 30% faster for shuffle-intensive jobs) Better resource isolation (isolation supports for disk and network) and Docker Scheduling enhancement (enhance container scheduling throughput by 6x) Re-architecture for YARN Timeline Service - ATS v Class at OSU 23

5194 Class at OSU 24 Spark Architecture Overview An in-memory data-processing framework Iterative machine learning jobs Interactive data analytics Scala based Implementation Standalone, YARN, Mesos A

24 5194 Class at OSU 24 Spark Architecture Overview An in-memory data-processing framework Iterative machine learning jobs Interactive data analytics Scala based Implementation Standalone, YARN, Mesos A unified engine to support Batch, Streaming, SQL, Graph, ML/DL workloads Scalable and communication intensive Wide dependencies between Resilient Distributed Datasets (RDDs) MapReduce-like shuffle operations to repartition RDDs Sockets based communication BlinkDB Spark SQL Spark Streaming (real-time) GraphX (graph) Spark MLlib (Machine Learning) Caffe, TensorFlow, BigDL, etc. (Deep Learning) Standalone Apache Mesos YARN

HBase Architecture Overview Apache Hadoop Database (http://hbase.apache.

datacenter applications eg: Facebook Social Inbox Developed in Java for

25 HBase Architecture Overview Apache Hadoop Database ( Semi-structured database, which is highly scalable Integral part of many datacenter applications eg: Facebook Social Inbox Developed in Java for platformindependence and portability Uses sockets for communication! (HBase Architecture) 5194 Class at OSU 25

Architecture Overview of Memcached Three-layer

0 Web Servers, Memcached Servers, Database Servers

0 architecture Distributed Caching Layer Allows to

purpose Typically used to cache database queries,

26 Architecture Overview of Memcached Three-layer architecture of Web 2.0 Web Servers, Memcached Servers, Database Servers Internet Memcached is a core component of Web 2.0 architecture Distributed Caching Layer Allows to aggregate spare memory from multiple nodes General purpose Typically used to cache database queries, results of API calls Scalable model, but typical usage very network intensive 5194 Class at OSU 26

5194 Class at OSU 27 Architecture Overview of grpc Key Features: Simple service definition Works across languages and platforms C++, Java, Python, Android Java etc Linux, Mac, Windows.

27 5194 Class at OSU 27 Architecture Overview of grpc Key Features: Simple service definition Works across languages and platforms C++, Java, Python, Android Java etc Linux, Mac, Windows. Start quickly and scale Bi-directional streaming and integrated authentication Used by Google (several of Google s cloud products and Google externally facing APIs, TensorFlow), NetFlix, Docker, Cisco, Juniper Networks etc. Uses sockets for communication! Large-scale distributed systems composed of micro services Source:

5194 Class at OSU 28 Architecture Overview of Google TensorFlow Key Features: Widely used for Deep Learning Open source software library for numerical computation using data flow graphs Graph edges

28 5194 Class at OSU 28 Architecture Overview of Google TensorFlow Key Features: Widely used for Deep Learning Open source software library for numerical computation using data flow graphs Graph edges represent the multidimensional data arrays Nodes in the graph represent mathematical operations Flexible architecture allows to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API Used by Google, Airbnb, DropBox, Snapchat, Twitter Communication and Computation intensive Architecture of TensorFlow Source:

29 5194 Class at OSU 29 Architecture Overview of grpc-based TensorFlow Worker /job:ps/task:0 CPU GPU RDMA grpc server/ client Client Master RDMA grpc server/ client Worker /job:worker/task:0 CPU GPU Worker services communicate among each other using RDMA-gRPC

30 Presentation Outline Overview MapReduce and RDD Programming Models Apache Hadoop, Spark, Memcached, grpc, and TensorFlow Challenges in Accelerating Hadoop, Spark, Memcached, grpc, and TensorFlow Acceleration Case Studies and In-Depth Performance Evaluation The High-Performance Big Data (HiBD) Project and Associated Releases Conclusion and Q&A 5194 Class at OSU 30

31 5194 Class at OSU 31 Increasing Usage of HPC, Big Data and Deep Learning HPC (MPI, RDMA, Lustre, etc.) Big Data (Hadoop, Spark, HBase, Memcached, etc.) Deep Learning (Caffe, TensorFlow, BigDL, etc.) Convergence of HPC, Big Data, and Deep Learning!!!

How Can HPC Clusters with High-Performance

Big Data and Deep Learning Applications?

alleviated with new designs by taking advantage

What are the major bottlenecks in current Big

benefit Big Data processing and Deep Learning?

32 How Can HPC Clusters with High-Performance Interconnect and Storage Architectures Benefit Big Data and Deep Learning Applications? 5194 Class at OSU 32 Can the bottlenecks be alleviated with new designs by taking advantage of HPC technologies? What are the major bottlenecks in current Big Data processing and Deep Learning middleware (e.g. Hadoop, Spark)? Can RDMA-enabled high-performance interconnects benefit Big Data processing and Deep Learning? Can HPC Clusters with high-performance storage systems (e.g. SSD, parallel file systems) benefit Big Data and Deep Learning applications? How much performance benefits can be achieved through enhanced designs? How to design benchmarks for evaluating the performance of Big Data and Deep Learning middleware on HPC clusters? Bring HPC, Big Data processing, and Deep Learning into a convergent trajectory!

33 5194 Class at OSU 33 Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

34 5194 Class at OSU 34 Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

35 5194 Class at OSU 35 Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

36 5194 Class at OSU 36 Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

5194 Class at OSU 37 Designing Communication and I/O Libraries for Big Data Systems: Challenges Applications Big Data Middleware Upper level (HDFS, MapReduce, HBase, Spark, grpc/tensorflow, and

37 5194 Class at OSU 37 Designing Communication and I/O Libraries for Big Data Systems: Challenges Applications Big Data Middleware Upper level (HDFS, MapReduce, HBase, Spark, grpc/tensorflow, and Memcached) Changes? Programming Models (Sockets) Benchmarks RDMA? Point-to-Point Communication Communication and I/O Library Threaded Models and Synchronization Virtualization (SR-IOV) I/O and File Systems QoS & Fault Tolerance Performance Tuning Networking Technologies (InfiniBand, 1/10/40/100 GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD, SSD, NVM, and NVMe- SSD)

38 Presentation Outline Overview MapReduce and RDD Programming Models Apache Hadoop, Spark, Memcached, grpc, and TensorFlow Challenges in Accelerating Hadoop, Spark, Memcached, grpc, and TensorFlow Acceleration Case Studies and In-Depth Performance Evaluation The High-Performance Big Data (HiBD) Project and Associated Releases Conclusion and Q&A 5194 Class at OSU 38

5194 Class at OSU 39 Designing Communication and I/O Libraries for Big Data Systems: Challenges Applications Benchmarks Big Data Middleware (HDFS, MapReduce, HBase, Spark, grpc/tensorflow, and

39 5194 Class at OSU 39 Designing Communication and I/O Libraries for Big Data Systems: Challenges Applications Benchmarks Big Data Middleware (HDFS, MapReduce, HBase, Spark, grpc/tensorflow, and Memcached) Programming Models (Sockets) RDMA Protocols Point-to-Point Communication Communication and I/O Library Threaded Models and Synchronization Virtualization (SR-IOV) I/O and File Systems QoS & Fault Tolerance Performance Tuning Networking Technologies (InfiniBand, 1/10/40/100 GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD, SSD, NVM, and NVMe- SSD)

40 Basic Acceleration Case Studies and In-Depth Performance Evaluation High-Performance Designs with RDMA, In-memory, SSD, Parallel Filesystems HDFS MapReduce Spark Hadoop RPC and HBase Memcached grpc and TensorFlow Kafka 5194 Class at OSU 40

41 Design Overview of HDFS with RDMA Applications Design Features RDMA-based HDFS write Others Java Socket Interface HDFS Write Java Native Interface (JNI) OSU Design RDMA-based HDFS replication Parallel replication support On-demand connection setup Verbs InfiniBand/RoCE support 1/10/40/100 GigE, IPoIB Network RDMA Capable Networks (IB, iwarp, RoCE..) N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda, High Performance RDMA-Based Design of HDFS over InfiniBand, Supercomputing (SC), Nov 2012 N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC '14, June Class at OSU 41

Enhanced HDFS with In-Memory and Heterogeneous Storage Applications Triple-H Data Placement Policies Hybrid Replication Eviction/Promotion Heterogeneous Storage RAM Disk

RAM, SSD, HDD, Lustre Eviction/Promotion based on data usage pattern Hybrid Replication Lustre-Integrated mode: Lustre-based fault-tolerance N. Islam, X. Lu, M. W.

42 Enhanced HDFS with In-Memory and Heterogeneous Storage Applications Triple-H Data Placement Policies Hybrid Replication Eviction/Promotion Heterogeneous Storage RAM Disk SSD HDD Lustre Design Features Three modes Default (HHH) In-Memory (HHH-M) Lustre-Integrated (HHH-L) Policies to efficiently utilize the heterogeneous storage devices RAM, SSD, HDD, Lustre Eviction/Promotion based on data usage pattern Hybrid Replication Lustre-Integrated mode: Lustre-based fault-tolerance N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture, CCGrid 15, May Class at OSU 42

43 Execution Time (s) Evaluation with Spark on SDSC Gordon (HHH vs. Alluxio) IPoIB (QDR) Tachyon OSU-IB (QDR) Reduced by 2.4x For 200GB TeraGen on 32 nodes Spark-TeraGen: HHH has 2.4x improvement over Tachyon; 2.3x over HDFS-IPoIB (QDR) Spark-TeraSort: HHH has 25.2% improvement over Tachyon; 17% over HDFS-IPoIB (QDR) 8:50 16:100 32:200 Cluster Size : Data Size (GB) TeraGen Execution Time (s) Reduced by 25.2% 8:50 16:100 32:200 Cluster Size : Data Size (GB) TeraSort N. Islam, M. W. Rahman, X. Lu, D. Shankar, and D. K. Panda, Performance Characterization and Acceleration of In-Memory File Systems for Hadoop and Spark Applications on HPC Clusters, IEEE BigData 15, October Class at OSU 43

44 5194 Class at OSU 44 Basic Acceleration Case Studies and In-Depth Performance Evaluation High-Performance Designs with RDMA, In-memory, SSD, Parallel Filesystems HDFS MapReduce Spark Hadoop RPC and HBase Memcached grpc and TensorFlow Kafka

45 Design Overview of MapReduce with RDMA MapReduce Job Tracker Java Socket Interface 1/10/40/100 GigE, IPoIB Network Applications Task Tracker Map Reduce Java Native Interface (JNI) OSU Design Verbs RDMA Capable Networks (IB, iwarp, RoCE..) Design Features RDMA-based shuffle Prefetching and caching map output Efficient Shuffle Algorithms In-memory merge On-demand Shuffle Adjustment Advanced overlapping map, shuffle, and merge shuffle, merge, and reduce On-demand connection setup InfiniBand/RoCE support M. W. Rahman, N. S. Islam, X. Lu, J. Jose, H. Subramon, H. Wang, and D. K. Panda, High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand, HPDIC Workshop, held in conjunction with IPDPS, May 2013 M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, ICS, June Class at OSU 45

46 Evaluations using PUMA Workload 10GigE IPoIB (QDR) OSU-IB (QDR) Normalized Execution Time AdjList (30GB) SelfJoin (80GB) SeqCount (30GB) WordCount (30GB) InvertIndex (30GB) Benchmarks 50% improvement in Self Join over IPoIB (QDR) for 80 GB data size 49% improvement in Sequence Count over IPoIB (QDR) for 30 GB data size 5194 Class at OSU 46

47 Optimize Hadoop YARN MapReduce over Parallel File Systems Compute Nodes App Master Map Reduce Lustre Client MetaData Servers Object Storage Servers Lustre Setup HPC Cluster Deployment Hybrid topological solution of Beowulf architecture with separate I/O nodes Lean compute nodes with light OS; more memory space; small local storage Sub-cluster of dedicated I/O nodes with parallel file systems, such as Lustre MapReduce over Lustre Local disk is used as the intermediate data directory Lustre is used as the intermediate data directory 5194 Class at OSU 47

Design Overview of Shuffle Strategies for MapReduce over Lustre Map 1 Map 2 Map 3 Design Features Two shuffle approaches Lustre read based shuffle In-memory merge/sort Intermediate Data Directory

48 Design Overview of Shuffle Strategies for MapReduce over Lustre Map 1 Map 2 Map 3 Design Features Two shuffle approaches Lustre read based shuffle In-memory merge/sort Intermediate Data Directory Lustre Lustre Read / RDMA Reduce 1 Reduce 2 reduce In-memory merge/sort reduce RDMA based shuffle Hybrid shuffle algorithm to take benefit from both shuffle approaches Dynamically adapts to the better shuffle approach for each shuffle request based on profiling values for each Lustre read operation In-memory merge and overlapping of different phases are kept similar to RDMA-enhanced MapReduce design M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, High Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA, IPDPS, May Class at OSU 48

49 5194 Class at OSU 49 Case Study - Performance Improvement of MapReduce over Lustre on SDSC-Gordon Lustre is used as the intermediate data directory Job Execution Time (sec) IPoIB (QDR) OSU-Lustre-Read (QDR) OSU-RDMA-IB (QDR) OSU-Hybrid-IB (QDR) Data Size (GB) For 80GB Sort in 8 nodes Reduced by 34% 34% improvement over IPoIB (QDR) Job Execution Time (sec) Reduced by 25% Data Size (GB) For 120GB TeraSort in 16 nodes 25% improvement over IPoIB (QDR)

50 5194 Class at OSU 50 Basic Acceleration Case Studies and In-Depth Performance Evaluation High-Performance Designs with RDMA, In-memory, SSD, Parallel Filesystems HDFS MapReduce Spark Hadoop RPC and HBase Memcached grpc and TensorFlow Kafka

51 Design Overview of Spark with RDMA Apache Spark Benchmarks/Applications/Libraries/Frameworks Netty Server Spark Core Shuffle Manager (Sort, Hash, Tungsten-Sort) Block Transfer Service (Netty, NIO, RDMA-Plugin) NIO Server Java Socket Interface RDMA Server Netty Client NIO Client RDMA Client Java Native Interface (JNI) Native RDMA-based Comm. Engine Design Features RDMA based shuffle plugin SEDA-based architecture Dynamic connection management and sharing Non-blocking data transfer Off-JVM-heap buffer management InfiniBand/RoCE support 1/10/40/100 GigE, IPoIB Network RDMA Capable Networks (IB, iwarp, RoCE..) Enables high performance RDMA communication, while supporting traditional socket interface JNI Layer bridges Scala based Spark with communication library written in native code X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014 X. Lu, D. Shankar, S. Gugnani, and D. K. Panda, High-Performance Design of Apache Spark with RDMA and Its Benefits on Various Workloads, IEEE BigData 16, Dec Class at OSU 51

52 Performance Evaluation on SDSC Comet HiBench PageRank Time (sec) IPoIB 37% RDMA Huge BigData Gigantic Data Size (GB) 32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R) RDMA-based design for Spark RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node. 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps) 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps) 5194 Class at OSU 52 Time (sec) IPoIB 43% RDMA Huge BigData Gigantic Data Size (GB)

53 5194 Class at OSU 53 Basic Acceleration Case Studies and In-Depth Performance Evaluation High-Performance Designs with RDMA, In-memory, SSD, Parallel Filesystems HDFS MapReduce Spark Hadoop RPC and HBase Memcached grpc and TensorFlow Kafka

54 Performance Benefits for Hadoop RPC and HBase Hadoop RPC Throughput on Chameleon-Cloud HBase YCSB Workload A on SDSC-Comet Hadoop RPC Throughput on Chameleon-Cloud-FDR up to 2.6x performance speedup over IPoIB for throughput HBase YCSB Workload A (read: write=50:50) on SDSC-Comet-FDR Native designs always perform better than the IPoIB-UD transport up to 2.4x performance speedup over IPoIB for throughput X. Lu, N. Islam, M. W. Rahman, J. Jose, H. Subramoni, H. Wang, and D. K. Panda, High-Performance Design of Hadoop RPC with RDMA over InfiniBand, ICPP '13, October X. Lu, D. Shankar, S. Gugnani, H. Subramoni, and D. K. Panda, Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase, CloudCom, Class at OSU 54

55 Accelerating Hybrid Memcached with RDMA, Non-blocking Extensions and SSDs CLIENT SERVER BLOCKING API NON-BLOCKING API REQ. NON-BLOCKING API REPLY RDMA-ENHANCED COMMUNICATION LIBRARY RDMA-ENHANCED COMMUNICATION LIBRARY Blocking API Flow LIBMEMCACHED LIBRARY HYBRID SLAB MANAGER (RAM+SSD) Non-Blocking API Flow RDMA-Accelerated Communication for Memcached Get/Set Hybrid RAM+SSD slab management for higher data retention Non-blocking API extensions memcached_(iset/iget/bset/bget/test/wait) Achieve near in-memory speeds while hiding bottlenecks of network and SSD I/O Ability to exploit communication/computation overlap Optional buffer re-use guarantees Adaptive slab manager with different I/O schemes for higher throughput. D. Shankar, X. Lu, N. S. Islam, M. W. Rahman, and D. K. Panda, High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and SSDs: Non-blocking Extensions, Designs, and Benefits, IPDPS, May Class at OSU 55

56 Average Latency (us) Performance Evaluation with Non-Blocking Memcached API Set Get Set Get Set Get Set Get Set Get Set Get Data does not fit in memory: Non-blocking Memcached Set/Get API Extensions can achieve >16x latency improvement vs. blocking API over RDMA-Hybrid/RDMA-Mem w/ penalty >2.5x throughput improvement vs. blocking API over default/optimized RDMA-Hybrid Data fits in memory: Non-blocking Extensions perform similar to RDMA-Mem/RDMA-Hybrid and >3.6x improvement over IPoIB-Mem Miss Penalty (Backend DB Access Overhead) Client Wait Server Response Cache Update Cache check+load (Memory and/or SSD read) Slab Allocation (w/ SSD write on Out-of-Mem) IPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-Block H-RDMA-Opt-NonBb H-RDMA-Opt-NonB- H = Hybrid Memcached over SATA SSD Opt = Adaptive slab manager Block = Default Blocking API NonB-i = Non-blocking iset/iget API NonB-b = Non-blocking bset/bget API w/ buffer re-use guarantee 5194 Class at OSU 56

57 Performance Benefits for RDMA-gRPC with Micro-Benchmark Latency (us) Default grpc OSU RDMA grpc Latency (us) Default grpc OSU RDMA grpc Latency (us) Default grpc OSU RDMA grpc K 8K payload (Bytes) 0 16K 32K 64K 128K 256K 512K Payload (Bytes) 100 1M 2M 4M 8M Payload (Bytes) RDMA-gRPC RPC Latency grpc-rdma Latency on SDSC-Comet-FDR Up to 2.7x performance speedup over IPoIB for Latency for small messages Up to 2.8x performance speedup over IPoIB for Latency for medium messages Up to 2.5x performance speedup over IPoIB for Latency for large messages R. Biswas, X. Lu, and D. K. Panda, Accelerating grpc and TensorFlow with RDMA for High-Performance Deep Learning over InfiniBand, Under Review Class at OSU 57

58 5194 Class at OSU 58 Performance Benefit for TensorFlow - Resnet50 Images / Second grppc MPI 35% Batch Size Verbs grpc+rdma Images / Second TensorFlow Resnet50 performance evaluation on an IB EDR cluster Up to 35% performance speedup over IPoIB for 4 nodes. Up to 41% performance speedup over IPoIB for 8 nodes grppc MPI 41% Verbs grpc+rdma Batch Size 4 Nodes 8 Nodes

59 5194 Class at OSU 59 Images / Second Performance Benefit for TensorFlow - Inception grppc Verbs MPI grpc+rdma 27% Batch Size Images / Second grppc Verbs MPI grpc+rdma 36% Batch Size 4 Nodes 8 Nodes TensorFlow Inception3 performance evaluation on an IB EDR cluster Up to 27% performance speedup over IPoIB for 4 nodes Up to 36% performance speedup over IPoIB for 8 nodes.

60 RDMA-Kafka: High-Performance Message Broker for Streaming Workloads Latency (ms) % RDMA IPoIB Record Size H. Javed, X. Lu, and D. K. Panda, Characterization of Big Data Stream Processing Pipeline: A Case Study using Flink and Kafka, BDCAT Class at OSU 60 Throughput (MB/s) Record Size Kafka Producer Benchmark (1 Broker 4 Producers) Experiments run on OSU-RI2 cluster 2.4GHz 28 cores, InfiniBand EDR, 512 GB RAM, 400GB SSD Up to 98% improvement in latency compared to IPoIB Up to 7x increase in throughput over IPoIB 7x

61 Presentation Outline Overview MapReduce and RDD Programming Models Apache Hadoop, Spark, Memcached, grpc, and TensorFlow Challenges in Accelerating Hadoop, Spark, Memcached, grpc, and TensorFlow Acceleration Case Studies and In-Depth Performance Evaluation The High-Performance Big Data (HiBD) Project and Associated Releases Conclusion and Q&A 5194 Class at OSU 61

62 5194 Class at OSU 62 The High-Performance Big Data (HiBD) Project RDMA for Apache Spark RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions RDMA for Apache HBase RDMA for Memcached (RDMA-Memcached) RDMA for Apache Hadoop 1.x (RDMA-Hadoop) OSU HiBD-Benchmarks (OHB) HDFS, Memcached, HBase, and Spark Micro-benchmarks Users Base: 290 organizations from 34 countries More than 27,850 downloads from the project site Available for InfiniBand and RoCE Also run on Ethernet Available for x86 and OpenPOWER Support for Singularity and Docker

63 HiBD Release Timeline and Downloads Number of Downloads RDMA-Hadoop 1.x RDMA-Hadoop 1.x RDMA-Hadoop 1.x RDMA-Memcached & OHB RDMA-Hadoop 2.x RDMA-Hadoop 2.x RDMA-Hadoop 2.x RDMA-Memcached RDMA-Spark RDMA-HBase RDMA-Hadoop 2.x RDMA-Spark RDMA-Hadoop 2.x RDMA-Memcached & OHB RDMA-Spark Timeline 5194 Class at OSU 63

64 RDMA for Apache Hadoop 2.x Distribution High-Performance Design of Hadoop over RDMA-enabled Interconnects High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and RPC components Enhanced HDFS with in-memory and heterogeneous storage High performance design of MapReduce over Lustre Memcached-based burst buffer for MapReduce over Lustre-integrated HDFS (HHH-L-BB mode) Plugin-based architecture supporting RDMA-based designs for Apache Hadoop, CDH and HDP Support for OpenPOWER, Singularity, and Docker Current release: Based on Apache Hadoop Compliant with Apache Hadoop 2.8.0, HDP and CDH APIs and applications Tested with Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR) RoCE support with Mellanox adapters Various multi-core platforms (x86, POWER) Different file systems with disks and SSDs and Lustre Class at OSU 64

65 Different Modes of RDMA for Apache Hadoop 2.x HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well as performance. This mode is enabled by default in the package. HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations inmemory and obtain as much performance benefit as possible. HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster. HHH-L-BB: This mode deploys a Memcached-based burst buffer system to reduce the bandwidth bottleneck of shared file system access. The burst buffer design is hosted by Memcached servers, each of which has a local SSD. MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top of Lustre alone. Here, two different modes are introduced: with local disks and without local disks. Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH- L, and MapReduce over Lustre) Class at OSU 65

66 5194 Class at OSU 66 RDMA for Apache Spark Distribution High-Performance Design of Spark over RDMA-enabled Interconnects High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for Spark RDMA-based data shuffle and SEDA-based shuffle architecture Non-blocking and chunk-based data transfer Off-JVM-heap buffer management Support for OpenPOWER Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB) Current release: Based on Apache Spark Tested with Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR) RoCE support with Mellanox adapters Various multi-core platforms (x86, POWER) RAM disks, SSDs, and HDD

67 HiBD Packages on SDSC Comet and Chameleon Cloud RDMA for Apache Hadoop 2.x and RDMA for Apache Spark are installed and available on SDSC Comet. Examples for various modes of usage are available in: RDMA for Apache Hadoop 2.x: /share/apps/examples/hadoop RDMA for Apache Spark: /share/apps/examples/spark/ Please (reference Comet as the machine, and SDSC as the site) if you have any further questions about usage and configuration. RDMA for Apache Hadoop is also available on Chameleon Cloud as an appliance M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet, XSEDE 16, July Class at OSU 67

68 5194 Class at OSU 68 Using HiBD Packages for Big Data Processing on Existing HPC Infrastructure

69 5194 Class at OSU 69 Using HiBD Packages for Big Data Processing on Existing HPC Infrastructure

70 5194 Class at OSU 70 RDMA for Apache HBase Distribution High-Performance Design of HBase over RDMA-enabled Interconnects High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for HBase Compliant with Apache HBase APIs and applications On-demand connection setup Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB) Current release: Based on Apache HBase Tested with Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR) RoCE support with Mellanox adapters Various multi-core platforms

71 RDMA for Memcached Distribution High-Performance Design of Memcached over RDMA-enabled Interconnects High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for Memcached and libmemcached components High performance design of SSD-Assisted Hybrid Memory Non-Blocking Libmemcached Set/Get API extensions Support for burst-buffer mode in Lustre-integrated design of HDFS in RDMA for Apache Hadoop-2.x Easily configurable for native InfiniBand, RoCE and the traditional sockets-based support (Ethernet and InfiniBand with IPoIB) Current release: Based on Memcached and libmemcached Compliant with libmemcached APIs and applications Tested with Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR) RoCE support with Mellanox adapters Various multi-core platforms SSD Class at OSU 71

72 OSU HiBD Micro-Benchmark (OHB) Suite HDFS, Memcached, HBase, and Spark Micro-benchmarks for Hadoop Distributed File System (HDFS) Sequential Write Latency (SWL) Benchmark, Sequential Read Latency (SRL) Benchmark, Random Read Latency (RRL) Benchmark, Sequential Write Throughput (SWT) Benchmark, Sequential Read Throughput (SRT) Benchmark Support benchmarking of Apache Hadoop 1.x and 2.x HDFS, Hortonworks Data Platform (HDP) HDFS, Cloudera Distribution of Hadoop (CDH) HDFS Micro-benchmarks for Memcached Get Benchmark, Set Benchmark, and Mixed Get/Set Benchmark, Non-Blocking API Latency Benchmark, Hybrid Memory Latency Benchmark Yahoo! Cloud Serving Benchmark (YCSB) Extension for RDMA-Memcached Micro-benchmarks for HBase Get Latency Benchmark, Put Latency Benchmark Micro-benchmarks for Spark GroupBy, SortBy Current release: Class at OSU 72

73 5194 Class at OSU 73 Performance Numbers of RDMA for Apache Hadoop 2.x RandomWriter & TeraGen in OSU-RI2 (EDR) Execution Time (s) IPoIB (EDR) Reduced by 3x IPoIB (EDR) Reduced by 4x 700 OSU-IB (EDR) 600 OSU-IB (EDR) Data Size (GB) Data Size (GB) RandomWriter TeraGen Cluster with 8 Nodes with a total of 64 maps Execution Time (s) RandomWriter 3x improvement over IPoIB for GB file size TeraGen 4x improvement over IPoIB for GB file size

74 Execution Time (s) Performance Numbers of RDMA for Apache Hadoop 2.x Sort & TeraSort in OSU-RI2 (EDR) Sort IPoIB (EDR) OSU-IB (EDR) Data Size (GB) Sort Cluster with 8 Nodes with a total of 64 maps and 14 reduces 61% improvement over IPoIB for GB data Reduced by 61% Execution Time (s) TeraSort IPoIB (EDR) OSU-IB (EDR) Data Size (GB) TeraSort Cluster with 8 Nodes with a total of 64 maps and 32 reduces 18% improvement over IPoIB for GB data Reduced by 18% Class at OSU 74

75 Performance Evaluation on SDSC-Comet-IB-FDR with RDMA-HDFS 5194 Class at OSU 75 8 Worker Nodes, HiBench Sort Total Time 8 Worker Nodes, HiBench TeraSort Total Time InfiniBand FDR, SSD, YARN-Client Mode, 192 Cores Benefit of RDMA-Spark and the capability of combining with other advanced technologies, such as RDMA-HDFS A combined version of RDMA-Spark+RDMA-HDFS can achieve the best performance RDMA vs. IPoIB HiBench Sort: Total time reduced by up to 38% over IPoIB through RDMA-Spark, up to 82% through RDMA-Spark+RDMA-HDFS HiBench TeraSort: Total time reduced by up to 17% over IPoIB through RDMA-Spark, up to 29% through RDMA-Spark+RDMA-HDFS

76 Presentation Outline Overview MapReduce and RDD Programming Models Apache Hadoop, Spark, Memcached, grpc, and TensorFlow Challenges in Accelerating Hadoop, Spark, Memcached, grpc, and TensorFlow Acceleration Case Studies and In-Depth Performance Evaluation The High-Performance Big Data (HiBD) Project and Associated Releases Conclusion and Q&A 5194 Class at OSU 76

77 5194 Class at OSU 77 Concluding Remarks Presented an overview of MapReduce and RDD programming models Presented an overview of Big Data, Hadoop, Spark, Memcached, grpc, and TensorFlow Provided an overview of Networking Technologies Discussed challenges in accelerating Hadoop, Spark, Memcached, grpc, and TensorFlow Presented basic and advanced designs to take advantage of InfiniBand/RDMA for HDFS, MapReduce, HBase, Spark, Memcached, grpc, and TensorFlow on HPC clusters and clouds Results are promising for Big Data processing and the associated Deep Learning tools Many other open issues need to be solved Will enable Big Data and Deep Learning communities to take advantage of modern HPC technologies to carry out their analytics in a fast and scalable manner

5194 Class at OSU 78 Thank You! luxi@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda http://www.cse.ohio-state.edu/~luxi Network-Based Computing Laboratory http://nowlab.

78 5194 Class at OSU 78 Thank You! Network-Based Computing Laboratory The High-Performance Big Data Project

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management SigHPC BigData BoF (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu