Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Similar documents
Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

NVMf based Integration of Non-volatile Memory in a Distributed System - Lessons learned

Accelerate Big Data Insights

Application Acceleration Beyond Flash Storage

Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

MapR Enterprise Hadoop

Backtesting with Spark

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters

April Copyright 2013 Cloudera Inc. All rights reserved.

Distributed Filesystem

EsgynDB Enterprise 2.0 Platform Reference Architecture

Lecture 11 Hadoop & Spark

Experiences Running and Optimizing the Berkeley Data Analytics Stack on Cray Platforms

SMB Direct Update. Tom Talpey and Greg Kramer Microsoft Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Why AI Frameworks Need (not only) RDMA?

RoCE vs. iwarp Competitive Analysis

August Li Qiang, Huang Qiulan, Sun Gongxing IHEP-CC. Supported by the National Natural Science Fund

The NE010 iwarp Adapter

WHITEPAPER. Improve Hadoop Performance with Memblaze PBlaze SSD

Memcached Design on High Performance RDMA Capable Interconnects

Introduction to MapReduce

Processing of big data with Apache Spark

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Advanced Computer Networks. End Host Optimization

MI-PDB, MIE-PDB: Advanced Database Systems

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

MapReduce. U of Toronto, 2014

Map Reduce Group Meeting

Distributed File Systems II

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1

Distributed Systems 16. Distributed File Systems II

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Mark Falco Oracle Coherence Development

Hadoop. copyright 2011 Trainologic LTD

The Evolution of Big Data Platforms and Data Science

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

BigData and Map Reduce VITMAC03

MapReduce, Hadoop and Spark. Bompotas Agorakis

Hadoop/MapReduce Computing Paradigm

Sharing High-Performance Devices Across Multiple Virtual Machines

Colin Cunningham, Intel Kumaran Siva, Intel Sandeep Mahajan, Oracle 03-Oct :45 p.m. - 5:30 p.m. Moscone West - Room 3020

Improving Altibase Performance with Solarflare 10GbE Server Adapters and OpenOnload

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Cluster Setup and Distributed File System

Structuring PLFS for Extensibility

Storage Protocol Offload for Virtualized Environments Session 301-F

Facilitating Consistency Check between Specification & Implementation with MapReduce Framework

CA485 Ray Walshe Google File System

Next-Generation Cloud Platform

Big Data Infrastructures & Technologies

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

An Introduction to Big Data Analysis using Spark

SNIA Developers Conference - Growth of the iscsi RDMA (iser) Ecosystem

HADOOP 3.0 is here! Dr. Sandeep Deshmukh Sadepach Labs Pvt. Ltd. - Let us grow together!

N V M e o v e r F a b r i c s -

The Leading Parallel Cluster File System

InfiniBand Networked Flash Storage

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh

A brief history on Hadoop

Mixing and matching virtual and physical HPC clusters. Paolo Anedda

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

High-Performance Training for Deep Learning and Computer Vision HPC

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

HDFS Architecture Guide

Best Practices for Deploying Hadoop Workloads on HCI Powered by vsan

Spark Overview. Professor Sasu Tarkoma.

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Scott Oaks, Oracle Sunil Raghavan, Intel Daniel Verkamp, Intel 03-Oct :45 p.m. - 4:30 p.m. Moscone West - Room 3020

CLOUD-SCALE FILE SYSTEMS

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Clustering Lecture 8: MapReduce

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Batch Inherence of Map Reduce Framework

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

Evaluation of the Chelsio T580-CR iscsi Offload adapter

MOHA: Many-Task Computing Framework on Hadoop

Programming Systems for Big Data

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

An NVMe-based Offload Engine for Storage Acceleration Sean Gibb, Eideticom Stephen Bates, Raithlin

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Gearing Hadoop towards HPC sytems

Oracle Exadata: Strategy and Roadmap

Introduction to TCP/IP Offload Engine (TOE)

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Transcription:

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay 1

Apache Spark - Intro Spark within the Big Data ecosystem Data Sources Data Acquisition / ETL Data Storage Data Analysis / ML Serving 3

Apache spark 101 Quick facts In a nutshell: Spark is a data analysis platform with implicit data parallelism and fault-tolerance Initial release: May, 2014 Originally developed at UC Berkeley s AMPLab Donated as open source to the Apache Software Foundation Most active Apache open source project 50% of production systems are in public clouds Notable users: 4

Hardware acceleration in Big Data/Machine Learning platforms Hardware acceleration adoption is continuously growing GPU integration is now standard FPGA/ASIC integration is spreading fast RDMA is already integrated in mainstream code of popular frameworks: TensorFlow Caffe2 CNTK Now it s Spark s and Hadoop s turn to catch up 5

What Is RDMA? Stands for Remote Direct Memory Access Advanced transport protocol (same layer as TCP and UDP) Modern RDMA comes from the Infiniband L4 transport specification. Full hardware implementation of the transport by the HCAs. Remote memory READ/WRITE semantics (one sided) in addition to SEND/RECV (2 sided) Uses Kernel bypass / direct user space access Supports Zero-copy RoCE: RDMA over Converged Ethernet The Infiniband transport over UDP encapsulation. Available for all Ethernet speeds 10 100G Growing cloud support Performance Sub-microsecond latency. Better CPU utilization. High Bandwidth. DMA App buffer Data Sockets TCP/IP Kernel Driver Network Adapter Context Control 6

Spark s Shuffle Internals Under the hood 7

MapReduce vs. Spark Spark s in-memory model completely changed how shuffle is done In both Spark and MapReduce, map output is saved on the local disk (usually in buffer cache) In MapReduce, map output is then copied over the network to the destined reducer s local disk In Spark, map output is fetched from the network, on-demand, to the reducer s memory Map Reduce Map Reduce Memory-to-network-to-memory? RDMA is a perfect fit! 8

Reduce Map Spark s Shuffle Basics Input Map Map output File Map Map File File Driver Map File Map File Reduce task Reduce task Reduce task Reduce task Reduce task Fetch blocks Fetch blocks Fetch blocks Fetch blocks Fetch blocks 9

Shuffle Read Protocol Send back Map Statuses Driver 2 Shuffle Read Request blocks from writers Request blocks from stream, one by one Block data is now ready Reader 1 3 4 6 8 Request Map Statuses Group block locations by writer Writer 5 7 Locate blocks, and setup as stream Locate block, send back 10

The Cost of Shuffling Shuffling is very expensive in terms of CPU, RAM, disk and network IOs Spark users try to avoid shuffles as much as they can Speedy shuffles can relieve developers of such concerns, and simplify applications 11

SparkRDMA Shuffle Plugin Accelerating Shuffle with RDMA 12

Design Approach Entire Shuffle-related communication is done with RDMA RPC messaging for meta-data transfers Block transfers SparkRDMA is an independent plugin Implements the ShuffleManager interface No changes to Spark s code use with any existing Spark installation Reuse Spark facilities Maximize reliability Minimize impact on code RDMA functionality is provided by DiSNI Open-source Java interface to RDMA user libraries https://github.com/zrlio/disni No functionality loss of any kind, SparkRDMA supports: Compression Spilling to disk Recovery from failed map or reduce tasks 14

ShuffleManager Plugin Spark allows for external implementations of ShuffleManagers to be plugged in Configurable per-job using: spark.shuffle.manager Interface allows proprietary implementations of Shuffle Writers and Readers, and essentially defers the entire Shuffle process to the new component SparkRDMA utilizes this interface to introduce RDMA in the Shuffle process SortShuffleManager RdmaShuffleManager 15

Writers Writers SparkRDMA Components SparkRDMA reuses the main Shuffle Writer implementations of mainstream Spark: Unsafe & Sort BypassMergeSortShuffleWriter Shuffle data is written and stored identically to the original implementation SortShuffleManager UnsafeShuffleWriter SortShuffleWriter BlockStoreShuffleReader All-new ShuffleReader and ShuffleBlockResolver provide an optimized RDMA transport when blocks are being read over the network IndexShuffleBlockResolver RdmaShuffleManager RdmaWrapperShuffleWriter UnsafeShuffleWriter SortShuffleWriter RdmaShuffleReader RdmaShuffleBlockResolver 16

Shuffle Read Protocol Standard vs. RDMA Send back Map Statuses Driver 2 Shuffle Read Reader 1 3 RDMA-Read Request blocks from writers 4 Block Request data blocks is now from stream, ready one by one 65 Block data is now ready 8 Request Map Statuses Group block locations by writer Writer 5 7 Locate No-op blocks, on writer and setup HW offloads as stream transfers Locate block, send back 17

RDMA Standard Send back Request blocks Map Statuses from writers Request blocks from stream, one by one Block data is now ready Driver Reader Shuffle Read 4 2 6 Request blocks from writers 8 Request blocks from stream, one by one Block data is now ready Reader 1 3 4 6 8 Writer Driver Reader Shuffle Read Request Map Statuses Send back RDMA-Read Map Statuses blocks from writers 2 Group block locations by writer Writer 5 7 Locate blocks, and setup as stream Block data is now ready 4 6 RDMA-Read 5blocks from writers 5 Locate block, send back Locate blocks, and setup as stream Block data is now ready 7 Locate block, send back Server-side: 0 CPU Shuffle transfers are not blocked by GC in executor No buffering Reader Writer Writer 1 Request Map Statuses No-op on writer HW offloads transfers 3 4 65 Group block locations by writer No-op on writer HW offloads transfers Client-side: Instant transfers Reduced messaging Direct, unblocked access to remote blocks 18

Benefits Substantial improvements in: Block transfer times: latency and total transfer time Memory consumption and management CPU utilization Easy to deploy and configure: Supports your current Spark installation Packed into a single JAR file Plugin is enabled through a simple configuration handle Allows finer tuning with a set of configuration handles Configuration and deployment are on a per-job basis: Can be deployed incrementally May be limited to Shuffle-intensive jobs 19

Results 20

Performance Results: TeraSort Testbed: HiBench TeraSort Workload: 175GB HDFS on Hadoop 2.6.0 No replication Spark 2.2.0 1 Master 16 Workers 28 active Spark cores on each node, 420 total Node info: Intel Xeon E5-2697 v3 @ 2.60GHz RoCE 100GbE 256GB RAM HDD is used for Spark local directories and HDFS Standar d RDMA 0 20 40 60 80 seconds 21

Performance Results: GroupBy Testbed: GroupBy 48M keys Each value: 4096 bytes Workload: 183GB Spark 2.2.0 1 Master 15 Workers 28 active Spark cores on each node, 420 total Node info: Intel Xeon E5-2697 v3 @ 2.60GHz RoCE 100GbE 256GB RAM HDD is used for Spark local directories and HDFS Standar d RDMA 0 5 10 15 20 25 seconds 22

Coming up next: HDFS+RDMA 23

HDFS+RDMA All-new implementation of RDMA acceleration for HDFS Implements a new DataNode and DFSClient Data transfers are done in zero-copy, with RDMA Lower CPU, lower latency, higher throughput Efficient memory utilization Initial support: Hadoop: HDFS 2.6 Cloudera: CDH 5.10 WRITE operations over RDMA READ operations still carried over TCP in this version Future: READ operations with RDMA Erasure coding offloads on HDFS 3.X NVMeF 24

HDFS+RDMA Performance results: DFSIO - TCP vs. RDMA CDH 5.10 16 x DataNodes, 1 x NameNode Single HDD per DataNode RoCE 100GbE Up to x1.25 speedup in total runtime Up to x1.43 in throughput TCP vs. RDMA: DFSIO write throughput in MB/s (higher is better) TCP vs. RDMA: DFSIO write runtime in seconds (lower is better) 600 400 200 0 40.00 30.00 20.00 10.00 0.00 TCP RDMA TCP RDMA 25

Roadmap What s next 26

Roadmap SparkRDMA v1.0 GA is available at https://github.com/mellanox/sparkrdma Quick installation guide Wiki pages for advanced settings SparkRDMA v2.0 GA April 2018 Significant performance improvements All-new robust messaging protocol Highly efficient memory management HDFS+RDMA v1.0 GA Q2 2018 WRITE operations with RDMA 27

Thank You 28