Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Similar documents
Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Pocket: Elastic Ephemeral Storage for Serverless Analytics

NVMf based Integration of Non-volatile Memory in a Distributed System - Lessons learned

Albis: High-Performance File Format for Big Data Systems

Storage Systems for Serverless Analytics

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing

ReFlex: Remote Flash Local Flash

Why AI Frameworks Need (not only) RDMA?

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

Designing Next Generation FS for NVMe and NVMe-oF

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Accessing NVM Locally and over RDMA Challenges and Opportunities

Application Acceleration Beyond Flash Storage

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture

SOFTWARE-DEFINED BLOCK STORAGE FOR HYPERSCALE APPLICATIONS

Flash Storage Complementing a Data Lake for Real-Time Insight

DataON and Intel Select Hyper-Converged Infrastructure (HCI) Maximizes IOPS Performance for Windows Server Software-Defined Storage

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics Vijay Balakrishnan Memory Solutions Lab. Samsung Semiconductor, Inc.

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

A New Key-Value Data Store For Heterogeneous Storage Architecture

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

A New Key-value Data Store For Heterogeneous Storage Architecture Intel APAC R&D Ltd.

NVMe over Universal RDMA Fabrics

Data Transformation and Migration in Polystores

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics

Designing elastic storage architectures leveraging distributed NVMe. Your network becomes your storage!

Advanced Computer Networks. End Host Optimization

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

How to Network Flash Storage Efficiently at Hyperscale. Flash Memory Summit 2017 Santa Clara, CA 1

Speeding Up Cloud/Server Applications Using Flash Memory

N V M e o v e r F a b r i c s -

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

Ziye Yang. NPG, DCG, Intel

Evaluation of the Chelsio T580-CR iscsi Offload adapter

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Part 1: Indexes for Big Data

BigData and Map Reduce VITMAC03

RoCE vs. iwarp A Great Storage Debate. Live Webcast August 22, :00 am PT

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters

A BigData Tour HDFS, Ceph and MapReduce

Distributed Filesystem

Distributed Systems 16. Distributed File Systems II

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

iscsi or iser? Asgeir Eiriksson CTO Chelsio Communications Inc

Micron and Hortonworks Power Advanced Big Data Solutions

Deep Learning Frameworks with Spark and GPUs

No Tradeoff Low Latency + High Efficiency

High-Performance Training for Deep Learning and Computer Vision HPC

Tatsuhiro Chiba, Takeshi Yoshimura, Michihiro Horie and Hiroshi Horii IBM Research

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

jverbs: Java/OFED Integration for the Cloud

Sharing High-Performance Devices Across Multiple Virtual Machines

FaRM: Fast Remote Memory

Scott Oaks, Oracle Sunil Raghavan, Intel Daniel Verkamp, Intel 03-Oct :45 p.m. - 4:30 p.m. Moscone West - Room 3020

Closing the Performance Gap Between Volatile and Persistent K-V Stores

Software-Defined Data Infrastructure Essentials

From Rack Scale to Network Scale: NVMe Over Fabrics Enables Exabyte Applica>ons. Zivan Ori, CEO & Co-founder, E8 Storage

EsgynDB Enterprise 2.0 Platform Reference Architecture

Oracle Exadata: Strategy and Roadmap

Introduction to MapReduce

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

DRBD SDS. Open Source Software defined Storage for Block IO - Appliances and Cloud Philipp Reisner. Flash Memory Summit 2016 Santa Clara, CA 1

Farewell to Servers: Resource Disaggregation

NVMe Over Fabrics (NVMe-oF)

NPTEL Course Jan K. Gopinath Indian Institute of Science

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

10 Million Smart Meter Data with Apache HBase

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017

May 1, Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) A Sleep-based Communication Mechanism to

Windows Support for PM. Tom Talpey, Microsoft

Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation

MapReduce. U of Toronto, 2014

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

A Disseminated Distributed OS for Hardware Resource Disaggregation Yizhou Shan

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

Enhancing NVMe-oF Capabilities Using Storage Abstraction

Lenovo - Excelero NVMesh Reference Architecture

Using Transparent Compression to Improve SSD-based I/O Caches

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

MapReduce, Hadoop and Spark. Bompotas Agorakis

Cloud Computing with FPGA-based NVMe SSDs

High Performance Storage : A Cloud Story. Luwei He Standard Engineer, Huawei

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF

Serverless Architecture Hochskalierbare Anwendungen ohne Server. Sascha Möllering, Solutions Architect

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Accelerate Big Data Insights

Transcription:

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research

The CRAIL Project: Overview Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) Spark-IO FS Albis Streaming Pocket KV 100 Gbps 10 μsec HDFS Crail Store TCP RDMA NVMeF SPDK Fast Network, e.g., 100 Gbps RoCE DRAM NVMe PCM GPU.

The CRAIL Project: Overview Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) Spark-IO FS fast sharing of ephemeral data Albis Streaming Pocket KV 100 Gbps 10 μsec HDFS Crail Store TCP RDMA NVMeF SPDK Fast Network, e.g., 100 Gbps RoCE DRAM NVMe PCM GPU.

The CRAIL Project: Overview Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) Spark-IO Albis Pocket 100 Gbps 10 μsec shuffle/broadcast acceleration FS fast sharing of ephemeral data Streaming KV HDFS Crail Store TCP RDMA NVMeF SPDK Fast Network, e.g., 100 Gbps RoCE DRAM NVMe PCM GPU.

The CRAIL Project: Overview Data Processing Framework efficient storage of relational data (e.g., Spark, TensorFlow, λ Compute) Spark-IO Albis Pocket 100 Gbps 10 μsec shuffle/broadcast acceleration FS fast sharing of ephemeral data Streaming KV HDFS Crail Store TCP RDMA NVMeF SPDK Fast Network, e.g., 100 Gbps RoCE DRAM NVMe PCM GPU.

The CRAIL Project: Overview data sharing for serverless applications Data Processing Framework efficient storage of relational data (e.g., Spark, TensorFlow, λ Compute) Spark-IO Albis Pocket 100 Gbps 10 μsec shuffle/broadcast acceleration FS fast sharing of ephemeral data Streaming KV HDFS Crail Store TCP RDMA NVMeF SPDK Fast Network, e.g., 100 Gbps RoCE DRAM NVMe PCM GPU.

Outline Why CRAIL Crail Store Workload specific I/O Processing put your #assignedhashtag here by setting the footer in view-header/footer File Format, shuffle engine, serverless Use Cases: Disaggregation Workloads: SQL, Machine Learning

#1 Performance Challenge (1)

#1 Performance Challenge (2) Sorting Application Data Processing Framework sockets filesystem TCP/IP block layer Ethernet iscsi NIC SSD Sorter Serializer Netty JVM

#1 Performance Challenge (2) Process chunk In reduce task Sorting Application Data Processing Framework sockets filesystem TCP/IP block layer Ethernet iscsi NIC SSD Sorter Serializer Netty JVM Fetch chunk Over the network HotNets 16

#1 Performance Challenge (2) Sorting Application Data Processing Framework sockets filesystem TCP/IP block layer Ethernet iscsi NIC SSD Sorter Serializer Netty JVM HotNets 16

#1 Performance Challenge (2) software overhead are spread over the entire stack Sorting Application Data Processing Framework sockets filesystem TCP/IP block layer Ethernet iscsi NIC SSD Sorter Serializer Netty JVM HotNets 16

#2 Diversity SPDK RDMA Verbs Diverse hardware technologies / complex programming APIs / many frameworks

#3 Ephemeral Data serverless (AWS lambda) applications Spark applications Ephemeral data has unique properties (e.g., wide range of I/O size)

CRAIL Approach (1) put your #assignedhashtag here by setting the footer in view-header/footer Abstract hardware via high-level storage interface

CRAIL Approach (1) put your #assignedhashtag here by setting the footer in view-header/footer Abstract hardware via high-level storage interface hardware specific plugins (storage tiers) most I/O operations can conveniently be implemented on a storage abstraction

CRAIL Approach (1) put your #assignedhashtag here by setting the footer in view-header/footer Abstract hardware via high-level storage interface hardware specific plugins (storage tiers) most I/O operations can conveniently be implemented on a storage abstraction what is the right API? (FS, KV,?)

CRAIL Approach (2) put your #assignedhashtag here by setting the footer in view-header/footer Filesystem-like interface: Hierarchical namespace Helps to organize data (shuffle, tmp, etc) for different jobs Separate data from metadata plane Reading/writing involves block metatdata lookup Cheap on a low-latency network (few usecs) Flexible: data objects can be of arbitrary size Specific data types KeyValue files: last create wins Shuffle files: efficient reading of multiple files in a directory Let applications control the details Data placement policy: which storage node or storage tier to use

CRAIL Approach (3) your #assignedhashtag here by setting the footer in view-header/footer Careful softwareputdesign: Leverage user-level APIs Seperate data from control operations Memory allocation, string parsing, etc. Efficient non-blocking operations RDMA, NFMf, DPDK, SPDK Avoid army of threads, let the hardware do the work Leverage byte-address storage Transmit no more data than what is read/written

Crail Store: Architecture hierarchical namespace, multiple data types distributed storage over DRAM and flash

Crail Store: Architecture files may spawn multiple storage tiers can be read like a single file hierarchical namespace, multiple data types distributed storage over DRAM and flash

Crail Store: Deployment Modes Metadata server Flash storage server DRAM storage server Application compute compute/storage co-located compute/storage disaggregated flash storage disaggregation

Crail Store: Read Throughput Performance of a single client running on one core only DRAM NVMf Single-client (1 core) throughput 80 12 Crail Alluxio Throughput (GB/s) Throughput [Gbit/s] 100 60 40 20 0 128B 256B 512B 1K 128K Buffer size Buffer size 256K 512K 1MB 10 8 6 4 2 0 1 28 25 6 51 2 1K 4K 8K 16 K NVMf - direct NVMf - buffered DRAM - buffered 32 64 12 2 K K 8K 56K Buffer size Crail reaches line speed at for an I/O size of 1K 51 2 K

Crail Store: Read Latency Remote DRAM 50 latency [us] 40 30 RAMCloud/read/C RAMCloud/read/Java Crail (lookup & read) Crail (lookup only) Remote NVMe SSD (3D XPoint) NVMf 124 20 10 0 4B 1K 4K 16K key size Buffer size 64K 256K Buffer size Buffer Buffer Buffer size size size Crail remote read latencies (DRAM and NVM) are very close to the hardware latencies

Metadata server scalability 35 Namenode IOPS IOPS [millions] 30 2 Namenodes IOPS 25 4 Namenodes IOPS 20 15 10 5 0 0 10 20 30 40 Number of clients 50 60 70 Network interfaces A single metadata server can process 10M Crail lookup ops/sec

Running Workloads: MapReduce Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) Spark-IO FS Albis Streaming Pocket KV 100 Gbps 10 μsec HDFS Crail Store TCP RDMA NVMeF SPDK Fast Network, e.g., 100 Gbps RoCE DRAM NVMe PCM GPU.

Spark Spark Exec0 Exec1 Crail map: Hash Append (fle) 1 /shuffle/1/ /shuffle/2/ /shuffle/3 hash f0 f0 f0 2 reduce: fetch Bucket (dir) hash f1 f1 f1 Spark Spark Exec0 Exec1 Spark ExecN hash fn fn fn Spark ExecN Throughput (Gbit/s) Throughput (Gbit/s) Spark GroupBy (80M keys, 4K) 100 Spark/ Vanilla 80 60 1 core 4 cores 8 cores 40 20 0 100 0 10 20 30 2x 40 20 0 0 10 20 50 60 70 80 90 100 110 120 Elapsed time (seconds) 1 core Spark/ 4 cores Crail 8 cores 80 60 40 2.5x 30 40 5x 50 60 70 80 90 100 110 120 Elapsed time (seconds) val pairs = sc.parallelize(1 to tasks, tasks).flatmap(_ => { var values = new array[(long,array[byte])](numkeys) values = initvalues(values) }).cache().groupbykey().count()

Sorting time [sec] Netw Thput [Gb/s] Sorting 12.8 TB on 128 nodes Spark Spark/Crail Task ID Native C distributed sorting benchmark Sorting rate of Crail/Spark only 27% slower than rate of Winner 2016

DRAM & Flash Disaggregation Vanilla/ sorting runtime Alluxio 200 Spark Spark Exec0 Exec1 Spark ExecN Reduce Time (sec) Input/Output/ Shuffle 160 Map 120 Crail/ DRAM 80 Crail/ NVMe 40 storage nodes compute nodes 0 Using Crail, a Spark 200GB sorting workload can be run with memory and flash disaggregated at no extra cost

Running Workloads: SQL Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) Spark-IO FS Albis Streaming Pocket KV 100 Gbps 10 μsec HDFS Crail Store TCP RDMA NVMeF SPDK Fast Network, e.g., 100 Gbps RoCE DRAM NVMe PCM GPU.

Goodput [Gbps] Reading Relational Data None of the common file formats delivers a performance close to the hardware speed

Revisiting Design Principles your #assignedhashtag here by setting the footer in view-header/footer Traditional put Assumption: CPU is fast, I/O is slow Use compression, encoding, etc. Pack data and metadata together Avoid metadata lookups Mismatch in case of fast I/O hardware! Albis: new file format designed for fast I/O hardware Albis design principles Avoid CPU pressure, i.e., no compression, encoding, etc. Simple metadata management

Reading Relational with Albis Albis/Crail delivers 2-30x performance improvements over other formats

TPC-DS using Albis/Crail Albis/Crail delivers up to 2.5x performance gains

Running Workloads: Serverless Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) Spark-IO FS Albis Streaming Pocket KV 100 Gbps 10 μsec HDFS Crail Store TCP RDMA NVMeF SPDK Fast Network, e.g., 100 Gbps RoCE DRAM NVMe PCM GPU.

Serverless Computing your #assignedhashtag here by setting the footer in view-header/footer Data sharingputimplemented using remote storage Enables fast and fine-grained scaling Problem: existing storage platforms not suitable Slow (e.g., S3) No dynamic scaling (e.g. Redis) Designed for either small or large data sets Can we use Crail? Not as is. Most clouds don t support RDMA, NVMf, etc. Lacks automatic & elastic resource management

Pocket An elastic distributed data store for ephemeral data sharing in serverless analytics Execution time (sec) Pocket dynamically rightsizes storage resources (nodes, media) in an attempt to find a spot with a good performance price ratio Resource usage ($/hr)

Pocket: Resource Utilization Pocket cost-effectively allocates resources based on user/framework hints

Conclusions put your #assignedhashtag here by setting the footer in view-header/footer Effectively using high-performance I/O hardware for data processing is challenging Crail is an attempt to re-think how data processing systems should interact with network and storage hardware User-level I/O Storage disaggregation Memory/flash convergence Elastic resource provisioning

References put your #assignedhashtag here by setting the footer in view-header/footer Crail: A High-Performance I/O Architecture for Distributed Data Processing, IEEE Data Bulletin 2017 Albis: High-Performance File-format for Big Data, Usenix ATC 18 Navigating Storage for Serverless Computing, Usenix ATC 18 Pocket: Ephemeral Storage for Serverless Analytics, OSDI 18 (to appear) Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash, Spark Summit 17 Serverless Machine Learning using Crail, Spark Summit 18 Apache Crail, http://crail.apache.org

Contributors Animesh Trivedi, Jonas Pfefferle, Bernard Metzler, Adrian Schuepbach, Ana Klimovic, Yawen Wang, Michael Kaufmann, Yuval Degani,...