Bring x3 Spark Performance Improvement with PCIe SSD. Yucai, Yu BDT/STO/SSG January, 2016

Similar documents
Big data systems 12/8/17

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

Fast Big Data Analytics with Spark on Tachyon

SPDK Blobstore: A Look Inside the NVM Optimized Allocator

Presented by: Nafiseh Mahmoudi Spring 2017

Identifying Performance Bottlenecks with Real- World Applications and Flash-Based Storage

Recovering Disk Storage Metrics from low level Trace events

An Introduction to Big Data Analysis using Spark

Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics Vijay Balakrishnan Memory Solutions Lab. Samsung Semiconductor, Inc.

Accelerating OLTP performance with NVMe SSDs Veronica Lagrange Changho Choi Vijay Balakrishnan

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

SFS: Random Write Considered Harmful in Solid State Drives

v02.54 (C) Copyright , American Megatrends, Inc.

STORING DATA: DISK AND FILES

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Erik Riedel Hewlett-Packard Labs

NVMe SSDs with Persistent Memory Regions

Spark and distributed data processing

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

I/O CANNOT BE IGNORED

All-NVMe Performance Deep Dive Into Ceph + Sneak Preview of QLC + NVMe Ceph

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

Coflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Batch Processing Basic architecture

Accelerate block service built on Ceph via SPDK Ziye Yang Intel

Intel Solid State Drive Data Center Family for PCIe* in Baidu s Data Center Environment

Applying Polling Techniques to QEMU

Backtesting with Spark

Agilio CX 2x40GbE with OVS-TC

Near-Data Processing for Differentiable Machine Learning Models

CS3600 SYSTEMS AND NETWORKS

Scott Oaks, Oracle Sunil Raghavan, Intel Daniel Verkamp, Intel 03-Oct :45 p.m. - 4:30 p.m. Moscone West - Room 3020

Deep Learning Performance and Cost Evaluation

W H I T E P A P E R. Comparison of Storage Protocol Performance in VMware vsphere 4

Using Transparent Compression to Improve SSD-based I/O Caches

Preemptive, Low Latency Datacenter Scheduling via Lightweight Virtualization

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

SparkBench: A Comprehensive Spark Benchmarking Suite Characterizing In-memory Data Analytics

Ultimate Workstation Performance

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

SDA: Software-Defined Accelerator for general-purpose big data analysis system

PSA: Performance and Space-Aware Data Layout for Hybrid Parallel File Systems

v02.54 (C) Copyright , American Megatrends, Inc.

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

HP visoko-performantna OLTP rješenja

Elastify Cloud-Native Spark Application with PMEM. Junping Du --- Chief Architect, Tencent Cloud Big Data Department Yue Li --- Cofounder, MemVerge

Experiences Running and Optimizing the Berkeley Data Analytics Stack on Cray Platforms

A New Key-value Data Store For Heterogeneous Storage Architecture Intel APAC R&D Ltd.

Apache Commons Crypto: Another wheel of Apache Commons. Dapeng Sun/ Xianda Ke

IBM s Data Warehouse Appliance Offerings

High Performance SSD & Benefit for Server Application

Storage: HDD, SSD and RAID

Enabling Cost-effective Data Processing with Smart SSD

Hard Disk Drives. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

Exploiting the benefits of native programming access to NVM devices

DBMS Data Loading: An Analysis on Modern Hardware. Adam Dziedzic, Manos Karpathiotakis*, Ioannis Alagiannis, Raja Appuswamy, Anastasia Ailamaki

WHITE PAPER SINGLE & MULTI CORE PERFORMANCE OF AN ERASURE CODING WORKLOAD ON AMD EPYC

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin

Dell PowerEdge R730xd Servers with Samsung SM1715 NVMe Drives Powers the Aerospike Fraud Prevention Benchmark

Dell PowerEdge R720xd 6,000 Mailbox Resiliency Microsoft Exchange 2013 Storage Solution. Tested with ESRP Storage Version 4.0 Tested Date: Feb 2014

BIOS SETUP UTILITY. v02.54 (C) Copyright , American Megatrends, Inc. BIOS SETUP UTILITY

I/O CANNOT BE IGNORED

Why? Storage: HDD, SSD and RAID. Computer architecture. Computer architecture. 10 µs - 10 ms. Johan Montelius

Improving Ceph Performance while Reducing Costs

A Fast and High Throughput SQL Query System for Big Data

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

CS6453. Data-Intensive Systems: Rachit Agarwal. Technology trends, Emerging challenges & opportuni=es

BIOS SETUP UTILITY Main Advanced H/W Monitor Boot Security Exit. v02.54 (C) Copyright , American Megatrends, Inc. BIOS SETUP UTILITY

Storage: HDD, SSD and RAID

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/24/2018 Week 10-B Sangmi Lee Pallickara

WaveView. System Requirement V6. Reference: WST Page 1. WaveView System Requirements V6 WST

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

BIOS SETUP UTILITY Main Smart Advanced H/W Monitor Boot Security Exit. v02.54 (C) Copyright , American Megatrends, Inc. BIOS SETUP UTILITY

Performance and Optimization Issues in Multicore Computing

RACKSPACE ONMETAL I/O V2 OUTPERFORMS AMAZON EC2 BY UP TO 2X IN BENCHMARK TESTING

Baoping Wang School of software, Nanyang Normal University, Nanyang , Henan, China

Solid State Performance Comparisons: SSD Cache Performance

I/O Acceleration by Host Side Resources

Is Open Source good enough? A deep study of Swift and Ceph performance. 11/2013

Data Storage and Query Answering. Data Storage and Disk Structure (2)

Deep Learning Performance and Cost Evaluation

Tackling the Management Challenges of Server Consolidation on Multi-core System

Data Platforms and Pattern Mining

Cisco and Cloudera Deliver WorldClass Solutions for Powering the Enterprise Data Hub alerts, etc. Organizations need the right technology and infrastr

Intel SR2612UR storage system

Readings. Storage Hierarchy III: I/O System. I/O (Disk) Performance. I/O Device Characteristics. often boring, but still quite important

Linux Storage System Analysis for e.mmc With Command Queuing

PACM: A Prediction-based Auto-adaptive Compression Model for HDFS. Ruijian Wang, Chao Wang, Li Zha

Input/Output. Today. Next. Principles of I/O hardware & software I/O software layers Disks. Protection & Security

JetStor White Paper SSD Caching

Service Oriented Performance Analysis

Cloudian Sizing and Architecture Guidelines

Computer Systems Laboratory Sungkyunkwan University

Hewlett Packard Enterprise HPE GEN10 PERSISTENT MEMORY PERFORMANCE THROUGH PERSISTENCE

Transcription:

Bring x3 Spark Performance Improvement with PCIe SSD Yucai, Yu (yucai.yu@intel.com) BDT/STO/SSG January, 2016

About me/us Me: Spark contributor, previous on virtualization, storage, mobile/iot OS. Intel Spark team, working on Spark upstream development, including: core, Spark SQL, Spark R, GraphX, machine learning etc. Top 3 contribution in 2015, 3 committers. Two publication: 2

Agenda PCIe SSD Overview Use PCIe SSD to accelerate computing Secret of SSD acceleration in big data 3

PCIe SSD Overview 4

Agenda PCIe SSD Overview Use PCIe SSD to accelerate computing Secret of SSD acceleration in big data 5

Use PCIe SSD to accelerate computing - Motivation Usually customers servers have HDDs (7-11 usually) already, so we propose to add 1 PCIe SSD as cache for hot data and HDDs as backup storage. 6

Use PCIe SSD to accelerate computing - Motivation Usually customers servers have HDDs (7-11 usually) already, so we propose to add 1 PCIe SSD as cache for hot data and HDDs as backup storage. Tachyon is an existing solution, but: Only supporting RDD cache, not including shuffle data Extra software component, extra deployment and maintain effort Extra performance loss to run tachyon daemon and inter-process communication 7

Use PCIe SSD to accelerate computing - Implementation When Spark core allocates files (either for RDD cache or shuffle), it gets files from PCIe SSD first, after PCIe SSD s useable space is less than some threshold, getting files from HDDs. Yarn dynamical allocation is supported also. 8

Use PCIe SSD to accelerate computing - Usage 1. Set the priority and threshold in spark-default.xml. 2. Configure ssd location: just put the keyword like "ssd in local dir. For example, in yarn-site.xml:. 9

Real world Spark adoptions Benchmarking Workloads Graph Analysis characteristic: 1. Using RDD cache for iterative computations. 2. Involving shuffle(s) operations heavily. Workload Category Description Rationale Customer NWeight Graph Analysis To compute associations between two vertices that are n-hop away(e.g., friend to-friend associations or similarities between videos for recommendation) Iterative graph-parallel algorithm, implemented with Bagel (Pregel on Spark) and/or GraphX (new Graph parallel framework on Spark) Real CSP customer application 10

NWeight Introduction To compute associations between two vertices that are n-hop away. e.g., friend to-friend, or similarities between videos for recommendation Initial directed graph f b 0.1 0.6 0.4 d a 0.3 0.5 c e 0.2 (f,0.24), (e,0.30) 2-hop association f b 0.1 0.6 0.4 d a (d, 0.6*0.1+ 0.3*0.2 = 0.12) 0.3 0.5 c e 0.2 (f,0.12), (e,0.15) Intel Confidential 11

Nomalized Excution Speed PCIe SSD hierarchy store performance report #A Pure SSD scenario: 1 PCIe SSD performs the same as 11 SATA SSDs (SSD shifts bottleneck to CPU). For our hierarchy store solution: No extra overhead: best case the same with pure SSD (PCIe/SATA SSD), worst case the same with pure HDDs. Compared with 11 HDDs, x1.86 improvement at least (CPU limitation). Compared with Tachyon, still shows x1.3 performance advantage: cache both RDD and shuffle, no inter-process communication. 1PCIE SSD + HDDs Hierarchy Store 1.78 1.82 1.85 1.86 The higher the better 1 1 11 HDDs 11 HDDs Hierarchy All in HDDs 1.38 0GB SSD Tachyon all in HDDs 300GB SSD quota Hierarchy Store 500GB SSD quota, Hierarchy Store all in SSD all in SSD PCI-E SSD 1 PCI-E SSD 11 SATA SSDs 11 SATA SSD Intel Confidential 12

Agenda PCIe SSD Overview Use PCIe SSD to accelerate computing Secret of SSD acceleration in big data 13

Deep dive into a real customer case NWeight x3 improvement!! 11 HDDs PCIe SSD Stage Id Description Input Output Shuffle Read Shuffle Write Duration Duration 23saveAsTextFile at BagelNWeight.scala:102+details 50.1 GB 27.6 GB 27 s 20 s 17foreach at Bagel.scala:256+details 732.0 GB 490.4 GB 23 min 7.5 min 16flatMap at Bagel.scala:96+details 732.0 GB 490.4 GB 15 min 13 min 11foreach at Bagel.scala:256+details 590.2 GB 379.5 GB 25 min 11 min 10flatMap at Bagel.scala:96+details 590.2 GB 379.6 GB 12 min 10 min 6foreach at Bagel.scala:256+details 56.1 GB 19.1 GB 4.9 min 3.7 min 5flatMap at Bagel.scala:96+details 56.1 GB 19.1 GB 1.5 min 1.5 min 2foreach at Bagel.scala:256+details 15.3 GB 38 s 39 s 1parallelize at BagelNWeight.scala:97+details 38 s 38 s 0flatMap at BagelNWeight.scala:72+details 22.6 GB 15.3 GB 46 s 46 s 14

5 Main IO pattern RDD Map Stage rdd_read_in_map Reduce Stage rdd_read_in_reduce rdd_write_in_reduce Shuffle shuffle_write_in_map shuffle_read_in_reduce 15

How to do IO characterization? We use blktrace* to monitor each IO to disk. Such as: Start to write 560 sectors from address 52090704 Start to read 256 sectors from address 13637888 Finish the previous read command (13637888 + 256) Finish the previous write command (52090704 + 560) We parse those raw info, generating 4 kinds of charts: IO size histogram, latency histogram, seek distance histogram and LBA timeline, from which we can identify the IO is sequential or random. * blktrace is a kernel block layer IO tracing mechanism which provides detailed information about disk request queue operations up to user space. 16

RDD Read in Map: sequential Big IO size Red is Read Green is Write Sequential data distribution Much 0 SD Classic hard disk seek time is 8-9ms, spindle rate is 7200rps, it means one random access needs 13ms at least. Low latency 17

Shuffle Read in Reduce: random Small IO size Red is Read Green is Write Random data distribution Few 0 SD High latency 18

Shuffle Write in Map: sequential Red is Read Green is Write Big IO size Sequential data distribution Much 0 SD 19

RDD Read in Reduce: sequential Big IO size Red is Read Green is Write Much 0 SD Sequential data distribution Low latency 20

RDD Write in Reduce: sequential write but with frequent 4K read Those 4K read is probably because of spilling in cogroup, maybe a spark issue Sequential data location Write IO size is big but with many small 4K read IO Red is Read Green is Write tel Confidential 1/25/2016 21

Overall Disk IO Picture LBA Timeline: 1 of 11 HDDs Red is Read Green is Write Shuffle Read is very random, while others are sequential. Shuffle Write Shuffle Read RDD Write RDD Read RDD Read Shuffle Write Shuffle Read RDD Write RDD Read RDD Read Shuffle Write Shuffle Read Reduce Map Reduce Map Reduce 22

Conclusion RDD read/write, shuffle write are sequential. Shuffle read is random. Type rdd_read_in_map shuffle_write_in_map rdd_read_in_reduce rdd_write_in_reduce shuffle_read_in_reduce IO Characterization Sequential Random 23

Using SSD to speed up shuffle read in reduce CPU is still the bottleneck! x2 improvement for shuffle read in reduce x3 improvement in real shuffle x2 improvement in E2E testing Per disk BW when shuffle read from HDD BW when shuffle read from SSD Only 40MB per disk at max SSD is much better, especially this stage 11 HDDs sum Shuffle read from HDD leads to High IO Wait Description Shuffle Read Shuffle Write SSD-RDD + HDD-Shuffle 1 SSD saveastextfile at BagelNWeight.scala 20 s 20 s foreach at Bagel.scala 490.3 GB 14 min 7.5 min flatmap at Bagel.scala 490.4 GB 12 min 13 min foreach at Bagel.scala 379.5 GB 13 min 11 min flatmap at Bagel.scala 379.6 GB 10 min 10 min foreach at Bagel.scala 19.1 GB 3.5 min 3.7 min flatmap at Bagel.scala 19.1 GB 1.5 min 1.5 min foreach at Bagel.scala 15.3 GB 38 s 39 s parallelize at BagelNWeight.scala 38 s 38 s flatmap at BagelNWeight.scala 15.3 GB 46 s 46 s 24

If CPU is not bottleneck? NWeight x3-5 improvement for shuffle x2 improvement for map stage x3 improvement in E2E testing 11 HDDs PCIe SSD HSW Stage Id Description Input Output Shuffle Read Shuffle Write Duration Duration Duration 23saveAsTextFile at BagelNWeight.scala:102+details 50.1 GB 27.6 GB 27 s 20 s 26 s 17foreach at Bagel.scala:256+details 732.0 GB 490.4 GB 23 min 7.5 min 4.6 min 16flatMap at Bagel.scala:96+details 732.0 GB 490.4 GB 15 min 13 min 6.3 min 11foreach at Bagel.scala:256+details 590.2 GB 379.5 GB 25 min 11 min 7.1 min 10flatMap at Bagel.scala:96+details 590.2 GB 379.6 GB 12 min 10 min 5.3 min 6foreach at Bagel.scala:256+details 56.1 GB 19.1 GB 4.9 min 3.7 min 2.8 min 5flatMap at Bagel.scala:96+details 56.1 GB 19.1 GB 1.5 min 1.5 min 47 s 2foreach at Bagel.scala:256+details 15.3 GB 38 s 39 s 36 s 1parallelize at BagelNWeight.scala:97+details 38 s 38 s 35 s 0flatMap at BagelNWeight.scala:72+details 22.6 GB 15.3 GB 46 s 46 s 43 s #A#B 25

We re hiring! wechat: 186 1658 3742 / Lex email: yucai.yu@intel.com Do you love the challenges of working with systems that host petabytes of data and many tens of thousands of cores? Do you want to build the next generation of Big Data technologies? Tackle the challenges in the operating systems, file system, data storage, database, network, distributed computing, machine learning and data mining? 26

BACKUP 27

SUT #A IVB Master CPU Intel(R) Xeon(R) CPU E5-2680 @ 2.70GHz (16 cores) Memory 64G Disk 2 SSD Network 1 Gigabit Ethernet Slaves Nodes 4 CPU Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz (2 CPUs, 10 cores, 40 threads) Memory 192G DDR3 1600MHz Disk 11 HDDs/11 SSDs/1 PCI-E SSD(P3600) Network 10 Gigabit Ethernet OS Red Hat 6.2 Kernel 3.16.7-upstream Spark Spark 1.4.1 Hadoop/HDFS Hadoop-2.5.0-cdh5.3.2 JDK Sun hotspot JDK 1.8.0 (64bits) Scala scala-2.10.4 IVB E5-2680 28

SUT #B HSW Master CPU Intel(R) Xeon(R) CPU X5570 @ 2.93GHz (16 cores) Memory 48G Disk 2 SSD Network 1 Gigabit Ethernet Slaves Nodes 4 CPU Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (2 CPUs, 18 cores, 72 threads) Memory 256G DDR4 2133MHz Disk 11 SSD Network 10 Gigabit Ethernet OS Ubuntu 14.04.2 LTS Kernel 3.16.0-30-generic.x86_64 Spark Spark 1.4.1 Hadoop/HDFS Hadoop-2.5.0-cdh5.3.2 JDK Sun hotspot JDK 1.8.0 (64bits) Scala scala-2.10.4 HSW E5-2699 29

Test Configuration executors number: 32 executor memory: 18G executor-cores: 5 spark-defaults.conf: spark.serializer spark.kryo.referencetracking org.apache.spark.serializer.kryoserializer false 30

HDD (Seagate ST9500620NS) SPEC 31

HDD (Seagate ST9500620NS) SPEC 32

PCIe SSD(P3600) SPEC 33

PCIe SSD(P3600) SPEC 34

35