Bring x3 Spark Performance Improvement with PCIe SSD. Yucai, Yu BDT/STO/SSG January, 2016

Size: px

Start display at page:

Download "Bring x3 Spark Performance Improvement with PCIe SSD. Yucai, Yu BDT/STO/SSG January, 2016"

Gilbert Matthews
6 years ago
Views:

1 Bring x3 Spark Performance Improvement with PCIe SSD Yucai, Yu BDT/STO/SSG January, 2016

Intel Spark team, working on Spark upstream development, including:

2 About me/us Me: Spark contributor, previous on virtualization, storage, mobile/iot OS. Intel Spark team, working on Spark upstream development, including: core, Spark SQL, Spark R, GraphX, machine learning etc. Top 3 contribution in 2015, 3 committers. Two publication: 2

3 Agenda PCIe SSD Overview Use PCIe SSD to accelerate computing Secret of SSD acceleration in big data 3

4 PCIe SSD Overview 4

5 Agenda PCIe SSD Overview Use PCIe SSD to accelerate computing Secret of SSD acceleration in big data 5

6 Use PCIe SSD to accelerate computing - Motivation Usually customers servers have HDDs (7-11 usually) already, so we propose to add 1 PCIe SSD as cache for hot data and HDDs as backup storage. 6

7 Use PCIe SSD to accelerate computing - Motivation Usually customers servers have HDDs (7-11 usually) already, so we propose to add 1 PCIe SSD as cache for hot data and HDDs as backup storage. Tachyon is an existing solution, but: Only supporting RDD cache, not including shuffle data Extra software component, extra deployment and maintain effort Extra performance loss to run tachyon daemon and inter-process communication 7

8 Use PCIe SSD to accelerate computing - Implementation When Spark core allocates files (either for RDD cache or shuffle), it gets files from PCIe SSD first, after PCIe SSD s useable space is less than some threshold, getting files from HDDs. Yarn dynamical allocation is supported also. 8

9 Use PCIe SSD to accelerate computing - Usage 1. Set the priority and threshold in spark-default.xml. 2. Configure ssd location: just put the keyword like "ssd in local dir. For example, in yarn-site.xml:. 9

10 Real world Spark adoptions Benchmarking Workloads Graph Analysis characteristic: 1. Using RDD cache for iterative computations. 2. Involving shuffle(s) operations heavily. Workload Category Description Rationale Customer NWeight Graph Analysis To compute associations between two vertices that are n-hop away(e.g., friend to-friend associations or similarities between videos for recommendation) Iterative graph-parallel algorithm, implemented with Bagel (Pregel on Spark) and/or GraphX (new Graph parallel framework on Spark) Real CSP customer application 10

11 NWeight Introduction To compute associations between two vertices that are n-hop away. e.g., friend to-friend, or similarities between videos for recommendation Initial directed graph f b d a c e 0.2 (f,0.24), (e,0.30) 2-hop association f b d a (d, 0.6* *0.2 = 0.12) c e 0.2 (f,0.12), (e,0.15) Intel Confidential 11

12 Nomalized Excution Speed PCIe SSD hierarchy store performance report #A Pure SSD scenario: 1 PCIe SSD performs the same as 11 SATA SSDs (SSD shifts bottleneck to CPU). For our hierarchy store solution: No extra overhead: best case the same with pure SSD (PCIe/SATA SSD), worst case the same with pure HDDs. Compared with 11 HDDs, x1.86 improvement at least (CPU limitation). Compared with Tachyon, still shows x1.3 performance advantage: cache both RDD and shuffle, no inter-process communication. 1PCIE SSD + HDDs Hierarchy Store The higher the better HDDs 11 HDDs Hierarchy All in HDDs GB SSD Tachyon all in HDDs 300GB SSD quota Hierarchy Store 500GB SSD quota, Hierarchy Store all in SSD all in SSD PCI-E SSD 1 PCI-E SSD 11 SATA SSDs 11 SATA SSD Intel Confidential 12

13 Agenda PCIe SSD Overview Use PCIe SSD to accelerate computing Secret of SSD acceleration in big data 13

14 Deep dive into a real customer case NWeight x3 improvement!! 11 HDDs PCIe SSD Stage Id Description Input Output Shuffle Read Shuffle Write Duration Duration 23saveAsTextFile at BagelNWeight.scala:102+details 50.1 GB 27.6 GB 27 s 20 s 17foreach at Bagel.scala:256+details GB GB 23 min 7.5 min 16flatMap at Bagel.scala:96+details GB GB 15 min 13 min 11foreach at Bagel.scala:256+details GB GB 25 min 11 min 10flatMap at Bagel.scala:96+details GB GB 12 min 10 min 6foreach at Bagel.scala:256+details 56.1 GB 19.1 GB 4.9 min 3.7 min 5flatMap at Bagel.scala:96+details 56.1 GB 19.1 GB 1.5 min 1.5 min 2foreach at Bagel.scala:256+details 15.3 GB 38 s 39 s 1parallelize at BagelNWeight.scala:97+details 38 s 38 s 0flatMap at BagelNWeight.scala:72+details 22.6 GB 15.3 GB 46 s 46 s 14

15 5 Main IO pattern RDD Map Stage rdd_read_in_map Reduce Stage rdd_read_in_reduce rdd_write_in_reduce Shuffle shuffle_write_in_map shuffle_read_in_reduce 15

16 How to do IO characterization? We use blktrace* to monitor each IO to disk. Such as: Start to write 560 sectors from address Start to read 256 sectors from address Finish the previous read command ( ) Finish the previous write command ( ) We parse those raw info, generating 4 kinds of charts: IO size histogram, latency histogram, seek distance histogram and LBA timeline, from which we can identify the IO is sequential or random. * blktrace is a kernel block layer IO tracing mechanism which provides detailed information about disk request queue operations up to user space. 16

17 RDD Read in Map: sequential Big IO size Red is Read Green is Write Sequential data distribution Much 0 SD Classic hard disk seek time is 8-9ms, spindle rate is 7200rps, it means one random access needs 13ms at least. Low latency 17

18 Shuffle Read in Reduce: random Small IO size Red is Read Green is Write Random data distribution Few 0 SD High latency 18

19 Shuffle Write in Map: sequential Red is Read Green is Write Big IO size Sequential data distribution Much 0 SD 19

20 RDD Read in Reduce: sequential Big IO size Red is Read Green is Write Much 0 SD Sequential data distribution Low latency 20

21 RDD Write in Reduce: sequential write but with frequent 4K read Those 4K read is probably because of spilling in cogroup, maybe a spark issue Sequential data location Write IO size is big but with many small 4K read IO Red is Read Green is Write tel Confidential 1/25/

22 Overall Disk IO Picture LBA Timeline: 1 of 11 HDDs Red is Read Green is Write Shuffle Read is very random, while others are sequential. Shuffle Write Shuffle Read RDD Write RDD Read RDD Read Shuffle Write Shuffle Read RDD Write RDD Read RDD Read Shuffle Write Shuffle Read Reduce Map Reduce Map Reduce 22

23 Conclusion RDD read/write, shuffle write are sequential. Shuffle read is random. Type rdd_read_in_map shuffle_write_in_map rdd_read_in_reduce rdd_write_in_reduce shuffle_read_in_reduce IO Characterization Sequential Random 23

SSD is much better, especially this stage 11 HDDs sum Shuffle read from HDD leads to High IO Wait Description Shuffle Read Shuffle Write SSD-RDD + HDD-Shuffle 1 SSD saveastextfile at BagelNWeight.

24 Using SSD to speed up shuffle read in reduce CPU is still the bottleneck! x2 improvement for shuffle read in reduce x3 improvement in real shuffle x2 improvement in E2E testing Per disk BW when shuffle read from HDD BW when shuffle read from SSD Only 40MB per disk at max SSD is much better, especially this stage 11 HDDs sum Shuffle read from HDD leads to High IO Wait Description Shuffle Read Shuffle Write SSD-RDD + HDD-Shuffle 1 SSD saveastextfile at BagelNWeight.scala 20 s 20 s foreach at Bagel.scala GB 14 min 7.5 min flatmap at Bagel.scala GB 12 min 13 min foreach at Bagel.scala GB 13 min 11 min flatmap at Bagel.scala GB 10 min 10 min foreach at Bagel.scala 19.1 GB 3.5 min 3.7 min flatmap at Bagel.scala 19.1 GB 1.5 min 1.5 min foreach at Bagel.scala 15.3 GB 38 s 39 s parallelize at BagelNWeight.scala 38 s 38 s flatmap at BagelNWeight.scala 15.3 GB 46 s 46 s 24

Duration 23saveAsTextFile at BagelNWeight.scala:102+details 50.1 GB 27.6 GB 27 s 20 s 26 s 17foreach at Bagel.scala:256+details 732.0 GB 490.4 GB 23 min 7.5 min 4.6 min 16flatMap at Bagel.

25 If CPU is not bottleneck? NWeight x3-5 improvement for shuffle x2 improvement for map stage x3 improvement in E2E testing 11 HDDs PCIe SSD HSW Stage Id Description Input Output Shuffle Read Shuffle Write Duration Duration Duration 23saveAsTextFile at BagelNWeight.scala:102+details 50.1 GB 27.6 GB 27 s 20 s 26 s 17foreach at Bagel.scala:256+details GB GB 23 min 7.5 min 4.6 min 16flatMap at Bagel.scala:96+details GB GB 15 min 13 min 6.3 min 11foreach at Bagel.scala:256+details GB GB 25 min 11 min 7.1 min 10flatMap at Bagel.scala:96+details GB GB 12 min 10 min 5.3 min 6foreach at Bagel.scala:256+details 56.1 GB 19.1 GB 4.9 min 3.7 min 2.8 min 5flatMap at Bagel.scala:96+details 56.1 GB 19.1 GB 1.5 min 1.5 min 47 s 2foreach at Bagel.scala:256+details 15.3 GB 38 s 39 s 36 s 1parallelize at BagelNWeight.scala:97+details 38 s 38 s 35 s 0flatMap at BagelNWeight.scala:72+details 22.6 GB 15.3 GB 46 s 46 s 43 s #A#B 25

We re hiring! wechat: 186 1658 3742 / Lex email: yucai.yu@intel.com Do you love the challenges of working with systems that host petabytes of data and many tens of thousands of cores?

26 We re hiring! wechat: / Lex yucai.yu@intel.com Do you love the challenges of working with systems that host petabytes of data and many tens of thousands of cores? Do you want to build the next generation of Big Data technologies? Tackle the challenges in the operating systems, file system, data storage, database, network, distributed computing, machine learning and data mining? 26

27 BACKUP 27

28 SUT #A IVB Master CPU Intel(R) Xeon(R) CPU 2.70GHz (16 cores) Memory 64G Disk 2 SSD Network 1 Gigabit Ethernet Slaves Nodes 4 CPU Intel(R) Xeon(R) CPU E GHz (2 CPUs, 10 cores, 40 threads) Memory 192G DDR3 1600MHz Disk 11 HDDs/11 SSDs/1 PCI-E SSD(P3600) Network 10 Gigabit Ethernet OS Red Hat 6.2 Kernel upstream Spark Spark Hadoop/HDFS Hadoop cdh5.3.2 JDK Sun hotspot JDK (64bits) Scala scala IVB E

29 SUT #B HSW Master CPU Intel(R) Xeon(R) CPU 2.93GHz (16 cores) Memory 48G Disk 2 SSD Network 1 Gigabit Ethernet Slaves Nodes 4 CPU Intel(R) Xeon(R) CPU E GHz (2 CPUs, 18 cores, 72 threads) Memory 256G DDR4 2133MHz Disk 11 SSD Network 10 Gigabit Ethernet OS Ubuntu LTS Kernel generic.x86_64 Spark Spark Hadoop/HDFS Hadoop cdh5.3.2 JDK Sun hotspot JDK (64bits) Scala scala HSW E

30 Test Configuration executors number: 32 executor memory: 18G executor-cores: 5 spark-defaults.conf: spark.serializer spark.kryo.referencetracking org.apache.spark.serializer.kryoserializer false 30

31 HDD (Seagate ST NS) SPEC 31

32 HDD (Seagate ST NS) SPEC 32

33 PCIe SSD(P3600) SPEC 33

34 PCIe SSD(P3600) SPEC 34

35 35

Big data systems 12/8/17

Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores