Workload Characterization and Optimization of TPC-H Queries on Apache Spark

Similar documents
Tatsuhiro Chiba, Takeshi Yoshimura, Michihiro Horie and Hiroshi Horii IBM Research

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Evolution From Shark To Spark SQL:

Accelerating Spark Workloads using GPUs

A Trace-based Java JIT Compiler Retrofitted from a Method-based Compiler

Spark Overview. Professor Sasu Tarkoma.

Processing of big data with Apache Spark

Java Application Performance Tuning for AMD EPYC Processors

Chapter 4: Apache Spark

JVM and application bottlenecks troubleshooting

Fast Big Data Analytics with Spark on Tachyon

MULTI-THREADED QUERIES

Bring x3 Spark Performance Improvement with PCIe SSD. Yucai, Yu BDT/STO/SSG January, 2016

Java Garbage Collection Best Practices For Tuning GC

How Data Volume Affects Spark Based Data Analytics on a Scale-up Server

Profiling: Understand Your Application

CSE 444: Database Internals. Lecture 23 Spark

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Apache Spark 2.0 Performance Improvements Investigated With Flame Graphs. Luca Canali CERN, Geneva (CH)

Backtesting with Spark

New Developments in Spark

Jackson Marusarz Intel Corporation

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

The Evolution of Big Data Platforms and Data Science

Cloud Computing & Visualization

Generational Garbage Collection Theory and Best Practices

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Concurrent execution of an analytical workload on a POWER8 server with K40 GPUs A Technology Demonstration

Spark-GPU: An Accelerated In-Memory Data Processing Engine on Clusters

Cross-Layer Memory Management to Reduce DRAM Power Consumption

Accelerate Big Data Insights

TPCX-BB (BigBench) Big Data Analytics Benchmark

WHITE PAPER. Apache Spark: RDD, DataFrame and Dataset. API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3

Towards Parallel, Scalable VM Services

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Scaling Up Performance Benchmarking

SPECjbb2005. Alan Adamson, IBM Canada David Dagastine, Sun Microsystems Stefan Sarne, BEA Systems

THE TROUBLE WITH MEMORY

Don t Get Caught In the Cold, Warm-up Your JVM Understand and Eliminate JVM Warm-up Overhead in Data-parallel Systems

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

The Z Garbage Collector Scalable Low-Latency GC in JDK 11

TRASH DAY: COORDINATING GARBAGE COLLECTION IN DISTRIBUTED SYSTEMS

Insight Case Studies. Tuning the Beloved DB-Engines. Presented By Nithya Koka and Michael Arnold

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Pause-Less GC for Improving Java Responsiveness. Charlie Gracie IBM Senior Software charliegracie

Method-Level Phase Behavior in Java Workloads

Quality in the Data Center: Data Collection and Analysis

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Optimization Coaching for Fork/Join Applications on the Java Virtual Machine

Cross-Layer Memory Management for Managed Language Applications

Morsel- Drive Parallelism: A NUMA- Aware Query Evaluation Framework for the Many- Core Age. Presented by Dennis Grishin

VIProf: A Vertically Integrated Full-System Profiler

High Performance Managed Languages. Martin Thompson

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Apache Spark 2.0. Matei

Memory Management for Spark. Ken Salem Cheriton School of Computer Science University of Waterloo

Efficient Runtime Tracking of Allocation Sites in Java

RxNetty vs Tomcat Performance Results

Albis: High-Performance File Format for Big Data Systems

Workload Optimization on Hybrid Architectures

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

IBM Education Assistance for z/os V2R2

Unifying Big Data Workloads in Apache Spark

Performance Tuning and Sizing Guidelines for Informatica Big Data Management

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

2 TEST: A Tracer for Extracting Speculative Threads

AUTOMATIC SMT THREADING

The Stratosphere Platform for Big Data Analytics

Identifying the Sources of Cache Misses in Java Programs Without Relying on Hardware Counters. Hiroshi Inoue and Toshio Nakatani IBM Research - Tokyo

Big Data Infrastructures & Technologies

Diagnostics in Testing and Performance Engineering

Performance, Power, Die Yield. CS301 Prof Szajda

Myths and Realities: The Performance Impact of Garbage Collection

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

OS-caused Long JVM Pauses - Deep Dive and Solutions

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

A Quest for Predictable Latency Adventures in Java Concurrency. Martin Thompson

SDA: Software-Defined Accelerator for general-purpose big data analysis system

Designing experiments Performing experiments in Java Intel s Manycore Testing Lab

Experiences Running and Optimizing the Berkeley Data Analytics Stack on Cray Platforms

Course Outline. Performance Tuning and Optimizing SQL Databases Course 10987B: 4 days Instructor Led

Low latency & Mechanical Sympathy: Issues and solutions

Java Performance Tuning From A Garbage Collection Perspective. Nagendra Nagarajayya MDE

Java Performance Tuning

CDS. André Schaaff1, François-Xavier Pineau1, Gilles Landais1, Laurent Michel2 de Données astronomiques de Strasbourg, 2SSC-XMM-Newton

Kodewerk. Java Performance Services. The War on Latency. Reducing Dead Time Kirk Pepperdine Principle Kodewerk Ltd.

Basics of Performance Engineering

[MS10987A]: Performance Tuning and Optimizing SQL Databases

Resilient Distributed Datasets

Exam Questions

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Batch Jobs Performance Testing

Performance Monitoring

Transcription:

Workload Characterization and Optimization of TPC-H Queries on Apache Spark Tatsuhiro Chiba and Tamiya Onodera IBM Research - Tokyo April. 17-19, 216 IEEE ISPASS 216 @ Uppsala, Sweden

Overview IBM Research Tokyo Introduction Motivation Goal and Result Workload Characterization How Spark and Spark SQL work Environments Application level analysis System level analysis GC analysis PMU analysis Problem Assessments and Approach Result Summary and Future Work 2

Motivation IBM Research Tokyo Apache Spark is an in-memory data processing framework, runs on JVM Spark executes Hadoop similar workloads, but optimization points are not same Complexity to find a fundamental Bottlenecks I/O bottleneck à CPU, Memory, Network bottleneck JVM handles many worker threads à What is a best practice to achieve high performance? Managing large Java heap causes high GC overhead Keeps as much data in memory as possible Generates short-lived immutable Java objects à How we can reduce garbage collection overhead? From Scale-out to Scale-up Utilizes many worker threads w/ SMT on multiple NUMA nodes Need to know micro architectural efforts à How we can exploit underlying Hardware features? Spark Application Spark Runtime JVM OS Hardware 3

Goal and Result Goal To characterize Spark Performance through SQL workloads To find optimization best practice for Spark To investigate potential problem in current Spark and JVM for scaling up Result Achieved up to 3% improvement by reducing GC overhead Achieved up to 1% improvement by utilizing more SMT Achieved up to 3% improvement by NUMA awareness Achieved 3 4% improvement on average with applying all optimization 4

How Spark and Spark SQL work Spark Job is described as a data transformation chain and divided into multiple stages Each stage includes multiple tasks Tasks are concurrently proceeded by worker threads on JVMs Data shuffling occurs between stages Spark SQL Catalyst, a query optimization framework for Spark, generates an optimized code It has a compatibility for HIVE query Cited from Michael et al., Spark SQL: Relational Data Processing in Spark, SIGMOD 15 5

How SQL code translates into Spark - TPC-H Q13 stage : Load CUSTOMER table named as c stage 1 : Load ORDERS table named as o stage 2 : Join c and o where c.custkey = o.custkey stage 3 : Select c_count, count(1) groupby c_count

Machine & Software Spec and Spark Settings Processor # Core SMT Memory OS POWER8 3.3 GHz * 2 24 cores (2 sockets * 12 cores) software 8 (total 192 hardware threads) 1TB version Spark 1.4.1 Hadoop (HDFS) 2.6. Java 1.8. (IBM J9 VM SR1 FP1) Scala 2.1.4 Ubuntu 14.1 (kernel 3.16.-31) Baseline Spark settings # of Executor JVMs: 1 # of worker threads: 48 Executor heap size: 192GB (nursery = 48g, tenure = 144g) Other picked up Spark configurations (spark-defaults.conf) spark.suffle.compress = true spark.sql.parquet.compression.codec = Snappy spark.sql.parquet.fileterpushdown = true 7

Workload Characterization Spark job level Query SQL Characteristics Converted Spark Operation ( # of stages) Q1 Q3 Q5 Q6 Q9 1 GroupBy Load 1 Table 1 GroupBy, 2 Join Load 3 Table 1 GroupBy, 5 Join Load 6 Table 1 Select, 1 Where Load 1 Table 1 GroupBy, 5 Join Load 6 Table 1 Load 1 Aggregate 3 Load 2 HashJoin, 1 Aggregate 3 Load, 3 HashJoin 1 BcastJoin, 1 Aggregate 1 Load 1 Aggregate 4 Load, 4 HashJoin 1 BcastJoin, 1 Aggregate Input (total, GB) Shuffle (total, GB) * Picked up several queries Stages / Tasks Time (sec) 4.8.2 2 / 793 48.7 7.3 5. 6 / 1345 64.6 8.8 14.1 8 / 1547 125 4.8 2 / 594 15.1 11.8 34.4 1 / 1838 37 Q18 3 Join, 1 UnionAll Load 3 Table 6 Load, 3 HashJoin 1 Union, 1 Limit 7.7 13.8 11 / 3725 22 Q19 3 Join, 1UnionAll Load 2 Table 6 Load, 3 HashJoin 1 Union, 1 Aggregate 19.8.4 8 / 2437 8.8 Shuffle-light queries Q1, Q6, Q19 Shuffle-heavy queries Q5, Q9, Q18 8

Workload Characterization oprofile Shuffle-heavy queries (e.g. Q5 and Q8) : over 3% cycles are spent in GC Shuffle-light queries (e.g. Q1 and Q6) : low SerDes cost Unexpected JIT and Thread Spin Lock overheads exists in Q16 and Q21 9

Workload Characterization garbage collection GC Many nursery GC ab Small Pause time (.2 sec) Few global GC Java Heap Low usage level ( within nursery space) GC Many nursery GC Big pause time (3 5 sec) Global GC while execution Java Heap Objects are flowed into tenure space GC pause time (sec.) 3.5 3 2.5 2 1.5 1.5 nursery gc global gc 5 1 15 2 25 3 GC pause time (sec.) 6 5 4 3 2 1 global gc nursery gc 1 2 3 4 5 6 7 8 9 Heap Size (GB) 5 4 3 2 1 nursery size used nursery heap used tenure heap total used heap Heap Size (GB) 2 15 1 5 used tenure heap used nursery heap total used heap 5 1 15 2 25 3 elapsed time (sec.) Shuffle-Light Query (Q1) 1 2 3 4 5 6 7 8 9 elapsed time (sec.) Shuffle-Heavy Query (Q5) 1

Workload Characterization PMU profile Approach Observed performance counters by perf Categorized them based on the CPI breakdown model [*] Result Backend stalls are quite big Lots of L3 miss which comes from distant memory access CPU Migration occurs frequently 11 [*] https://www.ibm.com/support/knowledgecenter/linuxonibm/liaal/iplsdkcpieventspower8.htm

Problem Assessments and Optimization Strategies How we can reduce GC overhead? 1). Heap sizing 2). JVM Option tuning 3). Changing # of Spark Executor JVMs 4). GC algorithm tuning How we can reduce backend stall cycles? 1). NUMA awareness 2). Adding more hardware threads to Executor JVMs 3). Changing SMT level (SMT2, SMT4, SMT8) 12

Efforts of Heap Space Sizing Heap sizing efforts Bigger nursery space achieves up to 3% improvement Small tenure space may be harmful Run out of tenure space by caching RDDs in memory Leaked objects from nursery space Execution Time (sec.) 35 3 25 2 15 1 5 Xmn48g Xmn96g Xmn128g relative (%) Q1 Q5 Q6 Q8 Q9 Q16 Q19 5.%.% -5.% -1.% -15.% -2.% -25.% -3.% -35.% Nursery Space (-Xmn) Execution Time (sec) GC ratio (%) Nursery GC Avg. pause time Nursery GC TPC-H Q9 Case 48g (default) 316 s 2 % 2.1 s 39 1 Global GC 96g 31 s 18 % 3.4 s 22 1 144g 292 s 14 % 3.6 s 14 13

Efforts of Other JVM Options JVM options Monitor threads tuning # of GC threads tuning Java thread tuning JIT tuning, etc. Result Improved over 2% Execution TIme (sec.) 12 1 8 6 4 2 GC Threads à Monitor Threads à Disable Large Object Area à Reduce thread lock cost à Stop compaction à Stop JIT feature à Q1 Q5 relative Q1 (%) relative Q5 (%) option 1option 2option 3option 4option 5option 6option 7 5.%.% -5.% -1.% -15.% -2.% -25.% Stop System.gc() à 14

Efforts of changing JVM Counts Execution TIme (sec.) 3 25 2 15 1 5 1JVM 2JVM 4JVM 8JVM best (%) worst (%) 6% 4% 2% % -2% -4% -6% Q1 Q5 Q6 Q8 Q9 Q16 Q19 Q19 Result Up to 7% improvement in Q16 (by avoiding unexpected threads lock activity) Reduced heavy GC overhead Has a drawback a little for shuffle-light queries Using 1 JVM frequently occurs task execution failure than 4 JVMs -8% 15

Efforts of NUMA aware process affinity Execution TIme (sec.) 25 2 15 1 5 NUMA off NUMA on speedup (%) Q1 Q5 Q9 Q19.% -.5% -1.% -1.5% -2.% -2.5% -3.% -3.5% -4.% Processors numactl -c [-7],[8-15],[16-23],[24-31],[32-39],[4-47] JVM 12threads JVM 1 12threads JVM 2 12threads JVM 3 12threads NUMA NUMA1 NUMA2 NUMA3 Socket Socket 1 DRAM DRAM DRAM DRAM Spark Executor JVMs DRAM DRAM DRAM DRAM Setting NUMA aware process affinity to each Executor JVM helps to speedup By reducing scheduling overhead By reducing cache miss and stall cycles Result Achieved 3 3.5 % improvement in all benchmarks without any bad effects 16

Efforts of Increasing worker threads Execution Time (sec.) 25 2 15 1 5 2WT/core (192GB) 4WT/core (192GB) relative (%) Q1 Q5 Q6 Q8 Q9 Q16 Q19 Settings 2WT/core : handles 12 worker threads on 6 cores (in total, 48 worker threads) 4WT/core : handles 24 worker threads on 6 cores (in total, 96 worker threads) 4.% 3.% 2.% 1.%.% -1.% -2.% Result Some queries gain over 1% improvement regardless of shuffle data size Q5 and Q8 had a drawback 17

Summary of applying all optimizations Execution Time (sec.) 35 3 25 2 15 1 5 default optimized relative (%) Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q1 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q2 Q22.% -1.% -2.% -3.% -4.% -5.% -6.% -7.% -8.% -9.% Shuffle-light: 1 2% improvement Shuffle-heavy: 3 4% improvement Eliminated unexpected JVM behavior in Q16 and Q21 Q21 took 543 sec, which took over 3 sec before tuning 18

Summary and Future Works Summary Reduced GC overhead from 3% to 1% or less by heap sizing, JVM counts, and JVM options Reduced distant memory access from 66.5% to 58.9% by NUMA awareness In summary, achieved 3 4 % improvement on average All experiment codes are available at https://github.com/tatsuhirochiba/tpch-on-spark Future works Comparison between x86_64 and POWER8 Other Spark workloads 19