Don t Get Caught In the Cold, Warm-up Your JVM Understand and Eliminate JVM Warm-up Overhead in Data-parallel Systems

Similar documents
DON T GET CAUGHT IN THE COLD, WARM-UP YOUR JVM: UNDERSTAND AND ELIMINATE JVM WARM-UP OVERHEAD IN DATA-PARALLEL SYSTEMS. David Lion

Yak: A High-Performance Big-Data-Friendly Garbage Collector. Khanh Nguyen, Lu Fang, Guoqing Xu, Brian Demsky, Sanazsadat Alamian Shan Lu

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

HPC-Reuse: efficient process creation for running MPI and Hadoop MapReduce on supercomputers

Albis: High-Performance File Format for Big Data Systems

Accelerate Big Data Insights

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Lecture 11 Hadoop & Spark

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Tatsuhiro Chiba, Takeshi Yoshimura, Michihiro Horie and Hiroshi Horii IBM Research

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa

Resilient Distributed Datasets

CSc 453 Interpreters & Interpretation

Running class Timing on Java HotSpot VM, 1

April Copyright 2013 Cloudera Inc. All rights reserved.

Simple testing can prevent most critical failures -- An analysis of production failures in distributed data-intensive systems

Cloud Computing & Visualization

Tuning Intelligent Data Lake Performance

BigData and Map Reduce VITMAC03

The Stratosphere Platform for Big Data Analytics

Tuning Intelligent Data Lake Performance

MapReduce, Hadoop and Spark. Bompotas Agorakis

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

The Evolution of Big Data Platforms and Data Science

MapReduce. U of Toronto, 2014

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida)

SFO15-TR6: Hadoop on ARM

Big Data XML Parsing in Pentaho Data Integration (PDI)

Compiler construction 2009

MI-PDB, MIE-PDB: Advanced Database Systems

JAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder

Shuffler: Fast and Deployable Continuous Code Re-Randomization

Corrections made in this version not in first posting:

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Just-In-Time Compilation

Accelerating Spark Workloads using GPUs

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Huge market -- essentially all high performance databases work this way

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Performance Profiling. Curtin University of Technology Department of Computing

Hadoop MapReduce Framework

What the CPU Sees Basic Flow Control Conditional Flow Control Structured Flow Control Functions and Scope. C Flow Control.

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

Project. there are a couple of 3 person teams. a new drop with new type checking is coming. regroup or see me or forever hold your peace

MapReduce-style data processing

VA Smalltalk 9: Exploring the next-gen LLVM-based virtual machine

Exploiting the Behavior of Generational Garbage Collector

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Distributed Systems 16. Distributed File Systems II

Detecting and Fixing Memory-Related Performance Problems in Managed Languages

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Big data systems 12/8/17

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Fast, Interactive, Language-Integrated Cluster Computing

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Heckaton. SQL Server's Memory Optimized OLTP Engine

Big Data Hadoop Stack

Welcome to the session...

Interruptible Tasks: Treating Memory Pressure as Interrupts for Highly Scalable Data-Parallel Programs

EsgynDB Enterprise 2.0 Platform Reference Architecture

Shark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley

Fast Big Data Analytics with Spark on Tachyon

CS 241 Honors Concurrent Data Structures

2. PICTURE: Cut and paste from paper

Truffle A language implementation framework

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Processes, Threads and Processors

Warehouse- Scale Computing and the BDAS Stack

Jupiter: A Modular and Extensible JVM

A BigData Tour HDFS, Ceph and MapReduce

The Z Garbage Collector Low Latency GC for OpenJDK

Oracle R Technologies

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Separating Access Control Policy, Enforcement, and Functionality in Extensible Systems. Robert Grimm University of Washington

Delft-Java Link Translation Buffer

Kris Mok, Software Engineer, 莫枢 / 撒迦

Method-Level Phase Behavior in Java Workloads

Shark: Hive (SQL) on Spark

Using FPGAs as Microservices

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Research in Middleware Systems For In-Situ Data Analytics and Instrument Data Analysis

CSE P 501 Compilers. Java Implementation JVMs, JITs &c Hal Perkins Winter /11/ Hal Perkins & UW CSE V-1

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Hierarchical PLABs, CLABs, TLABs in Hotspot

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Big Data Management and NoSQL Databases

Expert Lecture plan proposal Hadoop& itsapplication

PROCESSES AND THREADS THREADING MODELS. CS124 Operating Systems Winter , Lecture 8

Java Without the Jitter

Practical Big Data Processing An Overview of Apache Flink

Control Hazards. Prediction

Transcription:

Don t Get Caught In the Cold, Warm-up Your JVM Understand and Eliminate JVM Warm-up Overhead in Data-parallel Systems David Lion, Adrian Chiu, Hailong Sun*, Xin Zhuang, Nikola Grcevski, Ding Yuan University of Toronto, *Beihang University, Vena Solutions

The JVM is Popular Systems are increasingly built on the JVM Popular for big data applications Increasingly used on latency-sensitive queries 2

JVM Performance is Mysterious JVM performance has come a long way, but it will never match native code. - Quora User The most obvious outdated Java Performance fallacy is that it is slow. - InfoQ User Most work performed by HDFS and MapReduce is I/O, so Java is acceptable. - Hypertable Developer If you scale your application Java is likely fast enough for you. - StackOverflow user 3

An Analysis of JVM Overhead Surprisingly, warm-up overhead is the bottleneck Bottlenecks I/O intensive workloads (33% in 1GB HDFS read) Warm-up time stays constant (21s - Spark query) There is a contradiction between parallelizing long running jobs into short tasks, and amortizing JVM warm-up overhead through long tasks. Multi-layer systems aggravate the problem 4

HotTub: Eliminate Warm-up Overhead New JVM that can be reused across jobs HDFS 1MB Read Spark Query 1GB 2. Runtime (s) Runtime (s) 2 1. 1. 3.8X OpenJDK 8 HotTub Hive Query 1GB 7 3 6 3 2 1.8X 4 3 Runtime (s) 2 1 2 1 1 OpenJDK 8 HotTub 1.79X OpenJDK 8 HotTub

Outline JVM Analysis Warm-up overhead bottlenecks I/O intensive workloads Warm-up overhead is constant Warm-up overhead increased on multi-layer systems HotTub: eliminate warm-up with reuse Demo 6

Methodology Study HDFS, Spark, and Hive on Tez Queries from BigBench [Ghazal 13] (TPC-DS) Server components are fully warmed up 1 node cluster: 16 virtual cores, 128GB RAM, 1GbE Instrument OpenJDK 8 to measure warm-up overhead Understanding overall slowdown is complicated Blocked time analysis [Ousterhout 1] Subtract warm-up overhead and simulate scheduling 7

HDFS: I/O Bottleneck is Warm-up Warm-up time remains relatively constant across data sizes Same classes loaded, same methods JIT-compiled HDFS Sequential Read 2. Interpreter Class loading 2 Time (s) 1. 1. 1 1.... 1...7 1 2 3 Size (GB) 4 6 7 8 9 1 8

HDFS: I/O Bottleneck is Warm-up Warm-up time remains relatively constant across data sizes Warm-up more than 33% of execution for 1GB HDFS Sequential Read 2. Interpreter Class loading Warm-up % 7 6 2 1. 4 1 3 2. % Runtime Same classes loaded, same methods JIT-compiled Time (s) 1 1 1.... 1...7 1 2 3 Size (GB) 4 6 7 8 9 1 9

Zooming into HDFS Read Overhead 1GB file bottlenecked by client initialization warm-up overhead 21% of entire execution is class loading at initialization Warm-up overhead dwarfs disk I/O HDFS 1GB Sequential Read Compiled/native Interpreter Class loading Client init Read. 1 1. 2 2. Time (s) 3 3. 4 4. 1

HDFS Packet Overhead Datanode first sends acknowledgement to begin the read HDFS Datanode I/O Compiled/native Interpreter Class loading Datanode ack Client 1 1 2 2 Time (ms) 3 3 4 4 11

HDFS Packet Overhead Client processing acknowledgement is very slow Bottleneck is class loading HDFS Datanode I/O Compiled/native Interpreter Class loading Datanode ack Client parse DN ack 1 1 2 2 Time (ms) 3 3 4 4 12

HDFS Packet Overhead Datanode already started to send packets HDFS Datanode I/O Compiled/native Interpreter Class loading Datanode ack sendfile 1-38 Client parse DN ack 1 1 2 2 Time (ms) 3 3 4 4 13

HDFS Packet Overhead The client spends all of its time in warm-up CRC checksum interpretation is the bottleneck HDFS Datanode I/O Compiled/native Interpreter Class loading Datanode ack sendfile 1-38 Client parse DN ack read pckt. 1 1 1 2 2 Time (ms) 3 3 4 4 14

HDFS Packet Overhead Datanode must wait because sendfile buffer is full HDFS Datanode I/O Compiled/native Interpreter Class loading Datanode ack sendfile 1-38 wait Client parse DN ack read pckt. 1 1 1 2 2 Time (ms) 3 3 4 4 1

HDFS Packet Overhead Warm-up overhead slows even the actual I/O path 19 packets sent before client finishes processing 3 HDFS Datanode I/O Compiled/native Interpreter Class loading Datanode ack sendfile 1-38 wait sendfile 39-19 Client parse DN ack read pckt. 1 read pckt. 2 read pckt. 3 1 1 2 2 Time (ms) 3 3 4 4 16

Spark and Hive Warm-up Overhead Warm-up overhead is constant Average - Spark: 21s, Hive: 13s Warm-up Over Study Interpreter 4 Class loading 4 3 Time (s) 3 2 2 1 1 1 3 Spark 7 1 1 Scale Factor (GB) 3 Hive 7 1 17

Spark and Hive Warm-up Overhead Up to 4% of runtime spent in warm-up overhead Spark queries run faster, but have more warm-up overhead Warm-up Over Study Time (s) Class loading Warm-up % 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 1 3 Spark 7 1 1 Scale Factor (GB) 3 Hive 7 1 18 % Runtime Interpreter 4

More Layers More Overhead Spark client loads 3 times more classes than Hive 19,66 classes More classes generally leads to more interpretation time Because more unique methods are invoked Client Classes Classes Loaded 2 18 16 14 12 1 8 6 4 2 Spark Package Classes Spark Hive 19

JVM Reuse is Hard at Application Layer Spark and Tez already try to reuse JVMs But only within a job Many challenges exist when reusing JVMs Need to ensure same classes are used Need to reset all static data 3rd party libraries make this even harder Close file descriptors Kill threads Signals 2

Outline JVM Analysis Warm-up overhead bottlenecks I/O intensive workloads Warm-up overhead is constant Warm-up overhead increased on multi-layer systems HotTub: eliminate warm-up with reuse Demo 21

HotTub: Reuse JVMs Modify OpenJDK 8 to reuse warm JVMs Keeps a pool of warm JVM processes Data-parallel systems have lots of opportunity for reuse Drop-in replacement 22

HotTub: Initial Run If no JVM exists create a new one $java reusable JVM exists? 23

HotTub: Initial Run If no JVM exists create a new one Run application normally $java reusable JVM exists? false exec new JVM Run App. 24

HotTub: Initial Run If no JVM exists create a new one Run application normally Reset JVM before adding to the pool Clean up any threads Reset static data to type default Close file descriptors $java reusable JVM exists? false exec new JVM Run App. JVM JVM JVM JVM pool reset JVM 2

HotTub: Reuse Run Choose existing JVM from pool Ensure loaded classes are correct $java true Reinitialize all static data reusable JVM exists? false reinit. JVM exec new JVM Run App. JVM JVM JVM JVM pool reset JVM 26

HotTub Demo

HotTub Evaluation Constant improvement across different workloads on a system Reusing a JVM from a different query has similar results OpenJDK 8 6 4 3 2 8 1.66X 7 Runtimes (s) HotTub 7 Runtimes (s) 8 Spark 1GB 3.8X 1.8X OpenJDK 8 HotTub 1.1X 1 1 3 2 2 2 4 File Size 1.36X 1 1GB HotTub 6 1 1MB OpenJDK 8 Hive 1GB Runtimes (s) HDFS Read Best Worst BigBench Query (Speedup) 1.79X Best Worst BigBench Query (Speedup) 28

Limitations Security: limit reuse to same Linux user Could see loaded classes and compiled methods Similar to timing channel Not useful for long running JVMs Breaks the JVM specification for class initialization on edge cases 29

Related Work Garbage collection overhead in data-parallel systems Yak [Nguyen'16], Taurus [Maas'16], Broom [Gog'1], etc Not warm-up overhead Studies on data-parallel systems [Ousterhout'1], [Pavlo'9], etc Not targeting the JVM Studies on the cost of scalability [McSherry'1] Work on JVM unrelated to data-parallel systems 3

Conclusions Warm-up overhead is the bottleneck Bottlenecks even I/O intensive workloads Warm-up time stays constant Bad when parallelizing with JVMs Multi-layer systems aggravate warm-up overhead HotTub: eliminate warm-up through transparent JVM reuse Open sourced at: https://github.com/dsrg-uoft/hottub Thank You 31

Extra Slides

HotTub Overheads Memory: In our tests an idle JVM took around 1GB memory Garbage collection: ~2ms Can configure pool size (ideally pool is rarely idle) Few roots, most objects are dead All stacks ended + Static data set to type default (null) Class reinitialization: Spark executor: 4ms, Spark Client: 72ms Hive container: 3ms Not overhead, but cannot be skipped on reuse 33

Spark and Hive Study Complete results GC not a factor for these short queries 34

HotTub Iterations 1MB HDFS Read performed repeatedly Run is a new JVM 3

HotTub Cross-Query Spark 1GB Run training query 4 times then run testing query 36

Custom Class Loaders Instance re-created each run No consistency issues Cannot reuse 1 2 3 4 6 7 8 9 1 11 Class CustomClassLoader {... } Class Test { public static void foo() { CustomClassLoader ccl = new CustomClassLoader(); Class clas = ccl.loadclass("classname");... } } 37

HotTub Consistency Limitations Timing dependencies Unpredictable and dangerous practice HotTub initializes classes before runtime 1 2 3 4 6 9 Class A Loaded ClassA.a = Class ClassA { static int a = ; void foo () { a++; } } Class ClassB { static int b = ClassA.a; } foo() foo() ClassA.a = 1 ClassA.a = 2 Class B Loaded ClassB.b = 2 38

HotTub Consistency Limitations Timing dependencies foo() could be called multiple times before B is initialized Class dependence cycles Problem for normal JVMs too 1 2 3 4 6 9 Class ClassA { static int a = ; void foo () { a++; } } Class ClassB { static int b = ClassA.a; } 1 2 3 4 6 7 8 9 1 Class ClassC { static int c = 1; static int c1 = ClassD.d; } Class ClassD { // c1 = 1 in HotSpot // c1 = in HotTub static int d = ClassC.c; static int d1 = 12; } 39

Instrumentation Goal: track changes between interpreter and compiled code Returns are hard to track addr JVM stack x2a parameters x2a8 ret addr: x8f ret addr stack stack addr ret addr x67: ret_handler 1 2 3 4 6 7 8 9 1 push %rax... push %rdx callq _record_mode_change callq _pop_ret_addr movq %rax, %r11 pop %rdx... pop %rax jmp %r11 4

Instrumentation Copy original return address to thread local stack addr JVM stack x2a parameters x2a8 ret addr: x8f ret addr stack stack addr ret addr x2a8 x8f x67: ret_handler 1 2 3 4 6 7 8 9 1 push %rax... push %rdx callq _record_mode_change callq _pop_ret_addr movq %rax, %r11 pop %rdx... pop %rax jmp %r11 41

Instrumentation Copy our return handler into the return address addr JVM stack x2a parameters x2a8 ret addr: x67 ret addr stack stack addr ret addr x2a8 x8f x67: ret_handler 1 2 3 4 6 7 8 9 1 push %rax... push %rdx callq _record_mode_change callq _pop_ret_addr movq %rax, %r11 pop %rdx... pop %rax jmp %r11 42

Instrumentation Return calls into our ret_handler Record mode change and pop original return address addr JVM stack x2a parameters x2a8 ret addr: x67 ret addr stack stack addr ret addr x2a8 x8f x67: ret_handler 1 2 3 4 6 7 8 9 1 push %rax... push %rdx callq _record_mode_change callq _pop_ret_addr movq %rax, %r11 pop %rdx... pop %rax jmp %r11 43

Instrumentation Jump to original return address addr JVM stack x2a parameters x2a8 ret addr: x67 ret addr stack stack addr ret addr x67: ret_handler 1 2 3 4 6 7 8 9 1 push %rax... push %rdx callq _record_mode_change callq _pop_ret_addr movq %rax, %r11 pop %rdx... pop %rax jmp %r11 // x8f 44

Spark and Hive Parallelization Parallelization: split long running jobs into short tasks JVM warm-up overhead amortized when long running Jobs are smaller and faster than ever 9% of Facebook s analytics jobs <1GB input Majority of Hadoop workloads read and write <1GB per-task [Ousterhout 13] show a trend in increasingly short running jobs Hive on Tez: 1 JVM per task, Spark: 1 JVM per node 4