Trace-based JIT Compilation

Similar documents
IBM Research - Tokyo 数理 計算科学特論 C プログラミング言語処理系の最先端実装技術. Trace Compilation IBM Corporation

A Trace-based Java JIT Compiler Retrofitted from a Method-based Compiler

The Potentials and Challenges of Trace Compilation:

Adaptive Multi-Level Compilation in a Trace-based Java JIT Compiler

Identifying the Sources of Cache Misses in Java Programs Without Relying on Hardware Counters. Hiroshi Inoue and Toshio Nakatani IBM Research - Tokyo

Reducing Trace Selection Footprint for Large-scale Java Applications without Performance Loss

Trace Compilation. Christian Wimmer September 2009

Towards Parallel, Scalable VM Services

String Deduplication for Java-based Middleware in Virtualized Environments

Efficient Runtime Tracking of Allocation Sites in Java

Lecture 9 Dynamic Compilation

Just-In-Time Compilers & Runtime Optimizers

Deriving Code Coverage Information from Profiling Data Recorded for a Trace-based Just-in-time Compiler

Chronicler: Lightweight Recording to Reproduce Field Failures

Continuous Object Access Profiling and Optimizations to Overcome the Memory Wall and Bloat

Phosphor: Illuminating Dynamic. Data Flow in Commodity JVMs

YETI. GraduallY Extensible Trace Interpreter VEE Mathew Zaleski, Angela Demke Brown (University of Toronto) Kevin Stoodley (IBM Toronto)

Dynamic Vertical Memory Scalability for OpenJDK Cloud Applications

Lu Fang, University of California, Irvine Liang Dou, East China Normal University Harry Xu, University of California, Irvine

Compilation Queuing and Graph Caching for Dynamic Compilers

SABLEJIT: A Retargetable Just-In-Time Compiler for a Portable Virtual Machine p. 1

BEAMJIT: An LLVM based just-in-time compiler for Erlang. Frej Drejhammar

Adaptive SMT Control for More Responsive Web Applications

Introduction. CS 2210 Compiler Design Wonsun Ahn

H.-S. Oh, B.-J. Kim, H.-K. Choi, S.-M. Moon. School of Electrical Engineering and Computer Science Seoul National University, Korea

How s the Parallel Computing Revolution Going? Towards Parallel, Scalable VM Services

Compact and Efficient Strings for Java

Efficient Runtime Tracking of Allocation Sites in Java

Performance Profiling

BEAMJIT, a Maze of Twisty Little Traces

Mixed Mode Execution with Context Threading

CGO:U:Auto-tuning the HotSpot JVM

JAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder

Toward Efficient Strong Memory Model Support for the Java Platform via Hybrid Synchronization

Java On Steroids: Sun s High-Performance Java Implementation. History

Field Analysis. Last time Exploit encapsulation to improve memory system performance

Swift: A Register-based JIT Compiler for Embedded JVMs

Free-Me: A Static Analysis for Automatic Individual Object Reclamation

Accelerating Ruby with LLVM

Run-time Program Management. Hwansoo Han

JamaicaVM Java for Embedded Realtime Systems

High-Level Language VMs

Debunking Dynamic Optimization Myths

Performance analysis, development and improvement of programs, commands and BASH scripts in GNU/Linux systems

JIT Compilation Policy for Modern Machines

Just-In-Time Compilation

A Status Update of BEAMJIT, the Just-in-Time Compiling Abstract Machine. Frej Drejhammar and Lars Rasmusson

Performance Potential of Optimization Phase Selection During Dynamic JIT Compilation

Java on System z. IBM J9 virtual machine

Truffle A language implementation framework

Cross-Layer Memory Management for Managed Language Applications

For our next chapter, we will discuss the emulation process which is an integral part of virtual machines.

Method-Level Phase Behavior in Java Workloads

Probabilistic Calling Context

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

Large-Scale API Protocol Mining for Automated Bug Detection

Announcements. My office hours are today in Gates 160 from 1PM-3PM. Programming Project 3 checkpoint due tomorrow night at 11:59PM.

Eliminating Global Interpreter Locks in Ruby through Hardware Transactional Memory

Executing Legacy Applications on a Java Operating System

Java Performance: The Definitive Guide

Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters

Software. CPU implements "machine code" instructions. --Each machine code instruction is extremely simple. --To run, expanded to about 10 machine code

Windows Java address space

Ext3/4 file systems. Don Porter CSE 506

Dynamic Binary Instrumentation: Introduction to Pin

A Dynamic Evaluation of the Precision of Static Heap Abstractions

Agenda. CSE P 501 Compilers. Java Implementation Overview. JVM Architecture. JVM Runtime Data Areas (1) JVM Data Types. CSE P 501 Su04 T-1

Name, Scope, and Binding. Outline [1]

Sista: Improving Cog s JIT performance. Clément Béra

Adaptive Optimization using Hardware Performance Monitors. Master Thesis by Mathias Payer

Proceedings of the 2 nd Java TM Virtual Machine Research and Technology Symposium (JVM '02)

Optimization Techniques

Towards Garbage Collection Modeling

JOVE. An Optimizing Compiler for Java. Allen Wirfs-Brock Instantiations Inc.

Virtual Machines and Dynamic Translation: Implementing ISAs in Software

Just In Time Compilation

Servlet Performance and Apache JServ

HotPy (2) Binary Compatible High Performance VM for Python. Mark Shannon

Mobile Offloading. Matti Kemppainen

Borland Optimizeit Enterprise Suite 6

<Insert Picture Here> Symmetric multilanguage VM architecture: Running Java and JavaScript in Shared Environment on a Mobile Phone

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Sustainable Memory Use Allocation & (Implicit) Deallocation (mostly in Java)

Managed runtimes & garbage collection

Managed runtimes & garbage collection. CSE 6341 Some slides by Kathryn McKinley

Trace-Based Compilation in Execution Environments without Interpreters

An Empirical Study on Deoptimization in the Graal Compiler

Phase-based Adaptive Recompilation in a JVM

Re-constructing High-Level Information for Language-specific Binary Re-optimization

Intelligent Compilation

A Black-box Approach to Understanding Concurrency in DaCapo

A new Mono GC. Paolo Molaro October 25, 2006

Optimising Multicore JVMs. Khaled Alnowaiser

CS553 Lecture Profile-Guided Optimizations 3

Running class Timing on Java HotSpot VM, 1

Evolution of Virtual Machine Technologies for Portability and Application Capture. Bob Vandette Java Hotspot VM Engineering Sept 2004

Hands-on Lab Session 9909 Introduction to Application Performance Management: Monitoring. Timothy Burris, Cloud Adoption & Technical Enablement

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

CSE 237B Fall 2009 Virtualization, Security and RTOS. Rajesh Gupta Computer Science and Engineering University of California, San Diego.

New Java performance developments: compilation and garbage collection

Transcription:

Trace-based JIT Compilation Hiroshi Inoue, IBM Research - Tokyo 1

Trace JIT vs. Method JIT https://twitter.com/yukihiro_matz/status/533775624486133762 2

Background: Trace-based Compilation Using a Trace, a hot path identified at runtime, as a basic unit of compilation method entry method f if (x!= 0) rarely executed frequently executed do something while (!end) return 3

Background: Trace-based Compilation Using a Trace, a hot path identified at runtime, as a basic unit of compilation method entry method f Trace selection: how to form good compilation trace entry scope trace A if (x!= 0) if (x!= 0) rarely executed frequently executed trace exit (side exit) frequently executed while (!end) while (!end) do something return a hot path trace exit include only a hot path 4

A Brief History of Trace-based Compilation (1) Trace-based compilation was first introduced by binary translators and optimizers because method structures are not available in binaries e.g. Dynamo (PLDI 00), BOA/DAISY (MICRO 99 etc.) Dynamo demonstrated optimization potentials average 10% speedup over binaries compiled at O level improvements came mostly from better code layout and simple folding Trace-based compilation is still commonly used in binary instrumentation tools, translators today e.g. DynamoRIO, Strata, Intel Pin 5

A Brief History of Trace-based Compilation (2) 6 Recently, trace-based compilation has gained popularity in dynamic scripting languages because it can potentially provide more opportunities for type specialization e.g. TraceMonkey (PLDI 08), PyPy (ICOOOLPS 09), LuaJIT, SPUR (OOPSLA 10) TraceMonkey (used in Firefox 3.5-10) is the first trace-jit for JavaScript demonstrated 2x to 20x speedups on loop-intensive scripts PyPy is the first Python trace-jit use trace compilation for aggressive specialization of generic operations/data

A Brief History of Trace-based Compilation (3) HotpathVM (VEE 06), YETI (VEE 07), Maxpath (PPPJ 10), HotSpot-based trace-jit (PPPJ 11), Dalvik JIT of Android 2.2+ are trace-jits targeting Java and variants HotpathVM emphasizes its efficiency in resource constrained systems where full-blown JIT compilation is not practical YETI showed that the trace-based compilation eased the development of a JIT compiler by allowing incremental implementation of JIT compiler HotSpot-based trace-jit showed performance improvement over (method-based) HotSpot client compiler (closer to region-based compilation approach) 7

Outline Back ground Overview of our trace-jit Trace Selection and Performance Summary 8

Our Trace-based Java JIT Problem statement How to optimize large-scale Java applications with deep (>100) call chains and a flat execution profile? Why trace compilation? Tracing may create larger compilation scopes than conventional inlining, especially across method boundaries Tracing may provide context-sensitive profiling information Our approach Develop a trace-jit based on existing method-based Java JIT 2-year effort by 3 members Hiroshi Inoue, Hiroshige Hayashizaki (IBM Research Tokyo) Peng Wu (IBM Research Watson) 9

Motivating Example A trace can span multiple methods Free from method boundaries In large server workloads, there are deep (>100) layers of methods method 1 Web Container method 2 JavaServer Pages method 3 Enterprise JavaBeans method 4 JDBC Driver ok? error! ok? error! ok? error! ok? error! search for a handler search for a bean create an SQL send the SQL to DB log? print log? print log? print log? print 10

Our Questions 1. Can trace-jit break method boundaries more effectively? 2. Can trace-jit produce better codes? 3. Can trace-jit compile more efficiently (i.e., compile time & code size)? 4. Can a Java trace-jit beat a Java method-jit? 11

Trace Selection vs. Method Inlining ASSUMPTION: when a call graph is too big to be fully inlined into the root node invocation method with loop method w/o loop Method inlining forms hierarchical regions Trace selection forms contiguous regions blue, brown, green 12

Baseline Method-JIT Components Java VM interpreter compiled method dispatcher garbage collector class libraries compilation request for frequently executed methods code cache Method-JIT code generator compiled code IR generator optimizers 13

Our Trace-JIT Architecture Java VM Tracing runtime Trace-JIT interpreter execution events trace selection engine trace (Java bytecode) trace side exit elimination IR generator trace dispatcher hash map (e.g. compiled code address) code generator optimizers garbage collector class libraries code cache compiled code new component modified component unmodified component 14

Our Trace-JIT Architecture Java VM Tracing runtime interpreter execution events trace selection engine modified to call a hook at control-flow trace dispatcher events garbage collector class libraries branch method invoke method return hash exception map code cache Trace-JIT trace (Java bytecode) trace side exit elimination IR generator (e.g. compiled code address) code generator optimizers compiled code new component modified component unmodified component 15

Our Trace-JIT Architecture Java VM identify two types of hot paths Tracing runtime Trace-JIT interpreter execution events trace selection engine trace (Java bytecode) trace side exit elimination IR generator trace dispatcher linear trace A hash map B (e.g. compiled code address) exit stub code generator exit optimizers garbage collector cyclic trace class libraries A code cache B compiled code exit stub new component modified component unmodified component 16

Trace Selection 1. Identify a hot trace head a taken target of a backward branch a bytecode that follows a exit point of an existing trace 2. Record next execution path starting from the trace head 3. Stop recording when the trace being recorded: forms a cycle (loop) reaches pre-defined maximum length calls or returns to a JNI (native) method throws an exception 17

Our Trace-JIT Architecture Java VM interpreter execution events trace dispatcher garbage collector class libraries Tracing runtime Trace-JIT trace selection engine trace (Java bytecode) trace side exit elimination IR generator hash map Apply simple one-pass code cache value propagation to exploit simple topologies of traces (e.g. compiled code address) It removes potential side exits code generator reduce IR tree size and compilation time optimizers compiled code new component modified component unmodified component 18

relative performance over method-jit. higher is faster IBM Research - Tokyo Peak Performance (DaCapo-9.12) 1.2 1.0 0.8 0.6 0.4 0.2 0.0 avrora batik method-based JIT our trace-based JIT eclipse fop h2 jython luindex lusearch pmd sunflow tomcat tradebeans Trace-JIT was almost comparable to the method-jit on average 21% slower to 19% faster xalan geomean 19

Outline Back ground Overview of our trace-jit Trace Selection and Performance Summary 20

Trace Selection and Performance Block duplication is inherent to any trace selection algorithm e.g., most blocks following any join-node are duplicated on traces Generating larger compilation scope by allowing more duplication is key to achieve higher peak performance more optimization opportunities for compilers smaller trace transitioning overhead but it may yield longer compilation time costly source code analysis more duplicated code among traces We observed lots of duplication that does not help the performance 21

geomean xalan tradebeans tomcat sunflow pmd lusearch luindex jython h2 fop eclipse batik avrora DayTrader IBM Research - Tokyo Understanding the Causes (I): Short-Lived Traces SYMPTON 2 trace B Trace A is formed first Trace B is formed later Afterwards, A is no longer entered 1 trace A ROOT CAUSE 1. Trace A is formed before trace B, but node B dominates node A 2. Node A is part of trace B 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% On average, 40% traces of DaCapo 9-12 are short lived % traces selected by baseline algorithm with <500 execution frequency 22

DayTrader avrora batik eclipse fop h2 jython luindex lusearch pmd sunflow tomcat tradebeans xalan geomean Relative JITted Code Size. shorter is better IBM Research - Tokyo Compiled code size reduced by 70% baseline +exact-bb +head-opt +clear-counter +struct-trunk +prof +prof w/ trunk 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 DayTrader 2.0 running on WebSphere 7 DaCapo 9.12 23

DayTrader avrora batik eclipse fop h2 jython luindex lusearch pmd sunflow tomcat tradebeans xalan geomean Relative Startup Time shorter is better IBM Research - Tokyo Startup time reduced by 45% baseline +exact-bb +head-opt +clear-counter +struct-trunk +prof +prof w/ trunk 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 24

DayTrader avrora batik eclipse fop h2 jython luindex lusearch pmd sunflow tomcat tradebeans xalan geomean Relative Peak Performance higher is faster IBM Research - Tokyo Peak performance was also improved in DayTrader! baseline +exact-bb +head-opt +clear-counter +struct-trunk +prof +prof w/ trunk 1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 25

Trace Selection and Performance for Large-scale Applications Generating larger compilation scope by allowing more duplications is key to achieve higher peak performance more optimization opportunities for compilers smaller trace transitioning overhead but it may hurt startup performance longer compilation time more duplicated code among traces also it may hurt the peak performance for large applications Larger application tend to cause more instruction cache misses ~20% of CPU cycles were wasted by I-cache misses in DayTrader 26

Generating longer trace also does not necessarily work: Supporting a loop in a trace linear trace A B extended-form trace fewer trace transition L1 I$ miss: -10% A potentially more optimization opportunity but, no improvement observed 27 exit cyclic trace exit A B exit B C D more code duplication among traces total code size: +20% L3 Instruction miss: +6% Performance: -1% in DayTrader +1% on average of DaCapo-9.12 (up to +2.5%)

Outline Back ground Overview of our trace-jit Trace Selection and Performance A large-scale Java application with trace-jit Summary 28

Questions and Our Answers 1. Can trace-jit break method boundaries more effectively? Workload dependent, e.g., trace JIT produces 2X larger scope than method-jit for Jython, but 9% smaller scope for DayTrader But simply enlarge the compilation scope does not help performance 2. Can trace-jit produce better codes? Retrofitting method-jit optimizers for trace-jit does not yield significant better codes beyond the benefit of larger scope But opportunities may exist in new trace-specific optimizations How to generate trace exit code is an interesting challenge unique to trace-jit 3. Can trace-jit compile more efficiently (i.e., compile time & code size)? Yes. The simple topology of linear and cyclic traces can be compiled with much more efficiently 4. Can a Java trace-jit beat a Java method-jit? It is not easy to beat a mature method-jit. We feel that a specific type of workloads, such as Jython, respond better to trace-jit than method-jit 29

How about for other languages? TraceMonkey (JavaScript) http://hacks.mozilla.org/2010/03/improving-javascript-performancewith-jagermonkey/ That the approach that we ve taken with tracing tends to interact poorly with certain styles of code. That when we re able to stay on trace (more on this later) TraceMonkey wins against every other engine. Pyston (Python) https://tech.dropbox.com/2014/04/introducing-pyston-an-upcoming-jitbased-python-implementation/ Whether or not the same performance advantage holds for Python is an open question, but since the two approaches are fundamentally incompatible, the only way to start answering the question is to build a new method-at-a-time JIT. 30

Lessons Learned Trace selection algorithm has a big impact on performance and code size more flexible than the method inlining and hence is an interesting tool to evaluate the effect of code duplication What did not work for us extending trace-scope by allowing non-linear structures (e.g., trace grouping, trace tree) does not yield any performance improvement for DayTrader Possible future steps opportunities may exist in new trace-specific optimizations e.g., allocation removal [Bolz 10], aggressive redundancy elimination improving profile accuracy profile accuracy is more important in trace-jit than method-jit 31

Our Publication on trace-jit Hiroshi Inoue, Hiroshige Hayashizaki, Peng Wu, and Toshio Nakatani, "A Tracebased Java JIT Compiler Retrofitted from a Method-based Compiler", CGO 2011 Focus on trace-optimization aspect of the JIT, discussed the scope mismatch problem Hiroshige Hayashizaki, Peng Wu, Hiroshi Inoue, Mauricio Serrano and Toshio Nakatani, "Improving the Performance of Trace-based Systems by False Loop Filtering", ASPLOS 2011 Focus on trace selection algorithm, fragmentation of traces due to false loop problems Peng Wu, Hiroshige Hayashizaki, Hiroshi Inoue, and Toshio Nakatani, "Reducing Trace Selection Footprint for Large-scale Java Applications with no Performance Loss", OOPSLA 2011 Focus on trace selection algorithm, the code duplication problem Hiroshi Inoue, Hiroshige Hayashizaki, Peng Wu, and Toshio Nakatani, "Adaptive Multi-Level Compilation in a Trace-based Java JIT Compiler", OOPSLA 2012 Focus on adaptive compilation of trace-jit 32