Just-In-Time Compilers & Runtime Optimizers

Similar documents
Register Allocation. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Local Optimization: Value Numbering The Desert Island Optimization. Comp 412 COMP 412 FALL Chapter 8 in EaC2e. target code

The Software Stack: From Assembly Language to Machine Code

Code Merge. Flow Analysis. bookkeeping

Instruction Scheduling Beyond Basic Blocks Extended Basic Blocks, Superblock Cloning, & Traces, with a quick introduction to Dominators.

Intermediate Representations

The Processor Memory Hierarchy

The View from 35,000 Feet

Runtime Support for Algol-Like Languages Comp 412

BEAMJIT: An LLVM based just-in-time compiler for Erlang. Frej Drejhammar

Instruction Selection: Preliminaries. Comp 412

Combining Optimizations: Sparse Conditional Constant Propagation

Procedure and Function Calls, Part II. Comp 412 COMP 412 FALL Chapter 6 in EaC2e. target code. source code Front End Optimizer Back End

Mixed Mode Execution with Context Threading

CS 406/534 Compiler Construction Putting It All Together

Sustainable Memory Use Allocation & (Implicit) Deallocation (mostly in Java)

Runtime Support for OOLs Object Records, Code Vectors, Inheritance Comp 412

Running class Timing on Java HotSpot VM, 1

Code Shape Comp 412 COMP 412 FALL Chapters 4, 5, 6 & 7 in EaC2e. source code. IR IR target. code. Front End Optimizer Back End

Optimizer. Defining and preserving the meaning of the program

Implementing Control Flow Constructs Comp 412

Just-In-Time Compilation

SABLEJIT: A Retargetable Just-In-Time Compiler for a Portable Virtual Machine p. 1

Naming in OOLs and Storage Layout Comp 412

Overview of a Compiler

Arrays and Functions

JOVE. An Optimizing Compiler for Java. Allen Wirfs-Brock Instantiations Inc.

Trace-based JIT Compilation

Compiler construction 2009

Introduction to Optimization Local Value Numbering

About the Authors... iii Introduction... xvii. Chapter 1: System Software... 1

Dynamic Compila-on in Smalltalk- 80

Dynamo: A Transparent Dynamic Optimization System

CSc 453 Interpreters & Interpretation

Handling Assignment Comp 412

Parsing II Top-down parsing. Comp 412

Code Shape II Expressions & Assignment

Lecture 9 Dynamic Compilation

Building a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano

Intelligent Compilation

YETI. GraduallY Extensible Trace Interpreter VEE Mathew Zaleski, Angela Demke Brown (University of Toronto) Kevin Stoodley (IBM Toronto)

Optimization Techniques

Tour of common optimizations

CSE 501: Compiler Construction. Course outline. Goals for language implementation. Why study compilers? Models of compilation

The ILOC Virtual Machine (Lab 1 Background Material) Comp 412

Reducing the Overhead of Dynamic Compilation

Azul Systems, Inc.

Syntax Analysis, V Bottom-up Parsing & The Magic of Handles Comp 412

Syntax Analysis, III Comp 412

Debunking Dynamic Optimization Myths

The So'ware Stack: From Assembly Language to Machine Code

Run-time Program Management. Hwansoo Han

Code Optimization. Code Optimization

Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit

Syntax Analysis, III Comp 412

A Trace-based Java JIT Compiler Retrofitted from a Method-based Compiler

Administration CS 412/413. Why build a compiler? Compilers. Architectural independence. Source-to-source translator

Compact and Efficient Strings for Java

Intermediate Representations

For our next chapter, we will discuss the emulation process which is an integral part of virtual machines.

Global Register Allocation via Graph Coloring

Instruction Selection: Peephole Matching. Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Swift: A Register-based JIT Compiler for Embedded JVMs

Managed runtimes & garbage collection. CSE 6341 Some slides by Kathryn McKinley

IBM Research - Tokyo 数理 計算科学特論 C プログラミング言語処理系の最先端実装技術. Trace Compilation IBM Corporation

Intermediate Representations

Syntax Analysis, VII One more LR(1) example, plus some more stuff. Comp 412 COMP 412 FALL Chapter 3 in EaC2e. target code.

Managed runtimes & garbage collection

BEAMJIT, a Maze of Twisty Little Traces

YETI: a gradually Extensible Trace Interpreter

Loop Invariant Code Mo0on

Method-Level Phase Behavior in Java Workloads

Computers in Engineering COMP 208. Computer Structure. Computer Architecture. Computer Structure Michael A. Hawker

Dynamic Selection of Application-Specific Garbage Collectors

Compilers and Interpreters

Computing Inside The Parser Syntax-Directed Translation. Comp 412 COMP 412 FALL Chapter 4 in EaC2e. source code. IR IR target.

FIST: A Framework for Instrumentation in Software Dynamic Translators

EDAN65: Compilers, Lecture 13 Run;me systems for object- oriented languages. Görel Hedin Revised:

CS 406/534 Compiler Construction Instruction Selection and Global Register Allocation

COMP 181 Compilers. Administrative. Last time. Prelude. Compilation strategy. Translation strategy. Lecture 2 Overview

Reducing the Overhead of Dynamic Compilation

The Potentials and Challenges of Trace Compilation:

Intermediate Code Generation (ICG)

CS415 Compilers. Procedure Abstractions. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

CS553 Lecture Dynamic Optimizations 2

A Status Update of BEAMJIT, the Just-in-Time Compiling Abstract Machine. Frej Drejhammar and Lars Rasmusson

CS415 Compilers. Intermediate Represeation & Code Generation

Combining Analyses, Combining Optimizations - Summary

Trace Compilation. Christian Wimmer September 2009

Sista: Improving Cog s JIT performance. Clément Béra

Improving Mobile Program Performance Through the Use of a Hybrid Intermediate Representation

Overview of a Compiler

Computing Inside The Parser Syntax-Directed Translation, II. Comp 412 COMP 412 FALL Chapter 4 in EaC2e. source code. IR IR target.

Overview of a Compiler

USC 227 Office hours: 3-4 Monday and Wednesday CS553 Lecture 1 Introduction 4

A Sparse Algorithm for Predicated Global Value Numbering

Optimized Interval Splitting in a Linear Scan Register Allocator

Generating Code for Assignment Statements back to work. Comp 412 COMP 412 FALL Chapters 4, 6 & 7 in EaC2e. source code. IR IR target.

Advanced Programming & C++ Language

The VMKit project: Java (and.net) on top of LLVM

Transcription:

COMP 412 FALL 2017 Just-In-Time Compilers & Runtime Optimizers Comp 412 source code IR Front End Optimizer Back End IR target code Copyright 2017, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved. Not in EaC2e; might be in EaC3e

Runtime Compilation Many modern languages have features that make it difficult to produce high-quality code at compile time Late binding Dynamic loading Polymorphism One approach to improving performance in these languages is to compile, or to optimize, at runtime Just-in-time compilers Runtime optimizers Distinction with very little difference We will talk about one of each The idea is simple: optimize at runtime when more facts are known. COMP 412, Fall 2017 1

This Lecture We will look at two examples: Dynamo & the Hotspot Server Compiler Very different use cases Dynamo sits between a normal executable & the hardware Dynamo finds hot traces & uses local optimization to improve them Hotspot sits inside the JVM & compiles hot methods to native code Hotspot spends more compile time to achieve more improvement Good examples of JIT compilation Different granularities & levels of aggression Different runtime environments enable different techniques Online: the grandfather of Hotspot may well be Deutsch & Schiffman s system for Smalltalk-80 Hotspot authors studied that paper in COMP 512 Vasanth Bala got his PhD before COMP 512 existed COMP 412, Fall 2017 2

Runtime Optimizers: Dynamo Dynamo and DynamoRIO are classic examples of a runtime optimizer Attach to a running executable Capture hot traces, optimize them, & link them into running code Code Static & Global Runtime Stack Free Space Heap Dynamo ran on an HP 8000 Worked with, essentially, any executable Programmer adds a call to start the system Processor Normal Execution DynamoRIO re-implemented Dynamo for the x86 under Windows & Linux COMP 412, Fall 2017 3

Runtime Optimizers: Dynamo Dynamo and DynamoRIO are classic examples of a runtime optimizer Attach to a running executable Capture hot traces, optimize them, & link them into running code Code Static & Global Runtime Stack Free Space Heap Dynamo Processor Execution with Dynamo Dynamo sits between the code & the hardware Interprets the code & builds up traces Collects profile data When trace count > 50, it compiles & optimizes trace Future executions run the compiled version of trace Bala, Duesterwald, Banerjia, Dynamo: A transparent dynamic optimization system, COMP Proceedings 412, Fall of 2017 PLDI 2000, ACM, 2000. 4

How does it work? no Interpret until taken branch Lookup branch target in cache miss start of trace? hit Jump to cached fragment yes Counter for br. target ++ no Counter > hot threshold Fragment Cache yes Interpret + code gen until taken branch Emit, link, & recycle counter Create & opt ze new fragment end of trace? no COMP 412, Fall 2017 5

How does it work? no Interpret until taken branch Lookup branch target in cache hit Jump to cached fragment miss Fragment Cache Interpret the cold code until the code start of trace? takes a branch Target yesis in fragment cache? no Yes Þ jump to the fragment Counter for br. Counter > target No ++ Þ decide whether hot threshold or not to start a new fragment yes Interpret + code Cold code is interpreted gen until taken(slow) Hot code is optimized & branch run (fast) Emit, link, & recycle counter Create no Hot & must opt zemake up for end cold, of trace? if system is new to fragment be profitable COMP 412, Fall 2017 6

How does it work? Start of trace Þ Target of backward branch (loop header) Þ Fragment cache exit branch no Interpret until taken branch Lookup branch target in cache miss start of trace? hit Jump to cached fragment yes Counter for br. target ++ no Counter > hot threshold If start of trace condition holds, bump Fragment trace s counter Cache If the counter > some threshold value (50) Move into code generation & optimization phase to convert code into an optimized fragment in the fragment cache Emit, link, & Create & opt ze Otherwise, go back to interpreting recycle counter new fragment Counter forces trace to be hot before spending effort to improve it COMP 412, Fall 2017 7 yes Interpret + code gen until taken branch end of trace? no

How does it work? End of trace Þ Taken backward branch (bottom of loop) Þ Fragment cache entry label To build an optimized fragment: no Interpret each operation & generate low-level IR code Interpret until Lookup branch Encounter a branch? start of trace? taken branch target in cache miss If end-of-trace condition holds create & optimize the hitnew fragment yes no emit the fragment, link it to other fragments, & free the counter Otherwise, keep Jump interpreting to cached Counter for br. Counter > fragment target ++ hot threshold Fragment Cache yes Interpret + code gen until taken branch Emit, link, & recycle counter Create & opt ze new fragment end of trace? no COMP 412, Fall 2017 8

How does it work? no Interpret until taken branch Lookup branch target in cache miss start of trace? hit Jump to cached fragment yes Counter for br. target ++ no Counter > hot threshold Fragment Cache yes Interpret + code gen until taken branch Emit, link, & recycle counter Moving Create & code opt ze fragment into the cache end of trace? End new of fragment branch jumps to stub that jumps to the interpreter no COMP 412, Fall 2017 9

Fragment Construction B A D E F C call H G J I Profile mechanism identifies A as start of a hot path After threshold trips through A, the next path is compiled Speculative construction method adds C, D, G, I, J, & E, then backward branch to A Run-time compiler builds a fragment for ACDGIJE, with exits for B, H, & F Builds the branch for E A return COMP Speculative 412, Fall 2017 construction because it assumes that 51 st time will take the hot path. 10

Effects of Fragment Construction From the interpreter Back to the interpreter A C D G I J E F* B* H* Hot path is linearized Eliminates branches Creates superblock Applies local optimization Cold-path branches remain Targets are stubs Start interpreter at the target op Path can include call & return Jumps not branches Interprocedural effects Indirect branches Speculate on target Fall back on hash table of branch targets COMP 412, Fall 2017 11

Sources of Improvement Many small things contribute to make Dynamo profitable Linearization eliminates branches Improves TLB & I-cache behavior 2 passes of local optimization Redundancy elimination, copy propagation, constant folding, simple strength reduction, loop unrolling, loop invariant code motion, redundant load removal One forward pass, one backward pass Linear code with premature exits Dynamo appears to split traces at an intermediate entry point Fragment linking should make execution fast while splitting stops code motion across intermediate entry point Keep in mind that local in a trace captures interprocedural effects Engineering detail makes a difference Fragment cache management sparked lots of work in the noughts. COMP 412, Fall 2017 12

Redundant Computations Some on-trace redundancies are easily detected Trace defines some register, say r 3 Definition may be partially dead Live on exit but not in the trace Þ move it to the exit stub Live in trace but not at early exit Þ move it below the exit Implies that we have LIVE information for the code Collect LIVE sets during backward pass Move partially dead definitions during forward pass Store summary LIVE set for fragments Allows interfragment optimization Can we know this? Only if exit is to a fragment rather than to the interpreter. Otherwise, must assume that definition is LIVE on each exit COMP 412, Fall 2017 13

Fragment Linking Block B meets start of trace condition (exit from fragment) A From the interpreter A B C C D D E F call H G I Speculative Trace Construction G I J E J F* return What happens if another path becomes hot? (Say ABDGIJE) Back to the interpreter COMP 412, Fall 2017 14 B* H* *

Fragment Linking When counter in B reaches hot: Builds a fragment From the interpreter A C D G I J E Back to the interpreter COMP 412, Fall 2017 15 F* B* H*

Fragment Linking When counter in B reaches hot: Builds a fragment From the interpreter A C D G I J E B D G I J E Back to the interpreter COMP 412, Fall 2017 16 F* B* H* F* A* H*

Fragment Linking When counter reaches hot: Builds a fragment Links exit A B to new fragment Links exit E A to old fragment From the interpreter A C D G I J E B D G I J E F* F* Back to the interpreter COMP 412, Fall 2017 17 H* H*

Fragment Linking When counter reaches hot: Builds a fragment Links exit A B to new fragment Links exit E A to old fragment From the interpreter What if B* held redundant op? Have LIVE on entry to B Can test LIVE sets for both exit from A & entry to B May show op is dead Back to the interpreter COMP 412, Fall 2017 18 A C D G I J E F* H* B D G I J E F* H*

Results They measured performance on codes from Spec95 Now, recall that Dynamo operated in competition with the bare hardware the lowest overhead situation that we can imagine. What about a Java JIT, where the JVM has much higher overhead? Graphic from ARS Technica report on Dynamo COMP http://www.arstechnica.com/reviews/1q00/dynamo/dynamo-1.html 412, Fall 2017 19

JITs: The Java Hotspot TM Server Compiler Java s execution model is defined by the Java Virtual Machine (JVM) Code is compiled into bytecode for the JVM At runtime, the JVM interprets the bytecode Advantages for portability & security, disadvantage for execution speed Class Loader Thread 1 Thread 2 Thread n Bytecodes PC Register JVM Stack PC Register JVM Stack PC Register JVM Stack Native Stack Native Stack Native Stack Method Area Heap Execution Engine Native Method Interface Native Method Libraries COMP 412, Fall 2017 20

JITs: The Java Hotspot TM Server Compiler Java s execution model is defined by the Java Virtual Machine (JVM) Classic example of a Java JIT is the Hotspot Server Compiler A JIT adds another execution mode: native methods for user code Class Loader Thread 1 Thread 2 Thread n Native Code Bytecodes PC Register JVM Stack PC Register JVM Stack PC Register JVM Stack Code Cache Method Area Native Stack Native Stack Native Stack Heap JIT Execution Engine Native Method Interface Native Method Libraries COMP 412, Fall 2017 21

Hotspot How does all of this fit together? The execution engine identifies a hot method Typically uses a pre-set threshold (Hotspot default is 10,000) The execution engine invokes the JIT to compile for the hot method JIT compiles bytecode into native code that works within JVM structures JIT & execution engine arrange to link new code into the running program Class loader triggers selective de-optimization Runtime compilation places a premium on JIT efficiency What does the JIT do? Translation to its internal IR Fast optimization emphasis on fast, which often means local Translation to native code & linking to other native code Much of the speedup comes from eliminating cost of interpretation COMP 412, Fall 2017 22

Java JVM versus JIT Interpreted code in the JVM The JVM interprets eachbytecode Fetch, decode, & execute in software 1 Multiple hardware operations per bytecode Almost certainly sequential Each bytecode is interpreted every time it is encountered Slope difference between a python lab 1 and a Java lab 1 is, in large part, the JIT Code produced by JIT in JVM Execution mixes interpreted & compiled code JIT-compiled code runs as native operations Fetch, decode, & execute in hardware Three phases run in parallel Fewer operations per bytecode JIT-compiled code uses JVM runtime structures Little or no translation Efficient switching from JVM to JIT and back Method is translated once 1 COMP See page 412, 6FF Fall in 2017 The ILOC Virtual Machine lecture 5 on the course web site s Lectures page 23

Java JVM versus JIT 1.400 1.200 1.000 8.0 7.0 In this region, you see the switch over from JVM-interpreted code to native code. 0.800 0.600 0.400 0.200 Seconds 0.000 6.0 5.0 4.0 3.0 2.0 0 4,000 8,000 12,000 16,000 Good lab 1 in python, which is interpreted 1.0 Good lab 1 in Java 0.0 0 16,000 32,000 48,000 64,000 80,000 96,000 112,000 128,000 COMP 412, Fall 2017 Lines of ILOC in Input File 24

Hotspot Hotspot looks very different from Dynamo Compiles single methods or inlined chains of methods Identifies performance critical methods and works backward from them to identify how many levels of method (in call graph) it should compile Counter at method entry and on backward branches (DOM) When sum of these counters exceeds threshold, compile method and, perhaps, the chain that calls it Parse Java bytecodes into a low-level, graphical IR First pass identifies basic blocks Second pass generates the graph Hotspot performs local optimization during this translation Parser identifies loop headers and creates a list of headers The IR is a variant of Ferrante, Ottenstein, & Warren s data-flow graphs. See Click & Paleczny, COMP A Simple 412, Graph-Based Fall 2017 Intermediate Representation, or Click s thesis for details. 25

Hotspot Hotspot looks very different from Dynamo Optimizer uses classical optimization techniques, adapted to JIT Full-fledged class hierarchy analysis (CHA) to infer types Inlining based on results of CHA Fast path / slow path optimization (allocation, instanceof, ) Optimistic constant propagation (Wegman Zadeck) Iterative global value numbering (iterate around loops) BURS-based instruction scheduling (locally optimal) Global instruction scheduling (Click s algorithm) Sparse Chaitin-Briggs, graph-coloring register allocator Final pass of peephole optimization Global value numbering Starts from list of loop headers that s where the opportunity lies GVN includes a bunch of transformations, including cloning & loop peeling, constant propagation, dead code elimination, & value numbering COMP Paleczny, 412, Vick, Fall and 2017 Click, The Java Hotspot Server Compiler, Proceedings of JVM 01, April 2001. 26

Hotspot Results 100% (selected, figure from JVM 01 paper) SPECjvm98 (test mode) on IA32[tm] 90% 80% performance 70% 60% 50% 40% 30% 20% Mtrt Jess Compress Db Mpegau dio Jack Javac No Inlining Simple Inline No CHA FCS 2.0 These results are for total execution time including Hotspot execution. COMP 412, Fall 2017 27

Dynamo versus Hotspot Dynamo Input is running machine code Granularity is a runtime trace Performs 2 linear local passes Benefit comes from optimization and from linearization of trace Threshold is low (50) Overhead is low Dynamo is profitable, running on bare hardware (suprising) Hotspot Input is running Java bytecode Granularity is 1 method Parses bytecode, builds an IR, performs classical optimizations, uses a BURS selector, applies a graph-coloring allocator Threshold is high (sum > 10,000) Overhead is higher than Dynamo Hotspot is profitable, running inside the JVM Both systems worked quite well & inspired further work Their success inspired systems that are in widespread use, from DynamoRIO through JITs for Javascript (V8), PHP, and many others COMP 412, Fall 2017 28