Just-In-Time Compilers & Runtime Optimizers

COMP 412 FALL 2017 Just-In-Time Compilers & Runtime Optimizers Comp 412 source code IR Front End Optimizer Back End IR target code Copyright 2017, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved. Not in EaC2e; might be in EaC3e

Runtime Compilation Many modern languages have features that make it difficult to produce high-quality code at compile time Late binding Dynamic loading Polymorphism One approach to improving performance in these languages is to compile, or to optimize, at runtime Just-in-time compilers Runtime optimizers Distinction with very little difference We will talk about one of each The idea is simple: optimize at runtime when more facts are known. COMP 412, Fall 2017 1

This Lecture We will look at two examples: Dynamo & the Hotspot Server Compiler Very different use cases Dynamo sits between a normal executable & the hardware Dynamo finds hot traces & uses local optimization to improve them Hotspot sits inside the JVM & compiles hot methods to native code Hotspot spends more compile time to achieve more improvement Good examples of JIT compilation Different granularities & levels of aggression Different runtime environments enable different techniques Online: the grandfather of Hotspot may well be Deutsch & Schiffman s system for Smalltalk-80 Hotspot authors studied that paper in COMP 512 Vasanth Bala got his PhD before COMP 512 existed COMP 412, Fall 2017 2

Runtime Optimizers: Dynamo Dynamo and DynamoRIO are classic examples of a runtime optimizer Attach to a running executable Capture hot traces, optimize them, & link them into running code Code Static & Global Runtime Stack Free Space Heap Dynamo ran on an HP 8000 Worked with, essentially, any executable Programmer adds a call to start the system Processor Normal Execution DynamoRIO re-implemented Dynamo for the x86 under Windows & Linux COMP 412, Fall 2017 3

Runtime Optimizers: Dynamo Dynamo and DynamoRIO are classic examples of a runtime optimizer Attach to a running executable Capture hot traces, optimize them, & link them into running code Code Static & Global Runtime Stack Free Space Heap Dynamo Processor Execution with Dynamo Dynamo sits between the code & the hardware Interprets the code & builds up traces Collects profile data When trace count > 50, it compiles & optimizes trace Future executions run the compiled version of trace Bala, Duesterwald, Banerjia, Dynamo: A transparent dynamic optimization system, COMP Proceedings 412, Fall of 2017 PLDI 2000, ACM, 2000. 4

How does it work? no Interpret until taken branch Lookup branch target in cache miss start of trace? hit Jump to cached fragment yes Counter for br. target ++ no Counter > hot threshold Fragment Cache yes Interpret + code gen until taken branch Emit, link, & recycle counter Create & opt ze new fragment end of trace? no COMP 412, Fall 2017 5

How does it work? no Interpret until taken branch Lookup branch target in cache hit Jump to cached fragment miss Fragment Cache Interpret the cold code until the code start of trace? takes a branch Target yesis in fragment cache? no Yes Þ jump to the fragment Counter for br. Counter > target No ++ Þ decide whether hot threshold or not to start a new fragment yes Interpret + code Cold code is interpreted gen until taken(slow) Hot code is optimized & branch run (fast) Emit, link, & recycle counter Create no Hot & must opt zemake up for end cold, of trace? if system is new to fragment be profitable COMP 412, Fall 2017 6

How does it work? Start of trace Þ Target of backward branch (loop header) Þ Fragment cache exit branch no Interpret until taken branch Lookup branch target in cache miss start of trace? hit Jump to cached fragment yes Counter for br. target ++ no Counter > hot threshold If start of trace condition holds, bump Fragment trace s counter Cache If the counter > some threshold value (50) Move into code generation & optimization phase to convert code into an optimized fragment in the fragment cache Emit, link, & Create & opt ze Otherwise, go back to interpreting recycle counter new fragment Counter forces trace to be hot before spending effort to improve it COMP 412, Fall 2017 7 yes Interpret + code gen until taken branch end of trace? no

How does it work? End of trace Þ Taken backward branch (bottom of loop) Þ Fragment cache entry label To build an optimized fragment: no Interpret each operation & generate low-level IR code Interpret until Lookup branch Encounter a branch? start of trace? taken branch target in cache miss If end-of-trace condition holds create & optimize the hitnew fragment yes no emit the fragment, link it to other fragments, & free the counter Otherwise, keep Jump interpreting to cached Counter for br. Counter > fragment target ++ hot threshold Fragment Cache yes Interpret + code gen until taken branch Emit, link, & recycle counter Create & opt ze new fragment end of trace? no COMP 412, Fall 2017 8

How does it work? no Interpret until taken branch Lookup branch target in cache miss start of trace? hit Jump to cached fragment yes Counter for br. target ++ no Counter > hot threshold Fragment Cache yes Interpret + code gen until taken branch Emit, link, & recycle counter Moving Create & code opt ze fragment into the cache end of trace? End new of fragment branch jumps to stub that jumps to the interpreter no COMP 412, Fall 2017 9

Fragment Construction B A D E F C call H G J I Profile mechanism identifies A as start of a hot path After threshold trips through A, the next path is compiled Speculative construction method adds C, D, G, I, J, & E, then backward branch to A Run-time compiler builds a fragment for ACDGIJE, with exits for B, H, & F Builds the branch for E A return COMP Speculative 412, Fall 2017 construction because it assumes that 51 st time will take the hot path. 10

Effects of Fragment Construction From the interpreter Back to the interpreter A C D G I J E F* B* H* Hot path is linearized Eliminates branches Creates superblock Applies local optimization Cold-path branches remain Targets are stubs Start interpreter at the target op Path can include call & return Jumps not branches Interprocedural effects Indirect branches Speculate on target Fall back on hash table of branch targets COMP 412, Fall 2017 11

Sources of Improvement Many small things contribute to make Dynamo profitable Linearization eliminates branches Improves TLB & I-cache behavior 2 passes of local optimization Redundancy elimination, copy propagation, constant folding, simple strength reduction, loop unrolling, loop invariant code motion, redundant load removal One forward pass, one backward pass Linear code with premature exits Dynamo appears to split traces at an intermediate entry point Fragment linking should make execution fast while splitting stops code motion across intermediate entry point Keep in mind that local in a trace captures interprocedural effects Engineering detail makes a difference Fragment cache management sparked lots of work in the noughts. COMP 412, Fall 2017 12

Redundant Computations Some on-trace redundancies are easily detected Trace defines some register, say r 3 Definition may be partially dead Live on exit but not in the trace Þ move it to the exit stub Live in trace but not at early exit Þ move it below the exit Implies that we have LIVE information for the code Collect LIVE sets during backward pass Move partially dead definitions during forward pass Store summary LIVE set for fragments Allows interfragment optimization Can we know this? Only if exit is to a fragment rather than to the interpreter. Otherwise, must assume that definition is LIVE on each exit COMP 412, Fall 2017 13

Fragment Linking Block B meets start of trace condition (exit from fragment) A From the interpreter A B C C D D E F call H G I Speculative Trace Construction G I J E J F* return What happens if another path becomes hot? (Say ABDGIJE) Back to the interpreter COMP 412, Fall 2017 14 B* H* *

Fragment Linking When counter in B reaches hot: Builds a fragment From the interpreter A C D G I J E Back to the interpreter COMP 412, Fall 2017 15 F* B* H*

Fragment Linking When counter in B reaches hot: Builds a fragment From the interpreter A C D G I J E B D G I J E Back to the interpreter COMP 412, Fall 2017 16 F* B* H* F* A* H*

Fragment Linking When counter reaches hot: Builds a fragment Links exit A B to new fragment Links exit E A to old fragment From the interpreter A C D G I J E B D G I J E F* F* Back to the interpreter COMP 412, Fall 2017 17 H* H*

Fragment Linking When counter reaches hot: Builds a fragment Links exit A B to new fragment Links exit E A to old fragment From the interpreter What if B* held redundant op? Have LIVE on entry to B Can test LIVE sets for both exit from A & entry to B May show op is dead Back to the interpreter COMP 412, Fall 2017 18 A C D G I J E F* H* B D G I J E F* H*

Results They measured performance on codes from Spec95 Now, recall that Dynamo operated in competition with the bare hardware the lowest overhead situation that we can imagine. What about a Java JIT, where the JVM has much higher overhead? Graphic from ARS Technica report on Dynamo COMP http://www.arstechnica.com/reviews/1q00/dynamo/dynamo-1.html 412, Fall 2017 19

JITs: The Java Hotspot TM Server Compiler Java s execution model is defined by the Java Virtual Machine (JVM) Code is compiled into bytecode for the JVM At runtime, the JVM interprets the bytecode Advantages for portability & security, disadvantage for execution speed Class Loader Thread 1 Thread 2 Thread n Bytecodes PC Register JVM Stack PC Register JVM Stack PC Register JVM Stack Native Stack Native Stack Native Stack Method Area Heap Execution Engine Native Method Interface Native Method Libraries COMP 412, Fall 2017 20

JITs: The Java Hotspot TM Server Compiler Java s execution model is defined by the Java Virtual Machine (JVM) Classic example of a Java JIT is the Hotspot Server Compiler A JIT adds another execution mode: native methods for user code Class Loader Thread 1 Thread 2 Thread n Native Code Bytecodes PC Register JVM Stack PC Register JVM Stack PC Register JVM Stack Code Cache Method Area Native Stack Native Stack Native Stack Heap JIT Execution Engine Native Method Interface Native Method Libraries COMP 412, Fall 2017 21

Hotspot How does all of this fit together? The execution engine identifies a hot method Typically uses a pre-set threshold (Hotspot default is 10,000) The execution engine invokes the JIT to compile for the hot method JIT compiles bytecode into native code that works within JVM structures JIT & execution engine arrange to link new code into the running program Class loader triggers selective de-optimization Runtime compilation places a premium on JIT efficiency What does the JIT do? Translation to its internal IR Fast optimization emphasis on fast, which often means local Translation to native code & linking to other native code Much of the speedup comes from eliminating cost of interpretation COMP 412, Fall 2017 22

Java JVM versus JIT Interpreted code in the JVM The JVM interprets eachbytecode Fetch, decode, & execute in software 1 Multiple hardware operations per bytecode Almost certainly sequential Each bytecode is interpreted every time it is encountered Slope difference between a python lab 1 and a Java lab 1 is, in large part, the JIT Code produced by JIT in JVM Execution mixes interpreted & compiled code JIT-compiled code runs as native operations Fetch, decode, & execute in hardware Three phases run in parallel Fewer operations per bytecode JIT-compiled code uses JVM runtime structures Little or no translation Efficient switching from JVM to JIT and back Method is translated once 1 COMP See page 412, 6FF Fall in 2017 The ILOC Virtual Machine lecture 5 on the course web site s Lectures page 23

Java JVM versus JIT 1.400 1.200 1.000 8.0 7.0 In this region, you see the switch over from JVM-interpreted code to native code. 0.800 0.600 0.400 0.200 Seconds 0.000 6.0 5.0 4.0 3.0 2.0 0 4,000 8,000 12,000 16,000 Good lab 1 in python, which is interpreted 1.0 Good lab 1 in Java 0.0 0 16,000 32,000 48,000 64,000 80,000 96,000 112,000 128,000 COMP 412, Fall 2017 Lines of ILOC in Input File 24

Hotspot Hotspot looks very different from Dynamo Compiles single methods or inlined chains of methods Identifies performance critical methods and works backward from them to identify how many levels of method (in call graph) it should compile Counter at method entry and on backward branches (DOM) When sum of these counters exceeds threshold, compile method and, perhaps, the chain that calls it Parse Java bytecodes into a low-level, graphical IR First pass identifies basic blocks Second pass generates the graph Hotspot performs local optimization during this translation Parser identifies loop headers and creates a list of headers The IR is a variant of Ferrante, Ottenstein, & Warren s data-flow graphs. See Click & Paleczny, COMP A Simple 412, Graph-Based Fall 2017 Intermediate Representation, or Click s thesis for details. 25

Hotspot Hotspot looks very different from Dynamo Optimizer uses classical optimization techniques, adapted to JIT Full-fledged class hierarchy analysis (CHA) to infer types Inlining based on results of CHA Fast path / slow path optimization (allocation, instanceof, ) Optimistic constant propagation (Wegman Zadeck) Iterative global value numbering (iterate around loops) BURS-based instruction scheduling (locally optimal) Global instruction scheduling (Click s algorithm) Sparse Chaitin-Briggs, graph-coloring register allocator Final pass of peephole optimization Global value numbering Starts from list of loop headers that s where the opportunity lies GVN includes a bunch of transformations, including cloning & loop peeling, constant propagation, dead code elimination, & value numbering COMP Paleczny, 412, Vick, Fall and 2017 Click, The Java Hotspot Server Compiler, Proceedings of JVM 01, April 2001. 26

Hotspot Results 100% (selected, figure from JVM 01 paper) SPECjvm98 (test mode) on IA32[tm] 90% 80% performance 70% 60% 50% 40% 30% 20% Mtrt Jess Compress Db Mpegau dio Jack Javac No Inlining Simple Inline No CHA FCS 2.0 These results are for total execution time including Hotspot execution. COMP 412, Fall 2017 27

Dynamo versus Hotspot Dynamo Input is running machine code Granularity is a runtime trace Performs 2 linear local passes Benefit comes from optimization and from linearization of trace Threshold is low (50) Overhead is low Dynamo is profitable, running on bare hardware (suprising) Hotspot Input is running Java bytecode Granularity is 1 method Parses bytecode, builds an IR, performs classical optimizations, uses a BURS selector, applies a graph-coloring allocator Threshold is high (sum > 10,000) Overhead is higher than Dynamo Hotspot is profitable, running inside the JVM Both systems worked quite well & inspired further work Their success inspired systems that are in widespread use, from DynamoRIO through JITs for Javascript (V8), PHP, and many others COMP 412, Fall 2017 28