Azul Systems, Inc.

Similar documents
Algorithms for Dynamic Memory Management (236780) Lecture 4. Lecturer: Erez Petrank

Managed runtimes & garbage collection. CSE 6341 Some slides by Kathryn McKinley

Lecture 9 Dynamic Compilation

Managed runtimes & garbage collection

CMSC 330: Organization of Programming Languages

Agenda. CSE P 501 Compilers. Java Implementation Overview. JVM Architecture. JVM Runtime Data Areas (1) JVM Data Types. CSE P 501 Su04 T-1

THE TROUBLE WITH MEMORY

CMSC 330: Organization of Programming Languages

The C4 Collector. Or: the Application memory wall will remain until compaction is solved. Gil Tene Balaji Iyengar Michael Wolf

A new Mono GC. Paolo Molaro October 25, 2006

Runtime. The optimized program is ready to run What sorts of facilities are available at runtime

Concurrent Garbage Collection

Automatic Memory Management

Lecture 15 Garbage Collection

Lecture 13: Garbage Collection

Sustainable Memory Use Allocation & (Implicit) Deallocation (mostly in Java)

CMSC 330: Organization of Programming Languages. Memory Management and Garbage Collection

Running class Timing on Java HotSpot VM, 1

Administration CS 412/413. Why build a compiler? Compilers. Architectural independence. Source-to-source translator

Java Internals. Frank Yellin Tim Lindholm JavaSoft

Habanero Extreme Scale Software Research Project

Shenandoah An ultra-low pause time Garbage Collector for OpenJDK. Christine H. Flood Roman Kennke

JAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder

Design Issues. Subroutines and Control Abstraction. Subroutines and Control Abstraction. CSC 4101: Programming Languages 1. Textbook, Chapter 8

Separate compilation. Topic 6: Runtime Environments p.1/21. CS 526 Topic 6: Runtime Environments The linkage convention

Acknowledgements These slides are based on Kathryn McKinley s slides on garbage collection as well as E Christopher Lewis s slides

Java Performance Tuning

Compiler construction 2009

Programs in memory. The layout of memory is roughly:

Just-In-Time Compilers & Runtime Optimizers

A JVM Does What? Eva Andreasson Product Manager, Azul Systems

The Generational Hypothesis

The Art of (Java TM ) Benchmarking

Exploiting Prolific Types for Memory Management and Optimizations

Dynamic Selection of Application-Specific Garbage Collectors

Java Without the Jitter

The Fundamentals of JVM Tuning

Shenandoah: Theory and Practice. Christine Flood Roman Kennke Principal Software Engineers Red Hat

Performance of Non-Moving Garbage Collectors. Hans-J. Boehm HP Labs

Quantifying the Performance of Garbage Collection vs. Explicit Memory Management

Efficient Runtime Tracking of Allocation Sites in Java

Compiling Techniques

Optimising Multicore JVMs. Khaled Alnowaiser

The View from 35,000 Feet

CSE P 501 Compilers. Java Implementation JVMs, JITs &c Hal Perkins Winter /11/ Hal Perkins & UW CSE V-1

Field Analysis. Last time Exploit encapsulation to improve memory system performance

Shenandoah: An ultra-low pause time garbage collector for OpenJDK. Christine Flood Principal Software Engineer Red Hat

Run-Time Environments/Garbage Collection

Review. Partitioning: Divide heap, use different strategies per heap Generational GC: Partition by age Most objects die young

Chapter 9 :: Subroutines and Control Abstraction

Free-Me: A Static Analysis for Automatic Individual Object Reclamation

<Insert Picture Here> Symmetric multilanguage VM architecture: Running Java and JavaScript in Shared Environment on a Mobile Phone

Automatic Object Colocation Based on Read Barriers

Programming Languages

6.172 Performance Engineering of Software Systems Spring Lecture 9. P after. Figure 1: A diagram of the stack (Image by MIT OpenCourseWare.

Garbage-First Garbage Collection by David Detlefs, Christine Flood, Steve Heller & Tony Printezis. Presented by Edward Raff

Chap. 8 :: Subroutines and Control Abstraction

Automatic Garbage Collection

PennBench: A Benchmark Suite for Embedded Java

JVM and application bottlenecks troubleshooting

Control Abstraction. Hwansoo Han

Garbage Collection. Hwansoo Han

Trace Compilation. Christian Wimmer September 2009

Heap Management portion of the store lives indefinitely until the program explicitly deletes it C++ and Java new Such objects are stored on a heap

High Performance Managed Languages. Martin Thompson

Dynamic Storage Allocation

Garbage Collection Techniques

Hardware-Supported Pointer Detection for common Garbage Collections

Attila Szegedi, Software

CS 4120 Lecture 37 Memory Management 28 November 2011 Lecturer: Andrew Myers

Deallocation Mechanisms. User-controlled Deallocation. Automatic Garbage Collection

Using Intel VTune Amplifier XE and Inspector XE in.net environment

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 11

Azul Pauseless Garbage Collection

New Java performance developments: compilation and garbage collection

TECHNOLOGY WHITE PAPER. Azul Pauseless Garbage Collection. Providing continuous, pauseless operation for Java applications

High-Productivity Languages for HPC: Compiler Challenges

Run-time Program Management. Hwansoo Han

Shenandoah: An ultra-low pause time garbage collector for OpenJDK. Christine Flood Roman Kennke Principal Software Engineers Red Hat

Software Speculative Multithreading for Java

Heap Management. Heap Allocation

Future of JRockit & Tools

Understanding Garbage Collection

CSc 453 Interpreters & Interpretation

CSE P 501 Compilers. Memory Management and Garbage Collec<on Hal Perkins Winter UW CSE P 501 Winter 2016 W-1

Compact and Efficient Strings for Java

Understanding Application Hiccups

Past: Making physical memory pretty

2011 Oracle Corporation and Affiliates. Do not re-distribute!

Experiences with Multi-threading and Dynamic Class Loading in a Java Just-In-Time Compiler

MoarVM. A metamodel-focused runtime for NQP and Rakudo. Jonathan Worthington

Java Performance: The Definitive Guide

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES

Multi-level Translation. CS 537 Lecture 9 Paging. Example two-level page table. Multi-level Translation Analysis

One-Slide Summary. Lecture Outine. Automatic Memory Management #1. Why Automatic Memory Management? Garbage Collection.

Virtual Memory Management

Write Barrier Removal by Static Analysis

Performance Per Watt. Native code invited back from exile with the Return of the King:

Chapter 8 :: Subroutines and Control Abstraction. Final Test. Final Test Review Tomorrow

CS577 Modern Language Processors. Spring 2018 Lecture Garbage Collection

Transcription:

1

Stack Based Allocation in the Azul JVM Dr. Cliff Click cliffc@azulsystems.com 2005 Azul Systems, Inc.

Background The Azul JVM is based on Sun HotSpot a State-of-the-Art Java VM Java is a GC'd language HotSpot uses a fast Generational GC with test and bump allocation Test and bump allocation streams through memory Streaming memory is hard on caches Azul is looking at using Stack Based Allocation 3

Cost of using a Generational GC Typical allocation is pointer compare & bump Deallocation is free Requires scanning the live set Live set is much smaller than what was allocated Allocation streams through young generation Similar to streaming array access in Fortran Young generation >> L1 cache size Implies caches are flushed repeatedly Cache miss cost is spread out Hard to spot 4

Cost of using a Generational GC Allocation streams through young generation Cache miss for each new object Can cover with prefetch Read data is dead Object must be initialized Can prevent read with cache-line-zero ops Must evict for each new object Evicted data is mostly dead Occasional live object Result: wasted read AND wasted write 5

Generational GC Memory usage, over time Fresh heap, hot cached memory Free heap, cold uncached memory L1 Cache 6

Generational GC Memory usage, over time Fresh heap, hot cached memory Free heap, cold uncached memory L1 Cache 7

Generational GC Memory usage, over time Fresh heap, hot cached memory Free heap, cold uncached memory Reading dead memory Filling cache lines L1 Cache 8

Generational GC Memory usage, over time Old, cold heap, mostly dead Fresh heap, hot cached memory Free heap, cold uncached memory L1 Cache 9

Generational GC Memory usage, over time Old, cold heap, mostly dead Fresh heap, hot cached memory Free heap, cold uncached memory Writing mostly dead cache lines L1 Cache 10

Generational GC Memory usage, over time Old, cold heap, mostly dead Fresh heap, hot cached memory Free heap, cold uncached memory L1 Cache 11

Stack Based Allocation Objects are allocated on the stack Stack space reserved on function entry Deallocate is free on function return Has similar direct costs to Generational GC Orthogonal to Generational GC Recycles memory instead of streams memory Has a smaller cache footprint Fewer misses, stalls Requires knowledge about object lifetime Error if object escapes function return 12

Stack Based Allocation Call Sam Sam's frame SP Free stack Sam { a = new; b = new; Ted(); bad = Uli(); bad.crash; } L1 Cache 13

Stack Based Allocation Allocate object a a Sam's frame Sam { a = new; Free stack b = new; Ted(); bad = Uli(); SP bad.crash; } a L1 Cache 14

Stack Based Allocation Allocate object b a b SP Free stack Sam { a = new; b = new; Ted(); bad = Uli(); bad.crash; } a b L1 Cache 15

Stack Based Allocation Call Ted a b Ted's frame Sam's frame a b L1 Cache SP Free stack Sam { a = new; b = new; Ted(); bad = Uli(); bad.crash; } Ted { c = new; return; } 16

Stack Based Allocation Allocate object c a b c Sam's frame a b c L1 Cache Ted's frame Sam { a = new; SP Free stack b = new; Ted(); bad = Uli(); bad.crash; } Ted { c = new; return; } 17

Stack Based Allocation Exit Ted, free c a b SP a b c L1 Cache Free stack Sam { a = new; b = new; Ted(); bad = Uli(); bad.crash; } Ted { c = new; return; } 18

Stack Based Allocation Call Uli a b Uli's frame Sam's frame a b c L1 Cache SP Free stack Sam { a = new; b = new; Ted(); bad = Uli(); bad.crash; } Uli { d = new; return d; } 19

Stack Based Allocation Allocate object d a a b Sam's frame b L1 Cache d Sam { a = new; SP Free stack b = new; Ted(); bad = Uli(); bad.crash; } Uli { d = new; return d; } 20

Stack Based Allocation Exit Uli, return d a b SP a b d L1 Cache d Sam { a = new; Free stack b = new; Ted(); bad = Uli(); bad.crash; } Uli { d = new; return d; } 21

Stack Based Allocation Fast allocate and free Better cache behavior Only works for objects with stack lifetime How do we know object lifetime? Escape Analysis Precise, conservative Expensive at compile time Escape Detection Imprecise, optimistic Expensive at run time (requires write barrier) 22

Escape Detection Escape Analysis Conservatively correct Objects known to never escape Allows objects to be en-registered Fairly expensive (for a JIT) except in trivial cases Escape Detection Optimistic; some objects may escape Requires a complex write barrier and fix-up Write barrier can be done in hardware Requires minor hardware change Azul is implementing a Java VM with custom hardware 23

Escape Detection Hardware Allocate objects in stack frames Steal some address bits for frame depth (FID) FID bits stored in object pointer Hardware ignores FID on a load (getfield) Hardware does range check on a store (putfield) Storing old FID into new FID object is OK Storing new FID into old FID object TRAPS Trap requires expensive fixup Crawl only new frames on this thread's stack Move object, adjust FIDs 24

Fixup is Expensive Fixup must be rare Object factories always escape object! Tag allocation sites with relative allocation depth Allocate object in frames with longer lifetimes Or directly in heap if needed Start small Adjust tag upwards with each failure Rapidly converges in practice! No expensive analysis 25

How Big are Frames? Check for frame overflow on allocation Same as test-and-bump allocation If it fits, fine. If not... Allocate an overflow area on the side Stamp same FID into objects Hardware checks FID, not actual stack address Adjust call prolog code Make a larger frame for next time Rapidly converges in practice! 26

Never-Exit Frames Some top-level frames never exit Also loops, and JIT may inline But will accumulate dead objects Objects only freed on a frame exit! Periodically need a thread-local GC Toss out dead objects Compact overflow areas Can adjust frame sizes 27

Inlining helps Site Tagging Allocation site behavior may depend on call path Same site may want different depth tags depending on caller Hot code gets compiled, with inlining Inlining clones allocation sites Each inlined clone gets its own depth tag 28

Reverse: Tags Guide Inlining Depth tags tell empirical lifetime Of objects allocated here Likely can prove actual=empirical with Escape Analysis Inline up to tagged depth Run Escape Analysis on inlined code E.A. is cheap in a single compilation unit E.A., if successful, is better than Detection Can en-register whole objects Avoid E.A. If allocation site escapes 29

Prelimary Results Modest sized runs of SpecJVM98, SpecJBB, SJAS About ½ of all objects stack allocated With 4meg overflow areas Typical stack GC needed every 6-8sec Stack GC takes <1msec Some anomalies observed Lots of stack-allocated 2K and 4K arrays May be faster to heap allocate Because of Azul fast zero'ing hardware Frames forced too large too quick JIT inlines allocation into loops 30

Software Escape Barrier Azul has precise per-frame E.D. hardware Possible to make an software E.D. barrier Compare raw stack addresses & trap Requires all stacks be on same side of heap Store FID just prior to object header in stack Trap checks FID before declaring an escape load, load, compare, branch Trap can fail repeatedly for non-escaping Two objects in same frame but wrong order No good overflow solution 31

Summary First GC-specific hardware in a long time Looks good on paper Looks good in prelimary results Real Thing is a little ways off yet Needs tuning Performance-proof in large runs 32