Efficient Runtime Tracking of Allocation Sites in Java

Similar documents
Efficient Runtime Tracking of Allocation Sites in Java

Identifying the Sources of Cache Misses in Java Programs Without Relying on Hardware Counters. Hiroshi Inoue and Toshio Nakatani IBM Research - Tokyo

Continuous Object Access Profiling and Optimizations to Overcome the Memory Wall and Bloat

A Trace-based Java JIT Compiler Retrofitted from a Method-based Compiler

Adaptive Multi-Level Compilation in a Trace-based Java JIT Compiler

Trace-based JIT Compilation

String Deduplication for Java-based Middleware in Virtualized Environments

Probabilistic Calling Context

CGO:U:Auto-tuning the HotSpot JVM

IBM Research - Tokyo 数理 計算科学特論 C プログラミング言語処理系の最先端実装技術. Trace Compilation IBM Corporation

A Preliminary Workload Analysis of SPECjvm2008

Compact and Efficient Strings for Java

Kazunori Ogata, Dai Mikurube, Kiyokuni Kawachiya and Tamiya Onodera

JIT Compilation Policy for Modern Machines

Lu Fang, University of California, Irvine Liang Dou, East China Normal University Harry Xu, University of California, Irvine

Phase-based Adaptive Recompilation in a JVM

Shenandoah: Theory and Practice. Christine Flood Roman Kennke Principal Software Engineers Red Hat

Efficient and Viable Handling of Large Object Traces

SPEC* Java Platform Benchmarks and Their Role in the Java Technology Ecosystem. *Other names and brands may be claimed as the property of others.

Adaptive Optimization using Hardware Performance Monitors. Master Thesis by Mathias Payer

Chronicler: Lightweight Recording to Reproduce Field Failures

NG2C: Pretenuring Garbage Collection with Dynamic Generations for HotSpot Big Data Applications

Towards Parallel, Scalable VM Services

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

Azul Systems, Inc.

Method-Level Phase Behavior in Java Workloads

Cross-Layer Memory Management to Reduce DRAM Power Consumption

Dynamic Selection of Application-Specific Garbage Collectors

Optimising Multicore JVMs. Khaled Alnowaiser

CSc 453 Interpreters & Interpretation

The Potentials and Challenges of Trace Compilation:

Managed runtimes & garbage collection. CSE 6341 Some slides by Kathryn McKinley

A Side-channel Attack on HotSpot Heap Management. Xiaofeng Wu, Kun Suo, Yong Zhao, Jia Rao The University of Texas at Arlington

Towards Garbage Collection Modeling

Managed runtimes & garbage collection

The Z Garbage Collector An Introduction

Just-In-Time Compilers & Runtime Optimizers

Complex, concurrent software. Precision (no false positives) Find real bugs in real executions

Exploiting FIFO Scheduler to Improve Parallel Garbage Collection Performance

Cross-Layer Memory Management for Managed Language Applications

New Algorithms for Static Analysis via Dyck Reachability

Running class Timing on Java HotSpot VM, 1

Trace Compilation. Christian Wimmer September 2009

Efficient Context Sensitivity for Dynamic Analyses via Calling Context Uptrees and Customized Memory Management

ORDER: Object centric DEterministic Replay for Java

2011 Oracle Corporation and Affiliates. Do not re-distribute!

Agenda. CSE P 501 Compilers. Java Implementation Overview. JVM Architecture. JVM Runtime Data Areas (1) JVM Data Types. CSE P 501 Su04 T-1

Exploiting Prolific Types for Memory Management and Optimizations

Replicating Real-Time Garbage Collector

Eliminating Global Interpreter Locks in Ruby through Hardware Transactional Memory

Automatic Object Colocation Based on Read Barriers

Program Calling Context. Motivation

SABLEJIT: A Retargetable Just-In-Time Compiler for a Portable Virtual Machine p. 1

Deriving Code Coverage Information from Profiling Data Recorded for a Trace-based Just-in-time Compiler

Compiler construction 2009

High Performance Managed Languages. Martin Thompson

JAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder

Part I Garbage Collection Modeling: Progress Report

High Performance Managed Languages. Martin Thompson

A Dynamic Evaluation of the Precision of Static Heap Abstractions

Sista: Improving Cog s JIT performance. Clément Béra

Cross-Layer Memory Management for Managed Language Applications

The Z Garbage Collector Low Latency GC for OpenJDK

Elfen Scheduling. Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang

Bugs in Deployed Software

Phosphor: Illuminating Dynamic. Data Flow in Commodity JVMs

The DaCapo Benchmarks: Java Benchmarking Development and Analysis

Experiences with Multi-threading and Dynamic Class Loading in a Java Just-In-Time Compiler

Exploiting the Behavior of Generational Garbage Collector

MicroPhase: An Approach to Proactively Invoking Garbage Collection for Improved Performance

Run Anywhere. The Hardware Platform Perspective. Ben Pollan, AMD Java Labs October 28, 2008

Tolerating Memory Leaks

Z-Rays: Divide Arrays and Conquer Speed and Flexibility

New Java performance developments: compilation and garbage collection

02/03/15. Compile, execute, debugging THE ECLIPSE PLATFORM. Blanks'distribu.on' Ques+ons'with'no'answer' 10" 9" 8" No."of"students"vs."no.

Configuring the Heap and Garbage Collector for Real- Time Programming.

Continuously Measuring Critical Section Pressure with the Free-Lunch Profiler

Myths and Realities: The Performance Impact of Garbage Collection

Java Performance: The Definitive Guide

Pause-Less GC for Improving Java Responsiveness. Charlie Gracie IBM Senior Software charliegracie

Toward Efficient Strong Memory Model Support for the Java Platform via Hybrid Synchronization

Free-Me: A Static Analysis for Automatic Individual Object Reclamation

Swift: A Register-based JIT Compiler for Embedded JVMs

Intelligent Compilation

Advances in Memory Management and Symbol Lookup in pqr

Acknowledgements These slides are based on Kathryn McKinley s slides on garbage collection as well as E Christopher Lewis s slides

Programming Language Implementation

PennBench: A Benchmark Suite for Embedded Java

Java Performance Evaluation through Rigorous Replay Compilation

VIProf: A Vertically Integrated Full-System Profiler

NumaGiC: a Garbage Collector for Big Data on Big NUMA Machines

Defining a High-Level Programming Model for Emerging NVRAM Technologies

CSE P 501 Compilers. Java Implementation JVMs, JITs &c Hal Perkins Winter /11/ Hal Perkins & UW CSE V-1

NUMA in High-Level Languages. Patrick Siegler Non-Uniform Memory Architectures Hasso-Plattner-Institut

Cachetor: Detecting Cacheable Data to Remove Bloat

Shenandoah An ultra-low pause time Garbage Collector for OpenJDK. Christine H. Flood Roman Kennke

From Java Code to Java Heap Understanding the Memory Usage of Your Application

<Insert Picture Here> Maxine: A JVM Written in Java

Untyped Memory in the Java Virtual Machine

The Z Garbage Collector Scalable Low-Latency GC in JDK 11

Transcription:

Efficient Runtime Tracking of Allocation Sites in Java Rei Odaira, Kazunori Ogata, Kiyokuni Kawachiya, Tamiya Onodera, Toshio Nakatani IBM Research - Tokyo

Why Do You Need Allocation Site Information? How to fix a memory leak after two weeks of execution? > java com.example.server started...running for one week...running for two weeks server crashed due to java.lang.outofmemoryerror! > 2

Allocation Site Tracker Helps! Tracker tells you where each object was allocated. Bytes Class Allocation site (Method name and bytecode index) 900,278,800 String com.example.property.putproperty()#191 98,148,020 LinkedList com.example.datatable.putinteger()#187 20,352,384 String com.example.property.prepare()#35 void putproperty(element elem) { String attribute = new String(elem.getString()); } Allocation site is a good starting point for fixing the leak. Also, optimizations in JVM can benefit from the tracker. 3

Tracker Should Be Always Enabled. For fixing memory leaks > java com.example.server started...running for one week...running for two weeks server crashed due to java.lang.outofmemoryerror! > java enabletracker com.example.server started...running for one week...running for two weeks should have always enabled the tracker. And also for JVM optimizations. 4

Challenge: Performance Overhead Adding allocation site information to each object hits performance [Hauswirth et al., 2004]. Reducing effective CPU cache size, increasing GC frequency and overhead, etc. Java object layout Header Instance fields Allocation site info 5 Performance overhead (%) 16 14 12 10 8 6 4 2 0-2 compiler.sunflow compiler.compiler IBM Java 6 SR3 64-bit compressed pointer / Linux 2.6.18 / 8x x86-64 Intel Xeon 1.86GHz compress derby mpegaudio Not good for production environments serial sunflow xml.transform xml.validation antlr chart eclipse fop hsqldb luindex lusearch pmd SPECjbb2005 Geo. mean

Minimal-Overhead Allocation Site Trackers Never increase per-object space. Allocation-Site-as-a-Hash-code (ASH) Tracker Performance overhead: ~0% on average, 1.4% at maximum. Some JVMs do not always have a hash code field. Allocation-Site-via-a-Class-pointer (ASC) Tracker Performance overhead: 1% on average, 2% at maximum. Java object layout Class pointer Hash code Flags Instance fields Almost all JVMs have a class pointer field. 6

Outline Introduction ASH Tracker ASC Tracker Experiments Applications of ASH/ASC Trackers Conclusions 7

Allocation-Site-as-a-Hash-code (ASH) Tracker Embed an allocation-site ID into a hash code field. Embed at allocation time. ID is a unique integer of the site. Assigned by an interpreter or a JIT compiler. Original layout Class pointer Hash code Flags w/ ASH Tracker Class pointer Flags Instance fields Instance fields Object-specific random value, used as a hash table index, for example. Allocation site ID 8

How to Deal with Hash Code Collisions? Hash code values should be as distinct as possible from all others [Java API Spec]. Collisions slow down hash-table access, for example. How about appending a site ID field when hash code is first referred to? Some programs often refer to hash code. Allocation time First time hash code is referred to. Site ID Hash code Site ID 9

Collision Avoidance without Increasing Space Our solution: Embed a shorter dynamic ID and a random value. Hash code value = dynamic ID + random value Random value helps avoid collisions. Allocation time First time hash code is referred to. Object X Object X Object Y Other objects are not affected. 1111010111 0001100101 1111010111 Static ID Dynamic ID Hash code value Object-specific random value 10

How to Deal with Hash Code Collisions? (2 nd Round) All objects allocated at the same site have the same high-order bits in their hash code values. Need a long random-value field to avoid collisions. Hash code field: Dynamic ID Random Need a long ID field to track allocation sites accurately. 11

Variable-Length Dynamic ID Shorter IDs for hot allocation sites. Allocate many objects. Need a long random value. Not so many hot allocation sites in a program. Short site IDs suffice. Longer IDs for cold allocation sites. Allocate few objects. Short random values suffice. Many cold allocations sites in a program. Need long site IDs. for (i = 0; i < BIG_NUMBER; i++) { obj = new Object(); use(obj.hashcode()); } Dyn. ID if (error1) { obj = new Object(); use(obj.hashcode()); }... if (error2) { obj = new Object(); use(obj.hashcode()); } Dynamic ID Random Random 12

Dynamic Shrinking of ID Field It is not known in advance How many objects will be allocated at each site; or How many of their hash code values will be referred to. Our solution: Make the IDs of a site shorter and shorter As more and more hash code values are referred to. Hash code of objects allocated at putproperty()#191 Object 1 Object 2 Object 3 Object 4 Object 5 Object 6 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 1 0 0 0 1) Assign a long dynamic ID at first. 2) Occasionally, a random value becomes all zero. 3) Assign a new shorter ID. Maintain a mapping table. Dynamic ID Allocation site 000110 putproperty()#191 00101 putproperty()#191 13

Outline Introduction ASH Tracker ASC Tracker Experiments Applications of ASH/ASC Trackers Conclusions 14

What If There Is No Hash Code Field? Some JVMs do not always have a hash code field. Almost all JVMs have a class pointer field. To access class meta data. Class pointer Flags Instance fields Objects allocated at putproperty()#191 Class pointer Flags Class structure Virtual function table Instance size Other class meta data Class pointer Flags Instance fields Objects allocated at prepare()#35 Class pointer Flags Instance fields Instance fields 15

Allocation-Site-via-a-Class-pointer (ASC) Tracker Replace the class pointer with a pointer to its allocation site structure. This is possible because each allocation site always allocates objects of the same class. Alloc site ptr Objects allocated at putproperty()#191 Alloc site ptr Class pointer putproperty ()#191 Virtual function table Instance size Other class meta data Class pointer prepare() #35 Alloc site ptr Objects allocated at prepare()#35 Alloc site ptr 16

Mitigating Indirection Overhead (1) Duplicate frequently-accessed constant class fields. Need to choose carefully which fields to duplicate. Not to increase cache misses. Not to increase space overhead. Alloc site ptr Class pointer Instance size putproperty ()#191 Class pointer Instance size prepare() #35 Alloc site ptr Alloc site ptr Virtual function table Instance size Other class meta data Alloc site ptr 17

Mitigating Indirection Overhead (2) Profiling-based devirtualization by a JIT compiler. if (object->class_ptr!= HOT_CLASS) goto SlowPath; /* Inlined method, etc. */ Another load needed by ASC Tracker. if (object->alloc_site alloc_site->class_ptr!= HOT_CLASS) goto SlowPath; /* Inlined method, etc. */ Emit an allocation-site equality check where possible. if (object->alloc_site!= HOT_ALLOCATION_SITE) goto SlowPath; /* Inlined method, etc. */ 18

Outline Introduction ASH Tracker ASC Tracker Experiments Applications of ASH/ASC Trackers Conclusions 19

Evaluation IBM Research - Tokyo Environment 64-bit IBM J9/TR JVM 1.6.0 SR3 3-word header (1 word = 32 bits) Generational GC (copying young + mark-and-sweep old GC) 4x minimum Java heap 8-core 1.8GHz Intel Xeon Linux 2.6.18 Benchmarks SPECjvm2008, DaCapo, and SPECjbb2005 ASH Tracker Uses a 15-bit hash code field. Shrinks a dynamic ID field from 11 bits to 2 bits. ASC Tracker Duplicates an instance-size field and an array-element-class field. 20

Performance Overhead ASH Tracker: on average ~0% and at most 1.4% overhead. Up to 1.73x hash code collisions compared with the baseline. ASC Tracker: on average 1.0% and at most 2.0% overhead. 21 16 14 12 10 8 6 4 2 0-2 Per-object 4-byte field ASH Tracker ASC Tracker compiler.compiler compiler.sunflow compress derby mpegaudio serial sunflow xml.transform xml.validation antlr chart eclipse fop hsqldb luindex lusearch pmd SPECjbb2005 Geo. mean Performance overhead (%)

Performance Overhead ASH Tracker: on average ~0% and at most 1.4% overhead. Up to 1.73x hash code collisions compared with the baseline. ASC Tracker: on average 1.0% and at most 2.0% overhead. 22 Performance overhead (%) 16 14 12 10 8 6 4 2 0-2 compiler.compiler compiler.sunflow compress derby mpegaudio serial sunflow xml.transform Per-object 4-byte field ASH Tracker ASC Tracker Serial and pmd refer to the hash code of: more than 20% of allocated objects, and 5-11% of average live objects. xml.validation antlr chart eclipse fop hsqldb luindex lusearch pmd SPECjbb2005 Geo. mean

Outline Introduction ASH Tracker ASC Tracker Experiments Applications of ASH/ASC Trackers Conclusions 23

Minimal-Overhead Memory Leak Detector Scan the entire Java heap at each global GC time. Count the numbers of live objects for each allocation site. Save the numbers in per-allocation-site histories. (Inform users of possible memory leaks.) Java heap obj obj obj obj obj obj obj obj obj obj obj Per-allocation-site counters 6 3 2 24 #live objects history Allocation site 6, 5, 4, Property.putProperty()#191 3, 3, 3, DataTable.putInteger()#187 2, 1, 1, Property.prepare()#35

Performance Overhead of the Leak Detector ASH Tracker: on average 0.3% and at most 1.7% overhead ASC Tracker: on average 1.1% and at most 2.4% overhead 3 2.5 2 1.5 1 0.5 0-0.5-1 -1.5-2 ASH Tracker ASC Tracker compiler.compiler compiler.sunflow compress derby mpegaudio serial sunflow xml.transform xml.validation antlr chart eclipse fop hsqldb luindex lusearch pmd Performance overhead (%) SPECjbb2005 Geo. mean 25

Allocation-Site-Based Object Pretenuring for Generational GC Refer to our paper for more details. 4x min Java heap: at maximum 11% speed-up 2x min Java heap: at maximum 15% speed-up 22 20 18 16 14 12 10 8 6 4 2x min Java heap 4x min Java heap Speed-up (%) -2 02-4 compiler.compiler compiler.sunflow compress derby mpegaudio serial sunflow xml.transform xml.validation antlr chart eclipse fop hsqldb luindex lusearch pmd SPECjbb2005 Geo. mean 26

Related Work Bit-Encoding Leak Location (Bell) [Bond, et al., ASPLOS 2006] Probabilistic allocation site tracker Requires a sufficient number of samples. Cannot identify the allocation site of each object. Not suitable for JVM optimizations. Techniques for object header compression [Bacon, et al., ECOOP 2002] Remove the hash code field. Use ASC Tracker. Steal several bits of the class pointer. Steal those bits of the allocation site pointer. 27

Conclusion Minimal-overhead allocation site trackers ASH Tracker Embeds an allocation site ID into the hash code field. Performance overhead: ~0% on average, 1.4% at maximum. ASC Tracker Makes the class pointer field point to an allocation site structure. Performance overhead: 1% on average, 2% at maximum. Useful for both reliability and optimization. Reliability: minimal-overhead memory leak detector. Optimization: allocation-site-based object pretenuring. 28

Thank you! Questions? 29

Back-up 30

Space Overhead Compared with Physical Memory Usage ASH Tracker: on average 0.4% overhead ASC Tracker: on average 0.2% overhead 10 9 8 7 6 5 4 3 2 1 0 Per-object 4-byte field ASH Tracker ASC Tracker Space overhead (%) compiler.compiler compiler.sunflow compress derby mpegaudio serial sunflow xml.transform xml.validation antlr chart eclipse fop hsqldb luindex lusearch pmd SPECjbb2005 Eclipse/manual Geo. mean 31

Using ASH Tracker for Object Pretenuring Copy likely-to-be-long-lived objects directly to a tenured space. Not copying such objects multiple times within a young space. 1. Online profiling Compute the ratio of tenured objects (#t/#s) for each allocation site at young GC time. 2. Pretenuring Enable pretenuring for an allocation site if its ratio exceeds a certain threshold. Young space Young space Allocation space Object #s Survivor space Allocation space Object Survivor space Tenured space #t Tenured space 32