Identifying the Sources of Cache Misses in Java Programs Without Relying on Hardware Counters. Hiroshi Inoue and Toshio Nakatani IBM Research - Tokyo

Similar documents
Efficient Runtime Tracking of Allocation Sites in Java

Adaptive Multi-Level Compilation in a Trace-based Java JIT Compiler

A Trace-based Java JIT Compiler Retrofitted from a Method-based Compiler

Continuous Object Access Profiling and Optimizations to Overcome the Memory Wall and Bloat

Trace-based JIT Compilation

IBM Research - Tokyo 数理 計算科学特論 C プログラミング言語処理系の最先端実装技術. Trace Compilation IBM Corporation

Efficient Runtime Tracking of Allocation Sites in Java

Towards Parallel, Scalable VM Services

The Potentials and Challenges of Trace Compilation:

JIT Compilation Policy for Modern Machines

Deriving Code Coverage Information from Profiling Data Recorded for a Trace-based Just-in-time Compiler

Chronicler: Lightweight Recording to Reproduce Field Failures

How s the Parallel Computing Revolution Going? Towards Parallel, Scalable VM Services

Dynamic Selection of Application-Specific Garbage Collectors

Phosphor: Illuminating Dynamic. Data Flow in Commodity JVMs

String Deduplication for Java-based Middleware in Virtualized Environments

Dynamic Vertical Memory Scalability for OpenJDK Cloud Applications

Compact and Efficient Strings for Java

Efficient and Viable Handling of Large Object Traces

Continuously Measuring Critical Section Pressure with the Free-Lunch Profiler

A Preliminary Workload Analysis of SPECjvm2008

Probabilistic Calling Context

Phase-based Adaptive Recompilation in a JVM

Exploiting FIFO Scheduler to Improve Parallel Garbage Collection Performance

Myths and Realities: The Performance Impact of Garbage Collection

Method-Level Phase Behavior in Java Workloads

Adaptive SMT Control for More Responsive Web Applications

Hierarchical PLABs, CLABs, TLABs in Hotspot

SPEC* Java Platform Benchmarks and Their Role in the Java Technology Ecosystem. *Other names and brands may be claimed as the property of others.

CGO:U:Auto-tuning the HotSpot JVM

Compilation Queuing and Graph Caching for Dynamic Compilers

Towards Garbage Collection Modeling

Adaptive Optimization using Hardware Performance Monitors. Master Thesis by Mathias Payer

Cross-Layer Memory Management for Managed Language Applications

Performance Potential of Optimization Phase Selection During Dynamic JIT Compilation

Java on System z. IBM J9 virtual machine

Optimising Multicore JVMs. Khaled Alnowaiser

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

Comparison of Garbage Collectors in Java Programming Language

Shenandoah: Theory and Practice. Christine Flood Roman Kennke Principal Software Engineers Red Hat

Lu Fang, University of California, Irvine Liang Dou, East China Normal University Harry Xu, University of California, Irvine

Limits of Parallel Marking Garbage Collection....how parallel can a GC become?

Assessing the Scalability of Garbage Collectors on Many Cores

JAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder

VIProf: A Vertically Integrated Full-System Profiler

On The Limits of Modeling Generational Garbage Collector Performance

Shenandoah: An ultra-low pause time garbage collector for OpenJDK. Christine Flood Principal Software Engineer Red Hat

SABLEJIT: A Retargetable Just-In-Time Compiler for a Portable Virtual Machine p. 1

MicroPhase: An Approach to Proactively Invoking Garbage Collection for Improved Performance

Automatic Object Colocation Based on Read Barriers

NumaGiC: a Garbage Collector for Big Data on Big NUMA Machines

Reducing Trace Selection Footprint for Large-scale Java Applications without Performance Loss

Exploring Dynamic Compilation and Cross-Layer Object Management Policies for Managed Language Applications. Michael Jantz

A Dynamic Evaluation of the Precision of Static Heap Abstractions

Heap Compression for Memory-Constrained Java

Java Garbage Collector Performance Measurements

Experiences with Multi-threading and Dynamic Class Loading in a Java Just-In-Time Compiler

Space-and-Time Efficient Parallel Garbage Collector for Data-Intensive Applications

ORDER: Object centric DEterministic Replay for Java

Tolerating Memory Leaks

Proceedings of the 2 nd Java TM Virtual Machine Research and Technology Symposium (JVM '02)

Java Performance Evaluation through Rigorous Replay Compilation

IBM Research Report. Efficient Memory Management for Long-Lived Objects

Kazunori Ogata, Dai Mikurube, Kiyokuni Kawachiya and Tamiya Onodera

Trace Compilation. Christian Wimmer September 2009

Z-Rays: Divide Arrays and Conquer Speed and Flexibility

AOT Vs. JIT: Impact of Profile Data on Code Quality

Just-In-Time Compilation

Garbage-First Garbage Collection by David Detlefs, Christine Flood, Steve Heller & Tony Printezis. Presented by Edward Raff

Exploiting Prolific Types for Memory Management and Optimizations

High-Level Language VMs

Rethinking the Memory Hierarchy for Modern Languages. Po-An Tsai, Yee Ling Gan, and Daniel Sanchez

Free-Me: A Static Analysis for Automatic Individual Object Reclamation

A program execution is memory safe so long as memory access errors never occur:

Toward Efficient Strong Memory Model Support for the Java Platform via Hybrid Synchronization

Program Calling Context. Motivation

Short-term Memory for Self-collecting Mutators. Martin Aigner, Andreas Haas, Christoph Kirsch, Ana Sokolova Universität Salzburg

A Side-channel Attack on HotSpot Heap Management. Xiaofeng Wu, Kun Suo, Yong Zhao, Jia Rao The University of Texas at Arlington

Performance of Non-Moving Garbage Collectors. Hans-J. Boehm HP Labs

A new Mono GC. Paolo Molaro October 25, 2006

Data Structure Aware Garbage Collector

Reference Object Processing in On-The-Fly Garbage Collection

Agenda. CSE P 501 Compilers. Java Implementation Overview. JVM Architecture. JVM Runtime Data Areas (1) JVM Data Types. CSE P 501 Su04 T-1

Acknowledgements These slides are based on Kathryn McKinley s slides on garbage collection as well as E Christopher Lewis s slides

ITEC1620 Object-Based Programming. Lecture 13

Java Performance: The Definitive Guide

New Algorithms for Static Analysis via Dyck Reachability

IBM, Java and POWER Linux IBM Corporation

Method-Specific Dynamic Compilation using Logistic Regression

Separating Access Control Policy, Enforcement, and Functionality in Extensible Systems. Robert Grimm University of Washington

Detecting and Fixing Memory-Related Performance Problems in Managed Languages

SmartMD: A High Performance Deduplication Engine with Mixed Pages

Transparent Pointer Compression for Linked Data Structures

The DaCapo Benchmarks: Java Benchmarking Development and Analysis

Swift: A Register-based JIT Compiler for Embedded JVMs

Cross-Layer Memory Management for Managed Language Applications

Name, Scope, and Binding. Outline [1]

JOVE. An Optimizing Compiler for Java. Allen Wirfs-Brock Instantiations Inc.

Intelligent Compilation

Cycle Tracing. Presented by: Siddharth Tiwary

Transcription:

Identifying the Sources of Cache Misses in Java Programs Without Relying on Hardware Counters Hiroshi Inoue and Toshio Nakatani IBM Research - Tokyo June 15, 2012 ISMM 2012 at Beijing, China

Motivation and Goal Cache miss information from Hardware Performance Monitor (HPM) is useful for runtime optimizations Prefetch injection: Adl-Tabatabai et al. 2004 Object placement optimization: Schneider et al. 2007, etc.. HPM is difficult to use in the real world HPM may require a special device driver and root privilege Only one process can use hardware at a time Our Goal To enable runtime optimization by identifying the sources of cache misses without using HPM 2

Presentation Overview Motivation and Goal Analysis Key Observation Our Technique Evaluation in Coverages Optimization Object Alignment and Collocation Optimizations Our Techniques Performance Evaluation Summary 3

Key Observation Many cache misses in Java programs are caused in a simple idiomatic code pattern Pattern load a reference and touch the referenced object in a hot loop We can heuristically identify instructions and classes that may cause frequent cache misses by matching hot loops with the idiomatic pattern 4

So Simple Basic Code Pattern Tends to Cause Frequent Cache Misses ClassA obja; while (!end) { obja = ; miss access to obja; } within a hot loop detected by software-based profiling load a reference from a field of an object or a return value of a method call access to the referenced object a field access (load/store) a metadata access (checkcast, monitor enter) This access to obja tends to cause frequent cache misses 5

Anti-Pattern That Rarely Causes Cache Misses ClassA obja; while (!end) { obja = ; access to obja; } within a hot loop detected by software-based profiling if the first load is loop invariant access to the referenced object a field access (load/store) a metadata access (checkcast, monitor enter) This access to obja does NOT cause frequent cache misses 6

Implementation We implemented the analysis in 32-bit IBM J9/TR JVM Java 6 SR2 We execute pattern matching in JIT compiler after applying optimizations including method inlining and loop-invariant code motion using execution frequency information obtained by software-based profiling to identify the hot loops only for hot methods that are recompiled with higher optimization levels than the initial level 7

Evaluation We show coverages by instructions identified by our technique for Static count of load and store instructions L1D cache misses L2 cache data misses Memory accesses Environment Processor: POWER6 4.0GHz 64-KB L1D cache, 64-KB L1I cache 4-MB unified L2 cache 128-byte cache line for both L1 and L2 cache OS: RedHat Enterprise Linux 5.2 Benchmark: SPECpower_ssj2008, SPECjbb2005, SPECjvm2008, DaCapo-9.12 8

Coverages by Instructions Selected by Our Technique 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% static count of load/store instructions (average of all benchmarks, see the paper for more detail) L1D miss L2 data miss selected by our technique total in hot methods (upper bound for us) total in JITcompiled code Our technique selects only 2.8% of load and store instructions and they cover about 50% of the total cache misses compared to total in JIT-compiled code 9

Coverages by Instructions Selected by Our Technique 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% static count of load/store instructions (average of all benchmarks, see the paper for more detail) L1D miss L2 data miss selected by our technique total in hot methods (upper bound for us) total in JITcompiled code Our technique selects 14% of load and store instructions and they cover 64% in L1 miss and 69% in L2 miss compared to total in hot methods 10

Not Only Accessed Frequently 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% (average of all benchmarks, see the paper for more detail) Memory access L1D miss L2 data miss selected by our technique total in hot methods (upper bound for us) total in JITcompiled code Instructions selected by our technique causes about 2x more cache misses per execution than the instructions not selected 11

Presentation Overview Motivation and Goal Analysis Key Observation Our Technique Evaluation in Coverages Optimization Object Alignment and Collocation Optimizations Our Techniques Performance Evaluation Summary 12

Application in Runtime Optimization We implemented two object placement optimizations to reduce cache misses Object alignment cache line (128 byte) obja hot fields align object padding ObjA hot fields Object collocation cache line (128 byte) obja reference objb collocate two objects obja reference objb Both techniques can reduce cache misses from two to one 13

Our Approach for Optimizations We identify target classes for optimization based on our analysis in JIT compiler If two accesses to distinct fields of a object are selected in a hot loop select the class for target of alignment optimization If two accesses to objects, where one has a reference to another, are selected in a hot loop select the pair of classes for targets of collocation optimization We do optimizations both in garbage collector (for objects in tenure space) or at allocation time (for objects in nursery space) 14

Pattern for Alignment Optimization Derived from the basic pattern ClassA obja; while (!end) { // in a hot loop // 1) first, load a reference to a ClassA s instance loop variant load obja = ; // 2) then, access at least two different fields of obja access to obja.field1; obja access to obja.field2; } ClassA is selected for target of alignment optimization 15

Pattern for Collocation Optimization ClassA obja; ClassB objb; while (!end) { // in a hot loop loop variant load // 1) first, load a reference to a ClassA s instance obja = ; // 2) next, load a reference of ClassB from obja obja objb = obja.referencetoclassb; // 3) then, access at least one field of objb access to objb.field1; } objb pair of ClassA and ClassB is selected for target of collocation optimization 16

Special Handling For checkcast Operation ClassA obja; ClassS objs; // ClassS is a super class of ClassA while (!end) { // in a hot loop // 1) first, load a reference of a super class of ClassA objs = ; // 2) next, cast objc to ClassA (access to object header) obja = (ClassA) objs; // 3) then, access at least one field of obja access to obja.field1; } loop variant load obja header accessed by checkcast ClassA (not ClassS) is selected for target of alignment optimization Common pattern in HashMap or TreeMap accesses in Java 17

Performance Improvements by Optimizations Object alignment Object collocation relative throughput. 1.15 1.1 1.05 1 0.95 Baseline (w/o alignment optimization) Our software-only optimization HPM-based optmization 1.15 1.1 1.05 1 0.95 Baseline (w/o collocation optimization) Our software-only optimization HPM-based optmization higher is faster 0.9 (non-zero origin) SPECpower_ssj2008 SPECjbb2005 SPECjvm2008 DaCapo-9.12 0.9 SPECpower_ssj2008 SPECjbb2005 SPECjvm2008 DaCapo-9.12 cache miss intensive non intensive cache miss intensive non intensive Our technique achieved comparable performance gains in cache-miss-intensive programs without relying on the hardware help 18

Remaining Challenges For Optimizations Pattern matching in compiler cannot tell us the location of the objects in the Java heap (e.g. tenure or nursery) An instance of a subclass of the identified target may cause cache misses More detailed software-based profiling can help (in trade for additional overhead) For Analysis Current pattern matching cannot identify frequent cache misses caused by conflicting writes from multiple thread Different patterns and profiling information is required to achieve higher coverage 19

Summary We present a technique to identify the instructions and objects that frequently cause cache misses in Java programs without relying on the HPM Matching hot loops with simple idiomatic patterns worked very well for many Java programs We showed the effectiveness of our approach using two types of optimizations 20

backup 21

Coverage for each benchmark (L1D cache misses) 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Our technique with Aggressive threshold All in hot methods (upper bound for us) All in JITted code SPECpower_ssj2008 SPECjbb2005 compiler.compiler compiler.sunflow compress derby mpegaudio serial sunflow xml.transform xml.validation avrora batik eclipse fop h2 jython luindex lusearch pmd sunflow tomcat tradebeans tradesoap xalan average coverage (ratio to the total number of events caused in JIT-compiled code) 22