Limits of Parallel Marking Garbage Collection....how parallel can a GC become?

Similar documents
Hard Real-Time Garbage Collection in Java Virtual Machines

Java without the Coffee Breaks: A Nonintrusive Multiprocessor Garbage Collector

Towards Parallel, Scalable VM Services

The Segregated Binary Trees: Decoupling Memory Manager

Eliminating External Fragmentation in a Non-Moving Garbage Collector for Java

JamaicaVM Java for Embedded Realtime Systems

Method-Level Phase Behavior in Java Workloads

Swift: A Register-based JIT Compiler for Embedded JVMs

Configuring the Heap and Garbage Collector for Real- Time Programming.

Garbage Collection. Hwansoo Han

Heuristics for Profile-driven Method- level Speculative Parallelization

On the Effectiveness of GC in Java

Mostly Concurrent Garbage Collection Revisited

Dynamic Selection of Application-Specific Garbage Collectors

Hierarchical Real-time Garbage Collection

Java Memory Allocation with Lazy Worst Fit for Small Objects

Schism: Fragmentation-Tolerant Real-Time Garbage Collection. Fil Pizlo *

Mixed Mode Execution with Context Threading

SABLEJIT: A Retargetable Just-In-Time Compiler for a Portable Virtual Machine p. 1

Real-Time Garbage Collection Panel JTRES 2007

A Quantitative Evaluation of the Contribution of Native Code to Java Workloads

Cycle Tracing. Presented by: Siddharth Tiwary

Free-Me: A Static Analysis for Automatic Individual Object Reclamation

Garbage Collection Algorithms. Ganesh Bikshandi

JVM Memory Model and GC

An Efficient Memory Management Technique That Improves Localities

Write Barrier Elision for Concurrent Garbage Collectors

Phase-based Adaptive Recompilation in a JVM

Adaptive Optimization using Hardware Performance Monitors. Master Thesis by Mathias Payer

JOVE. An Optimizing Compiler for Java. Allen Wirfs-Brock Instantiations Inc.

IBM Research Report. Efficient Memory Management for Long-Lived Objects

Approximation of the Worst-Case Execution Time Using Structural Analysis. Matteo Corti and Thomas Gross Zürich

Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters

Managed runtimes & garbage collection. CSE 6341 Some slides by Kathryn McKinley

Exploiting the Behavior of Generational Garbage Collector

Managed runtimes & garbage collection

JVM Performance Study Comparing Java HotSpot to Azul Zing Using Red Hat JBoss Data Grid

YETI. GraduallY Extensible Trace Interpreter VEE Mathew Zaleski, Angela Demke Brown (University of Toronto) Kevin Stoodley (IBM Toronto)

Reference Counting. Reference counting: a way to know whether a record has other users

Memory Organization and Optimization for Java Workloads

Kodewerk. Java Performance Services. The War on Latency. Reducing Dead Time Kirk Pepperdine Principle Kodewerk Ltd.

How s the Parallel Computing Revolution Going? Towards Parallel, Scalable VM Services

Experiences with Multi-threading and Dynamic Class Loading in a Java Just-In-Time Compiler

Do Your GC Logs Speak To You

Contaminated Garbage Collection

A High Integrity Distributed Deterministic Java Environment. WORDS 2002 January 7, San Diego CA

Exploiting Prolific Types for Memory Management and Optimizations

Accurate Garbage Collection in Uncooperative Environments with Lazy Pointer Stacks

Using Prefetching to Improve Reference-Counting Garbage Collectors

Scheduling Hard Real-time Garbage Collection

MicroPhase: An Approach to Proactively Invoking Garbage Collection for Improved Performance

Understanding Parallelism-Inhibiting Dependences in Sequential Java

Heap Defragmentation in Bounded Time

Eliminating Exception Constraints of Java Programs for IA-64

Using Prefetching to Improve Reference-Counting Garbage Collectors

Identifying the Sources of Cache Misses in Java Programs Without Relying on Hardware Counters. Hiroshi Inoue and Toshio Nakatani IBM Research - Tokyo

Hard Real-Time Garbage Collection in the Jamaica Virtual Machine

Phases in Branch Targets of Java Programs

Replicating Real-Time Garbage Collector

CS842: Automatic Memory Management and Garbage Collection. Mark and sweep

Chip-Multithreading Systems Need A New Operating Systems Scheduler

Jazz: A Tool for Demand-Driven Structural Testing

Ulterior Reference Counting: Fast Garbage Collection without a Long Wait

The VMKit project: Java (and.net) on top of LLVM

Heap Compression for Memory-Constrained Java

An Experimental Study of Rapidly Alternating Bottleneck in n-tier Applications

Reducing the Overhead of Dynamic Compilation

Vertical Profiling: Understanding the Behavior of Object-Oriented Applications

Automatic Object Colocation Based on Read Barriers

Dynamic Profiling & Comparison of Sun Microsystems JDK1.3.1 versus the Kaffe VM APIs

DNWSH - Version: 2.3..NET Performance and Debugging Workshop

Efficient Object Placement including Node Selection in a Distributed Virtual Machine

Reducing the Overhead of Dynamic Compilation

Dynamic SimpleScalar: Simulating Java Virtual Machines

Go GC: Prioritizing Low Latency and Simplicity

Java Performance Evaluation through Rigorous Replay Compilation

Garbage-First Garbage Collection

SUB SUB+BI SUB+BI+AR TINY. nucleic. genlex kb. Ocaml benchmark. (a) Pentium 4 Mispredicted Taken Branches. genlex. nucleic.

Shenandoah: An ultra-low pause time garbage collector for OpenJDK. Christine Flood Roman Kennke Principal Software Engineers Red Hat

Lecture 15 Advanced Garbage Collection

Understanding Application Hiccups

Hardware-Supported Pointer Detection for common Garbage Collections

JAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder

Reference Analyses. VTA - Variable Type Analysis

Memory Management 3/29/14 21:38

Static Java Program Features for Intelligent Squash Prediction

Estimating the Impact of Heap Liveness Information on Space Consumption in Java

Java Performance Tuning and Optimization Student Guide

CS577 Modern Language Processors. Spring 2018 Lecture Garbage Collection

Reducing Generational Copy Reserve Overhead with Fallback Compaction

Reducing Pause Time of Conservative Collectors

CS229 Project: TLS, using Learning to Speculate

MEMORY MANAGEMENT HEAP, STACK AND GARBAGE COLLECTION

A Side-channel Attack on HotSpot Heap Management. Xiaofeng Wu, Kun Suo, Yong Zhao, Jia Rao The University of Texas at Arlington

Garbage Collection. Weiyuan Li

Lecture Notes on Advanced Garbage Collection

Garbage Collection. CS 351: Systems Programming Michael Saelee

Computer Languages, Systems & Structures

Parallel Memory Defragmentation on a GPU

Towards High Performance Processing in Modern Java-based Control Systems. Marek Misiowiec Wojciech Buczak, Mark Buttner CERN ICalepcs 2011

Transcription:

Limits of Parallel Marking Garbage Collection...how parallel can a GC become? Dr. Fridtjof Siebert CTO, aicas ISMM 2008, Tucson, 7. June 2008

Introduction Parallel Hardware is becoming the norm even for embedded computers even for real time systems We need parallel garbage collection That is not only optimized for max. throughput But that gives guarantees on its performance The worst case GC timing must predictable and fast 2

Terminology blocking GC 3

Terminology blocking GC Incremental GC 4

Terminology blocking GC Incremental GC Concurrent GC CPU 1: Application CPU 2: GC CPU 3: Application 5

Terminology blocking GC Incremental GC Concurrent GC parallel GC CPU 1: Application CPU 1 CPU 2: GC CPU 2 CPU 3: Application CPU 3 6

Terminology blocking GC Incremental GC Concurrent GC parallel GC CPU 1: Application CPU 1 CPU 2: GC CPU 2 CPU 3: Application Parallel & Concurrent CPU 3 CPU 1: Application CPU 2: GC 7 CPU 3: GC

Terminology blocking GC Incremental GC Concurrent GC parallel GC CPU 1: Application CPU 1 CPU 2: GC CPU 2 CPU 3: Application CPU 3 Parallel & Concurrent CPU 1: Application Parallel & Concurrent CPU 1 CPU 2: GC CPU 2 8 CPU 3: GC CPU 3

Terminology blocking GC Incremental GC Concurrent GC parallel GC CPU 1: Application CPU 1 CPU 2: GC CPU 2 CPU 3: Application CPU 3 Parallel & Concurrent CPU 1: Application Parallel & Concurrent CPU 1 CPU 2: GC CPU 2 9 CPU 3: GC CPU 3

Parallel Mark & Sweep Incremental Mark & Sweep uses three color marking: white, grey and black mark phase step is find take grey object o mark all white objects referenced by o grey mark o black sweep phase step is take white object free its memory 10

Parallel Mark & Sweep Limits of Parallel Marking Garbage Collection Parallel Sweep Steps not addressed here sweeping can be performed fully in parallel by sweeping different regions of the heap by different CPUs need parallel access to the free lists 11

Parallel Mark & Sweep Parallel Mark several threads may scan grey objects in parallel new color anthracite for grey object that is being scanned by one CPU stalls possible if grey set temporarily empty! 12

Worst Case: Linked List Limits of Parallel Marking Garbage Collection root 13

Worst Case: Linked List Limits of Parallel Marking Garbage Collection root CPU1 CPU2 CPU3 14

Worst Case: Linked List Limits of Parallel Marking Garbage Collection root CPU1 starts mark step CPU1 CPU2 CPU3 15

Worst Case: Linked List Limits of Parallel Marking Garbage Collection root CPU1 CPU1 CPU2 no grey object, stalls! CPU3 16

Worst Case: Linked List Limits of Parallel Marking Garbage Collection root CPU1 CPU1 CPU2 CPU3 no grey object, stalls! 17

Worst Case: Linked List Limits of Parallel Marking Garbage Collection root CPU1 mark step finished CPU1 CPU2 CPU3 18

Worst Case: Linked List Limits of Parallel Marking Garbage Collection root CPU1 CPU2 all CPUs compete for one grey object! CPU1 CPU3 19

Worst Case: Linked List Limits of Parallel Marking Garbage Collection root CPU1 CPU2 eg., CPU2 successful, CPU1 + CPU3 stall! CPU1 CPU2 CPU3 20

Worst Case: Linked List With n CPUs performing mark in parallel there might be n 1 stalls for each mark step only one CPU is performing a mark step at any time Worst case performance equal to non parallel GC! 21

Can we find a better limit for real applications? First, look at two processor parallel mark only what if memory graph consists of two linked lists? 22

Two Linked Lists with two CPUs root CPU1 CPU2 we might be lucky and see no stalls 23

Two Linked Lists with two CPUs root CPU1 CPU2 but we might have bad luck: one list is scanned first, there is a single linked list left! 24

Limit on stalls depends on object depth root 1 2 3 4 13 CPU1 2 11 10 5 12 CPU2 3 12 9 6 11 4 13 8 7 10 5 6 7 8 9 25

Limit on stalls depends on object depth (2 processors) after 1 st stall, all objects with depth 1 are black after 2 nd stall, all objects with depth 2 are black etc. after n th stall, all objects with depth n are black 26

Limit on stalls depends on object depth (2 processors) # of stalls s on two processor parallel mark is limited by max. depth of the memory graph H: 27

Generalization for more processors # of stalls s on p processor parallel mark is limited by: 28

Analysis and Measurements Instrumented JamaicaVM Java implementation to measure the maximum depth of the heap graph, make samples of the current heap graph all 10,000 reference store operations, and output the maximum depths and the maximum ratios depth / heap size in # of objects The instrumented VM was then used to run the SPECjvm98 benchmark suite 29

Measurements Maximum depths of SPECjvm98 benchmarks 1500 1250 1000 750 500 250 0 check jess compress raytrace db mpegaudio javac mtrt jack 30

Measurements Maximum relative depths of SPECjvm98 benchmarks 4,00% 3,50% 3,00% 2,50% 2,00% 1,50% 1,00% 0,50% 0,00% check jess db compress raytrace mpegaudio javac mtrt jack 31

100 10 Measurements Worst case scalability of SPECjvm98 benchmarks 1,0 0,9 0,8 ideal 0,7 check compress 0,6 jess raytrace 0,5 db 0,4 javac mpegaudio 0,3 mtrt jack 0,2 0,1 ideal check compress jess ray trace db jav ac mpegaudio mtrt jack non-parallel 32 1 1 2 4 8 16 32 64 128 256 0,0 1 2 4 8 16 32 64 128 256

Conclusions Limits of Parallel Marking Garbage Collection In the general case, parallel marking garbage collection can not be parallelized. However, if the depth of the memory graph is limited, then parallel mark phase generally works well. To be able to give realtime guarantees on the performance of the mark phase, we need a guarantee from the application on its maximum heap depth. 33