Short-term Memory for Self-collecting Mutators. Martin Aigner, Andreas Haas, Christoph Kirsch, Ana Sokolova Universität Salzburg

Similar documents
Acknowledgements These slides are based on Kathryn McKinley s slides on garbage collection as well as E Christopher Lewis s slides

Performance of Non-Moving Garbage Collectors. Hans-J. Boehm HP Labs

Concurrency and Scalability versus Fragmentation and Compaction with Compact-fit

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES

Exploiting the Behavior of Generational Garbage Collector

Run-time Environments -Part 3

Lecture 15 Garbage Collection

Myths and Realities: The Performance Impact of Garbage Collection

ECE 598 Advanced Operating Systems Lecture 10

Compiler Construction

Compiler Construction

RiSE: Relaxed Systems Engineering? Christoph Kirsch University of Salzburg

Memory management. Johan Montelius KTH

Hierarchical PLABs, CLABs, TLABs in Hotspot

Run-Time Environments/Garbage Collection

Garbage Collection Algorithms. Ganesh Bikshandi

Runtime. The optimized program is ready to run What sorts of facilities are available at runtime

Motivation for Dynamic Memory. Dynamic Memory Allocation. Stack Organization. Stack Discussion. Questions answered in this lecture:

Computer Systems A Programmer s Perspective 1 (Beta Draft)

CHAPTER 2: PROCESS MANAGEMENT

Automatic Memory Management

Fiji VM Safety Critical Java

A Compacting Real-Time Memory Management System

Robust Memory Management Schemes

Sustainable Memory Use Allocation & (Implicit) Deallocation (mostly in Java)

Lecture 13: Garbage Collection

Dnmaloc: a more secure memory allocator

6.172 Performance Engineering of Software Systems Spring Lecture 9. P after. Figure 1: A diagram of the stack (Image by MIT OpenCourseWare.

CMSC 330: Organization of Programming Languages

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Run-time Environments - 3

A.Arpaci-Dusseau. Mapping from logical address space to physical address space. CS 537:Operating Systems lecture12.fm.2

The Next Frontier of Cloud Computing is in the Clouds, Literally

Compilers. 8. Run-time Support. Laszlo Böszörmenyi Compilers Run-time - 1

Garbage Collection. Weiyuan Li

Heap Management. Heap Allocation

Managed runtimes & garbage collection. CSE 6341 Some slides by Kathryn McKinley

A Case Study in Optimizing GNU Radio s ATSC Flowgraph

CMSC 330: Organization of Programming Languages

Garbage Collection (2) Advanced Operating Systems Lecture 9

Dynamic Storage Allocation

Garbage Collection. Akim D le, Etienne Renault, Roland Levillain. May 15, CCMP2 Garbage Collection May 15, / 35

Designing a Compositional Real-Time Operating System. Christoph Kirsch Universität Salzburg

Memory Allocation. Static Allocation. Dynamic Allocation. Dynamic Storage Allocation. CS 414: Operating Systems Spring 2008

High-Level Language VMs

Managed runtimes & garbage collection

Heap, Variables, References, and Garbage. CS152. Chris Pollett. Oct. 13, 2008.

CS 261 Fall Mike Lam, Professor. Virtual Memory

Garbage Collection Techniques

Abstract Interpretation Continued

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 11

A Compacting Real-Time Memory Management System

Lecture Notes on Garbage Collection

Shaping Process Semantics

High Performance Managed Languages. Martin Thompson

DYNAMIC MEMORY ALLOCATION ON REAL-TIME LINUX

Precise Garbage Collection for C. Jon Rafkind * Adam Wick + John Regehr * Matthew Flatt *

Hardware-Supported Pointer Detection for common Garbage Collections

Computer Architecture Area Fall 2009 PhD Qualifier Exam October 20 th 2008

CMSC 330: Organization of Programming Languages. Memory Management and Garbage Collection

CS Computer Systems. Lecture 8: Free Memory Management

Quantifying the Performance of Garbage Collection vs. Explicit Memory Management

One-Slide Summary. Lecture Outine. Automatic Memory Management #1. Why Automatic Memory Management? Garbage Collection.

Qualifying Exam in Programming Languages and Compilers

Write Barrier Elision for Concurrent Garbage Collectors

ECE 598 Advanced Operating Systems Lecture 12

Binding and Storage. COMP 524: Programming Language Concepts Björn B. Brandenburg. The University of North Carolina at Chapel Hill

SPIN Operating System

Data Structure Aware Garbage Collector

CSE 307: Principles of Programming Languages

The Z Garbage Collector Low Latency GC for OpenJDK

Segmentation with Paging. Review. Segmentation with Page (MULTICS) Segmentation with Page (MULTICS) Segmentation with Page (MULTICS)

Automatic Garbage Collection

JVM Performance Study Comparing Java HotSpot to Azul Zing Using Red Hat JBoss Data Grid

The benefits and costs of writing a POSIX kernel in a high-level language

Lecture 15 Advanced Garbage Collection

Dynamic Memory Management! Goals of this Lecture!

I J C S I E International Science Press

Q1: /8 Q2: /30 Q3: /30 Q4: /32. Total: /100

Ulterior Reference Counting: Fast Garbage Collection without a Long Wait

Dynamic Memory Allocation

Name, Scope, and Binding. Outline [1]

StackVsHeap SPL/2010 SPL/20

Lecture Notes on Garbage Collection

Jaguar: Enabling Efficient Communication and I/O in Java

Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2

Physical memory vs. Logical memory Process address space Addresses assignment to processes Operating system tasks Hardware support CONCEPTS 3.

A new Mono GC. Paolo Molaro October 25, 2006

CS558 Programming Languages

Programming Language Implementation

Performance Best Practices Paper for IBM Tivoli Directory Integrator v6.1 and v6.1.1

Dynamic Data Structures (II)

10.1. CS356 Unit 10. Memory Allocation & Heap Management

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

IBM Research Report. Efficient Memory Management for Long-Lived Objects

Deallocation Mechanisms. User-controlled Deallocation. Automatic Garbage Collection

Compiler Construction D7011E

Towards Garbage Collection Modeling

Push-button verification of Files Systems via Crash Refinement

Project. there are a couple of 3 person teams. a new drop with new type checking is coming. regroup or see me or forever hold your peace

Transcription:

Short-term Memory for Self-collecting Mutators Martin Aigner, Andreas Haas, Christoph Kirsch, Ana Sokolova Universität Salzburg CHESS Seminar, UC Berkeley, September 2010

Heap Management explicit heap memory management deallocates here not-needed unreachable garbage collectors deallocate needed reachable here memory leaks dangling pointers tracing reference-counting reachable memory leaks

Short-term Memory

Traditional (Persistent) Memory Model Allocated memory objects are guaranteed to exist until deallocation Explicit deallocation is not safe (dangling pointers) and can be space-unbounded (memory leaks) Implicit deallocation (unreachable objects) is safe but may be slow or space-consuming (proportional to size of live memory) and can still be space-unbounded (memory leaks)

Short-term Memory Memory objects are only guaranteed to exist for a finite amount of time Memory objects are allocated with a given expiration date Memory objects are neither explicitly nor implicitly deallocated but may be refreshed to extend their expiration date

With short-term memory programmers or algorithms specify which memory objects are still needed and not which memory objects are not needed anymore!

Full Compile-Time Knowledge!""#$!%&#'(+*!""#$!%&#'()* "&,-%&.- "&,-%&.- %&.- Figure 2. Allocation with known expiration date.

Maximal Memory Consumption "&+,%&-,!""#$!%&#'()*.'./,0 %&-, %&$1 Figure 3. All objects are allocated for one time unit.

Trading-off Compile- Time, Runtime, Memory "&+,%&-,!""#$!%&#'()*.,+.,/0(1*.,+.,/0(2* %&-, Figure 4. Allocation with estimated expiration date. If the object is needed longer, it is refreshed.

Heap Management heap conservative expiration not-expired needed reachable conservative refresh

Sources of Errors: 1. not-needed objects are continuously refreshed or time does not advance (memory leaks) 2. needed objects expire (dangling pointers)

Explicit Programming Model Each thread advances a thread-local clock by invoking an explicit tick() call Each object receives upon its allocation an expiration date that is initialized to the thread-local time An explicit refresh(object, Extension) call sets the expiration date of the Object to the current thread-local time plus the given Extension

Explicit, Concurrent Programming Model Each object (logically!) receives expiration dates for all threads that are initialized to the respective thread-local times Refreshing an object (logically!) sets its already expired expiration dates to the respective thread-local times all threads must tick() before a newly allocated or refreshed object can expire!

Our Conjecture: It is easier to say which objects are still needed than which objects are not needed anymore!

Use Cases benchmark LoC tick refresh free aux total mpg123 16043 1 0 (-)43 0 44 JLayer 8247 1 6 0 2 9 Monte Carlo 1450 1 3 0 2 6 LuIndex 74584 2 15 0 3 20 Table 2. Use cases of short-term memory: lines of code of the benchmark, number of tick-calls, number of refresh-calls, number of free-calls, number of auxiliary lines of code, and total number of modified lines of code.

Self-collecting Mutators

Goals Explicit, thread-safe memory management system Constant time complexity for all operations predictable execution times, incrementality Constant space consumption by all operations small, bounded space overhead No additional threads and no read/write barriers self-collecting mutators!

Implementations works with any legacy code (1-word space overhead per Java patch under EPL memory block) based on Jikes RVM, GNU Classpath class library Dynamic C library (libscm) under GPL based on POSIX threads, ptmalloc2 allocator Available at: tiptoe.cs.uni-salzburg.at/short-term-memory

Two Approximations Single-expiration-date approximation (for Java) one expiration date for all threads recursive refresh is easy but blocking threads are a problem Multiple-expiration-date approximation (for C) expiration dates for all threads that refreshed an object recursive refresh is difficult but blocking threads can be handled

Global Time ticked-threads counter thread-local time 1 thread-local time 2 thread-local time 3 3 2 2 1 3 4 tick 5 tick 6 6 6 17 17 17 tick 18 18 2 2 2 2 tick 3 global time 2 2 2 2 3 Figure 5. Global time calculation.

Single Expiration Date Allocation: expiration date = global time + 1 Refresh: expiration date = global time + 1 + extension unless the result is less than the old date Expiration: expiration date < global time!""#$!%&#'()* +,-./0!%+/1 2!3+/+,-&4!%&#' (!* ) 1 5 6

Thread-Global Time Threads are partitioned into active and passive Global time is computed over active threads ticked-threads counter 2 1 1 1 2 thread-global time 1 thread-global time 2 5 4 block 5 10 x tick 15 15 unblock 4 tick 16 4 global time 4 4 14 14 15 Figure 7. Thread-global times.

Multiple Expiration Dates Allocation: first expiration date = thread-global time + 2 Refresh: new expiration date = thread-global time + 2 + extension Expiration: for all threads t and expiration dates d of t: expiration date d < thread-global time of t

Multiple Expiration Dates -../#-!"/0%(' 1,2345-!14*!"#$!"#$!"#$ %&' ) * +,,, ( ) * + %#' ( -../#-!"/0%('!"#$ 1,2345-!14+!"#$!"#$!"#$ ) * + 6,,,, ) * +

Implementation

Java Object Model Jikes objects are extended by a 3-word object header: 16-bit integer for expiration date 2 references for doubly-linked list of objects sorted by expiration dates 16-bit allocation-site identifier Three list operations: insert, remove, select-expired

Complexity Trade-off insert delete select expired Singly-linked list O(1) O(m) O(m) Doubly-linked list O(1) O(1) O(m) Sorted doubly- O(m) O(1) O(1) linked list Insert-pointer buffer O(log n) O(1) O(1) Segregated buffer O(1) O(1) O(log n) Table 2. Comparison of buffer implementations. The number of objects in a buffer is m, the maximal expiration extension is n.

Segregated buffer (with bounded expiration extension n=3 at time 5)!! " " # # $ % % " # % & '()*+, 7 '+-+.*/+012)+,34))45 26'+)*/1(26*+)34))45 Figure 7. Segregated buffer implementation.

C Memory Block Model An expiration date for a given memory block is represented by a descriptor, which is a pointer to the block Memory blocks are extended by a 1-word descriptor counter, which counts the descriptors pointing to a given block Descriptors representing a given expiration date are gathered in a per-thread descriptor list

Descriptor List Figure 8. The design of the descriptor list.

Descriptor Buffer A descriptor buffer is an array of size n+3 of descriptor lists where n is a compiletime bound on the maximal extension for refreshing Two (constant-time) buffer operations: insert, move-expired Two buffers per thread: locally-clocked and globally-clocked

Memory Operations (are all constant-time modulo the underlying allocator) malloc(s) returns a pointer to a memory block of size s plus one word for the descriptor counter, which is set to zero free(block) frees the given Block if its descriptor counter is zero local_refresh(block, Extension) global_refresh(block, Extension) tick()

Experiments

Setup CPU 2x AMD Opteron DualCore, 2.0 GHz RAM 4GB OS Linux 2.6.32-21-generic Java VM Jikes RVM 3.1.0 C compiler gcc version 4.4.3 C allocator ptmalloc2-20011215 (glibc-2.10.1) Table 3. System configuration.

Java: Memory MC MC 4 MC JLayer LuIndex leaky fixed fixed SCM(1,1) 40MB 40MB 60MB 95MB 370MB SCM 50MB 40MB 70MB / / (50,20) aggressive / / / 90MB 250MB SCM(1,1) GEN 95MB 40MB 70MB 95MB 370MB MS 100MB 40MB 70MB 95MB 370MB Table 4. Heap size for the different system configurations. SCM(n, k) stands for self-collecting mutators with a maximal expiration extension of n. Atick-call is executed every k-th round of the periodic behavior of the benchmark.

Java: Throughput total runtime in % of the runtime of SCM (lower is better) 116 114 112 110 108 106 104 102 100 98 Monte Carlo Benchmarks MC leaky MC fixed 4xMC fixed SCM(50,20) GEN MS SCM(50,20) double memory GEN double memory MS double memory Figure 9. Total execution time of the Monte Carlo benchmarks in percentage of the total execution time of the benchmark using selfcollecting mutators. Fi

total runtime in % of the runtime of SCM (lower is better) Java: Throughput 104 103.5 103 102.5 102 101.5 101 100.5 100 99.5 99 SCM(1,1) aggressive SCM(1,1) GEN MS GEN double memory MS double memory JLayer LuIndex Figure 10. Total execution time of the JLayer and the LuIndex benchmarks in percentage of the total execution time of the benchmark using self-collecting mutators. ga ati ec sta 20 th of sta so of ne nu ar in tic at un re sh

Java: Latency & Memory 45 100000 free memory in MB (higher is better) 40 35 30 25 20 15 10 5 10000 1000 loop execution time in microseconds (logarithmic) (lower is better) 0 100 0 500 1000 1500 2000 2500 loop iteration GEN free memory MS free memory SCM free memory GEN loop execution time MS loop execution time SCM loop execution time Figure 11. Free memory and loop execution time of the fixed Monte Carlo benchmark.

Java: Latency w/ Refreshing loop execution time in microseconds (logarithmic) (lower is better) 5000 1000 500 1 tick/1 iteration 1 tick/50 iterations 1 tick/200 iterations 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 loop iteration Figure 13. Loop execution time of the Monte Carlo benchmark with different tick frequencies. Self-collecting mutators is used.

Java: Memory w/ Refreshing free memory in MB (higher is better) 28 26 24 22 20 18 16 1 tick/1 iteration 1 tick/50 iterations 1 tick/200 iterations 1 10 100 1000 loop iteration (logarithmic) Figure 14. Free memory of the Monte Carlo benchmark with different tick frequencies. Self-collecting mutators is used.

C: Overhead persistent MM short-term MM malloc of ptmalloc2 166 (78 / 199k) / free of ptmalloc2 86 (14 / 169k) / malloc of SCM 172 (82 / 267k) 138 (75 / 271k) free of SCM 91 (10 / 157k) / local-refresh(1, 256B) / 227 (131 / 548k) local-refresh(10, 256B) / 225 (131 / 548k) local-refresh(1, 4KB) / 228 (131 / 548k) local-refresh(10, 4KB) / 230 (131 / 548k) global-refresh(1, 256B) / 226 (116 / 551k) global-refresh(10, 256B) / 224 (116 / 551k) global-refresh(1, 4KB) / 227 (116 / 551k) global-refresh(10, 4KB) / 228 (116 / 551k) local-tick(1, 256B) / 378 (277 / 164k) local-tick(10, 256B) / 359 (277 / 71k) local-tick(1, 4KB) / 375 (277 / 164k) local-tick(10, 4KB) / 366 (277 / 164k) global-tick(1, 256B) / 367 (229 / 169k) global-tick(10, 256B) / 352 (229 / 151k) global-tick(1, 4KB) / 365 (229 / 169k) global-tick(10, 4KB) / 361 (229 / 169k) Table 5. Average (min/max) execution time in CPU clock cycles of the memory management operations in the mpg123 benchmark. Here, e.g. local-refresh(n, m) stands for the local-refresh-call with a maximal expiration extension of n and descriptor page size m. When local/global-refresh is used then the tick-call is denoted by local/global-tick.

C: Throughput ptmalloc2 895.25ms 100.00% ptmalloc2 through SCM 899.43ms 100.47% local-scm(1, 256B) 890.18ms 99.43% local-scm(10, 256B) 898.28ms 100.34% local-scm(1, 4KB) 892.18ms 99.66% local-scm(10, 4KB) 892.28ms 99.67% global-scm(1, 256B) 893.76ms 99.83% Table 6. Total execution times of the mpg123 benchmark averaged over 100 repetitions. Here, local/global-scm(n, m) stands for self-collecting mutators with a maximal expiration extension of n and descriptor page size m, using local/global-refresh.

C: Memory memory consumption in KB (lower is better) 1000 900 800 700 600 500 400 300 200 100 0 tick tick tick tick tick 0 20 40 60 80 100 120 140 160 180 ptmalloc2 (1) local-scm(1, 256B)(2) space-overhead(1, 256B)(3) global-scm(1, 256B)(4) local-scm(10, 256B)(5) number of allocations (8) (4) (9) (7) (5) (2) (1) (10) (6) (3) space-overhead(10, 256B)(6) local-scm(1, 4KB)(7) space-overhead(1, 4KB)(8) local-scm(10, 4KB)(9) space-overhead(10, 4KB)(10) Figure 15. Memory overhead and consumption of the mpg123 benchmark. Again, local/global-scm(n, m) stands for selfcollecting mutators with a maximal expiration extension of n and descriptor page size m, using local/global-refresh. We write space-overhead(n, m) to denote the memory overhead of the local- SCM(n, m) configurations for storing descriptors and descriptor counters.

Thank you Check out: eurosys2011.cs.uni-salzburg.at