Limits of Parallel Marking Garbage Collection...how parallel can a GC become? Dr. Fridtjof Siebert CTO, aicas ISMM 2008, Tucson, 7. June 2008
Introduction Parallel Hardware is becoming the norm even for embedded computers even for real time systems We need parallel garbage collection That is not only optimized for max. throughput But that gives guarantees on its performance The worst case GC timing must predictable and fast 2
Terminology blocking GC 3
Terminology blocking GC Incremental GC 4
Terminology blocking GC Incremental GC Concurrent GC CPU 1: Application CPU 2: GC CPU 3: Application 5
Terminology blocking GC Incremental GC Concurrent GC parallel GC CPU 1: Application CPU 1 CPU 2: GC CPU 2 CPU 3: Application CPU 3 6
Terminology blocking GC Incremental GC Concurrent GC parallel GC CPU 1: Application CPU 1 CPU 2: GC CPU 2 CPU 3: Application Parallel & Concurrent CPU 3 CPU 1: Application CPU 2: GC 7 CPU 3: GC
Terminology blocking GC Incremental GC Concurrent GC parallel GC CPU 1: Application CPU 1 CPU 2: GC CPU 2 CPU 3: Application CPU 3 Parallel & Concurrent CPU 1: Application Parallel & Concurrent CPU 1 CPU 2: GC CPU 2 8 CPU 3: GC CPU 3
Terminology blocking GC Incremental GC Concurrent GC parallel GC CPU 1: Application CPU 1 CPU 2: GC CPU 2 CPU 3: Application CPU 3 Parallel & Concurrent CPU 1: Application Parallel & Concurrent CPU 1 CPU 2: GC CPU 2 9 CPU 3: GC CPU 3
Parallel Mark & Sweep Incremental Mark & Sweep uses three color marking: white, grey and black mark phase step is find take grey object o mark all white objects referenced by o grey mark o black sweep phase step is take white object free its memory 10
Parallel Mark & Sweep Limits of Parallel Marking Garbage Collection Parallel Sweep Steps not addressed here sweeping can be performed fully in parallel by sweeping different regions of the heap by different CPUs need parallel access to the free lists 11
Parallel Mark & Sweep Parallel Mark several threads may scan grey objects in parallel new color anthracite for grey object that is being scanned by one CPU stalls possible if grey set temporarily empty! 12
Worst Case: Linked List Limits of Parallel Marking Garbage Collection root 13
Worst Case: Linked List Limits of Parallel Marking Garbage Collection root CPU1 CPU2 CPU3 14
Worst Case: Linked List Limits of Parallel Marking Garbage Collection root CPU1 starts mark step CPU1 CPU2 CPU3 15
Worst Case: Linked List Limits of Parallel Marking Garbage Collection root CPU1 CPU1 CPU2 no grey object, stalls! CPU3 16
Worst Case: Linked List Limits of Parallel Marking Garbage Collection root CPU1 CPU1 CPU2 CPU3 no grey object, stalls! 17
Worst Case: Linked List Limits of Parallel Marking Garbage Collection root CPU1 mark step finished CPU1 CPU2 CPU3 18
Worst Case: Linked List Limits of Parallel Marking Garbage Collection root CPU1 CPU2 all CPUs compete for one grey object! CPU1 CPU3 19
Worst Case: Linked List Limits of Parallel Marking Garbage Collection root CPU1 CPU2 eg., CPU2 successful, CPU1 + CPU3 stall! CPU1 CPU2 CPU3 20
Worst Case: Linked List With n CPUs performing mark in parallel there might be n 1 stalls for each mark step only one CPU is performing a mark step at any time Worst case performance equal to non parallel GC! 21
Can we find a better limit for real applications? First, look at two processor parallel mark only what if memory graph consists of two linked lists? 22
Two Linked Lists with two CPUs root CPU1 CPU2 we might be lucky and see no stalls 23
Two Linked Lists with two CPUs root CPU1 CPU2 but we might have bad luck: one list is scanned first, there is a single linked list left! 24
Limit on stalls depends on object depth root 1 2 3 4 13 CPU1 2 11 10 5 12 CPU2 3 12 9 6 11 4 13 8 7 10 5 6 7 8 9 25
Limit on stalls depends on object depth (2 processors) after 1 st stall, all objects with depth 1 are black after 2 nd stall, all objects with depth 2 are black etc. after n th stall, all objects with depth n are black 26
Limit on stalls depends on object depth (2 processors) # of stalls s on two processor parallel mark is limited by max. depth of the memory graph H: 27
Generalization for more processors # of stalls s on p processor parallel mark is limited by: 28
Analysis and Measurements Instrumented JamaicaVM Java implementation to measure the maximum depth of the heap graph, make samples of the current heap graph all 10,000 reference store operations, and output the maximum depths and the maximum ratios depth / heap size in # of objects The instrumented VM was then used to run the SPECjvm98 benchmark suite 29
Measurements Maximum depths of SPECjvm98 benchmarks 1500 1250 1000 750 500 250 0 check jess compress raytrace db mpegaudio javac mtrt jack 30
Measurements Maximum relative depths of SPECjvm98 benchmarks 4,00% 3,50% 3,00% 2,50% 2,00% 1,50% 1,00% 0,50% 0,00% check jess db compress raytrace mpegaudio javac mtrt jack 31
100 10 Measurements Worst case scalability of SPECjvm98 benchmarks 1,0 0,9 0,8 ideal 0,7 check compress 0,6 jess raytrace 0,5 db 0,4 javac mpegaudio 0,3 mtrt jack 0,2 0,1 ideal check compress jess ray trace db jav ac mpegaudio mtrt jack non-parallel 32 1 1 2 4 8 16 32 64 128 256 0,0 1 2 4 8 16 32 64 128 256
Conclusions Limits of Parallel Marking Garbage Collection In the general case, parallel marking garbage collection can not be parallelized. However, if the depth of the memory graph is limited, then parallel mark phase generally works well. To be able to give realtime guarantees on the performance of the mark phase, we need a guarantee from the application on its maximum heap depth. 33