Towards Parallel, Scalable VM Services

Size: px

Start display at page:

Download "Towards Parallel, Scalable VM Services"

Jasmine Adele Blair
5 years ago
Views:

1 Towards Parallel, Scalable VM Services Kathryn S McKinley The University of Texas at Austin Kathryn McKinley Towards Parallel, Scalable VM Services 1 20 th Century Simplistic Hardware View Faster Processors Frequency Scaling Speculation, OO programs do not change they just run faster Kathryn McKinley Towards Parallel, Scalable VM Services 2 1

20 th Century Simplistic Software View Larger, More Capable Software Managed Languages hardware does not change it just runs faster Kathryn McKinley Towards Parallel, Scalable VM

2 20 th Century Simplistic Software View Larger, More Capable Software Managed Languages hardware does not change it just runs faster Kathryn McKinley Towards Parallel, Scalable VM Services 3 The 20 th Century Virtuous Cycle Faster Single Processor Frequency Scaling Larger, More Capable Software Managed Languages Kathryn McKinley Towards Parallel, Scalable VM Services 4 2

3 The 21 st Century Virtuous Cycle? More Cores Chip Multiproccesors CMP? Scalable Software Scalable Apps + Scalable Runtime Kathryn McKinley Towards Parallel, Scalable VM Services 5 How is this new virtuous cycle going? Kathryn McKinley Towards Parallel, Scalable VM Services 6 3

4 Measured Power vs Performance 50 Power (W) (log) ?? Pentium 4 (130nm) Core 2 Duo (65nm) i7 (45nm) Core 2 Duo (45nm) i5 (32nm) Speedup (v Atom 230) (log) SPEC CPU 2006, DaCapo, SPEC jvm98 How is this new virtuous cycle going for single threaded Java Kathryn McKinley Towards Parallel, Scalable VM Services 8 4

5 Performance Scaling Single Threaded Java Benchmarks Core i7: 4 cores, 2 way SMT Speedup Hardware Contexts antlr bloat compress db fop jack javac jess mpegaudio pmd raytrace geomean Kathryn McKinley Towards Parallel, Scalable VM Services 9 How is this new virtuous cycle going for multi-threaded Java Kathryn McKinley Towards Parallel, Scalable VM Services 10 5

6 Performance Scaling Multi-Threaded Java Benchmarks Core i7: 4 cores, 2 way SMT 4.0 Speedup Hardware Contexts avrora batik eclipse h2 jython luindex lusearch mtrt pjbb2005 sunflow tomcat tradebeans tradesoap xalan geomean Kathryn McKinley Towards Parallel, Scalable VM Services 11 Is there hope? Kathryn McKinley Towards Parallel, Scalable VM Services 12 6

7 Managed Languages Challenges & Opportunities Kathryn McKinley Towards Parallel, Scalable VM Services 13 Must Start with a Scalable Managed Runtime Kathryn McKinley Towards Parallel, Scalable VM Services 14 7

Sequential Managed Programs time Single Core Managed Runtime Profiling

Kathryn McKinley Towards Parallel, Scalable VM Services 15 Steps towards

Parallel application Core 0 Core 1 Core 2 Core 3 Threads Core 4 Core 5

8 Sequential Managed Programs time Single Core Managed Runtime Profiling Dynamic Analysis Compilation Garbage Collection Other Helper Threads Kathryn McKinley Towards Parallel, Scalable VM Services 15 Steps towards scalability Step 1. Parallel application Core 0 Core 1 Core 2 Core 3 Threads Core 4 Core 5 Core 6 Unused cores Core 7 time Each thread has different running time Kathryn McKinley Towards Parallel, Scalable VM Services 16 8

9 Steps towards scalability Step 2. Parallel runtime Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Threads Managed Runtime Threads time Runtime waits for all application threads to pause Kathryn McKinley Towards Parallel, Scalable VM Services 17 Steps towards scalability Step 3. Parallel & concurrent runtime Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Threads Managed Runtime Threads time Managed runtime on application s critical path may perturb its performance Kathryn McKinley Towards Parallel, Scalable VM Services 18 9

10 Steps towards scalability Ideal model Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Step 4. Minimize perturbation Threads Threads Whole runtime task taken off critical path time offloads work to concurrent runtime threads Kathryn McKinley Towards Parallel, Scalable VM Services 19 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Steps towards scalability Ideal model Step 4. Minimize perturbation Threads Threads time Worst case is parallel & concurrent Kathryn McKinley Towards Parallel, Scalable VM Services 20 10

11 Vision Scalable Runtimes Runtime & application parallelism & concurrency CMP aware runtime improves application scalability Communication Cache coherency is expensive and performance sensitive Memory bandwidth scaling is problematic Heterogeneity Move non-critical path off power-hungry cores Smarter, more aggressive analysis Specialization? Tuned cores? Special purpose cores? Kathryn McKinley Towards Parallel, Scalable VM Services 21 Approach Profiling (feedback directed optimization) Concurrent analysis More invasive analysis on low-power cores GC High performance concurrent GC High performance non-moving GC Reduced synchronization overheads Distributed & scratchpad GC JIT Concurrent, parallel JIT Cost-benefit shift as low-power cores used Architecture Tuned and/or specialized cores for runtime services Coherence tailed for restricted, common case of GC Kathryn McKinley Towards Parallel, Scalable VM Services 22 11

12 Today Profiling (feedback directed optimization) Concurrent analysis More invasive analysis on low-power cores GC High performance concurrent GC High performance non-moving GC Reduced synchronization overheads Distributed & scratchpad GC JIT Concurrent, parallel JIT Cost-benefit shift as low-power cores used Architecture Tuned and/or specialized cores for runtime services Coherence tailed for restricted, common case of GC Kathryn McKinley Towards Parallel, Scalable VM Services 23 A Concurrent Dynamic Analysis Framework For CMP Hardware Jungwoo Ha U. Texas & UCS/ICI-East Matthew Arnold IBM Research Stephen M. Blackburn Australian National University Kathryn S. McKinley University of Texas at Austin Kathryn McKinley Towards Parallel, Scalable VM Services 24 12

13 Generic Sequential Analysis instrumented code (== overhead)! time data collection! analysis! Difficult to optimize instrumented code Trade accuracy for overhead (sampling) Kathryn McKinley Towards Parallel, Scalable VM Services 25 Generic Concurrent Analysis instrumented code (reduced overhead)! time data collection! dequeue! enqueue! buffering! analysis! (producer) Analysis (consumer) Lower overhead & higher accuracy Must deal with microarchitectural side-effects Kathryn McKinley Towards Parallel, Scalable VM Services 26 13

14 Side-effects to Avoid Core A L1 lower level cache(s) L1 Core B false & true sharing (Producer) High latency memory operation Analysis (Consumer) Cache line ping-ponging Kathryn McKinley Towards Parallel, Scalable VM Services 27 Cache-friendly Asymmetric Buffering Lock-free communication channel between application and analysis thread Cache-friendly asymmetric buffering Actively avoids microarchitectural side-effects Enqueue light-weight instrumentation produces one record at time Dequeue consumes one chunk (fraction of a buffer) at a time Kathryn McKinley Towards Parallel, Scalable VM Services 28 14

Cache-friendly Asymmetric Buffering Core A L1 lower level cache(s) L1 Core B (Producer) Analysis (Consumer) application application!

16 slots on the buffer 4 chunks, 4 slot on each chunk L1 size == chunk size chunk buffer Kathryn McKinley Towards Parallel, Scalable VM

application 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 buffer analyzer chunk Delay consumer dequeue operation until cache line is flushed 2 chunks

15 Cache-friendly Asymmetric Buffering Core A L1 lower level cache(s) L1 Core B (Producer) Analysis (Consumer) application application! writes here! analyzer waits for application here.! analyzer analyzer! reads here! 16 slots on the buffer 4 chunks, 4 slot on each chunk L1 size == chunk size chunk buffer Kathryn McKinley Towards Parallel, Scalable VM Services 29 Cache-friendly Asymmetric Buffering Core A L1 lower level cache(s) (Producer) L1 Core B Analysis (Consumer) application buffer analyzer chunk Delay consumer dequeue operation until cache line is flushed 2 chunks away (smiley location) Analyzer operates one chunk at a time chunk_size > L1 size In practice, chunk_size >= 2 * L1 works well. Kathryn McKinley Towards Parallel, Scalable VM Services 30 15

Cache-friendly Asymmetric Buffering Core A L1 lower level cache(s) (Producer) 4 5 6 7 0 1 2 3 L1 Core B Analysis (Consumer) application 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 analyzer application

$= 0) block(); } *bufptr++ = data; Consumer (sees chunk) while (app_is_running) { index = index_of(chunk_num+2); while (buffer[index] == 0) spin_or_sleep(); consume(chunk_num); chunk_num =$

16 Cache-friendly Asymmetric Buffering Core A L1 lower level cache(s) (Producer) L1 Core B Analysis (Consumer) application analyzer application blocks only when buffer is full waiting until two more chunks are available Kathryn McKinley Towards Parallel, Scalable VM Services 31 Cache-friendly Asymmetric Buffering Producer (sees buffer) while (*bufptr!= 0) { if (*bufptr == MAGIC) bufptr = buffer; if (*bufptr!= 0) block(); } *bufptr++ = data; Consumer (sees chunk) while (app_is_running) { index = index_of(chunk_num+2); while (buffer[index] == 0) spin_or_sleep(); consume(chunk_num); chunk_num = NEXT(chunk_num); } MAGIC Consumer only spins here producer may spin on bufptr, while consumer may spin on buffer Producer code common case is 6 instructions in x86. Kathryn McKinley Towards Parallel, Scalable VM Services 32 16

17 Framework Provides Cache-friendly Asymmetric Buffering (CAB) Minimizes microarchitectural side-effects Quickly offloads event data from application s critical path Configurable parameters for optimization buffer size & chunk size Various collection mode Exhaustive mode Sampling mode Works on various threading model N:M (green) threading model native threading model Kathryn McKinley Towards Parallel, Scalable VM Services 33 Evaluation 3 different CMP processors 8KB L1 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 512KB L2 4MB L2 4MB L2 256K B 256K B 256K B 256K B system bus 8MB L3 Pentium 4 w/ hyperthreading Core 2 Quad Core i7 Kathryn McKinley Towards Parallel, Scalable VM Services 34 17

18 Evaluation Jikes RVM (2 different threading models) N:M threading (Jikes RVM 2.9.2) Native threading (Jikes RVM 3.0.1) Reference Dynamic Analysis Implementation Method counting Call graph Call tree profiling Path profiling Cache simulator using load/store events Benchmarks DaCapo, SPEC JVM 98 benchmark suites Parameters buffer size = 2MB, chunk size = 128KB Kathryn McKinley Towards Parallel, Scalable VM Services 35 GeoMean Overhead (%) Call Graph Profiling Instrumentation Enqueue Enqueue+Dequeue Concurrent (N:M) Sequential Core-i7 Core-2 Pentium-4 Instrumentation Overhead Bar 1 Bar1 Collect event data and write into a single word. No analysis thread Kathryn McKinley Towards Parallel, Scalable VM Services 36 18

Call Graph Profiling GeoMean Overhead (%) 30 25 20 15 10 5 Instrumentation Enqueue Enqueue+Dequeue Concurrent (N:M) Sequential 0 Core-i7 Core-2 Pentium-4 Enqueueing Overhead (Bar2 Bar 1) Bar2 Collect

19 Call Graph Profiling GeoMean Overhead (%) Instrumentation Enqueue Enqueue+Dequeue Concurrent (N:M) Sequential 0 Core-i7 Core-2 Pentium-4 Enqueueing Overhead (Bar2 Bar 1) Bar2 Collect event data and write into the buffer. No analysis thread Kathryn McKinley Towards Parallel, Scalable VM Services 37 Call Graph Profiling 30 GeoMean Overhead (%) Instrumentation Enqueue Enqueue+Dequeue Concurrent (N:M) Sequential 0 Core-i7 Core-2 Pentium-4 Communication Overhead (Bar3 Bar 2) Bar3 Analysis thread dequeues and write it into a single word. Kathryn McKinley Towards Parallel, Scalable VM Services 38 19

20 Call Graph Profiling GeoMean Overhead (%) Instrumentation Enqueue Enqueue+Dequeue Concurrent (N:M) Sequential 0 Core-i7 Core-2 Pentium-4 Analysis (data processing) Overhead (Bar 5 Bar 1) Bar4 Concurrent Analysis Bar5 Sequential Analysis Kathryn McKinley Towards Parallel, Scalable VM Services 39 Call Graph Profiling 30 GeoMean Overhead (%) Instrumentation Enqueue Enqueue+Dequeue Concurrent (N:M) Sequential 0 Core-i7 Core-2 Pentium-4 Overhead reduction with Concurrent Analysis (Bar 5 Bar 4) Bar4 Concurrent Analysis Bar5 Sequential Analysis Kathryn McKinley Towards Parallel, Scalable VM Services 40 20

GeoMean Overhead (%) 160 140 120 100 80 60 40 20 0 Path Profiling Core-i7 Core-2 Pentium-4 Instrumentation Enqueue Enqueue+Dequeue Concurrent (N:M) Sequential Overhead reduction with Concurrent

21 GeoMean Overhead (%) Path Profiling Core-i7 Core-2 Pentium-4 Instrumentation Enqueue Enqueue+Dequeue Concurrent (N:M) Sequential Overhead reduction with Concurrent Analysis (Bar 5 Bar 4) More data & computation than call graph Kathryn McKinley Towards Parallel, Scalable VM Services 41 Path Profiling Multithreaded Benchmarks Overhead (%) Instrumentation Enqueue Enqueue+Dequeue Concurrent (N:M) Sequential 0.00 mtrt hsqldb lusearch xalan Core 2 Multi-threaded benchmarks Kathryn McKinley Towards Parallel, Scalable VM Services 42 21

22 Concurrent Dynamic Analysis Framework Conclusions Framework eases implementation of many client analyses CAB efficiently transfers application data to analysis thread avoiding microarchitectural side-effects Framework efficiently utilizes extra cycles to perform dynamic analysis concurrently Kathryn McKinley Towards Parallel, Scalable VM Services 43 Related Work Concurrent Lock-free Queue FastForward single-producer & singleconsumer [Giacomoni et al. 09] Concurrent analysis for specific clients PiPA cache simulator [Zhao et al. 08] Shadow process approach Shadow profiling [Moseley et al. 07] SuperPin [Wallace et al. 07] Kathryn McKinley Towards Parallel, Scalable VM Services 44 22

23 Are we finished with CMP efficient buffering? Kathryn McKinley Towards Parallel, Scalable VM Services 45 Are we finished with CMP efficient buffering? Not yet Kathryn McKinley Towards Parallel, Scalable VM Services 46 23

24 Are we finished with CMP efficient buffering? Not yet parallel analysis memory scalable self-tuning parameters Kathryn McKinley Towards Parallel, Scalable VM Services 47 There is some hope but we need many such base mechanisms Kathryn McKinley Towards Parallel, Scalable VM Services 48 24

25 Software Challenges and Opportunities Communication (efficient coherency) Analysis (off critical path, new analyses) GC (concurrent, parallel, high throughput) JIT (concurrent, parallel, more aggressive) Heterogeneity (exploit it) Memory (PCM, bandwidth limits) Kathryn McKinley Towards Parallel, Scalable VM Services 49 Hardware Challenges and Opportunities Heterogeneity Tune cores to ubiquitous loads? Specialize for ubiquitous loads? Coherency Support special (weak) case for concurrent GC? Memory/Cache Exploit access behavior of managed languages Kathryn McKinley Towards Parallel, Scalable VM Services 50 25

26 The 21 st Century Virtuous Cycle? Questions? More Cores Chip Multiproccesors CMP? Scalable Software Scalable Apps + Scalable Runtime Kathryn McKinley Towards Parallel, Scalable VM Services 51 Why I love my job Because when I fail, I get better Kathryn McKinley Towards Parallel, Scalable VM Services 52 26

27 Failures Rejected: jobs (all) Failed: my Rice PhD qualifying exam Rejected: jobs (some) Rejected: my first three grant applications Bad teaching evaluations Rejected 2 times: my most cited paper Rejected: jobs (some) Rejected: papers, grants, papers, grants,... Bad teaching evaluations Kathryn McKinley Towards Parallel, Scalable VM Services 53 27

How s the Parallel Computing Revolution Going? Towards Parallel, Scalable VM Services

How s the Parallel Computing Revolution Going? Towards Parallel, Scalable VM Services Kathryn S McKinley The University of Texas at Austin Kathryn McKinley Towards Parallel, Scalable VM Services 1 20 th