Towards Parallel, Scalable VM Services
|
|
- Jasmine Adele Blair
- 5 years ago
- Views:
Transcription
1 Towards Parallel, Scalable VM Services Kathryn S McKinley The University of Texas at Austin Kathryn McKinley Towards Parallel, Scalable VM Services 1 20 th Century Simplistic Hardware View Faster Processors Frequency Scaling Speculation, OO programs do not change they just run faster Kathryn McKinley Towards Parallel, Scalable VM Services 2 1
2 20 th Century Simplistic Software View Larger, More Capable Software Managed Languages hardware does not change it just runs faster Kathryn McKinley Towards Parallel, Scalable VM Services 3 The 20 th Century Virtuous Cycle Faster Single Processor Frequency Scaling Larger, More Capable Software Managed Languages Kathryn McKinley Towards Parallel, Scalable VM Services 4 2
3 The 21 st Century Virtuous Cycle? More Cores Chip Multiproccesors CMP? Scalable Software Scalable Apps + Scalable Runtime Kathryn McKinley Towards Parallel, Scalable VM Services 5 How is this new virtuous cycle going? Kathryn McKinley Towards Parallel, Scalable VM Services 6 3
4 Measured Power vs Performance 50 Power (W) (log) ?? Pentium 4 (130nm) Core 2 Duo (65nm) i7 (45nm) Core 2 Duo (45nm) i5 (32nm) Speedup (v Atom 230) (log) SPEC CPU 2006, DaCapo, SPEC jvm98 How is this new virtuous cycle going for single threaded Java Kathryn McKinley Towards Parallel, Scalable VM Services 8 4
5 Performance Scaling Single Threaded Java Benchmarks Core i7: 4 cores, 2 way SMT Speedup Hardware Contexts antlr bloat compress db fop jack javac jess mpegaudio pmd raytrace geomean Kathryn McKinley Towards Parallel, Scalable VM Services 9 How is this new virtuous cycle going for multi-threaded Java Kathryn McKinley Towards Parallel, Scalable VM Services 10 5
6 Performance Scaling Multi-Threaded Java Benchmarks Core i7: 4 cores, 2 way SMT 4.0 Speedup Hardware Contexts avrora batik eclipse h2 jython luindex lusearch mtrt pjbb2005 sunflow tomcat tradebeans tradesoap xalan geomean Kathryn McKinley Towards Parallel, Scalable VM Services 11 Is there hope? Kathryn McKinley Towards Parallel, Scalable VM Services 12 6
7 Managed Languages Challenges & Opportunities Kathryn McKinley Towards Parallel, Scalable VM Services 13 Must Start with a Scalable Managed Runtime Kathryn McKinley Towards Parallel, Scalable VM Services 14 7
8 Sequential Managed Programs time Single Core Managed Runtime Profiling Dynamic Analysis Compilation Garbage Collection Other Helper Threads Kathryn McKinley Towards Parallel, Scalable VM Services 15 Steps towards scalability Step 1. Parallel application Core 0 Core 1 Core 2 Core 3 Threads Core 4 Core 5 Core 6 Unused cores Core 7 time Each thread has different running time Kathryn McKinley Towards Parallel, Scalable VM Services 16 8
9 Steps towards scalability Step 2. Parallel runtime Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Threads Managed Runtime Threads time Runtime waits for all application threads to pause Kathryn McKinley Towards Parallel, Scalable VM Services 17 Steps towards scalability Step 3. Parallel & concurrent runtime Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Threads Managed Runtime Threads time Managed runtime on application s critical path may perturb its performance Kathryn McKinley Towards Parallel, Scalable VM Services 18 9
10 Steps towards scalability Ideal model Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Step 4. Minimize perturbation Threads Threads Whole runtime task taken off critical path time offloads work to concurrent runtime threads Kathryn McKinley Towards Parallel, Scalable VM Services 19 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Steps towards scalability Ideal model Step 4. Minimize perturbation Threads Threads time Worst case is parallel & concurrent Kathryn McKinley Towards Parallel, Scalable VM Services 20 10
11 Vision Scalable Runtimes Runtime & application parallelism & concurrency CMP aware runtime improves application scalability Communication Cache coherency is expensive and performance sensitive Memory bandwidth scaling is problematic Heterogeneity Move non-critical path off power-hungry cores Smarter, more aggressive analysis Specialization? Tuned cores? Special purpose cores? Kathryn McKinley Towards Parallel, Scalable VM Services 21 Approach Profiling (feedback directed optimization) Concurrent analysis More invasive analysis on low-power cores GC High performance concurrent GC High performance non-moving GC Reduced synchronization overheads Distributed & scratchpad GC JIT Concurrent, parallel JIT Cost-benefit shift as low-power cores used Architecture Tuned and/or specialized cores for runtime services Coherence tailed for restricted, common case of GC Kathryn McKinley Towards Parallel, Scalable VM Services 22 11
12 Today Profiling (feedback directed optimization) Concurrent analysis More invasive analysis on low-power cores GC High performance concurrent GC High performance non-moving GC Reduced synchronization overheads Distributed & scratchpad GC JIT Concurrent, parallel JIT Cost-benefit shift as low-power cores used Architecture Tuned and/or specialized cores for runtime services Coherence tailed for restricted, common case of GC Kathryn McKinley Towards Parallel, Scalable VM Services 23 A Concurrent Dynamic Analysis Framework For CMP Hardware Jungwoo Ha U. Texas & UCS/ICI-East Matthew Arnold IBM Research Stephen M. Blackburn Australian National University Kathryn S. McKinley University of Texas at Austin Kathryn McKinley Towards Parallel, Scalable VM Services 24 12
13 Generic Sequential Analysis instrumented code (== overhead)! time data collection! analysis! Difficult to optimize instrumented code Trade accuracy for overhead (sampling) Kathryn McKinley Towards Parallel, Scalable VM Services 25 Generic Concurrent Analysis instrumented code (reduced overhead)! time data collection! dequeue! enqueue! buffering! analysis! (producer) Analysis (consumer) Lower overhead & higher accuracy Must deal with microarchitectural side-effects Kathryn McKinley Towards Parallel, Scalable VM Services 26 13
14 Side-effects to Avoid Core A L1 lower level cache(s) L1 Core B false & true sharing (Producer) High latency memory operation Analysis (Consumer) Cache line ping-ponging Kathryn McKinley Towards Parallel, Scalable VM Services 27 Cache-friendly Asymmetric Buffering Lock-free communication channel between application and analysis thread Cache-friendly asymmetric buffering Actively avoids microarchitectural side-effects Enqueue light-weight instrumentation produces one record at time Dequeue consumes one chunk (fraction of a buffer) at a time Kathryn McKinley Towards Parallel, Scalable VM Services 28 14
15 Cache-friendly Asymmetric Buffering Core A L1 lower level cache(s) L1 Core B (Producer) Analysis (Consumer) application application! writes here! analyzer waits for application here.! analyzer analyzer! reads here! 16 slots on the buffer 4 chunks, 4 slot on each chunk L1 size == chunk size chunk buffer Kathryn McKinley Towards Parallel, Scalable VM Services 29 Cache-friendly Asymmetric Buffering Core A L1 lower level cache(s) (Producer) L1 Core B Analysis (Consumer) application buffer analyzer chunk Delay consumer dequeue operation until cache line is flushed 2 chunks away (smiley location) Analyzer operates one chunk at a time chunk_size > L1 size In practice, chunk_size >= 2 * L1 works well. Kathryn McKinley Towards Parallel, Scalable VM Services 30 15
16 Cache-friendly Asymmetric Buffering Core A L1 lower level cache(s) (Producer) L1 Core B Analysis (Consumer) application analyzer application blocks only when buffer is full waiting until two more chunks are available Kathryn McKinley Towards Parallel, Scalable VM Services 31 Cache-friendly Asymmetric Buffering Producer (sees buffer) while (*bufptr!= 0) { if (*bufptr == MAGIC) bufptr = buffer; if (*bufptr!= 0) block(); } *bufptr++ = data; Consumer (sees chunk) while (app_is_running) { index = index_of(chunk_num+2); while (buffer[index] == 0) spin_or_sleep(); consume(chunk_num); chunk_num = NEXT(chunk_num); } MAGIC Consumer only spins here producer may spin on bufptr, while consumer may spin on buffer Producer code common case is 6 instructions in x86. Kathryn McKinley Towards Parallel, Scalable VM Services 32 16
17 Framework Provides Cache-friendly Asymmetric Buffering (CAB) Minimizes microarchitectural side-effects Quickly offloads event data from application s critical path Configurable parameters for optimization buffer size & chunk size Various collection mode Exhaustive mode Sampling mode Works on various threading model N:M (green) threading model native threading model Kathryn McKinley Towards Parallel, Scalable VM Services 33 Evaluation 3 different CMP processors 8KB L1 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 512KB L2 4MB L2 4MB L2 256K B 256K B 256K B 256K B system bus 8MB L3 Pentium 4 w/ hyperthreading Core 2 Quad Core i7 Kathryn McKinley Towards Parallel, Scalable VM Services 34 17
18 Evaluation Jikes RVM (2 different threading models) N:M threading (Jikes RVM 2.9.2) Native threading (Jikes RVM 3.0.1) Reference Dynamic Analysis Implementation Method counting Call graph Call tree profiling Path profiling Cache simulator using load/store events Benchmarks DaCapo, SPEC JVM 98 benchmark suites Parameters buffer size = 2MB, chunk size = 128KB Kathryn McKinley Towards Parallel, Scalable VM Services 35 GeoMean Overhead (%) Call Graph Profiling Instrumentation Enqueue Enqueue+Dequeue Concurrent (N:M) Sequential Core-i7 Core-2 Pentium-4 Instrumentation Overhead Bar 1 Bar1 Collect event data and write into a single word. No analysis thread Kathryn McKinley Towards Parallel, Scalable VM Services 36 18
19 Call Graph Profiling GeoMean Overhead (%) Instrumentation Enqueue Enqueue+Dequeue Concurrent (N:M) Sequential 0 Core-i7 Core-2 Pentium-4 Enqueueing Overhead (Bar2 Bar 1) Bar2 Collect event data and write into the buffer. No analysis thread Kathryn McKinley Towards Parallel, Scalable VM Services 37 Call Graph Profiling 30 GeoMean Overhead (%) Instrumentation Enqueue Enqueue+Dequeue Concurrent (N:M) Sequential 0 Core-i7 Core-2 Pentium-4 Communication Overhead (Bar3 Bar 2) Bar3 Analysis thread dequeues and write it into a single word. Kathryn McKinley Towards Parallel, Scalable VM Services 38 19
20 Call Graph Profiling GeoMean Overhead (%) Instrumentation Enqueue Enqueue+Dequeue Concurrent (N:M) Sequential 0 Core-i7 Core-2 Pentium-4 Analysis (data processing) Overhead (Bar 5 Bar 1) Bar4 Concurrent Analysis Bar5 Sequential Analysis Kathryn McKinley Towards Parallel, Scalable VM Services 39 Call Graph Profiling 30 GeoMean Overhead (%) Instrumentation Enqueue Enqueue+Dequeue Concurrent (N:M) Sequential 0 Core-i7 Core-2 Pentium-4 Overhead reduction with Concurrent Analysis (Bar 5 Bar 4) Bar4 Concurrent Analysis Bar5 Sequential Analysis Kathryn McKinley Towards Parallel, Scalable VM Services 40 20
21 GeoMean Overhead (%) Path Profiling Core-i7 Core-2 Pentium-4 Instrumentation Enqueue Enqueue+Dequeue Concurrent (N:M) Sequential Overhead reduction with Concurrent Analysis (Bar 5 Bar 4) More data & computation than call graph Kathryn McKinley Towards Parallel, Scalable VM Services 41 Path Profiling Multithreaded Benchmarks Overhead (%) Instrumentation Enqueue Enqueue+Dequeue Concurrent (N:M) Sequential 0.00 mtrt hsqldb lusearch xalan Core 2 Multi-threaded benchmarks Kathryn McKinley Towards Parallel, Scalable VM Services 42 21
22 Concurrent Dynamic Analysis Framework Conclusions Framework eases implementation of many client analyses CAB efficiently transfers application data to analysis thread avoiding microarchitectural side-effects Framework efficiently utilizes extra cycles to perform dynamic analysis concurrently Kathryn McKinley Towards Parallel, Scalable VM Services 43 Related Work Concurrent Lock-free Queue FastForward single-producer & singleconsumer [Giacomoni et al. 09] Concurrent analysis for specific clients PiPA cache simulator [Zhao et al. 08] Shadow process approach Shadow profiling [Moseley et al. 07] SuperPin [Wallace et al. 07] Kathryn McKinley Towards Parallel, Scalable VM Services 44 22
23 Are we finished with CMP efficient buffering? Kathryn McKinley Towards Parallel, Scalable VM Services 45 Are we finished with CMP efficient buffering? Not yet Kathryn McKinley Towards Parallel, Scalable VM Services 46 23
24 Are we finished with CMP efficient buffering? Not yet parallel analysis memory scalable self-tuning parameters Kathryn McKinley Towards Parallel, Scalable VM Services 47 There is some hope but we need many such base mechanisms Kathryn McKinley Towards Parallel, Scalable VM Services 48 24
25 Software Challenges and Opportunities Communication (efficient coherency) Analysis (off critical path, new analyses) GC (concurrent, parallel, high throughput) JIT (concurrent, parallel, more aggressive) Heterogeneity (exploit it) Memory (PCM, bandwidth limits) Kathryn McKinley Towards Parallel, Scalable VM Services 49 Hardware Challenges and Opportunities Heterogeneity Tune cores to ubiquitous loads? Specialize for ubiquitous loads? Coherency Support special (weak) case for concurrent GC? Memory/Cache Exploit access behavior of managed languages Kathryn McKinley Towards Parallel, Scalable VM Services 50 25
26 The 21 st Century Virtuous Cycle? Questions? More Cores Chip Multiproccesors CMP? Scalable Software Scalable Apps + Scalable Runtime Kathryn McKinley Towards Parallel, Scalable VM Services 51 Why I love my job Because when I fail, I get better Kathryn McKinley Towards Parallel, Scalable VM Services 52 26
27 Failures Rejected: jobs (all) Failed: my Rice PhD qualifying exam Rejected: jobs (some) Rejected: my first three grant applications Bad teaching evaluations Rejected 2 times: my most cited paper Rejected: jobs (some) Rejected: papers, grants, papers, grants,... Bad teaching evaluations Kathryn McKinley Towards Parallel, Scalable VM Services 53 27
How s the Parallel Computing Revolution Going? Towards Parallel, Scalable VM Services
How s the Parallel Computing Revolution Going? Towards Parallel, Scalable VM Services Kathryn S McKinley The University of Texas at Austin Kathryn McKinley Towards Parallel, Scalable VM Services 1 20 th
More informationA Trace-based Java JIT Compiler Retrofitted from a Method-based Compiler
A Trace-based Java JIT Compiler Retrofitted from a Method-based Compiler Hiroshi Inoue, Hiroshige Hayashizaki, Peng Wu and Toshio Nakatani IBM Research Tokyo IBM Research T.J. Watson Research Center April
More informationAdaptive Multi-Level Compilation in a Trace-based Java JIT Compiler
Adaptive Multi-Level Compilation in a Trace-based Java JIT Compiler Hiroshi Inoue, Hiroshige Hayashizaki, Peng Wu and Toshio Nakatani IBM Research Tokyo IBM Research T.J. Watson Research Center October
More informationProbabilistic Calling Context
Probabilistic Calling Context Michael D. Bond Kathryn S. McKinley University of Texas at Austin Why Context Sensitivity? Static program location not enough at com.mckoi.db.jdbcserver.jdbcinterface.execquery():213
More informationIdentifying the Sources of Cache Misses in Java Programs Without Relying on Hardware Counters. Hiroshi Inoue and Toshio Nakatani IBM Research - Tokyo
Identifying the Sources of Cache Misses in Java Programs Without Relying on Hardware Counters Hiroshi Inoue and Toshio Nakatani IBM Research - Tokyo June 15, 2012 ISMM 2012 at Beijing, China Motivation
More informationPhase-based Adaptive Recompilation in a JVM
Phase-based Adaptive Recompilation in a JVM Dayong Gu Clark Verbrugge Sable Research Group, School of Computer Science McGill University, Montréal, Canada {dgu1, clump}@cs.mcgill.ca April 7, 2008 Sable
More informationLimits of Parallel Marking Garbage Collection....how parallel can a GC become?
Limits of Parallel Marking Garbage Collection...how parallel can a GC become? Dr. Fridtjof Siebert CTO, aicas ISMM 2008, Tucson, 7. June 2008 Introduction Parallel Hardware is becoming the norm even for
More informationContinuous Object Access Profiling and Optimizations to Overcome the Memory Wall and Bloat
Continuous Object Access Profiling and Optimizations to Overcome the Memory Wall and Bloat Rei Odaira, Toshio Nakatani IBM Research Tokyo ASPLOS 2012 March 5, 2012 Many Wasteful Objects Hurt Performance.
More informationTrace-based JIT Compilation
Trace-based JIT Compilation Hiroshi Inoue, IBM Research - Tokyo 1 Trace JIT vs. Method JIT https://twitter.com/yukihiro_matz/status/533775624486133762 2 Background: Trace-based Compilation Using a Trace,
More informationElfen Scheduling. Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang
Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University Stephen M Blackburn Australian National University
More informationToward Efficient Strong Memory Model Support for the Java Platform via Hybrid Synchronization
Toward Efficient Strong Memory Model Support for the Java Platform via Hybrid Synchronization Aritra Sengupta, Man Cao, Michael D. Bond and Milind Kulkarni 1 PPPJ 2015, Melbourne, Florida, USA Programming
More informationAdaptive Optimization using Hardware Performance Monitors. Master Thesis by Mathias Payer
Adaptive Optimization using Hardware Performance Monitors Master Thesis by Mathias Payer Supervising Professor: Thomas Gross Supervising Assistant: Florian Schneider Adaptive Optimization using HPM 1/21
More informationMicroprocessor Trends and Implications for the Future
Microprocessor Trends and Implications for the Future John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 4 1 September 2016 Context Last two classes: from
More informationJIT Compilation Policy for Modern Machines
JIT Compilation Policy for Modern Machines Prasad A. Kulkarni Department of Electrical Engineering and Computer Science, University of Kansas prasadk@ku.edu Abstract Dynamic or Just-in-Time (JIT) compilation
More informationLightweight Data Race Detection for Production Runs
Lightweight Data Race Detection for Production Runs Swarnendu Biswas, UT Austin Man Cao, Ohio State University Minjia Zhang, Microsoft Research Michael D. Bond, Ohio State University Benjamin P. Wood,
More informationPhosphor: Illuminating Dynamic. Data Flow in Commodity JVMs
Phosphor: Illuminating Dynamic Fork me on Github Data Flow in Commodity JVMs Jonathan Bell and Gail Kaiser Columbia University, New York, NY USA Dynamic Data Flow Analysis: Taint Tracking Output that is
More informationChronicler: Lightweight Recording to Reproduce Field Failures
Chronicler: Lightweight Recording to Reproduce Field Failures Jonathan Bell, Nikhil Sarda and Gail Kaiser Columbia University, New York, NY USA What happens when software crashes? Stack Traces Aren
More informationLu Fang, University of California, Irvine Liang Dou, East China Normal University Harry Xu, University of California, Irvine
Lu Fang, University of California, Irvine Liang Dou, East China Normal University Harry Xu, University of California, Irvine 2015-07-09 Inefficient code regions [G. Jin et al. PLDI 2012] Inefficient code
More informationFree-Me: A Static Analysis for Automatic Individual Object Reclamation
Free-Me: A Static Analysis for Automatic Individual Object Reclamation Samuel Z. Guyer, Kathryn McKinley, Daniel Frampton Presented by: Jason VanFickell Thanks to Dimitris Prountzos for slides adapted
More informationDynamic Vertical Memory Scalability for OpenJDK Cloud Applications
Dynamic Vertical Memory Scalability for OpenJDK Cloud Applications Rodrigo Bruno, Paulo Ferreira: INESC-ID / Instituto Superior Técnico, University of Lisbon Ruslan Synytsky, Tetiana Fydorenchyk: Jelastic
More informationUnderstanding Parallelism-Inhibiting Dependences in Sequential Java
Understanding Parallelism-Inhibiting Dependences in Sequential Java Programs Atanas(Nasko) Rountev Kevin Van Valkenburgh Dacong Yan P. Sadayappan Ohio State University Overview and Motivation Multi-core
More informationJava Performance Evaluation through Rigorous Replay Compilation
Java Performance Evaluation through Rigorous Replay Compilation Andy Georges Lieven Eeckhout Dries Buytaert Department Electronics and Information Systems, Ghent University, Belgium {ageorges,leeckhou}@elis.ugent.be,
More informationIntelligent Compilation
Intelligent Compilation John Cavazos Department of Computer and Information Sciences University of Delaware Autotuning and Compilers Proposition: Autotuning is a component of an Intelligent Compiler. Code
More informationTolerating Memory Leaks
Tolerating Memory Leaks UT Austin Technical Report TR-07-64 December 7, 2007 Michael D. Bond Kathryn S. McKinley Department of Computer Sciences The University of Texas at Austin {mikebond,mckinley}@cs.utexas.edu
More informationMethod-Level Phase Behavior in Java Workloads
Method-Level Phase Behavior in Java Workloads Andy Georges, Dries Buytaert, Lieven Eeckhout and Koen De Bosschere Ghent University Presented by Bruno Dufour dufour@cs.rutgers.edu Rutgers University DCS
More informationJinn: Synthesizing Dynamic Bug Detectors for Foreign Language Interfaces
Jinn: Synthesizing Dynamic Bug Detectors for Foreign Language Interfaces Byeongcheol Lee Ben Wiedermann Mar>n Hirzel Robert Grimm Kathryn S. McKinley B. Lee, B. Wiedermann, M. Hirzel, R. Grimm, and K.
More informationImmix: A Mark-Region Garbage Collector with Space Efficiency, Fast Collection, and Mutator Performance
Immix: A Mark-Region Garbage Collector with Space Efficiency, Fast Collection, and Mutator Performance Stephen M. Blackburn Australian National University Steve.Blackburn@anu.edu.au Kathryn S. McKinley
More informationTowards Garbage Collection Modeling
Towards Garbage Collection Modeling Peter Libič Petr Tůma Department of Distributed and Dependable Systems Faculty of Mathematics and Physics, Charles University Malostranské nám. 25, 118 00 Prague, Czech
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationJava without the Coffee Breaks: A Nonintrusive Multiprocessor Garbage Collector
Java without the Coffee Breaks: A Nonintrusive Multiprocessor Garbage Collector David F. Bacon IBM T.J. Watson Research Center Joint work with C.R. Attanasio, Han Lee, V.T. Rajan, and Steve Smith ACM Conference
More informationNew Challenges in Microarchitecture and Compiler Design
New Challenges in Microarchitecture and Compiler Design Contributors: Jesse Fang Tin-Fook Ngai Fred Pollack Intel Fellow Director of Microprocessor Research Labs Intel Corporation fred.pollack@intel.com
More informationExperiences with Multi-threading and Dynamic Class Loading in a Java Just-In-Time Compiler
, Compilation Technology Experiences with Multi-threading and Dynamic Class Loading in a Java Just-In-Time Compiler Daryl Maier, Pramod Ramarao, Mark Stoodley, Vijay Sundaresan TestaRossa JIT compiler
More informationIBM Research - Tokyo 数理 計算科学特論 C プログラミング言語処理系の最先端実装技術. Trace Compilation IBM Corporation
数理 計算科学特論 C プログラミング言語処理系の最先端実装技術 Trace Compilation Trace JIT vs. Method JIT https://twitter.com/yukihiro_matz/status/533775624486133762 2 Background: Trace-based Compilation Using a Trace, a hot path identified
More informationSwift: A Register-based JIT Compiler for Embedded JVMs
Swift: A Register-based JIT Compiler for Embedded JVMs Yuan Zhang, Min Yang, Bo Zhou, Zhemin Yang, Weihua Zhang, Binyu Zang Fudan University Eighth Conference on Virtual Execution Environment (VEE 2012)
More informationWorkload Characterization and Optimization of TPC-H Queries on Apache Spark
Workload Characterization and Optimization of TPC-H Queries on Apache Spark Tatsuhiro Chiba and Tamiya Onodera IBM Research - Tokyo April. 17-19, 216 IEEE ISPASS 216 @ Uppsala, Sweden Overview IBM Research
More informationA Black-box Approach to Understanding Concurrency in DaCapo
A Black-box Approach to Understanding Concurrency in DaCapo Tomas Kalibera Matthew Mole Richard Jones Jan Vitek University of Kent, Canterbury Purdue University Abstract Increasing levels of hardware parallelism
More informationThe DaCapo Benchmarks: Java Benchmarking Development and Analysis
The DaCapo Benchmarks: Java Benchmarking Development and Analysis Stephen M Blackburn αβ, Robin Garner β, Chris Hoffmann γ, Asjad M Khan γ, Kathryn S McKinley δ, Rotem Bentzur ε, Amer Diwan ζ, Daniel Feinberg
More information2 TEST: A Tracer for Extracting Speculative Threads
EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath
More informationCopyright by Ivan Jibaja 2015
Copyright by Ivan Jibaja 2015 The Dissertation Committee for Ivan Jibaja certifies that this is the approved version of the following dissertation: Exploiting Hardware Heterogeneity and Parallelism for
More informationCGO:U:Auto-tuning the HotSpot JVM
CGO:U:Auto-tuning the HotSpot JVM Milinda Fernando, Tharindu Rusira, Chalitha Perera, Chamara Philips Department of Computer Science and Engineering University of Moratuwa Sri Lanka {milinda.10, tharindurusira.10,
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationHybrid Static-Dynamic Analysis for Statically Bounded Region Serializability
Hybrid Static-Dynamic Analysis for Statically Bounded Region Serializability Aritra Sengupta, Swarnendu Biswas, Minjia Zhang, Michael D. Bond and Milind Kulkarni ASPLOS 2015, ISTANBUL, TURKEY Programming
More informationThe Segregated Binary Trees: Decoupling Memory Manager
The Segregated Binary Trees: Decoupling Memory Manager Mehran Rezaei Dept. of Electrical and Computer Engineering University of Alabama in Huntsville Ron K. Cytron Department of Computer Science Washington
More informationClustered Communication for Efficient Pipelined Multithreading on Commodity MCPs
Clustered Communication for Efficient Pipelined Multithreading on Commodity MCPs Yuanming Zhang, Kanemitsu Ootsu, Takashi Yokota, and Takanobu Baba Abstract Low inter-core communication overheads are critical
More informationZ-Rays: Divide Arrays and Conquer Speed and Flexibility
Z-Rays: Divide Arrays and Conquer Speed and Flexibility Jennifer B. Sartor Stephen M. Blackburn Daniel Frampton Martin Hirzel Kathryn S. McKinley University of Texas at Austin Australian National University
More informationThe Potentials and Challenges of Trace Compilation:
Peng Wu IBM Research January 26, 2011 The Potentials and Challenges of Trace Compilation: Lessons learned from building a trace-jit on top of J9 JVM (Joint work with Hiroshige Hayashizaki and Hiroshi Inoue,
More informationProgram Calling Context. Motivation
Program alling ontext Motivation alling context enhances program understanding and dynamic analyses by providing a rich representation of program location ollecting calling context can be expensive The
More informationDebunking Dynamic Optimization Myths
Debunking Dynamic Optimization Myths Michael Hind (Material borrowed from 3 hr PLDI 04 Tutorial by Stephen Fink, David Grove, and Michael Hind. See those slides for more details.) September 15, 2004 Future
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationMulticore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.
CS 320 Ch. 18 Multicore Computers Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. Definitions: Hyper-threading Intel's proprietary simultaneous
More informationString Deduplication for Java-based Middleware in Virtualized Environments
String Deduplication for Java-based Middleware in Virtualized Environments Michihiro Horie, Kazunori Ogata, Kiyokuni Kawachiya, Tamiya Onodera IBM Research - Tokyo Duplicated strings found on Web Application
More informationPerformance Potential of Optimization Phase Selection During Dynamic JIT Compilation
Performance Potential of Optimization Phase Selection During Dynamic JIT Compilation Michael R. Jantz Prasad A. Kulkarni Electrical Engineering and Computer Science, University of Kansas {mjantz,kulkarni}@ittc.ku.edu
More informationEfficient Runtime Tracking of Allocation Sites in Java
Efficient Runtime Tracking of Allocation Sites in Java Rei Odaira, Kazunori Ogata, Kiyokuni Kawachiya, Tamiya Onodera, Toshio Nakatani IBM Research - Tokyo Why Do You Need Allocation Site Information?
More informationJOVE. An Optimizing Compiler for Java. Allen Wirfs-Brock Instantiations Inc.
An Optimizing Compiler for Java Allen Wirfs-Brock Instantiations Inc. Object-Orient Languages Provide a Breakthrough in Programmer Productivity Reusable software components Higher level abstractions Yield
More informationGC Assertions: Using the Garbage Collector to Check Heap Properties
GC Assertions: Using the Garbage Collector to Check Heap Properties Edward E. Aftandilian Samuel Z. Guyer Department of Computer Science Tufts University {eaftan,sguyer}@cs.tufts.edu Abstract This paper
More informationBeyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji
Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of
More informationReference Object Processing in On-The-Fly Garbage Collection
Reference Object Processing in On-The-Fly Garbage Collection Tomoharu Ugawa, Kochi University of Technology Richard Jones, Carl Ritson, University of Kent Weak Pointers Weak pointers are a mechanism to
More informationAcknowledgements These slides are based on Kathryn McKinley s slides on garbage collection as well as E Christopher Lewis s slides
Garbage Collection Last time Compiling Object-Oriented Languages Today Motivation behind garbage collection Garbage collection basics Garbage collection performance Specific example of using GC in C++
More informationPERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES
PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES Anish Athalye and Patrick Long Mentors: Austin Clements and Stephen Tu 3 rd annual MIT PRIMES Conference Sequential
More informationThe VMKit project: Java (and.net) on top of LLVM
The VMKit project: Java (and.net) on top of LLVM Nicolas Geoffray Université Pierre et Marie Curie, France nicolas.geoffray@lip6.fr What is VMKit? Glue between existing VM components LLVM, GNU Classpath,
More informationExploiting FIFO Scheduler to Improve Parallel Garbage Collection Performance
Exploiting FIFO Scheduler to Improve Parallel Garbage Collection Performance Junjie Qian, Witawas Srisa-an, Sharad Seth, Hong Jiang, Du Li, Pan Yi University of Nebraska Lincoln, University of Texas Arlington,
More informationThe Design, Implementation, and Evaluation of Adaptive Code Unloading for Resource-Constrained Devices
The Design, Implementation, and Evaluation of Adaptive Code Unloading for Resource-Constrained Devices LINGLI ZHANG and CHANDRA KRINTZ University of California, Santa Barbara Java Virtual Machines (JVMs)
More informationOptimising Multicore JVMs. Khaled Alnowaiser
Optimising Multicore JVMs Khaled Alnowaiser Outline JVM structure and overhead analysis Multithreaded JVM services JVM on multicore An observational study Potential JVM optimisations Basic JVM Services
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationMicroPhase: An Approach to Proactively Invoking Garbage Collection for Improved Performance
MicroPhase: An Approach to Proactively Invoking Garbage Collection for Improved Performance Feng Xian, Witawas Srisa-an, and Hong Jiang Department of Computer Science & Engineering University of Nebraska-Lincoln
More informationStatic Java Program Features for Intelligent Squash Prediction
Static Java Program Features for Intelligent Squash Prediction Jeremy Singer,, Adam Pocock, Mikel Lujan, Gavin Brown, Nikolas Ioannou, Marcelo Cintra Thread-level Speculation... Aim to use parallel multi-core
More informationDynamic SimpleScalar: Simulating Java Virtual Machines
Dynamic SimpleScalar: Simulating Java Virtual Machines Xianglong Huang J. Eliot B. Moss Kathryn S. McKinley Steve Blackburn Doug Burger Department of Computer Sciences Department of Computer Science Department
More informationLecture 14: Multithreading
CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw
More informationJava On Steroids: Sun s High-Performance Java Implementation. History
Java On Steroids: Sun s High-Performance Java Implementation Urs Hölzle Lars Bak Steffen Grarup Robert Griesemer Srdjan Mitrovic Sun Microsystems History First Java implementations: interpreters compact
More informationPhases in Branch Targets of Java Programs
Phases in Branch Targets of Java Programs Technical Report CU-CS-983-04 ABSTRACT Matthias Hauswirth Computer Science University of Colorado Boulder, CO 80309 hauswirt@cs.colorado.edu Recent work on phase
More informationAverroes: Letting go of the library!
Workshop on WALA - PLDI 15 June 13th, 2015 public class HelloWorld { public static void main(string[] args) { System.out.println("Hello, World!"); 2 public class HelloWorld { public static void main(string[]
More informationAmdahl's Law in the Multicore Era
Amdahl's Law in the Multicore Era Explain intuitively why in the asymmetric model, the speedup actually decreases past a certain point of increasing r. The limiting factor of these improved equations and
More informationReplicating Real-Time Garbage Collector
Replicating Real-Time Garbage Collector Tomas Kalibera Purdue University, West Lafayette, IN 47907, USA; Charles University, Prague, 147 00, Czech Republic SUMMARY Real-time Java is becoming a viable platform
More informationA Quest for Predictable Latency Adventures in Java Concurrency. Martin Thompson
A Quest for Predictable Latency Adventures in Java Concurrency Martin Thompson - @mjpt777 If a system does not respond in a timely manner then it is effectively unavailable 1. It s all about the Blocking
More informationA Dynamic Evaluation of the Precision of Static Heap Abstractions
A Dynamic Evaluation of the Precision of Static Heap Abstractions OOSPLA - Reno, NV October 20, 2010 Percy Liang Omer Tripp Mayur Naik Mooly Sagiv UC Berkeley Tel-Aviv Univ. Intel Labs Berkeley Tel-Aviv
More informationAN 831: Intel FPGA SDK for OpenCL
AN 831: Intel FPGA SDK for OpenCL Host Pipelined Multithread Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents 1 Intel FPGA SDK for OpenCL Host Pipelined Multithread...3 1.1
More informationMyths and Realities: The Performance Impact of Garbage Collection
Myths and Realities: The Performance Impact of Garbage Collection Tapasya Patki February 17, 2011 1 Motivation Automatic memory management has numerous software engineering benefits from the developer
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationBoosting the Priority of Garbage: Scheduling Collection on Heterogeneous Multicore Processors
Boosting the Priority of Garbage: Scheduling Collection on Heterogeneous Multicore Processors SHOAIB AKRAM, Ghent University JENNIFER B. SARTOR, Ghent University and Vrije Universiteit Brussel KENZO VAN
More informationManaged runtimes & garbage collection
Managed runtimes Advantages? Managed runtimes & garbage collection CSE 631 Some slides by Kathryn McKinley Disadvantages? 1 2 Managed runtimes Portability (& performance) Advantages? Reliability Security
More informationComplex, concurrent software. Precision (no false positives) Find real bugs in real executions
Harry Xu May 2012 Complex, concurrent software Precision (no false positives) Find real bugs in real executions Need to modify JVM (e.g., object layout, GC, or ISA-level code) Need to demonstrate realism
More informationMoneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories
Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories Adrian M. Caulfield Arup De, Joel Coburn, Todor I. Mollov, Rajesh K. Gupta, Steven Swanson Non-Volatile Systems
More informationDynamic Selection of Application-Specific Garbage Collectors
Dynamic Selection of Application-Specific Garbage Collectors Sunil V. Soman Chandra Krintz University of California, Santa Barbara David F. Bacon IBM T.J. Watson Research Center Background VMs/managed
More informationTail Latency: Beyond Queuing Theory. Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini
Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini Microsoft s Quincy datacenter 3 Servers in US datacenters 2 Servers in
More informationThe Implications of Multi-core
The Implications of Multi- What I want to do today Given that everyone is heralding Multi- Is it really the Holy Grail? Will it cure cancer? A lot of misinformation has surfaced What multi- is and what
More informationManaged runtimes & garbage collection. CSE 6341 Some slides by Kathryn McKinley
Managed runtimes & garbage collection CSE 6341 Some slides by Kathryn McKinley 1 Managed runtimes Advantages? Disadvantages? 2 Managed runtimes Advantages? Reliability Security Portability Performance?
More informationChapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)
Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3, 3.9, and Appendix C) Hardware
More informationROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 16 MARKS CS 2354 ADVANCE COMPUTER ARCHITECTURE 1. Explain the concepts and challenges of Instruction-Level Parallelism. Define
More informationUnderstanding Hardware Transactional Memory
Understanding Hardware Transactional Memory Gil Tene, CTO & co-founder, Azul Systems @giltene 2015 Azul Systems, Inc. Agenda Brief introduction What is Hardware Transactional Memory (HTM)? Cache coherence
More informationMultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines. MultiJav: Introduction
: A Distributed Shared Memory System Based on Multiple Java Virtual Machines X. Chen and V.H. Allan Computer Science Department, Utah State University 1998 : Introduction Built on concurrency supported
More informationKazunori Ogata, Dai Mikurube, Kiyokuni Kawachiya and Tamiya Onodera
RT0854 Computer Science 13 pages Research Report April 27, 2009 A Study of Java s non-java Memory Kazunori Ogata, Dai Mikurube, Kiyokuni Kawachiya and Tamiya Onodera IBM Research, Tokyo Research Laboratory
More informationFiji VM Safety Critical Java
Fiji VM Safety Critical Java Filip Pizlo, President Fiji Systems Inc. Introduction Java is a modern, portable programming language with wide-spread adoption. Goal: streamlining debugging and certification.
More informationSpeculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois
Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications José éf. Martínez * and Josep Torrellas University of Illinois ASPLOS 2002 * Now at Cornell University Overview Allow
More informationMultiprocessor Systems. Chapter 8, 8.1
Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor
More informationHyperthreading Technology
Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?
More informationWhy Nothing Matters: The Impact of Zeroing
Why Nothing Matters: The Impact of Zeroing Xi Yang, Stephen M. Blackburn, Daniel Frampton, Jennifer B. Sartor, Kathryn S. McKinley Australian National University EPFL Microsoft Research University of Texas
More informationHeuristics for Profile-driven Method- level Speculative Parallelization
Heuristics for Profile-driven Method- level John Whaley and Christos Kozyrakis Stanford University Speculative Multithreading Speculatively parallelize an application Uses speculation to overcome ambiguous
More informationAssessing the Scalability of Garbage Collectors on Many Cores
Assessing the Scalability of Garbage Collectors on Many Cores Lokesh Gidra Gaël Thomas Julien Sopena Marc Shapiro Regal-LIP6/INRIA Université Pierre et Marie Curie, 4 place Jussieu, Paris, France firstname.lastname@lip6.fr
More informationA Quantitative Evaluation of the Contribution of Native Code to Java Workloads
A Quantitative Evaluation of the Contribution of Native Code to Java Workloads Walter Binder University of Lugano Switzerland walter.binder@unisi.ch Jarle Hulaas, Philippe Moret EPFL Switzerland {jarle.hulaas,philippe.moret}@epfl.ch
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More information