MELD: Merging Execution- and Language-level Determinism. Joe Devietti, Dan Grossman, Luis Ceze University of Washington
|
|
- Hugo Mason
- 6 years ago
- Views:
Transcription
1 MELD: Merging Execution- and Language-level Determinism Joe Devietti, Dan Grossman, Luis Ceze University of Washington
2 CoreDet results overhead normalized to nondet 800% 600% 400% 200% 0% barnes blacksch lu radix streamcl Bergan et al., ASPLOS 10; Devietti et al. ASPLOS 11 2
3 CoreDet results overhead normalized to nondet 800% 600% 400% 200% 0% barnes blacksch lu radix streamcl Bergan et al., ASPLOS 10; Devietti et al. ASPLOS 11 2
4 determinism recipe isolate threads updates merge updates i. in a deterministic way ii. at deterministic times 3
5 determinism recipe use store isolate threads updates buffers to buffer updates merge updates i. in a deterministic way ii. at deterministic times 3
6 determinism recipe use store isolate threads updates buffers to buffer updates merge updates i. in a deterministic way parallel merge algorithm ii. at deterministic times 3
7 determinism recipe use store isolate threads updates buffers to buffer updates merge updates i. in a deterministic way parallel merge algorithm ii. at deterministic times count fixed # of instructions 3
8 determinism via store buffers parallel T 1 T 2 T 3 time 4
9 determinism via store buffers parallel parallel mode: buffer all stores, T 1 sync via [Olszewski et al, ASPLOS 09] T 2 T 3 time 4
10 determinism via store buffers parallel T 1 T 2 wr A rd A parallel mode: buffer all stores, sync via [Olszewski et al, ASPLOS 09] T 3 time 4
11 determinism via store buffers parallel T 1 T 2 wr A rd A parallel mode: buffer all stores, sync via [Olszewski et al, ASPLOS 09] T 3 time 4
12 determinism via store buffers parallel parallel mode: buffer all stores, T 1 wr A rd A sync via [Olszewski et al, ASPLOS 09] T 2 rd A T 3 time 4
13 determinism via store buffers parallel parallel mode: buffer all stores, T 1 wr A rd A sync via [Olszewski et al, ASPLOS 09] T 2 T 3 rd A wr A time 4
14 determinism via store buffers parallel parallel mode: buffer all stores, T 1 wr A rd A sync via [Olszewski et al, ASPLOS 09] T 2 rd A T 3 time 4
15 determinism via store buffers parallel parallel mode: buffer all stores, T 1 T 2 sync via [Olszewski et al, ASPLOS 09] commit mode: deterministically publish buffers T 3 time 4
16 determinism via store buffers parallel commit parallel mode: buffer all stores, T 1 T 2 sync via [Olszewski et al, ASPLOS 09] commit mode: deterministically publish buffers T 3 time 4
17 sources of overhead parallel commit T 1 T 2 T 3 time 5
18 sources of overhead parallel commit store buffer T 1 instrumentation T 2 T 3 time 5
19 sources of overhead parallel commit store buffer T 1 instrumentation T 2 imbalance T 3 time 5
20 sources of overhead parallel commit store buffer T 1 instrumentation T 2 imbalance T 3 time 5
21 MELD Merging Execution- and Language-level Determinism languagelevel executionlevel examples Jade, DPJ DMP, Kendo 6
22 MELD Merging Execution- and Language-level Determinism languagelevel executionlevel examples Jade, DPJ DMP, Kendo runtime overhead? none moderate-high 6
23 MELD Merging Execution- and Language-level Determinism languagelevel executionlevel examples Jade, DPJ DMP, Kendo runtime overhead? supports all code? none no moderate-high yes 6
24 MELD Merging Execution- and Language-level Determinism languagelevel executionlevel examples Jade, DPJ DMP, Kendo runtime overhead? supports all code? sequential semantics? none no yes moderate-high yes no 6
25 MELD Merging Execution- and Language-level Determinism languagelevel executionlevel hybrid examples Jade, DPJ DMP, Kendo MELD runtime overhead? none moderate-high low supports all code? no yes yes sequential semantics? yes no no 6
26 90% of execution time is spent in 10% of the code 7
27 90% of execution time is spent in 10% of the code often data-parallel 7
28 program composition 8
29 program composition locks condition variables queues flags pointers privatization 8
30 program composition locks condition variables queues flags pointers privatization regular data parallel computation 8
31 program composition locks condition variables queues flags pointers privatization regular data parallel computation 8
32 program composition execution-level locks condition variables determinism queues flags pointers privatization regular data parallel computation 8
33 program composition execution-level locks condition variables determinism queues flags language-level determinism pointers privatization regular data parallel computation 8
34 program composition execution-level locks condition variables determinism queues flags language-level determinism pointers privatization regular data parallel computation 8
35 what could possibly go wrong? mergesort(int* array) { // verified by det lang } 9
36 what could possibly go wrong? what other threads call mergesort concurrently? mergesort(int* array) { // verified by det lang } what aliases array? can other threads concurrently access array? 9
37 what could possibly go wrong? what other threads call mergesort concurrently? langdet mergesort(int* array) { // verified by det lang } what aliases array? can other threads concurrently access array? 9
38 compilation flow 10
39 compilation flow lightweight type qualifier system 10
40 compilation flow lightweight type qualifier system store buffer instrumentation 10
41 compilation flow lightweight type qualifier system store buffer instrumentation verifier (done manually) 10
42 compilation flow lightweight type qualifier system store buffer instrumentation verifier insn counting instrumentation (done manually) 10
43 example: radix int dest =...; // implicitly exdet 11
44 example: radix int dest =...; // implicitly exdet langdet int langdet _source = cast(source); 11
45 example: radix int dest =...; // implicitly exdet langdet int langdet _source = cast(source); BARRIER(); for (int i =...) { dest[complicated] = _source[i]; } BARRIER(); 11
46 experimental setup 8-core 2.4GHz Intel Nehalem, 10GB RAM C benchmarks from SPLASH2, PARSEC CoreDet compiler with consistency optimizations from [Devietti et al., ASPLOS 11] 12
47 MELD results overhead normalized to nondet 800% 600% 400% 200% 0% CoreDet MELD barnes blacksch lu radix streamcl 13
48 MELD results overhead normalized to nondet 800% 600% 400% 200% 0% CoreDet MELD barnes blacksch lu radix streamcl 13
49 MELD results overhead normalized to nondet 800% 600% 400% 200% 0% CoreDet MELD barnes blacksch lu radix streamcl 13
50 characterization workload LOC explicit sync ops (static) barnes blackscholes lu radix streamcluster
51 usability workload annotations casts barnes 6 7 blackscholes 8 0 lu 10 3 radix 2 2 streamcluster
52 future work build fully integrated system supporting nondeterminism via information flow tracking type system find gainful employment 16
53 Questions?
Deterministic Shared Memory Multiprocessing
Deterministic Shared Memory Multiprocessing Luis Ceze, University of Washington joint work with Owen Anderson, Tom Bergan, Joe Devietti, Brandon Lucia, Karin Strauss, Dan Grossman, Mark Oskin. Safe MultiProcessing
More informationExplicitly Parallel Programming with Shared Memory is Insane: At Least Make it Deterministic!
Explicitly Parallel Programming with Shared Memory is Insane: At Least Make it Deterministic! Joe Devietti, Brandon Lucia, Luis Ceze and Mark Oskin University of Washington Parallel Programming is Hard
More informationDMP Deterministic Shared Memory Multiprocessing
DMP Deterministic Shared Memory Multiprocessing University of Washington Joe Devietti, Brandon Lucia, Luis Ceze, Mark Oskin A multithreaded voting machine 2 thread 0 thread 1 while (more_votes) { load
More informationDeterministic Scaling
Deterministic Scaling Gabriel Southern Madan Das Jose Renau Dept. of Computer Engineering, University of California Santa Cruz {gsouther, madandas, renau}@soe.ucsc.edu http://masc.soe.ucsc.edu ABSTRACT
More informationRCDC: A Relaxed Consistency Deterministic Computer
RCDC: A Relaxed Consistency Deterministic Computer Joseph Devietti Jacob Nelson Tom Bergan Luis Ceze Dan Grossman University of Washington, Department of Computer Science & Engineering {devietti,nelson,tbergan,luisceze,djg}@cs.washington.edu
More informationDeterministic Process Groups in
Deterministic Process Groups in Tom Bergan Nicholas Hunt, Luis Ceze, Steven D. Gribble University of Washington A Nondeterministic Program global x=0 Thread 1 Thread 2 t := x x := t + 1 t := x x := t +
More informationDMP: DETERMINISTIC SHARED- MEMORY MULTIPROCESSING
... DMP: DETERMINISTIC SHARED- MEMORY MULTIPROCESSING... SHARED-MEMORY MULTICORE AND MULTIPROCESSOR SYSTEMS ARE NONDETERMINISTIC, WHICH FRUSTRATES DEBUGGING AND COMPLICATES TESTING OF MULTITHREADED CODE,
More informationIdentifying Ad-hoc Synchronization for Enhanced Race Detection
Identifying Ad-hoc Synchronization for Enhanced Race Detection IPD Tichy Lehrstuhl für Programmiersysteme IPDPS 20 April, 2010 Ali Jannesari / Walter F. Tichy KIT die Kooperation von Forschungszentrum
More informationDeterministic OpenMP for Race-Free Parallelism
Deterministic OpenMP for Race-Free Parallelism Amittai Aviram and Bryan Ford Yale University {amittai.aviram, bryan.ford}@yale.edu Recent deterministic parallel programming models show promise for their
More informationSpeculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois
Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications José éf. Martínez * and Josep Torrellas University of Illinois ASPLOS 2002 * Now at Cornell University Overview Allow
More informationTERN: Stable Deterministic Multithreading through Schedule Memoization
TERN: Stable Deterministic Multithreading through Schedule Memoization Heming Cui, Jingyue Wu, Chia-che Tsai, Junfeng Yang Columbia University Appeared in OSDI 10 Nondeterministic Execution One input many
More informationEfficient and highly portable deterministic multithreading (DetLock)
Computing (2014) 96:1131 1147 DOI 10.1007/s00607-013-0370-9 Efficient and highly portable deterministic multithreading (DetLock) Hamid Mushtaq Zaid Al-Ars Koen Bertels Received: 22 March 2013 / Accepted:
More informationHybrid Static-Dynamic Analysis for Statically Bounded Region Serializability
Hybrid Static-Dynamic Analysis for Statically Bounded Region Serializability Aritra Sengupta, Swarnendu Biswas, Minjia Zhang, Michael D. Bond and Milind Kulkarni ASPLOS 2015, ISTANBUL, TURKEY Programming
More informationThe benefits and costs of writing a POSIX kernel in a high-level language
1 / 38 The benefits and costs of writing a POSIX kernel in a high-level language Cody Cutler, M. Frans Kaashoek, Robert T. Morris MIT CSAIL Should we use high-level languages to build OS kernels? 2 / 38
More informationDeterministic Parallel Programming
Deterministic Parallel Programming Concepts and Practices 04/2011 1 How hard is parallel programming What s the result of this program? What is data race? Should data races be allowed? Initially x = 0
More informationEfficient Data Race Detection for Unified Parallel C
P A R A L L E L C O M P U T I N G L A B O R A T O R Y Efficient Data Race Detection for Unified Parallel C ParLab Winter Retreat 1/14/2011" Costin Iancu, LBL" Nick Jalbert, UC Berkeley" Chang-Seo Park,
More informationEffective Data-Race Detection for the Kernel
Effective Data-Race Detection for the Kernel John Erickson, Madanlal Musuvathi, Sebastian Burckhardt, Kirk Olynyk Microsoft Research Presented by Thaddeus Czauski 06 Aug 2011 CS 5204 2 How do we prevent
More informationA Case for System Support for Concurrency Exceptions
A Case for System Support for Concurrency Exceptions Luis Ceze, Joseph Devietti, Brandon Lucia and Shaz Qadeer University of Washington {luisceze, devietti, blucia0a}@cs.washington.edu Microsoft Research
More informationEfficient Deterministic Multithreading Without Global Barriers
Efficient Deterministic Multithreading Without Global Barriers Kai Lu 1,2 Xu Zhou 1,2 Tom Bergan 3 Xiaoping Wang 1,2 1.Science and Technology on Parallel and Distributed Processing Laboratory, National
More informationChimera: Hybrid Program Analysis for Determinism
Chimera: Hybrid Program Analysis for Determinism Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor - 1 - * Chimera image from http://superpunch.blogspot.com/2009/02/chimera-sketch.html
More informationOptimize HPC - Application Efficiency on Many Core Systems
Meet the experts Optimize HPC - Application Efficiency on Many Core Systems 2018 Arm Limited Florent Lebeau 27 March 2018 2 2018 Arm Limited Speedup Multithreading and scalability I wrote my program to
More informationWritten Presentation: JoCaml, a Language for Concurrent Distributed and Mobile Programming
Written Presentation: JoCaml, a Language for Concurrent Distributed and Mobile Programming Nicolas Bettenburg 1 Universitaet des Saarlandes, D-66041 Saarbruecken, nicbet@studcs.uni-sb.de Abstract. As traditional
More informationNVthreads: Practical Persistence for Multi-threaded Applications
NVthreads: Practical Persistence for Multi-threaded Applications Terry Hsu*, Purdue University Helge Brügner*, TU München Indrajit Roy*, Google Inc. Kimberly Keeton, Hewlett Packard Labs Patrick Eugster,
More informationSpeculative Synchronization
Speculative Synchronization José F. Martínez Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu/martinez Problem 1: Conservative Parallelization No parallelization
More informationWhat is the Cost of Determinism?
What is the Cost of Determinism? Abstract Cedomir Segulja We analyze the fundamental performance impact of enforcing a fixed thread communication order to achieve deterministic execution. Our analysis
More informationSiloed Reference Analysis
Siloed Reference Analysis Xing Zhou 1. Objectives: Traditional compiler optimizations must be conservative for multithreaded programs in order to ensure correctness, since the global variables or memory
More informationDTHREADS: Efficient Deterministic Multithreading
DTHREADS: Efficient Deterministic Multithreading Tongping Liu Charlie Curtsinger Emery D. Berger Dept. of Computer Science University of Massachusetts, Amherst Amherst, MA 01003 {tonyliu,charlie,emery}@cs.umass.edu
More informationIntroduction Contech s Task Graph Representation Parallel Program Instrumentation (Break) Analysis and Usage of a Contech Task Graph Hands-on
Introduction Contech s Task Graph Representation Parallel Program Instrumentation (Break) Analysis and Usage of a Contech Task Graph Hands-on Exercises 2 Contech is An LLVM compiler pass to instrument
More informationDTHREADS: Efficient Deterministic
DTHREADS: Efficient Deterministic Multithreading Tongping Liu, Charlie Curtsinger, and Emery D. Berger Department of Computer Science University of Massachusetts, Amherst Amherst, MA 01003 {tonyliu,charlie,emery}@cs.umass.edu
More informationfor Task-Based Parallelism
for Task-Based Parallelism School of EEECS, Queen s University of Belfast July Table of Contents 2 / 24 Thread-Based Programming Hard for programmer to reason about thread interleavings and concurrent
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationRelaxed Memory Consistency
Relaxed Memory Consistency Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationCost of Concurrency in Hybrid Transactional Memory. Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University)
Cost of Concurrency in Hybrid Transactional Memory Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University) 1 Transactional Memory: a history Hardware TM Software TM Hybrid TM 1993 1995-today
More informationVulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically
Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically Abdullah Muzahid, Shanxiang Qi, and University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu MICRO December
More informationOPERATING SYSTEM TRANSACTIONS
OPERATING SYSTEM TRANSACTIONS Donald E. Porter, Owen S. Hofmann, Christopher J. Rossbach, Alexander Benn, and Emmett Witchel The University of Texas at Austin OS APIs don t handle concurrency 2 OS is weak
More informationAccelerating Dynamic Data Race Detection Using Static Thread Interference Analysis
Accelerating Dynamic Data Race Detection Using Static Thread Interference Peng Di and Yulei Sui School of Computer Science and Engineering The University of New South Wales 2052 Sydney Australia March
More informationChí Cao Minh 28 May 2008
Chí Cao Minh 28 May 2008 Uniprocessor systems hitting limits Design complexity overwhelming Power consumption increasing dramatically Instruction-level parallelism exhausted Solution is multiprocessor
More informationDeterministic Replay and Data Race Detection for Multithreaded Programs
Deterministic Replay and Data Race Detection for Multithreaded Programs Dongyoon Lee Computer Science Department - 1 - The Shift to Multicore Systems 100+ cores Desktop/Server 8+ cores Smartphones 2+ cores
More informationRecovering Disk Storage Metrics from low level Trace events
Recovering Disk Storage Metrics from low level Trace events Progress Report Meeting May 05, 2016 Houssem Daoud Michel Dagenais École Polytechnique de Montréal Laboratoire DORSAL Agenda Introduction and
More informationScalable Software Transactional Memory for Chapel High-Productivity Language
Scalable Software Transactional Memory for Chapel High-Productivity Language Srinivas Sridharan and Peter Kogge, U. Notre Dame Brad Chamberlain, Cray Inc Jeffrey Vetter, Future Technologies Group, ORNL
More informationDBT Tool. DBT Framework
Thread-Safe Dynamic Binary Translation using Transactional Memory JaeWoong Chung,, Michael Dalton, Hari Kannan, Christos Kozyrakis Computer Systems Laboratory Stanford University http://csl.stanford.edu
More informationExplicitly Parallel Programming with Shared-Memory is Insane: At Least Make it Deterministic!
Explicitly Parallel Programming with Shared-Memory is Insane: At Least Make it Deterministic! Joe Devietti, Brandon Lucia, Luis Ceze and Mark Oskin Department of Computer Science and Engineering University
More informationNDSeq: Runtime Checking for Nondeterministic Sequential Specs of Parallel Correctness
EECS Electrical Engineering and Computer Sciences P A R A L L E L C O M P U T I N G L A B O R A T O R Y NDSeq: Runtime Checking for Nondeterministic Sequential Specs of Parallel Correctness Jacob Burnim,
More informationNon-Blocking Inter-Partition Communication with Wait-Free Pair Transactions
Non-Blocking Inter-Partition Communication with Wait-Free Pair Transactions Ethan Blanton and Lukasz Ziarek Fiji Systems, Inc. October 10 th, 2013 WFPT Overview Wait-Free Pair Transactions A communication
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationThe Bulk Multicore Architecture for Programmability
The Bulk Multicore Architecture for Programmability Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Acknowledgments Key contributors: Luis Ceze Calin
More informationBlurred Persistence in Transactional Persistent Memory
Blurred Persistence in Transactional Persistent Memory Youyou Lu, Jiwu Shu, Long Sun Tsinghua University Overview Problem: high performance overhead in ensuring storage consistency of persistent memory
More informationLeveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures. Joy Arulraj, Guoliang Jin and Shan Lu
Leveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures Joy Arulraj, Guoliang Jin and Shan Lu Production-Run Failure Diagnosis Goal Figure out root cause of failure on
More informationproceedings (OOPSLA 2009 and POPL 2011). As to that work only, the following notice applies: Copyright
c 2010 Robert L. Bocchino Jr. Chapters 3 and 5 are derived from work published in ACM conference proceedings (OOPSLA 2009 and POPL 2011). As to that work only, the following notice applies: Copyright c
More informationNePaLTM: Design and Implementation of Nested Parallelism for Transactional Memory Systems
NePaLTM: Design and Implementation of Nested Parallelism for Transactional Memory Systems Haris Volos 1, Adam Welc 2, Ali-Reza Adl-Tabatabai 2, Tatiana Shpeisman 2, Xinmin Tian 2, and Ravi Narayanaswamy
More informationImproving the Practicality of Transactional Memory
Improving the Practicality of Transactional Memory Woongki Baek Electrical Engineering Stanford University Programming Multiprocessors Multiprocessor systems are now everywhere From embedded to datacenter
More informationSoftware-Controlled Multithreading Using Informing Memory Operations
Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University
More informationSamsara: Efficient Deterministic Replay in Multiprocessor. Environments with Hardware Virtualization Extensions
Samsara: Efficient Deterministic Replay in Multiprocessor Environments with Hardware Virtualization Extensions Shiru Ren, Le Tan, Chunqi Li, Zhen Xiao, and Weijia Song June 24, 2016 Table of Contents 1
More informationDynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle
Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation
More informationMajor Project, CSD411
Major Project, CSD411 Deterministic Execution In Multithreaded Applications Supervisor Prof. Sorav Bansal Ashwin Kumar 2010CS10211 Himanshu Gupta 2010CS10220 1 Deterministic Execution In Multithreaded
More informationFlexBulk: Intelligently Forming Atomic Blocks in Blocked-Execution Multiprocessors to Minimize Squashes
FlexBulk: Intelligently Forming Atomic Blocks in Blocked-Execution Multiprocessors to Minimize Squashes Rishi Agarwal and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu
More informationEnforcing Textual Alignment of
Parallel Hardware Parallel Applications IT industry (Silicon Valley) Parallel Software Users Enforcing Textual Alignment of Collectives using Dynamic Checks and Katherine Yelick UC Berkeley Parallel Computing
More informationPipelining. CS701 High Performance Computing
Pipelining CS701 High Performance Computing Student Presentation 1 Two 20 minute presentations Burks, Goldstine, von Neumann. Preliminary Discussion of the Logical Design of an Electronic Computing Instrument.
More informationCache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency
Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency Anders Gidenstam Håkan Sundell Philippas Tsigas School of business and informatics University of Borås Distributed
More informationLightweight Fault Detection in Parallelized Programs
Lightweight Fault Detection in Parallelized Programs Li Tan UC Riverside Min Feng NEC Labs Rajiv Gupta UC Riverside CGO 13, Shenzhen, China Feb. 25, 2013 Program Parallelization Parallelism can be achieved
More informationLDetector: A low overhead data race detector for GPU programs
LDetector: A low overhead data race detector for GPU programs 1 PENGCHENG LI CHEN DING XIAOYU HU TOLGA SOYATA UNIVERSITY OF ROCHESTER 1 Data races in GPU Introduction & Contribution Impact correctness
More informationSoftware Transactions: A Programming-Languages Perspective
Software Transactions: A Programming-Languages Perspective Dan Grossman University of Washington 5 December 2006 A big deal Research on software transactions broad Programming languages PLDI, POPL, ICFP,
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationTowards Fair and Efficient SMP Virtual Machine Scheduling
Towards Fair and Efficient SMP Virtual Machine Scheduling Jia Rao and Xiaobo Zhou University of Colorado, Colorado Springs http://cs.uccs.edu/~jrao/ Executive Summary Problem: unfairness and inefficiency
More informationOPERATING SYSTEM SUPPORT FOR REDUNDANT MULTITHREADING. Björn Döbel (TU Dresden)
OPERATING SYSTEM SUPPORT FOR REDUNDANT MULTITHREADING Björn Döbel (TU Dresden) Brussels, 02.02.2013 Hardware Faults Radiation-induced soft errors Mainly an issue in avionics+space 1 DRAM errors in large
More informationOutline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work
Using Non-blocking Operations in HPC to Reduce Execution Times David Buettner, Julian Kunkel, Thomas Ludwig Euro PVM/MPI September 8th, 2009 Outline 1 Motivation 2 Theory of a non-blocking benchmark 3
More informationParallel Computer Architecture Spring Memory Consistency. Nikos Bellas
Parallel Computer Architecture Spring 2018 Memory Consistency Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture 1 Coherence vs Consistency
More informationDeLorean: Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently
: Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently, Luis Ceze* and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign
More informationTransaction Memory for Existing Programs Michael M. Swift Haris Volos, Andres Tack, Shan Lu, Adam Welc * University of Wisconsin-Madison, *Intel
Transaction Memory for Existing Programs Michael M. Swift Haris Volos, Andres Tack, Shan Lu, Adam Welc * University of Wisconsin-Madison, *Intel Where do TM programs ome from?! Parallel benchmarks replacing
More informationHeckaton. SQL Server's Memory Optimized OLTP Engine
Heckaton SQL Server's Memory Optimized OLTP Engine Agenda Introduction to Hekaton Design Consideration High Level Architecture Storage and Indexing Query Processing Transaction Management Transaction Durability
More informationSupport Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura
Support Tools for Porting Legacy Applications to Multicore Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura Agenda Introduction PEMAP: Performance Estimator for MAny core Processors The overview
More informationTransactional Memory. Concurrency unlocked Programming. Bingsheng Wang TM Operating Systems
Concurrency unlocked Programming Bingsheng Wang TM Operating Systems 1 Outline Background Motivation Database Transaction Transactional Memory History Transactional Memory Example Mechanisms Software Transactional
More informationInstantCheck: Checking the Determinism of Parallel Programs Using On-the-fly Incremental Hashing
InstantCheck: Checking the Determinism of Parallel Programs Using On-the-fly Incremental Hashing Adrian Nistor University of Illinois at Urbana-Champaign, USA nistor1@illinois.edu Darko Marinov University
More information11/19/2013. Imperative programs
if (flag) 1 2 From my perspective, parallelism is the biggest challenge since high level programming languages. It s the biggest thing in 50 years because industry is betting its future that parallel programming
More informationPotential violations of Serializability: Example 1
CSCE 6610:Advanced Computer Architecture Review New Amdahl s law A possible idea for a term project Explore my idea about changing frequency based on serial fraction to maintain fixed energy or keep same
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationTyped Assembly Language for Implementing OS Kernels in SMP/Multi-Core Environments with Interrupts
Typed Assembly Language for Implementing OS Kernels in SMP/Multi-Core Environments with Interrupts Toshiyuki Maeda and Akinori Yonezawa University of Tokyo Quiz [Environment] CPU: Intel Xeon X5570 (2.93GHz)
More informationThread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications
Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark Motivation Hardware Trends Put more cores
More informationGPUfs: Integrating a file system with GPUs
GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Building systems with GPUs is hard. Why? 2 Goal of
More informationRace Analysis for SystemC using Model Checking
Race Analysis for SystemC using Model Checking Nicolas Blanc, Daniel Kroening www.cprover.org/scoot Supported by Intel and SRC Department of Computer Science ETH Zürich Wednesday, 19 November 2008 This
More informationKey-value store with eventual consistency without trusting individual nodes
basementdb Key-value store with eventual consistency without trusting individual nodes https://github.com/spferical/basementdb 1. Abstract basementdb is an eventually-consistent key-value store, composed
More informationApproximate Program Synthesis
Approximate Program Synthesis James Bornholt Emina Torlak Luis Ceze Dan Grossman University of Washington Writing approximate programs is hard Precise Implementation Writing approximate programs is hard
More informationApplication Programming
Multicore Application Programming For Windows, Linux, and Oracle Solaris Darryl Gove AAddison-Wesley Upper Saddle River, NJ Boston Indianapolis San Francisco New York Toronto Montreal London Munich Paris
More informationShared Memory Programming with OpenMP. Lecture 8: Memory model, flush and atomics
Shared Memory Programming with OpenMP Lecture 8: Memory model, flush and atomics Why do we need a memory model? On modern computers code is rarely executed in the same order as it was specified in the
More informationArrakis: The Operating System is the Control Plane
Arrakis: The Operating System is the Control Plane Simon Peter, Jialin Li, Irene Zhang, Dan Ports, Doug Woos, Arvind Krishnamurthy, Tom Anderson University of Washington Timothy Roscoe ETH Zurich Building
More informationBei Wang, Dmitry Prohorov and Carlos Rosales
Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512
More informationOur Many-core Benchmarks Do Not Use That Many Cores Paul D. Bryan Jesse G. Beu Thomas M. Conte Paolo Faraboschi Daniel Ortega
Our Many-core Benchmarks Do Not Use That Many Cores Paul D. Bryan Jesse G. Beu Thomas M. Conte Paolo Faraboschi Daniel Ortega Georgia Institute of Technology HP Labs, Exascale Computing Lab {paul.bryan,
More informationEffective Performance Measurement and Analysis of Multithreaded Applications
Effective Performance Measurement and Analysis of Multithreaded Applications Nathan Tallent John Mellor-Crummey Rice University CSCaDS hpctoolkit.org Wanted: Multicore Programming Models Simple well-defined
More informationLock Prediction. Brandon Lucia Joseph Devietti Tom Bergan Luis Ceze Dan Grossman
Lock Prediction Brandon Lucia Joseph Devietti Tom Bergan Luis Ceze Dan Grossman University of Washington, Computer Science and Engineering http://sampa.cs.washington.edu Abstract This paper makes the case
More informationHPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh
HPMMAP: Lightweight Memory Management for Commodity Operating Systems Brian Kocoloski Jack Lange University of Pittsburgh Lightweight Experience in a Consolidated Environment HPC applications need lightweight
More informationParallelization Libraries: Characterizing and Reducing Overheads
Parallelization Libraries: Characterizing and Reducing Overheads ABHISHEK BHATTACHARJEE, RutgersUniversity GILBERTO CONTRERAS, Nvidia Corporation and MARGARET MARTONOSI, Princeton University Creating efficient,
More informationMohamed M. Saad & Binoy Ravindran
Mohamed M. Saad & Binoy Ravindran VT-MENA Program Electrical & Computer Engineering Department Virginia Polytechnic Institute and State University TRANSACT 11 San Jose, CA An operation (or set of operations)
More informationGPUfs: Integrating a file system with GPUs
GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU
More informationCS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME
CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME 1 Last time... GPU Memory System Different kinds of memory pools, caches, etc Different optimization techniques 2 Warp Schedulers
More informationNON-SPECULATIVE LOAD LOAD REORDERING IN TSO 1
NON-SPECULATIVE LOAD LOAD REORDERING IN TSO 1 Alberto Ros Universidad de Murcia October 17th, 2017 1 A. Ros, T. E. Carlson, M. Alipour, and S. Kaxiras, "Non-Speculative Load-Load Reordering in TSO". ISCA,
More informationMain-Memory Databases 1 / 25
1 / 25 Motivation Hardware trends Huge main memory capacity with complex access characteristics (Caches, NUMA) Many-core CPUs SIMD support in CPUs New CPU features (HTM) Also: Graphic cards, FPGAs, low
More informationGPUfs: Integrating a file system with GPUs
ASPLOS 2013 GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications
More informationParallel SimOS: Scalability and Performance for Large System Simulation
Parallel SimOS: Scalability and Performance for Large System Simulation Ph.D. Oral Defense Robert E. Lantz Computer Systems Laboratory Stanford University 1 Overview This work develops methods to simulate
More informationSTEPS Towards Cache-Resident Transaction Processing
STEPS Towards Cache-Resident Transaction Processing Stavros Harizopoulos joint work with Anastassia Ailamaki VLDB 2004 Carnegie ellon CPI OLTP workloads on modern CPUs 6 4 2 L2-I stalls L2-D stalls L1-I
More informationRAMP-White / FAST-MP
RAMP-White / FAST-MP Hari Angepat and Derek Chiou Electrical and Computer Engineering University of Texas at Austin Supported in part by DOE, NSF, SRC,Bluespec, Intel, Xilinx, IBM, and Freescale RAMP-White
More information