The Bulk Multicore Architecture for Programmability

Similar documents
DeLorean: Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently

DOI: /

FlexBulk: Intelligently Forming Atomic Blocks in Blocked-Execution Multiprocessors to Minimize Squashes

Universal Parallel Computing Research Center at Illinois

Relaxed Memory Consistency

RelaxReplay: Record and Replay for Relaxed-Consistency Multiprocessors

Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically

Computer Architecture

Foundations of the C++ Concurrency Memory Model

Speculative Synchronization

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

ScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Envionment

POSH: A TLS Compiler that Exploits Program Structure

Potential violations of Serializability: Example 1

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

Lecture: Consistency Models, TM

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

DMP Deterministic Shared Memory Multiprocessing

DeAliaser: Alias Speculation Using Atomic Region Support

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

Overview: Memory Consistency

Improving the Practicality of Transactional Memory

Coherence and Consistency

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Memory Consistency Models: Convergence At Last!

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

CSE502: Computer Architecture CSE 502: Computer Architecture

EE382 Processor Design. Processor Issues for MP

COMP Parallel Computing. CC-NUMA (2) Memory Consistency

Dynamically Detecting and Tolerating IF-Condition Data Races

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency

Lessons Learned During the Development of the CapoOne Deterministic Multiprocessor Replay System

Unit 12: Memory Consistency Models. Includes slides originally developed by Prof. Amir Roth

Taming release-acquire consistency

Chí Cao Minh 28 May 2008

TSO-CC: Consistency-directed Coherence for TSO. Vijay Nagarajan

Productive Design of Extensible Cache Coherence Protocols!

Today s Outline: Shared Memory Review. Shared Memory & Concurrency. Concurrency v. Parallelism. Thread-Level Parallelism. CS758: Multicore Programming

Speculative Locks. Dept. of Computer Science

Memory Consistency Models. CSE 451 James Bornholt

CS533 Concepts of Operating Systems. Jonathan Walpole

Shared Memory Consistency Models: A Tutorial

Multiprocessor Synchronization

Samsara: Efficient Deterministic Replay in Multiprocessor. Environments with Hardware Virtualization Extensions

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Memory Consistency Models

Relaxed Memory-Consistency Models

Bus-Based Coherent Multiprocessors

Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

M4 Parallelism. Implementation of Locks Cache Coherence

The Java Memory Model

Computer Science 146. Computer Architecture

Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations

Designing Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve

A Case for System Support for Concurrency Exceptions

RELAXED CONSISTENCY 1

Review of last lecture. Peer Quiz. DPHPC Overview. Goals of this lecture. Is coherence everything?

Shared Memory Programming with OpenMP. Lecture 8: Memory model, flush and atomics

FlexTM. Flexible Decoupled Transactional Memory Support. Arrvindh Shriraman Sandhya Dwarkadas Michael L. Scott Department of Computer Science

The Design Complexity of Program Undo Support in a General Purpose Processor. Radu Teodorescu and Josep Torrellas

Shared memory. Caches, Cache coherence and Memory consistency models. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16

Portland State University ECE 588/688. Memory Consistency Models

Efficient Persist Barriers for Multicores

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Distributed Operating Systems Memory Consistency

Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems

Capo: A Software-Hardware Interface for Practical Deterministic Multiprocessor Replay

Chapter 5. Thread-Level Parallelism

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

An introduction to weak memory consistency and the out-of-thin-air problem

CS3350B Computer Architecture

Transactions & Update Correctness. April 11, 2018

Interconnect Routing

Coherence, Consistency, and Déjà vu: Memory Hierarchies in the Era of Specialization

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech

Beyond Sequential Consistency: Relaxed Memory Models

Multiprocessors 1. Outline

Handout 3 Multiprocessor and thread level parallelism

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

Deterministic Shared Memory Multiprocessing

GPU Concurrency: Weak Behaviours and Programming Assumptions

Deterministic Parallel Programming

CoreRacer: A Practical Memory Race Recorder for Multicore x86 TSO Processors

A Survey in Deterministic Replaying Approaches in Multiprocessors

Memory Consistency Models

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

The New Java Technology Memory Model

Explicitly Parallel Programming with Shared Memory is Insane: At Least Make it Deterministic!

SHARED-MEMORY COMMUNICATION

The Cache-Coherence Problem

Grassroots ASPLOS. can we still rethink the hardware/software interface in processors? Raphael kena Poss University of Amsterdam, the Netherlands

Chapter 5. Multiprocessors and Thread-Level Parallelism

ReEnact: Using Thread-Level Speculation Mechanisms to Debug Data Races in Multithreaded Codes

Distributed Shared Memory and Memory Consistency Models

Transcription:

The Bulk Multicore Architecture for Programmability Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu

Acknowledgments Key contributors: Luis Ceze Calin Cascaval James Tuck Pablo Montesinos Wonsun Ahn Milos Prvulovic Pin Zhou YY Zhou Jose Martinez 2

Arch Challenge: Enable a Programmable Environment Able to attain high efficiency while relieving the programmer from low-level tasks Help minimize chance of (parallel) programming errors 3

The Bulk Multicore [CommACM 09] General-purpose multicore for programmability http://iacoma.cs.uiuc.edu/bulkmulticore.pdf Novel scalable cache-coherent shared-memory (signatures & chunks) Relieves programmer/runtime from managing shared data High-performance sequential memory consistency Provides a more SW-friendly environment HW primitives for a low-overhead program dev & debug environment (data-race detection, deterministic replay, address disambiguation) Helps reduce the chance of parallel programming errors Overhead low enough to be on during production runs 4

The Bulk Multicore Idea: Eliminate the commit of individual instructions at a time Mechanism: By default, processors commit chunks of instructions at a time (e.g. 2,000 dynamic instr) Chunks execute atomically and in isolation (using buffering and undo) Memory effects of chunks summarized in HW signatures Chunks invisible to SW Advantages over current: Higher programmability Higher performance Simpler processor hardware The Bulk Multicore 5

Rest of the Talk The Bulk Multicore How it improves programmability 6

Hardware Mechanism: Signatures [ISCA06] Hardware accumulates the addresses read/written in signatures Read and Write signatures Summarize the footprint of a Chunk of code 7

Signature Operations In Hardware 8 Inexpensive Operations on Groups of Addresses

Executing Chunks Atomically & In Isolation: Simple! Thread 0 Thread 1 W0 = sig(b,c) R0 = sig(x,y) commit ld X st B st C ld Y Chunk W0 ld B st T ld C W1 = sig(t) R1 = sig(b,c) (W0 R1 ) (W0 W1) 9

Chunk Operation + Signatures: Bulk [ISCA07] Execute each chunk atomically and in isolation (Distributed) arbiter ensures a total order of chunk commits P1 st A ld C st C st D P2 st A st C ld D st X st A st C st A ld C ld D st X st C st D P1 P2 P3... PN Mem Logical picture Supports Sequential Consistency [Lamport79]: Low hardware complexity: Need not snoop ld buffer for consistency High performance: Instructions are fully reordered by HW loads and stores make it in any order to the sig Fences are NOOPS 10

Summary: Benefits of Bulk Multicore Gains in HW simplicity, performance, and programmability Hardware simplicity: Memory consistency support moved away from core Toward commodity cores Easy to plug-in accelerators High performance: HW reorders accesses heavily (intra- and inter-chunk) 11

Benefits of Bulk Multicore (II) High programmability: Invisible to the programming model/language Supports Sequential Consistency (SC) * Software correctness tools assume SC Enables novel always-on debugging techniques * Only keep per-chunk state, not per-load/store state * Deterministic replay of parallel programs with no log * Data race detection at production-run speed 12

Benefits of Bulk Multicore (III) Extension: Signatures visible to SW through ISA Enables pervasive monitoring Enables novel compiler opts Many novel programming/compiler/tool opportunities 13

Rest of the Talk The Bulk Multicore How it improves programmability 14

Supports Sequential Consistency (SC) Correctness tools assume SC: Verification tools that prove software correctness Under SC, semantics for data races are clear: Easy specifications for safe languages Much easier to debug parallel codes (and design debuggers) Works with hand-crafted synchronization 15

Deterministic Replay of MP Execution During Execution: HW records into a log the order of dependences between threads The log has captured the interleaving of threads During Replay: Re-run the program Enforcing the dependence orders in the log 16

Conventional Schemes P1 P2 n1 Wa P2 s Log n2 Wb Ra Wb m1 m2 P1 n1 m1 P1 n2 m2 Potentially large logs 17

Bulk: Log Necessary is Minuscule [ISCA08] During Execution: Commit the instructions in chunks, not individually chunk 1 P1 Wa Wb Rc Wa Rb Wc P2 chunk 2 Combined Log of all Procs: P1 P2 Pi If we fix the chunk commit interleaving: Combined Log = NIL 18

Data Race Detection at Production-Run Speed [ISCA03] Lock L Unlock L Lock L Unlock L If we detect communication between Ordered chunks: not a data race Unordered chunks: data race 19

Different Synchronization Ops Lock L Unlock L Lock L Unlock L Set F Wait F Barrier Barrier Lock Flag Barrier 20

Benefits of Bulk Multicore (III) Extension: Signatures visible to SW through ISA Enables pervasive monitoring [ISCA04] Support numerous watchpoints for free Enables novel compiler opts [ASPLOS08] Function memoization Loop-invariant code motion 21

Pervasive Monitoring: Attaching a Monitor Function to Address Watch memory location Trigger monitoring function when it is accessed Thread instr Watch(addr, usr_monitor) instr instr instr *p =... instr instr instr usr_monitor(addr){.. } Main Program *p= Rest of Program Monitoring Function 22

Enabling Novel Compiler Optimizations New instruction: Begin/End collecting addresses into sig 23

Enabling Novel Compiler Optimizations New instruction: Begin/End collecting addresses into sig 24

Instruction: Begin/End Disambiguation Against Sig 25

Instruction: Begin/End Disambiguation Against Sig 26

Instruction: Begin/End Remote Disambiguation 27

Instruction: Begin/End Remote Disambiguation 28

Optimization: Function Memoization Goal: skip the execution of functions 29

Example Opt: Function Memoization Goal: skip the execution of functions whose outputs are known 30

Example Opt: Loop-Invariant Code Motion 31

Example Opt: Loop-Invariant Code Motion 32

Summary: The Bulk Multicore for Year 2015-2018 128+ cores/chip, coherent shared-memory (perhaps in groups) Simple HW with commodity cores Memory consistency checks moved away from the core High performance shared-memory programming model Execution in programmer-transparent chunks Signatures for disambiguation, cache coherence, and compiler opts High programmability: Sequential consistency Sophisticated always-on development support Deterministic replay of parallel programs with no log (DeLorean) Data race detection for production runs (ReEnact) Pervasive program monitoring (iwatcher) 33

The Bulk Multicore Architecture for Programmability Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu