Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Similar documents
Speculative Lock Elision:: Enabling Highly Concurrent Multithreaded Execution

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Portland State University ECE 588/688. Cray-1 and Cray T3E

Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I. Prof. Onur Mutlu Carnegie Mellon University 10/10/2012

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Transactional Lock-Free Execution of Lock-Based Programs

Implicit Transactional Memory in Chip Multiprocessors

SGI Challenge Overview

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

LogTM: Log-Based Transactional Memory

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

EECS 570 Final Exam - SOLUTIONS Winter 2015

Multiprocessors and Locking

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Speculative Synchronization

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

Computer Science 146. Computer Architecture

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

CS533: Speculative Parallelization (Thread-Level Speculation)

Execution-based Prediction Using Speculative Slices

November 7, 2014 Prediction

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Understanding Hardware Transactional Memory

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Computer Architecture

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

15-740/ Computer Architecture

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Snoop-Based Multiprocessor Design III: Case Studies

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

Reducing Data Cache Energy Consumption via Cached Load/Store Queue

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Superscalar Processors

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Distributed Shared Memory and Memory Consistency Models

Hyperthreading Technology

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

5 Solutions. Solution a. no solution provided. b. no solution provided

Processor (IV) - advanced ILP. Hwansoo Han

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Lecture: Consistency Models, TM

University of Toronto Faculty of Applied Science and Engineering

Speculative Versioning Cache: Unifying Speculation and Coherence

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I

Wide Instruction Fetch

GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

EE382 Processor Design. Processor Issues for MP

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Chapter 5. Multiprocessors and Thread-Level Parallelism

Shared Memory Multiprocessors

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

Lecture: Transactional Memory. Topics: TM implementations

Token Coherence. Milo M. K. Martin Dissertation Defense

CSCI 4717 Computer Architecture

Complexity Analysis of A Cache Controller for Speculative Multithreading Chip Multiprocessors

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech

Chapter 18 Parallel Processing

Consistency Models. ECE/CS 757: Advanced Computer Architecture II. Readings. ECE 752: Advanced Computer Architecture I 1. Instructor:Mikko H Lipasti

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

CS 152, Spring 2011 Section 8

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

CSE502: Computer Architecture CSE 502: Computer Architecture

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Handout 2 ILP: Part B

The Processor: Instruction-Level Parallelism

ECE 485/585 Microprocessor System Design

CHECKPOINT PROCESSING AND RECOVERY: AN EFFICIENT, SCALABLE ALTERNATIVE TO REORDER BUFFERS

Speculative Locks. Dept. of Computer Science

Temporally Silent Stores

CSC 631: High-Performance Computer Architecture

Chapter 4 The Processor (Part 4)

6 Transactional Memory. Robert Mullins

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Handout 3 Multiprocessor and thread level parallelism

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

Lecture 8: Transactional Memory TCC. Topics: lazy implementation (TCC)

Multicast Snooping: A Multicast Address Network. A New Coherence Method Using. With sponsorship and/or participation from. Mark Hill & David Wood

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

TRANSACTIONAL EXECUTION: TOWARD RELIABLE, HIGH- PERFORMANCE MULTITHREADING

Transcription:

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding Ack.: NSF grant CCR-9810114 and Intel Corporation The problem Locks Contention limits performance Contention limits scalability Cause unnecessary memory traffic Involve long latency memory accesses Programmers can use finer-grained locks Time-consuming Difficult and Error-prone Elision:the act or an instance of dropping out or omitting something Elide : to leave out of consideration 2001 Ravi Rajwar 2 The question The solution Can speculative hardware provide? Higher performance More scalability Without software changes Speculative Lock Elision Idea 1. Get lock without writing lock variable 2. Execute critical speculatively 3. Track data (blocks) read/written 4. Rollback when data conflicts occur Features Implement with small changes to processor Transparent to software Use existing coherence protocol for tracking data Non-conflicting critical s execute in parallel! 3 4 1

Outline Conservative lock use: thread safe hash table Speculative Thread-safe Lock hash Elision table concept Speculative Lock contention Lock Elision vs. data details conflict Speculative Lock Elision concept Concluding Speculative remarks Lock Elision details T i m e Thread 1 Thread 2 Lock: Free T1 T2 Hash Table X Y 5 6 Lock contention vs. Data conflict Data conflicts limit concurrency, not lock contention Ideal case: concurrent critical s Thread 1 Thread 2 Lock contention present No data conflicts Thread 1 and 2 Thread 1 Thread 2 Lock X Y T i m e Hash Table Lock: Free X Y Focus on data conflicts, not lock contention 7 If no data conflicts 8 2

Outline Speculative Lock Elision concept s Conceptual view of SLE Speculative Lock Elision details time s Provide atomicity Implemented using locks T1 T2 T1 T2 T1 T2 9 Correct Correct Conflicting overlap??? Don t know in advance, hence need a lock 10 time Conceptual view of SLE T1 T2 T1 T2 T1 T2 Identify candidate instruction sequences for atomic execution Execute critical speculatively Commit, if no data conflicts detected Outline Speculative Lock Elision concept Speculative Lock Elision details SLE details Identifying regions for atomic execution Buffering speculative state Detecting conflicts/recovering Committing speculative state Conditions for speculating Microarchitectural datapaths SLE execution 11 12 3

Identifying regions for atomic execution Key: Lock variables exhibit temporal silence Look for instruction sequences to execute atomically Locks provide a good hint Protect critical s Hardware normally cannot identify lock operations Can predict these operations Predict candidate instruction sequence for execution Exploit property of lock variables Exploit property of lock variables Lock variables exhibit temporal silence Lock release restores value of lock prior to lock acquire Acquire Release If (lock == UNHELD) lock := HELD lock := UNHELD Temporally Silent-pairs Speculating thread sees acquired lock Other threads see a free lock 13 14 SLE details SLE details SLE relies on patterns in instruction stream, not semantics Two local predictions Predict silent store-pairs (acquire and release) Identify regions for executing atomically Predict no data conflicts Locally verified Region may not be critical but simply match the pattern That s ok, correct execution is still guaranteed 1. Get lock without writing lock variable Save processor state: registers Elide write to lock 2. Execute critical speculatively Use speculative execution 3. Track data (blocks) read/written and detect conflicts Use cache coherence protocol 4. Rollback when data conflicts occur Restore processor state: registers Conceptually similar to database OCC Program order semantics guaranteed No dependence on semantics/software information 15 16 4

Buffering speculative state Register state No need to track inter-instruction dependences One checkpoint already available for restoring Speculatively retire instructions Memory state Augment write buffer to store speculative data Any other buffering technique ok Speculative data never exposed until commit Detecting data conflicts Data conflicts For any two threads, at least one is writing the data block Extend cache with speculative access bits Track data block accesses Coherence protocol provides functionality Writes to shared locations generate invalidations Zero cost detection Lock kept in shared state Automatically notified if anyone acquires lock Misspeculation may result in explicit lock acquisition 17 18 Recovering from misspeculation Recovery Treat like a branch misprediction Roll-back processor state Squash speculative entries in write buffer Coherence protocol does not care Restart threshold N Can re-attempt speculation until N failures Committing speculation Register state Simply discard checkpoint Memory state Write permissions for modified blocks already obtained Write requests (not data) exposed to outside world Write buffer is marked as having latest data Alternatively, can use cache to buffer data Coherence state transitions support present Speculative loads, exclusive prefetches Provides illusion of atomic commit of all writes 19 20 5

SLE summary Microarchitecture data-paths Conditions for starting speculation Appropriate starting instruction/value pattern detected Lock filter Silent pair/misspeculation detector Register checkpoint Speculative data Conditions for ending speculation Restoring store to lock detected Non-restoring store to lock detected Some operation cannot be performed speculatively Appropriate terminating pattern not detected Data conflict detected All operations between start and end appear atomic Front end Load/store queue write-buffer Data cache External memory interface Memory consistency Always correct to insert an atomic set of memory operations into a global order of operations under any memory model Speculative access bits 21 22 SLE execution Outline T i m e Thread 1 Thread 2 Hash Table Lock: Free X Speculative Lock Elision concept Speculative Lock Elision details Y 23 24 6

Micro-benchmark result CMP, N counters, N processors (one per counter), 1 lock 2 16 /N increments per processor Without SLE With SLE Application results CMP/SMP/DSM configurations for 8 and 16 processors Sun Gigaplane-type bus and SGI Origin 2000-type directory Splash/Splash2 applications used Detailed parameters in paper Execution cycle (mi 0 2 4 6 8 10 12 Processor Count 25 26 Outline Speculative Lock Elision concept Speculative Lock Elision details SLE advantages/limitations SLE summary SLE advantages Implementation + Uses well understood speculation techniques + Much functionality already present: coherence protocol + Transparent to programmers No software or instruction set support + No system level changes Completely in the micro-architecture No coherence protocol changes 27 Performance + Lock never written to No serialization on lock Kept locally in shared state + Concurrent execution + Reduced observed memory latencies + Reduced memory traffic 28 7

SLE limitations Extra data paths and hardware Lock acquired for misspeculation conditions other than conflicts Resource constraints, I/O, certain types of system calls Get same performance and behavior as without SLE Sensitive to restart threshold Coherence protocol interference SLE contributions Enables highly concurrent multithreaded execution Concurrently execute non-conflict critical s Validation without writing to/acquiring lock Simplifies correct multithreaded code development Programmers do not have to learn new techniques Easy implementation Does not require software or instruction set support Minor changes to modern processors Opens up optimization opportunities Can provide strong performance guarantees even with conflicts? 29 30 Simulation parameters for SMP Application performance Processor speed Reorder buffer Issue mechanism Branch predictor Write buffer L1 instruction cache L1 data cache L2 unified cache Line size Coherence protocol Address network Data network DRAM memory module Memory consistency 1 GHz (1 ns clock) 128 entry with a 64 entry LSQ Out-of-order issue/commit of 8 ops. per cycle 8-K entry combining predictor, 8-K 4-way BTB 64-entry 64-KB, 2-way, 1-cycle access, 8 pend. misses 128-KB, 4-way, write-back, 3 ports, 1-cycle access, 8 pen. misses 4-MB, 4-way, write-back, 12 cycle access, 16 pend. misses 64 bytes Sun Gigaplane-type MOESI protocol b/w L2s. Split transaction. Broadcast snooping, 20 ns per snoop, 120 outstanding transactions. Point-to-Point, pipelined, 70 ns transfer latency 8-bytes wide, ~70ns access time for 64 byte line Total Store Ordering More configurations in paper Sources of performance: Concurrent execution Reduced memory latencies Reduced memory traffic Normalized execution time 1.0 0.8 0.6 0.4 0.2 0.0 ocean radiosity water barnes cholesky mp3d non-lock portion lock accesses backup 31 backup 32 8

Related work Concurrency techniques Lamport Transactional memory Load-Linked/Store-conditional Database techniques Optimistic concurrency control Predicting critical s Need only detect atomic TEST&SET operation Load-locked/store-conditional sequence No need to detect TEST operations Track stores to address of LL/SC operand to detect lock release PC of store-conditional used to identify critical Similar approach for other primitives Speculative buffering ARB (Franklin & Sohi) SVC (Gopal et al.) Speculative Retirement (Ranganathan et al., Lai & Falsafi) Multiversion Memory (Sorin et al.) backup 33 backup 34 Nested critical s Does not matter Automatically handled Speculation un-aware of critical s Commit a sequence of instructions atomically The sequence can have nested critical s too Speculation does not care about locks or nesting Speculation simply looks for sequence to commit atomically For a dynamic instance of speculation All critical s nested within are executed normally backup 35 9