CS533: Speculative Parallelization (Thread-Level Speculation)

Similar documents
Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I. Prof. Onur Mutlu Carnegie Mellon University 10/10/2012

A Chip-Multiprocessor Architecture with Speculative Multithreading

Eliminating Squashes Through Learning Cross-Thread Violations in Speculative Parallelization li for Multiprocessors

Multi-Version Caches for Multiscalar Processors. Manoj Franklin. Clemson University. 221-C Riggs Hall, Clemson, SC , USA

Multiscalar Processors

Complexity Analysis of A Cache Controller for Speculative Multithreading Chip Multiprocessors

POSH: A TLS Compiler that Exploits Program Structure

SPECULATIVE MULTITHREADED ARCHITECTURES

Supertask Successor. Predictor. Global Sequencer. Super PE 0 Super PE 1. Value. Predictor. (Optional) Interconnect ARB. Data Cache

Supporting Speculative Multithreading on Simultaneous Multithreaded Processors

A Hardware/Software Approach for Thread Level Control Speculation

Architectural Support for Scalable Speculative Parallelization in Shared-Memory Multiprocessors Λ

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Speculative Versioning Cache: Unifying Speculation and Coherence

Complexity Analysis of Cache Mechanisms for Speculative Multithreading Chip Multiprocessors

ARB: A Hardware Mechanism for Dynamic Reordering of Memory References* Department of Electrical and Computer Engineering Computer Sciences Department

Simultaneous Multithreading Processor

Speculation and Future-Generation Computer Architecture

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Evaluation of a Speculative Multithreading Compiler by Characterizing Program Dependences

Multiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor

Task Selection for the Multiscalar Architecture

Lecture-18 (Cache Optimizations) CS422-Spring

The Design Complexity of Program Undo Support in a General Purpose Processor. Radu Teodorescu and Josep Torrellas

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Tradeoffs in Buffering Speculative Memory State for Thread-Level Speculation in Multiprocessors

EE 660: Computer Architecture Superscalar Techniques

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Handout 2 ILP: Part B

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

A Compiler Cost Model for Speculative Multithreading Chip-Multiprocessor Architectures

Tradeoffs in Buffering Memory State for Thread-Level Speculation in Multiprocessors Λ

A Thread Partitioning Algorithm using Structural Analysis

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Speculative Synchronization

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Advanced issues in pipelining

A survey of new research directions in microprocessors

Portland State University ECE 588/688. Cray-1 and Cray T3E

EECS 570 Final Exam - SOLUTIONS Winter 2015

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

ENERGY-EFFICIENT THREAD-LEVEL SPECULATION

CS425 Computer Systems Architecture

The Design Complexity of Program Undo Support in a General-Purpose Processor

HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Tasking with Out-of-Order Spawn in TLS Chip Multiprocessors: Microarchitecture and Compilation

Computer Architecture Spring 2016

Computer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University

Dynamic Performance Tuning for Speculative Threads

Multiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor

Reconfigurable and Self-optimizing Multicore Architectures. Presented by: Naveen Sundarraj

Speculative Locks. Dept. of Computer Science

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

2 TEST: A Tracer for Extracting Speculative Threads


Tasking with Out-of-Order Spawn in TLS Chip Multiprocessors: Microarchitecture and Compilation

CS 2410 Mid term (fall 2018)

A Brief Description of the NMP ISA and Benchmarks

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Synchronizing Store Sets (SSS): Balancing the Benefits and Risks of Inter-thread Load Speculation

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

FlexBulk: Intelligently Forming Atomic Blocks in Blocked-Execution Multiprocessors to Minimize Squashes

Eric Rotenberg Jim Smith North Carolina State University Department of Electrical and Computer Engineering

Removing Architectural Bottlenecks to the Scalability of Speculative Parallelization

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

Execution-based Prediction Using Speculative Slices

Superscalar Machines. Characteristics of superscalar processors

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Computer Architecture: Multithreading (IV) Prof. Onur Mutlu Carnegie Mellon University

Superscalar Processors

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

A Dynamic Multithreading Processor

Memory Hierarchy Basics

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

A Compiler Cost Model for Speculative Parallelization

Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Architectures for Instruction-Level Parallelism

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Computer Science 146. Computer Architecture

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

Lecture 19: Instruction Level Parallelism

November 7, 2014 Prediction

Transcription:

CS533: Speculative Parallelization (Thread-Level Speculation) Josep Torrellas University of Illinois in Urbana-Champaign March 5, 2015 Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 1 / 21

Concepts Main idea: execute unsafe threads in parallel Typically threads come from a sequential application Hardware monitors for data dependence violations If violation: offending thread is squashed and restarted If no violation: Thread completes When all predecessor threads are completed, thread becomes non-speculative or safe When all predecessor threads have committed, thread commit Note: thread commit one by one and in order Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 2 / 21

Communication between Threads Threads in a CMP communicate with each other with: Registers Memory Register communication: Needs hardware between processors Dependences between threads known by compiler Can be producer initiated or consumer initiated If consumer first: consumer stalls producer forwards If producer first: producer writes and continues consumer reads later Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 3 / 21

Communication (II) Memory communication: May cause thread squashing Not known by the compiler Loads are done speculatively: get the data from the closest predecessor keep record that read the data (L1 cache or other structure) Stores are done speculatively buffer the update while speculative (write buffer or L1) check successors for premature reads if successors prematurely read: squash typically squash the offending thread and all successors When thread commits: its state is merged with main memory (safe state) write back ask for owernship Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 4 / 21

Dependence Violations Types of dependence violations: out of order... LD, ST: name dependence; hardware may handle ST, ST: name dependence; hardware may handle ST, LD: true dependence; causes a squash To handle name dependences: multiple-version support: every spec or non-spec store to a location: creates a new version Keep and check for speculative versions: Centralized scheme: ARB (paper by Sohi et al.) De-centralized schemes: L1 or write buffers (paper by Krishnan et al.) Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 5 / 21

Centralized Non-speculative Spec1 Spec2 Spec3 Rotate Speculative State ARB L1 cache Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 6 / 21

Decentralized Non-speculative Spec1 Spec2 Spec3 Rotate Speculative State L1 caches L2 cache Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 7 / 21

Multiscalar Overview G. Sohi et al. Multiscalar Processors ISCA, 1995 Exploit implict thread-level parallelism within a serial program Compiler divides program into tasks Tasks scheduled on independent processing resources Hardware handles register dependences between tasks Memory speculation for memory dependences Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 8 / 21

Expandable Split-Window Paradigm Superscalar single centralized window Multiscalar multiple distributed windows PC task PC task PC task PC PC task dynamic instruction stream Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 9 / 21

Tasks Task A A task is a subgraph of the control flow graph (CFG) e.g., a basic block, basic blocks, loop body, function,... Tasks are selected by compiler and conveyed to hardware Tasks are predicted and scheduled by processor Tasks may have data and/or control dependences Task B Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 10 / 21

Multiscalar Processor Head Sequencer Tail icache fer concludof the CFG a large and rom which heduled for P parlance, ion for exer processor struction by r), nor basic e execution mic instrucblock, multire loop, a titioned into CFG (more Processing Unit Data Bank unidirectional ring processing element register file Interconnect arb dcache unidirectional ring Figure 1: A Possible Microarchitecture of a Multiscalar Processor. Processing Unit Data Bank Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 11 / 21

Compiler s Job Task selection partition CFG into tasks Load balance Minimize inter-task data dependences Minimize inter-task control dependences By embedding hard-to-predict branches within tasks Convey information in the executable Task headers create mask (1 bit per register) Indicates all registers that are possibly modified or created by the task Don t forward instances received from prior tasks PCs of successor tasks Insert special instructions and opcodes Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 12 / 21

Special Opcode Bits and Instructions Compiler must identify the last instance of a register within a task Opcodes that write a register have additional forward bit, indicating the instance should be forwarded Stop bits indicate end of task Release instruction Release instruction tells PE to forward the value Examples Task Task Task LD r1, LD r1, LD F r1, LD F r1, LD F r1, Rel r1 Rel r1 F: forward bit Rel: release Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 13 / 21

Task Sequencer Task prediction analogous to branch prediction Predict inter-task control flow Task A Task A Task A a b Task C Task B Task B Task B Implicitly predict a & b Control independent Highly predictable inter-task branch Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 14 / 21

Inter-task Dependences Control dependences Predict Squash subsequent tasks on misprediction Data dependences Register file: mask bits and forwarding (stall until available) Memory: address resolution buffer (speculative load, squash on violation) Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 15 / 21

Address Resolution Buffer (ARB) Multiscalar issues loads to ARB/D-cache as soon as address is computed Optimistic speculation: No prior unresolved stores to the same address ARB is organized like a cache, maintaining state for all outstanding load/store addresses An ARB entry Tag L S Data L S Data L S Data L S Data Stage 0 Stage 1 Stage 2 Stage 3 Stage = Task = PE L: load performed S: store performed Data: store data Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 16 / 21

ARB Loads ARB miss data comes from D-cache (no prior stores yet) ARB hit get most recent data to the load, which may be from D-cache, or nearest prior stage with S = 1 Stores ARB buffers speculative stores When head pointer moves to PE i, commit all stores in stage i to the D-cache ARB performs memory renaming Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 17 / 21

Summary: Multiscalar Software and hardware extract parallelism Divide CFG into tasks Speculatively execute tasks Inter-task control dependence Prediction Inter-task data dependence (Register) create mask s and forwarding (Memory) speculative load, disambiguation by ARB Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 18 / 21

A Chip Multiprocessor Architecture with Speculative Multithreading V. Krishna and J. Torrellas Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 19 / 21

Speculative Versioning A load must eventually read the value created by its most recent store: A load must be squashed and re-executed if done before the store All stores that folow must be buffered until the load is done A memory location must eventually have the correct version speculative versions must be committed in program order Information on the data read/written speculatively can be kept: tags of caches in directory-like structure: MDT Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 20 / 21

Memory Disambiguation Table (MDT) Bus in the Chip Multiprocessor Valid Line Address L0 L1 L2 L3 S0 S1 S2 S3 1 0x1234740 0 0 1 0 0 1 0 0 0 0x3348210 1 0x4321760 0 0 1 1 0 1 0 0 Load Bits Store Bits When a thread becomes non-speculative clear its bits Multibanked Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 21 / 21