Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Similar documents
Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun

Handout 2 ILP: Part B

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Data Speculation Support for a Chip Multiprocessor

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UNIT I (Two Marks Questions & Answers)

THE STANFORD HYDRA CMP

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

SPECULATIVE MULTITHREADED ARCHITECTURES

Speculative Synchronization

CS425 Computer Systems Architecture

Portland State University ECE 588/688. Cray-1 and Cray T3E

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

RECAP. B649 Parallel Architectures and Programming

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

5008: Computer Architecture

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor

Hydra: A Chip Multiprocessor with Support for Speculative Thread-Level Parallelization

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Instruction Level Parallelism (ILP)

Hardware-Based Speculation

Hardware-based Speculation

Multiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor

Complexity Analysis of A Cache Controller for Speculative Multithreading Chip Multiprocessors

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Multi-core Architectures. Dr. Yingwu Zhu

Advanced issues in pipelining

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

A Chip-Multiprocessor Architecture with Speculative Multithreading

Speculation and Future-Generation Computer Architecture

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Exploitation of instruction level parallelism

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Chapter 4 The Processor 1. Chapter 4D. The Processor

Computer Systems Architecture

CS533: Speculative Parallelization (Thread-Level Speculation)


Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Floating Point/Multicycle Pipelining in DLX

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

Transactional Memory Coherence and Consistency

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Lecture-13 (ROB and Multi-threading) CS422-Spring

Multithreaded Processors. Department of Electrical Engineering Stanford University

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Tutorial 11. Final Exam Review

Complexity Analysis of Cache Mechanisms for Speculative Multithreading Chip Multiprocessors

Computer Systems Architecture

Abstract. 1 Introduction. 2 The Hydra CMP. Computer Systems Laboratory Stanford University Stanford, CA

CS420/520 Computer Architecture I

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Computer Architecture

EECC551 Exam Review 4 questions out of 6 questions

Instruction Pipelining Review

Dynamic Scheduling. CSE471 Susan Eggers 1

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

EECS 452 Lecture 9 TLP Thread-Level Parallelism

Simultaneous Multithreading (SMT)

Keywords and Review Questions

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Multi-Version Caches for Multiscalar Processors. Manoj Franklin. Clemson University. 221-C Riggs Hall, Clemson, SC , USA

Multi-core Architectures. Dr. Yingwu Zhu

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Superscalar Processors

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Multi-cycle Instructions in the Pipeline (Floating Point)

Processor (IV) - advanced ILP. Hwansoo Han

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CMSC 611: Advanced Computer Architecture

The basic structure of a MIPS floating-point unit

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Lecture 13: March 25

Hyperthreading Technology

Transcription:

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software support for Data/Thread Level Speculation (TLS) to extract parallel speculated threads from sequential code (single thread) augmented with software thread speculation handlers. Stanford Hydra, discussed here, is one TLS architecture example. Other TLS Architectures include: - Wisconsin Multiscalar - Carnegie-Mellon Stampede - MIT M-machine (Primary Hydra papers: 4, 6) #1 lec # 10 Fall 2006 10-25-2006

Motivation for Chip Multiprocessors (CMPs) Chip Multiprocessors (CMPs) offers implementation benefits: High-speed signals are localized in individual CPUs A proven CPU design is replicated across the die (including SMT processors, e.g IBM Power 5) Overcomes diminishing performance/transistor return problem (limited- ILP) in single-threaded superscalar processors (similar motivation for SMT) Transistors are used today mostly for ILP extraction CMPs use transistors to run multiple threads (exploit thread level parallelism, TLP): On parallelized (multi-threaded) programs With multi-programmed workloads (multi-tasking) A number of single-threaded applications executing on different CPUs Fast inter-processor communication eases parallelization of code (Using shared L2 cache) But slower than logical processor communication in SMT Potential Drawback of CMPs: High power/heat issues using current VLSI processes. Results in reduced CMP clock rate #2 lec # 10 Fall 2006 10-25-2006

Stanford Hydra CMP Approach Goals Exploit all levels of program parallelism. Develop a single-chip multiprocessor architecture that simplifies microprocessor design and achieves high performance. Make the multiprocessor transparent to the average user. Integrate use of parallelizing compiler technology in the design of microarchitecture that supports data/thread level speculation (TLS). Coarse Grain On multiple CPU cores within a single CMP or multiple CMPs Fine Grain On multiple CPU cores within a single CMP using Thread Level Speculation (TLS) Within a single CPU core #3 lec # 10 Fall 2006 10-25-2006

Hydra Prototype Overview 4 CPU cores with modified private L1 caches. Speculative coprocessor (for each processor core) Speculative memory reference controller Speculative interrupt screening mechanism Statistics mechanisms for performance evaluation and to allow feedback for code tuning Memory system Read and write buses Controllers for all resources On-chip shared L2 cache L2 Speculation write buffers. Simple off-chip main memory controller I/O and debugging interface #4 lec # 10 Fall 2006 10-25-2006

The Basic Hydra CMP L2 4 processor cores and shared secondary (L2) cache on a chip 2 buses connect processors and memory Cache Coherence: writes are broadcast on write bus #5 lec # 10 Fall 2006 10-25-2006

Hydra Memory Hierarchy Characteristics #6 lec # 10 Fall 2006 10-25-2006

Hydra Prototype Layout Private Split L1 caches Per core I-L1 I-L1 Shared L2 D-L1 D-L1 L2 Speculation Write Buffers D-L1 D-L1 I-L1 I-L1 250 MHz clock rate target Circa ~ 1999 #7 lec # 10 Fall 2006 10-25-2006

CMP Parallel Performance Varying levels of performance: 1. Multiprogrammed workloads work well High Data Parallelism/LLP 2. Very parallel apps (matrix-based FP and multimedia) are excellent 3. Acceptable only with a few less parallel (i.e. integer) general applications 3 2 1 Thread Level Speculation (TLS) Target Applications Results given here are without Thread Level Speculation (TLS) #8 lec # 10 Fall 2006 10-25-2006

The Parallelization Problem Current automated parallelization software (parallel compilers) is limited Parallel compilers are generally successful for scientific applications with statically known dependencies (e.g dense matrix computations). High Data Parallelism/LLP Automated parallization of general-purpose applications provides poor parallel performance especially for integer applications due to ambiguous dependencies resulting from: Significant pointer use: Pointer aliasing (Pointer disambiguation problem) Dynamic loop limits Complex control flow Irregular array accesses Inter-procedural dependencies Ambiguous dependencies limit extracted parallelism/performance: Complicate static dependency analysis Introduce imprecision into dependence relations Force conservative performance-degrading synchronization to safely handle potential dependencies. Parallelism may exist in algorithm, but code hides it. Manual parallelization can provide good performance on a much wider range of applications: Requires different initial program design/data structures/algorithms Programmers with additional skills. Handling ambiguous dependencies present in general-purpose applications may still force conservative synchronization greatly limiting parallel performance Can hardware help the situation? Hardware Supported Thread Level Speculation #9 lec # 10 Fall 2006 10-25-2006

Possible Limited Parallel Software Solution: Data Speculation & Thread Level Speculation (TLS) Data speculation and Thread Level Speculation (TLS) enable parallelization without regard for data dependencies Normal sequential program is broken up into speculative threads Speculative threads are now run in parallel on multiple physical CPUs (e.g. CMP) and/or logical CPUs (e.g. SMT). Speculation hardware (TLS processor) architecture ensures correctness Parallel software implications Loop parallelization is now easily automated Ambiguous dependencies resolved dynamically without conservative synchronization More arbitrary threads are possible (subroutines) Add synchronization only for performance Thread Level Speculation (TLS) hardware support mechanisms Speculative thread control mechanism Five basic speculation hardware/memory system requirements for correct data/thread speculation Given later in slide 21 e.g Speculative thread creation, restart, termination.. #10 lec # 10 Fall 2006 10-25-2006

Subroutine Thread Speculation Speculated Thread Speculated threads communicate results through shared memory locations #11 lec # 10 Fall 2006 10-25-2006

Loop Iteration Speculative Threads A Simple example of a speculatively executed loop using Data/Thread Level Speculation (TLS) Speculated Threads Shown here one iteration per speculated thread Original Sequential (Single Thread) Loop Most common Application of TLS Speculated threads communicate results through shared memory locations #12 lec # 10 Fall 2006 10-25-2006

Speculative Thread Creation in Hydra Register Passing Buffer (RPB) #13 lec # 10 Fall 2006 10-25-2006

Hydra s Data & Thread Speculation Operations Speculated threads must commit results in order #14 lec # 10 Fall 2006 10-25-2006

Hydra Loop Compiling for Speculation Create Speculated Threads #15 lec # 10 Fall 2006 10-25-2006

Overview of Loop-Iteration Thread Speculation Parallel regions (loop iterations) are annotated by the compiler. e.g. Begin_Speculation End_Speculation The hardware uses these annotations to run loop iterations in parallel as speculated threads on a number of CPU cores. Each CPU core knows which loop iteration it is running. CPUs dynamically prevent data/name dependency violations: later iterations can t use data before write by earlier iterations (Prevent data dependency violation, RAW hazard) earlier iterations never see writes by later iterations (WAW, Memory Renaming WAR hazards prevented): How? Multiple views of memory are created by TLS hardware If a later iteration (more speculated thread) has used data that an earlier iteration (less speculated thread) writes before data is actually written (data dependency violation, RAW hazard), the later iteration is restarted All following iterations are halted and restarted, also. All writes by the later iteration are discarded (undo speculated work). Speculated threads communicate results through shared memory locations A later iteration is assigned a more speculated thread #16 lec # 10 Fall 2006 10-25-2006

Loop Execution with Thread Speculation Data Dependency Violation (RAW hazard) Handling Example Data Dependency Violation (RAW hazard) Value read too early If a later iteration (more speculated thread) has used data that an earlier iteration (less speculated thread) writes before data is actually written (data dependency violation, RAW hazard), the later iteration is restarted All following iterations are halted and restarted, also. All writes by the later iteration are discarded (undo speculated work). Earlier (less speculative) thread Later (more speculative) thread Speculated threads communicate results through shared memory locations #17 lec # 10 Fall 2006 10-25-2006

I (Write) e.g. S.D. F4, 0(R1) True Data Dependence e.g L.D. F6, 0(R1) J (Read) Data Hazard/Dependence Classification Shared Name Read after Write (RAW) if data dependence is violated I (Write) e.g. S.D. F4, 0(R1) A name dependence: output dependence e.g. S.D. F6, 0(R1) J (Write) Shared Name Write after Write (WAW) if output dependence is violated Name: Register or Memory Location I.... J Program Order Here, speculated threads communicate results through shared memory locations I (Read) e.g. L.D. F6, 0(R1) A name dependence: antidependence e.g. S.D. F4, 0(R1) J (Write) Shared Name Write after Read (WAR) if antidependence is violated I (Read) e.g. L.D. F6, 0(R1) No dependence e.g. L.D. F4, 0(R1) J (Read) Shared Name Read after Read (RAR) not a hazard #18 lec # 10 Fall 2006 10-25-2006

Speculative Data Access in Speculated Threads i Less Speculated thread i+1 More speculated thread i i+1 WAR RAW WAW Data Dep. violation Access in correct program order to same memory location i before i+1 Write by i+1 Not seen by i Reversed access order to same memory location i+1 before i Speculated threads communicate results through shared memory locations #19 lec # 10 Fall 2006 10-25-2006

Speculative Data Access in Speculated Threads To provide the desired memory behavior, the data/thread speculation hardware must provide: 1. A method for detecting true memory dependencies, in order to determine when a dependency has been violated (RAW hazard). 2. A method for backing up and re-executing speculative loads and any instructions that may be dependent upon them when the load causes a violation. i.e RAW hazard 3. A method for buffering any data written during a speculative region of a program so that it may be discarded when a violation occurs or permanently committed at the right time. i.e when thread is no longer speculated and in correct program order #20 lec # 10 Fall 2006 10-25-2006

Five Basic Speculation Hardware Requirements For Correct Data/Thread Speculation 1. Forward data between parallel threads (Prevent RAW). A speculative system must be able to forward shared data quickly and efficiently from an earlier thread running on one processor to a later thread running on another. 2. Detect when reads occur too early (RAW hazard occurred ). If a data value is read by a later thread and subsequently written by an earlier thread, the hardware must notice that the read retrieved incorrect data since a true dependence violation (RAW hazard) has occurred. 3. Safely discard speculative state after violations (RAW hazards). All speculative changes to the machine state must be discarded after a violation, while no permanent machine state may be lost in the process. 4. Retire speculative writes in the correct order (Prevent WAW hazards). Once speculative threads have completed successfully (no longer speculative), their state must be added (committed) to the permanent state of the machine in the correct program order, considering the original sequencing of the threads. 5. Provide memory renaming (Prevent WAR hazards). The speculative hardware must ensure that the older thread cannot see any changes made by later threads, as these would not have occurred yet in the original sequential program. (i.g. Multiple views of memory) #21 lec # 10 Fall 2006 10-25-2006

Speculative Hardware/Memory Requirements 1-2 2 Read is too early Less Speculated Thread i 1 More Speculated Thread i +1 (prevent RAW) (RAW hazard or violation) #22 lec # 10 Fall 2006 10-25-2006

Speculative Hardware/Memory Requirements 3-4 RAW Hazard Occurred/Detected More Speculated Thread 3 Restart Commit speculative writes in correct program order 4 (RAW hazard). Commit (prevent WAW hazards) #23 lec # 10 Fall 2006 10-25-2006

Speculative Hardware/Memory Requirement 5 Memory Renaming Less speculated Thread i More Speculated Thread i + 1 Even more Speculated Thread i + 2 Not visible to less speculated thread i 5 Write X by i+1 not visible to less speculated threads (thread i here) (i.e. no WAR hazard) Memory Renaming to prevent WAR hazards. #24 lec # 10 Fall 2006 10-25-2006

Hydra Thread Level Speculation (TLS) Hardware Speculation Coprocessor L2 Cache Speculation Write Buffers (one per core) #25 lec # 10 Fall 2006 10-25-2006

Hydra Thread Level Speculation (TLS) Support Multiple Memory views or Memory Renaming #26 lec # 10 Fall 2006 10-25-2006

L1 Cache Tag Details - Record writes of more speculated threads #27 lec # 10 Fall 2006 10-25-2006

L2 Speculation Write Buffer Details L2 speculation write buffers committed in L2 (which holds permanent nonspeculative state ) in correct program order (when no longer speculative) To prevent WAW hazards (basic requirement 5) #28 lec # 10 Fall 2006 10-25-2006

The Operation of Speculative Loads Less Speculative On local Data L1 Miss: First, check own and then less speculated (earlier) Write buffers then L2 cache Check L2 Last L1 More Speculative On local Data L1 Miss: Do Not Check: More Speculated Later writes not visible (otherwise WAR) Check First On L1 miss This operation of speculative loads provides multiple memory views (memory renaming) where more speculative writes are not visible to less speculative threads which prevents WAR hazards (This satisfies basic speculative hardware/memory requirement 4) and satisfies data dependencies (Requirement 1) #29 lec # 10 Fall 2006 10-25-2006

Speculative Load Operation: Reading L2 Cache Speculative Buffers On local Data L1 Miss: First, check own and then less speculated (earlier) Write buffers then L2 cache On local Data L1 Miss: Do Not Check: More Speculated Later writes not visible (otherwise WAR) Similar to last slide #30 lec # 10 Fall 2006 10-25-2006

Less Speculated More Speculated The Operation of Speculative Stores Write to L1 and own L2 speculation write buffer RAW Detection Similar to invalidate cache coherency protocols (This satisfies basic speculative hardware/memory requirements 2-3) L2 speculation write buffers committed in L2 (which holds permanent nonspeculative state ) in correct program order (when no longer speculative) (This satisfies basic speculative hardware/memory requirement 5) #31 lec # 10 Fall 2006 10-25-2006

Hydra s Handling of Five Basic Speculation Hardware Requirements For Correct Data/Thread Speculation 1. Forward data between parallel threads (RAW). Speculative Load As seen earlier in slides 29-30 When a speculative thread writes data over the write bus, all more-speculative threads that may need the data have their current copy of that cache line invalidated. In primary cache (Data L1 cache) This is similar to the way the system works during nonspeculative operation (invalidate cache coherency protocol). If any of the threads subsequently need the new speculative data forwarded to them, they will miss in their primary cache and access the secondary cache. And own L2 write buffer and less speculated L2 buffers The speculative data contained in the write buffers of the current or older threads replaces data returned from the secondary cache on a byte-by-byte basis just before the composite line is returned to the processor and primary cache. #32 lec # 10 Fall 2006 10-25-2006

Hydra s Handling of Five Basic Speculation Hardware Requirements For Correct Data/Thread Speculation 2. Detect when reads occur too early (Detect RAW hazards). Primary cache (Data L1) read bits are set to mark any reads that may cause violations. Subsequently, if a write to that address from an earlier thread (less speculated) invalidates the address, a violation is detected, and the thread is restarted. 3. Safely discard speculative state after violations. Since all permanent machine state in Hydra is always maintained within the secondary cache, anything in the primary caches and RAW hazards secondary cache speculation buffers may be invalidated at any time without risking a loss of permanent state. As a result, any lines in the primary cache containing speculative data (marked with a special modified bit) may simply be invalidated all at once to clear any speculative state from a primary cache. In parallel with this operation, the secondary cache buffer for the thread may be emptied to discard any speculative data written by the thread. #33 lec # 10 Fall 2006 10-25-2006

As seen earlier in speculative load operation Hydra s Handling of Five Basic Speculation Hardware Requirements For Correct Data/Thread Speculation 4. Retire speculative writes in the correct order (Prevent WAW hazards). Separate secondary cache speculation buffers are maintained for each thread. As long as these are drained (committed) into the secondary cache in the original program sequence of the threads, they will reorder speculative memory references correctly. 5. Provide memory renaming (Prevent WAR hazards). Multiple Memory Views Each processor can only read data written by itself or earlier threads (less speculated threads) when reading its own primary cache or the secondary cache speculation buffers. Writes from later (more speculative) threads don t cause immediate invalidations in the primary cache, since these writes should not be visible to earlier (less speculative) threads. However, these ignored invalidations are recorded using an additional preinvalidate primary cache bit associated with each line. This is because they must be processed before a different speculative or non-speculative thread executes on this processor. If future threads have written to a particular line in the primary cache, the preinvalidate bit for that line is set. When the current thread completes, these bits allow the processor to quickly simulate the effect of all stored invalidations caused by all writes from later processors all at once, before a new thread begins execution on this processor. #34 lec # 10 Fall 2006 10-25-2006

Thread Speculation Performance Results representative of entire uniprocessor applications Simulated with accurate modeling of Hydra s memory and hardware speculation support. (No TLS) #35 lec # 10 Fall 2006 10-25-2006

Hydra Conclusions Hydra offers a number of advantages Good performance on parallel applications Promising performance on difficult to parallelize sequential (single-threaded) applications using data/thread Level Speculation (TLS) mechanisms. Scalable, modular design Low hardware overhead support for speculative thread parallelism, yet greatly increases the number of parallel applications. #36 lec # 10 Fall 2006 10-25-2006

Other Thread Level Speculation (TLS) Efforts: Wisconsin Multiscalar (1995) This CMP-based design proposed the first reasonable hardware to implement TLS. Unlike Hydra, Multiscalar implements a ring-like network between all of the processors to allow direct register-to-register communication. Along with hardware-based thread sequencing, this type of communication allows much smaller threads to be exploited at the expense of more complex processor cores. The designers proposed two different speculative memory systems to support the Multiscalar core. The first was a unified primary cache, or address resolution buffer (ARB). Unfortunately, the ARB has most of the complexity of Hydra s secondary cache buffers at the primary cache level, making it difficult to implement. Later, they proposed the speculative versioning cache (SVC). The SVC uses write-back primary caches to buffer speculative writes in the primary caches, using a sophisticated coherence scheme. #37 lec # 10 Fall 2006 10-25-2006

Other Thread Level Speculation (TLS) Efforts: Carnegie-Mellon Stampede This CMP-with-TLS proposal is very similar to Hydra, Including the use of software speculation handlers. However, the hardware is simpler than Hydra s. The design uses write-back primary caches to buffer writes similar to those in the SVC and sophisticated compiler technology to explicitly mark all memory references that require forwarding to another speculative thread. Their simplified SVC must drain its speculative contents as each thread completes, unfortunately resulting in heavy bursts of bus activity. #38 lec # 10 Fall 2006 10-25-2006

Other Thread Level Speculation (TLS) Efforts: MIT M-machine This CMP design has three processors that share a primary cache and can communicate register-to-register through a crossbar. Each processor can also switch dynamically among several threads. (TLS & SMT??) As a result, the hardware connecting processors together is quite complex and slow. However, programs executed on the M-machine can be parallelized using very fine-grain mechanisms that are impossible on an architecture that shares outside of the processor cores, like Hydra. Performance results show that on typical applications extremely fine-grained parallelization is often not as effective as parallelism at the levels that Hydra can exploit. The overhead incurred by frequent synchronizations reduces the effectiveness. #39 lec # 10 Fall 2006 10-25-2006