Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Size: px
Start display at page:

Download "Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)"

Transcription

1 Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software support for Data/Thread Level Speculation (TLS) to extract parallel speculated threads from sequential code (single thread) augmented with software thread speculation handlers (Primary papers: 4, 6) #1 lec # 10 Fall

2 Motivation for Chip Multiprocessors (CMPs) A CMP offers implementation benefits High-speed signals are localized in individual CPUs A proven CPU design is replicated across the die (including SMT processors, e.g IBM Power 5) Overcomes diminishing performance/transistor return problem in uniprocessors (similar motivation for SMT) Transistors are used today mostly for ILP extraction MPs use transistors to run multiple threads (exploit thread level parallelism, TLP): On parallelized programs With multiprogrammed workloads A number of single-threaded applications executing of different CPUs Fast inter-processor communication eases parallelization of code (Using shared L2 cache) Potential Drawback of CMPs: High power/heat issues using current VLSI processes. #2 lec # 10 Fall

3 Stanford Hydra CMP Approach Goals Exploit all levels of program parallelism. Develop a single-chip multiprocessor architecture that simplifies microprocessor design and achieves high performance. Make the multiprocessor transparent to the average user. Integrate use of parallelizing compiler technology in the design of microarchitecture that supports data/thread level speculation (TLS). On multiple CPU cores within a single CMP or multiple CMPs On multiple CPU cores within a single CMP using Thread Level Speculation (TLS) Within a single CPU core #3 lec # 10 Fall

4 Hydra Prototype Overview 4 CPU cores with modified private L1 caches. Speculative coprocessor (for each processor core) Speculative memory reference controller Speculative interrupt screening mechanism Statistics mechanisms for performance evaluation and to allow feedback for code tuning Memory system Read and write buses Controllers for all resources On-chip shared L2 cache L2 Speculation write buffers. Simple off-chip main memory controller I/O and debugging interface #4 lec # 10 Fall

5 The Basic Hydra CMP 4 processors and secondary cache on a chip 2 buses connect processors and memory Coherence: writes are broadcast on write bus #5 lec # 10 Fall

6 Hydra Memory Hierarchy Characteristics #6 lec # 10 Fall

7 Hydra Prototype Layout Shared L2 L2 Speculation Write Buffers 250 MHz clock rate target #7 lec # 10 Fall

8 CMP Parallel Performance Varying levels of performance Multiprogrammed workloads work well Very parallel apps (matrix-based FP and multimedia) are excellent Acceptable only with a few less parallel (i.e. integer) applications Thread Level Speculation (TLS) Target Applications Without Thread Level Speculation (TLS) #8 lec # 10 Fall

9 The Parallelization Problem Current automated parallelization software (parallel compilers) is limited Parallel compilers are generally successful for scientific applications with statically known dependencies (e.g dense matrix computations). Automated parallization of general-purpose applications provides poor parallel performance especially for integer applications due to ambiguous dependencies resulting from: Significant pointer use: Pointer aliasing (Pointer disambiguation problem) Dynamic loop limits Complex control flow Irregular array accesses Inter-procedural dependencies Ambiguous dependencies limit extracted parallelism/performance: Complicate static dependency analysis Introduce imprecision into dependence relations Force conservative performance-degrading synchronization to safely handle potential dependencies. Parallelism may exist in algorithm, but code hides it. Manual parallelization can provide good performance on a much wider range of applications: Requires different initial program design/data structures/algorithms Programmers with additional skills. Handling ambiguous dependencies present in general-purpose applications may still force conservative synchronization greatly limiting parallel performance Can hardware help the situation? #9 lec # 10 Fall

10 Possible Limited Parallel Software Solution: Data Speculation & Thread Level Speculation (TLS) Data speculation and Thread Level Speculation (TLS) enable parallelization without regard for data dependencies Normal sequential program is broken up into speculative threads Speculative threads are now run in parallel on multiple physical CPUs (e.g. CMP) and/or logical CPUs (e.g. SMT). Speculation hardware (TLS processor) architecture ensures correctness Parallel software implications Loop parallelization is now easily automated Ambiguous dependencies resolved dynamically without conservative synchronization More arbitrary threads are possible (subroutines) Add synchronization only for performance Thread Level Speculation (TLS) hardware support mechanisms Speculative thread control mechanism Five basic speculation hardware/memory system requirements for correct data/thread speculation #10 lec # 10 Fall

11 Subroutine Thread Speculation Speculated Thread #11 lec # 10 Fall

12 Loop Iteration Speculative Threads A Simple example of a speculatively executed loop using Data/Thread Level Speculation (TLS) Speculated Threads Original Sequential (Single Thread) Loop Most common Application of TLS #12 lec # 10 Fall

13 Overview of Loop-Iteration Thread Speculation Parallel regions (loop iterations) are annotated by the compiler. e.g. Begin_Speculation End_Speculation The hardware uses these annotations to run loop iterations in parallel as speculated threads on a number of CPUs. Each CPU knows which loop iteration it is running CPUs dynamically prevent data/name dependency violations later iterations can t use data before write by earlier iterations (RAW) earlier iterations never see writes by later iterations (WAW, WAR hazards prevented): Multiple views of memory are created by TLS hardware If a later iteration has used data that an earlier iteration writes (RAW hazard), it is restarted All following iterations are halted and restarted, also All writes by the later iteration are discarded (undo speculated work). #13 lec # 10 Fall

14 Hydra s Data & Thread Speculation Operations #14 lec # 10 Fall

15 Hydra Loop Compiling for Speculation Speculated Threads #15 lec # 10 Fall

16 Loop Execution with Thread Speculation Data Dependency Violation (RAW hazard) #16 lec # 10 Fall

17 Speculative Thread Creation in Hydra Register Passing Buffer (RPB) #17 lec # 10 Fall

18 Speculative Data Access in Speculated Threads i Less Speculated thread i+1 More speculated thread i WAR i+1 RAW WAW Write by i+1 Not seen by i #18 lec # 10 Fall

19 Speculative Data Access in Speculated Threads To provide the desired memory behavior, the data/thread speculation hardware must provide: 1. A method for detecting true memory dependencies, in order to determine when a dependency has been violated (RAW hazard). 2. A method for backing up and re-executing speculative loads and any instructions that may be dependent upon them when the load causes a violation. 3. A method for buffering any data written during a speculative region of a program so that it may be discarded when a violation occurs or permanently committed at the right time. #19 lec # 10 Fall

20 Five Basic Speculation Hardware Requirements For Correct Data/Thread Speculation 1. Forward data between parallel threads (RAW). A speculative system must be able to forward shared data quickly and efficiently from an earlier thread running on one processor to a later thread running on another. 2. Detect when reads occur too early (RAW hazards). If a data value is read by a later thread and subsequently written by an earlier thread, the hardware must notice that the read retrieved incorrect data since a true dependence violation has occurred. 3. Safely discard speculative state after violations. All speculative changes to the machine state must be discarded after a violation, while no permanent machine state may be lost in the process. 4. Retire speculative writes in the correct order (WAW hazards). Once speculative threads have completed successfully, their state must be added to the permanent state of the machine in the correct program order, considering the original sequencing of the threads. 5. Provide memory renaming (WAR hazards). The speculative hardware must ensure that the older thread cannot see any changes made by later threads, as these would not have occurred yet in the original sequential program. (i.g. Multiple views of memory) #20 lec # 10 Fall

21 Speculative Hardware/Memory Requirements More Speculated Thread (RAW) (RAW hazard or violation) #21 lec # 10 Fall

22 Speculative Hardware/Memory Requirements 3-4 More Speculated Thread Restart 3 4 (RAW hazard). (prevent WAW hazards) #22 lec # 10 Fall

23 Speculative Hardware/Memory Requirement 5 Less speculated Thread i More Speculated Thread i + 1 Even more Speculated Thread i + 2 Not visible to less speculated thread i 5 Write X by i+1 not visible to less speculated threads (thread i here) (i.e. no WAR hazard) Memory Renaming to prevent WAR hazards. #23 lec # 10 Fall

24 Hydra Thread Level Speculation (TLS) Hardware #24 lec # 10 Fall

25 Hydra Thread Level Speculation (TLS) Support #25 lec # 10 Fall

26 L1 Cache Tag Details - Record writes of more speculated threads #26 lec # 10 Fall

27 L2 Speculation Buffer Details #27 lec # 10 Fall

28 The Operation of Speculative Loads Less Speculative More Speculative Check Last Do Not Check: More Speculated Later writes not visible (otherwise WAR) Check First #28 lec # 10 Fall

29 Reading L2 Cache Speculative Buffers Similar to last slide #29 lec # 10 Fall

30 Less Speculated More Speculated The Operation of Speculative Stores RAW Detection Similar to invalidate cache coherency protocols #30 lec # 10 Fall

31 Hydra s Handling of Five Basic Speculation Hardware Requirements For Correct Data/Thread Speculation 1. Forward data between parallel threads (RAW). Speculative Load When a speculative thread writes data over the write bus, all more-speculative threads that may need the data have their current copy of that cache line invalidated. This is similar to the way the system works during nonspeculative operation (invalidate cache coherency protocol). If any of the threads subsequently need the new speculative data forwarded to them, they will miss in their primary cache and access the secondary cache. The speculative data contained in the write buffers of the current or older threads replaces data returned from the secondary cache on a byte-by-byte basis just before the composite line is returned to the processor and primary cache. #31 lec # 10 Fall

32 Hydra s Handling of Five Basic Speculation Hardware Requirements For Correct Data/Thread Speculation 2. Detect when reads occur too early (RAW hazards). Primary cache bits are set to mark any reads that may cause violations. Subsequently, if a write to that address from an earlier thread (less speculated) invalidates the address, a violation is detected, and the thread is restarted. 3. Safely discard speculative state after violations. Since all permanent machine state in Hydra is always maintained within the secondary cache, anything in the primary caches and secondary cache speculation buffers may be invalidated at any time without risking a loss of permanent state. As a result, any lines in the primary cache containing speculative data (marked with a special modified bit) may simply be invalidated all at once to clear any speculative state from a primary cache. In parallel with this operation, the secondary cache buffer for the thread may be emptied to discard any speculative data written by the thread. #32 lec # 10 Fall

33 Hydra s Handling of Five Basic Speculation Hardware Requirements For Correct Data/Thread Speculation 4. Retire speculative writes in the correct order (WAW hazards). Separate secondary cache speculation buffers are maintained for each thread. As long as these are drained into the secondary cache in the original program sequence of the threads, they will reorder speculative memory references correctly. 5. Provide memory renaming (WAR hazards). Each processor can only read data written by itself or earlier threads (less speculated threads) when reading its own primary cache or the secondary cache speculation buffers. Writes from later threads don t cause immediate invalidations in the primary cache, since these writes should not be visible to earlier (less speculative) threads. However, these ignored invalidations are recorded using an additional preinvalidate primary cache bit associated with each line. This is because they must be processed before a different speculative or non-speculative thread executes on this processor. If future threads have written to a particular line in the primary cache, the preinvalidate bit for that line is set. When the current thread completes, these bits allow the processor to quickly simulate the effect of all stored invalidations caused by all writes from later processors all at once, before a new thread begins execution on this processor. More speculative writes not visible #33 lec # 10 Fall

34 Thread Speculation Performance Results representative of entire uniprocessor applications Simulated with accurate modeling of Hydra s memory and hardware speculation support. #34 lec # 10 Fall

35 Hydra Conclusions Hydra offers a number of advantages Good performance on parallel applications Promising performance on difficult to parallelize sequential (single-threaded) applications using data/thread Level Speculation (TLS) mechanisms. Scalable, modular design Low hardware overhead support for speculative thread parallelism, yet greatly increases the number of parallel applications. #35 lec # 10 Fall

36 Other Thread Level Speculation (TLS) Efforts: Wisconsin Multiscalar (1995) This CMP-based design proposed the first reasonable hardware to implement TLS. Unlike Hydra, Multiscalar implements a ring-like network between all of the processors to allow direct register-to-register communication. Along with hardware-based thread sequencing, this type of communication allows much smaller threads to be exploited at the expense of more complex processor cores. The designers proposed two different speculative memory systems to support the Multiscalar core. The first was a unified primary cache, or address resolution buffer (ARB). Unfortunately, the ARB has most of the complexity of Hydra s secondary cache buffers at the primary cache level, making it difficult to implement. Later, they proposed the speculative versioning cache (SVC). The SVC uses write-back primary caches to buffer speculative writes in the primary caches, using a sophisticated coherence scheme. #36 lec # 10 Fall

37 Other Thread Level Speculation (TLS) Efforts: Carnegie-Mellon Stampede This CMP-with-TLS proposal is very similar to Hydra, Including the use of software speculation handlers. However, the hardware is simpler than Hydra s. The design uses write-back primary caches to buffer writes similar to those in the SVC and sophisticated compiler technology to explicitly mark all memory references that require forwarding to another speculative thread. Their simplified SVC must drain its speculative contents as each thread completes, unfortunately resulting in heavy bursts of bus activity. #37 lec # 10 Fall

38 Other Thread Level Speculation (TLS) Efforts: MIT M-machine This CMP design has three processors that share a primary cache and can communicate register-to-register through a crossbar. Each processor can also switch dynamically among several threads. (TLS & SMT??) As a result, the hardware connecting processors together is quite complex and slow. However, programs executed on the M-machine can be parallelized using very fine-grain mechanisms that are impossible on an architecture that shares outside of the processor cores, like Hydra. Performance results show that on typical applications extremely fine-grained parallelization is often not as effective as parallelism at the levels that Hydra can exploit. The overhead incurred by frequent synchronizations reduces the effectiveness. #38 lec # 10 Fall

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun The Stanford Hydra CMP Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu

More information

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun The Stanford Hydra CMP Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu

More information

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA CS 258 Parallel Computer Architecture Data Speculation Support for a Chip Multiprocessor (Hydra CMP) Lance Hammond, Mark Willey and Kunle Olukotun Presented: May 7 th, 2008 Ankit Jain Outline The Hydra

More information

Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun

Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu A Chip Multiprocessor Implementation

More information

Data Speculation Support for a Chip Multiprocessor

Data Speculation Support for a Chip Multiprocessor Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey and Kunle Olukotun Computer Systems Laboratory Stanford University Stanford, CA 94305-4070 http://www-hydra.stanford.edu/ Abstract

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

THE STANFORD HYDRA CMP

THE STANFORD HYDRA CMP THE STANFORD HYDRA CMP CHIP MULTIPROCESSORS OFFER AN ECONOMICAL, SCALABLE ARCHITECTURE FOR FUTURE MICROPROCESSORS. THREAD-LEVEL SPECULATION SUPPORT ALLOWS THEM TO SPEED UP PAST SOFTWARE. Lance Hammond

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review CSE502 Graduate Computer Architecture Lec 22 Goodbye to Computer Architecture and Review Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from

More information

Speculative Synchronization

Speculative Synchronization Speculative Synchronization José F. Martínez Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu/martinez Problem 1: Conservative Parallelization No parallelization

More information

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech Transactional Memory Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech (Adapted from Stanford TCC group and MIT SuperTech Group) Motivation Uniprocessor Systems Frequency

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012 18-742 Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II Prof. Onur Mutlu Carnegie Mellon University 10/12/2012 Past Due: Review Assignments Was Due: Tuesday, October 9, 11:59pm. Sohi

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

RECAP. B649 Parallel Architectures and Programming

RECAP. B649 Parallel Architectures and Programming RECAP B649 Parallel Architectures and Programming RECAP 2 Recap ILP Exploiting ILP Dynamic scheduling Thread-level Parallelism Memory Hierarchy Other topics through student presentations Virtual Machines

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Hydra: A Chip Multiprocessor with Support for Speculative Thread-Level Parallelization

Hydra: A Chip Multiprocessor with Support for Speculative Thread-Level Parallelization Hydra: A Chip Multiprocessor with Support for Speculative Thread-Level Parallelization A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD

More information

Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor

Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor Jeffrey Oplinger, David Heine, Shih-Wei Liao, Basem A. Nayfeh, Monica S. Lam and Kunle Olukotun Computer Systems Laboratory

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Complexity Analysis of A Cache Controller for Speculative Multithreading Chip Multiprocessors

Complexity Analysis of A Cache Controller for Speculative Multithreading Chip Multiprocessors Complexity Analysis of A Cache Controller for Speculative Multithreading Chip Multiprocessors Yoshimitsu Yanagawa, Luong Dinh Hung, Chitaka Iwama, Niko Demus Barli, Shuichi Sakai and Hidehiko Tanaka Although

More information

Abstract. 1 Introduction. 2 The Hydra CMP. Computer Systems Laboratory Stanford University Stanford, CA

Abstract. 1 Introduction. 2 The Hydra CMP. Computer Systems Laboratory Stanford University Stanford, CA Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey and Kunle Olukotun Computer Systems Laboratory Stanford University Stanford, CA 94305-4070 http://www-hydra.stanford.edu/ Abstract

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

A Chip-Multiprocessor Architecture with Speculative Multithreading

A Chip-Multiprocessor Architecture with Speculative Multithreading 866 IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 9, SEPTEMBER 1999 A Chip-Multiprocessor Architecture with Speculative Multithreading Venkata Krishnan, Member, IEEE, and Josep Torrellas AbstractÐMuch emphasis

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers) physical register file that is the same size as the architectural registers

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 16 MARKS CS 2354 ADVANCE COMPUTER ARCHITECTURE 1. Explain the concepts and challenges of Instruction-Level Parallelism. Define

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

Complexity Analysis of Cache Mechanisms for Speculative Multithreading Chip Multiprocessors

Complexity Analysis of Cache Mechanisms for Speculative Multithreading Chip Multiprocessors Complexity Analysis of Cache Mechanisms for Speculative Multithreading Chip Multiprocessors Yoshimitsu Yanagawa 1 Introduction 1.1 Backgrounds 1.1.1 Chip Multiprocessors With rapidly improving technology

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Multiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor

Multiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor Multiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor Seon Wook Kim, Chong-Liang Ooi, Il Park, Rudolf Eigenmann, Babak Falsafi, and T. N. Vijaykumar School

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Portland State University ECE 588/688. IBM Power4 System Microarchitecture Portland State University ECE 588/688 IBM Power4 System Microarchitecture Copyright by Alaa Alameldeen 2018 IBM Power4 Design Principles SMP optimization Designed for high-throughput multi-tasking environments

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Multi-Version Caches for Multiscalar Processors. Manoj Franklin. Clemson University. 221-C Riggs Hall, Clemson, SC , USA

Multi-Version Caches for Multiscalar Processors. Manoj Franklin. Clemson University. 221-C Riggs Hall, Clemson, SC , USA Multi-Version Caches for Multiscalar Processors Manoj Franklin Department of Electrical and Computer Engineering Clemson University 22-C Riggs Hall, Clemson, SC 29634-095, USA Email: mfrankl@blessing.eng.clemson.edu

More information

Exploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 18-447 Computer Architecture Lecture 15: Load/Store Handling and Data Flow Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 Lab 4 Heads Up Lab 4a out Branch handling and branch predictors

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

EECS 452 Lecture 9 TLP Thread-Level Parallelism

EECS 452 Lecture 9 TLP Thread-Level Parallelism EECS 452 Lecture 9 TLP Thread-Level Parallelism Instructor: Gokhan Memik EECS Dept., Northwestern University The lecture is adapted from slides by Iris Bahar (Brown), James Hoe (CMU), and John Shen (CMU

More information

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Computer Sciences Department University of Wisconsin Madison http://www.cs.wisc.edu/~ericro/ericro.html ericro@cs.wisc.edu High-Performance

More information

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University Lecture 16: Checkpointed Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 18-1 Announcements Reading for today: class notes Your main focus:

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Processor Architecture and Interconnect

Processor Architecture and Interconnect Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Final Review Shuai Wang Department of Computer Science and Technology Nanjing University Computer Architecture Computer architecture, like other architecture, is the art

More information

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown

More information

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted

More information

Memory Consistency and Multiprocessor Performance

Memory Consistency and Multiprocessor Performance Memory Consistency Model Memory Consistency and Multiprocessor Performance Define memory correctness for parallel execution Execution appears to the that of some correct execution of some theoretical parallel

More information

CMSC 611: Advanced Computer Architecture

CMSC 611: Advanced Computer Architecture CMSC 611: Advanced Computer Architecture Instruction Level Parallelism Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson /

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to

More information

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB Memory Consistency and Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB 1 Memory Consistency Model Define memory correctness for parallel execution Execution appears to the that

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

Improving the Performance of Speculatively Parallel Applications on the Hydra CMP

Improving the Performance of Speculatively Parallel Applications on the Hydra CMP Improving the Performance of Speculatively Parallel Applications on the Hydra CMP Kunle Olukotun, Lance Hammond and Mark Willey Computer Systems Laboratory Stanford University Stanford, CA 94305-4070 http://www-hydra.stanford.edu/

More information

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any a performance

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3, 3.9, and Appendix C) Hardware

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?) Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Instruction Level Parallelism (ILP)

Instruction Level Parallelism (ILP) 1 / 26 Instruction Level Parallelism (ILP) ILP: The simultaneous execution of multiple instructions from a program. While pipelining is a form of ILP, the general application of ILP goes much further into

More information

Chapter 4 The Processor 1. Chapter 4D. The Processor

Chapter 4 The Processor 1. Chapter 4D. The Processor Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

CS533: Speculative Parallelization (Thread-Level Speculation)

CS533: Speculative Parallelization (Thread-Level Speculation) CS533: Speculative Parallelization (Thread-Level Speculation) Josep Torrellas University of Illinois in Urbana-Champaign March 5, 2015 Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 1 / 21 Concepts

More information

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

EITF20: Computer Architecture Part3.2.1: Pipeline - 3 EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done

More information

Speculation and Future-Generation Computer Architecture

Speculation and Future-Generation Computer Architecture Speculation and Future-Generation Computer Architecture University of Wisconsin Madison URL: http://www.cs.wisc.edu/~sohi Outline Computer architecture and speculation control, dependence, value speculation

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

Lecture 13: March 25

Lecture 13: March 25 CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

CMP Support for Large and Dependent Speculative Threads

CMP Support for Large and Dependent Speculative Threads TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 1 CMP Support for Large and Dependent Speculative Threads Christopher B. Colohan, Anastassia Ailamaki, J. Gregory Steffan, Member, IEEE, and Todd C. Mowry

More information

Transactional Memory Coherence and Consistency

Transactional Memory Coherence and Consistency Transactional emory Coherence and Consistency all transactions, all the time Lance Hammond, Vicky Wong, ike Chen, rian D. Carlstrom, ohn D. Davis, en Hertzberg, anohar K. Prabhu, Honggo Wijaya, Christos

More information

CMSC 611: Advanced. Parallel Systems

CMSC 611: Advanced. Parallel Systems CMSC 611: Advanced Computer Architecture Parallel Systems Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems

More information

The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization

The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization The Potential for Using Thread-Level Data peculation to acilitate Automatic Parallelization J. Gregory teffan and Todd C. Mowry Department of Computer cience University http://www.cs.cmu.edu/~{steffan,tcm}

More information

EE382 Processor Design. Processor Issues for MP

EE382 Processor Design. Processor Issues for MP EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I EE 382 Processor Design Winter 98/99 Michael Flynn 1 Processor Issues for MP Initialization Interrupts Virtual Memory TLB Coherency

More information

Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 13:

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont Lecture 18 Simultaneous Multithreading Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi,

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Computer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Out-of-Order Execution II Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 15 Video

More information

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation TDT4255 Lecture 9: ILP and speculation Donn Morrison Department of Computer Science 2 Outline Textbook: Computer Architecture: A Quantitative Approach, 4th ed Section 2.6: Speculation Section 2.7: Multiple

More information

Last 2 Classes: Introduction to Operating Systems & C++ tutorial. Today: OS and Computer Architecture

Last 2 Classes: Introduction to Operating Systems & C++ tutorial. Today: OS and Computer Architecture Last 2 Classes: Introduction to Operating Systems & C++ tutorial User apps OS Virtual machine interface hardware physical machine interface An operating system is the interface between the user and the

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information