Speculation and Future-Generation Computer Architecture

Similar documents
Multiscalar Processors

Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESI

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

Processor (IV) - advanced ILP. Hwansoo Han

The Processor: Instruction-Level Parallelism

SPECULATIVE MULTITHREADED ARCHITECTURES

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

LECTURE 3: THE PROCESSOR

Pipelining. CSC Friday, November 6, 2015

Complex Pipelines and Branch Prediction

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Chapter 4. The Processor

Instruction Level Parallelism (ILP)

Handout 2 ILP: Part B

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

The Processor: Improving the performance - Control Hazards

Chapter 4 The Processor (Part 4)

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Full Datapath. Chapter 4 The Processor 2

Chapter 4. The Processor

Chapter 4. The Processor

Tutorial 11. Final Exam Review

Multithreaded Processors. Department of Electrical Engineering Stanford University

Lecture-13 (ROB and Multi-threading) CS422-Spring

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Speculative Multithreaded Processors

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Copyright 2012, Elsevier Inc. All rights reserved.

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Chapter 4 The Processor 1. Chapter 4D. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

November 7, 2014 Prediction

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Advanced issues in pipelining

Hardware-Based Speculation

Architectures for Instruction-Level Parallelism

Limitations of Scalar Pipelines

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

Exploitation of instruction level parallelism

EIE/ENE 334 Microprocessors

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CS533: Speculative Parallelization (Thread-Level Speculation)

5008: Computer Architecture

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

Advanced Instruction-Level Parallelism

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Lecture 7: Static ILP and branch prediction. Topics: static speculation and branch prediction (Appendix G, Section 2.3)

Hardware-based Speculation

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Lecture 7: Static ILP, Branch prediction. Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

Chapter 06: Instruction Pipelining and Parallel Processing

Data-flow prescheduling for large instruction windows in out-of-order processors. Pierre Michaud, André Seznec IRISA / INRIA January 2001

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

COMPUTER ORGANIZATION AND DESIGN

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Instruction Level Parallelism

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Precise Exceptions and Out-of-Order Execution. Samira Khan

Wide Instruction Fetch

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Full Datapath. Chapter 4 The Processor 2

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Hardware-Based Speculation

Multiple Instruction Issue. Superscalars

ILP: Instruction Level Parallelism

Getting CPI under 1: Outline

The Microarchitecture of Superscalar Processors

Achieving Out-of-Order Performance with Almost In-Order Complexity

Instruction Level Parallelism

CS 152 Computer Architecture and Engineering

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Transcription:

Speculation and Future-Generation Computer Architecture University of Wisconsin Madison URL: http://www.cs.wisc.edu/~sohi

Outline Computer architecture and speculation control, dependence, value speculation Multiscalar: next generation microarchitecture 2

Processing: Big Picture Sequence through static representation of program to create dynamic stream of operations Execute operations in the dynamic stream Determine dependence relationships Schedule operations for execution Execute operations Communicate values 3

Role of Computer Architect Use available technology to perform processing tasks Match processing tasks to hardware blocks constructed from available technology Do so in a manner that is easy to design/verify 4

Constraints for the Architect Program ordering constraints, a.k.a, dependences Control Artificial (i.e., name or storage) Ambiguous True Increased latencies manifest as performance-degrading stalls through program dependences Need for architect to develop techniques to overcome dependences 5

Speculation and Computer Architecture Speculation:.. to assume a business risk in hope of gain -- Webster Speculation in computer architecture is used to try to overcome constraining conditions 6

Speculation and Computer Architecture Speculate outcome of event rather than waiting for outcome to be known mis-speculation if wrong mis-speculation can have penalty Develop techniques to speculate better 7

Memory Hierarchies Technology trend: small and fast, or large and slow Can I get fast with appearance of large? Speculate: going to use referenced data item again (temporal locality) Speculate: going to reference data items in proximity to referenced item (spatial locality) 8

Processor Generations Generation 1 Generation 2 Generation 4 Generation 3 o o o 9

Superscalar (Generation 3) Technology allows large execution engines to be built Feeding execution engines is a problem branch instructions constrain program sequencing and instruction fetching waiting for branch outcome wasteful Speculate the outcome of a branch (control speculation) Use mechanisms to improve accuracy of speculation: branch predictors 10

Software or Hardware? Speculation can be done in both hardware and software Software speculation requires instruction set (ISA) support - problematic if ISA change not practical +requires little hardware support Hardware speculation does not require ISA support +better if business reasons preclude changing ISA - requires more hardware 11

Hardware Speculation Mechanisms Need to give appearance of sequential execution Need to recover from mis-speculations history buffers: keep checkpoint, and back up to checkpoint reorder buffers: do not update (architectural) state until speculation outcome is known Speculation improvement mechanisms branch predictors 12

Observation Mechanisms rely on ability to give appearance of sequential execution Appearance of sequential execution implies appearance that no ordering constraints were violated Basic (recovery) mechanisms for supporting control speculation can be used to support arbitrary forms of speculative execution! 13

Out-of-Order Processing Processing Phase Program Form static program instruction fetch & branch prediction dependence checking & dispatch dynamic instruction stream execution window instruction issue instruction execution Execution Wavefront instruction reorder & commit completed instructions 14

Generic Superscalar Processor floating pt. register file predecode instr. cache instr. buffer register rename &dispatch floating pt. instruction buffers integer/address instruction buffers functional units functional units and data cache memory interface integer register file re-order buffer 15

Wires used to pass values Technology Trends Wires getting relatively slower Short wires for fast clock Short wires implies localized communication 16

Alpha 21264 Reservation Stations Reg File0 Reg File1 Exec Exec 1 CP delay Exec Exec 17

Decentralized Scheduling Scheduler in one cluster needs dependence information for instructions in other cluster problematic for memory operations knowledge of memory operations needed address calculations needed Can we work without knowledge of memory operations/ addresses, i.e., ambiguous dependences? Use data dependence speculation 18

Scheduling Memory Operations Data dependence speculation is the default predict no dependences Improving accuracy of data dependence prediction - akin to branch prediction for control dependences Track history of dependence mis-speculations small number of static dependence pairs exhibit temporal locality Use history for future data dependence speculation/ synchronization decisions 19

Overcoming True Dependences Break true dependence using value speculation predict the outcome of an operation (e.g., 32 bits) Mechanisms to support other forms of speculation (e.g., control speculation) can be use to recover Subject of significant current research 20

Multiscalar Processors Paradigm for Generation 4 processors Proposed in early 1990s for 2000s processors Employs thread-level speculation First commercial example of concept: Space Time Computing (STC) in Sun MAJC 21

Processing Hardware: Big Picture Start with a static representation of a program Sequence through the program to generate the dynamic stream of operations Use single PC to walk through static representation? Execute operations in dynamic stream Schedule operations for execution Execute operations Communicate values PROGRAM A B C 22

Basic Issues Sequencing Scheduling Operation execution Operand communication How do we sequence, schedule, execute, and communicate in a more powerful manner? 23

Trends Lots of transistors available Wires getting slower Design complexity getting unwieldy Verification complexity getting unwieldy Software complexity getting unwieldy 24

Hardware Wish List Simplify engineering (design, verification, testing) Use of simple, regular hardware structures clock speeds comparable to single-issue processors Locality of interconnect Easy growth path from one generation to next reuse existing processing cores No centralized bottlenecks Still build high-performance processor speed up execution of single program 25

The Hardware-Influenced Solution Take current generation processor Replicate some parts, share others Have next generation processor Different units can sequence, schedule, etc. in parallel BUT, the software problem... 26

The Software Problem Can t always break up single program into independent chunks (i.e., multiple sequencers) statically control dependences data dependences (especially ambiguous ones) also load balance Can t map program onto rigid hardware model 27

Hardware/Software Cooperation Take mostly sequential static program Use speculation to overcome dependence limitations When in doubt, speculate Break up program into potentially independent chunks dynamically 28

Sequencing Unraveling the operations to be executed dynamically Use 2-level sequencing sequence high level in task-sized steps sequence within task (task == thread) vectors? Use control flow speculation to increase sequencing power overcome stalls PROGRAM A B C 29

Scheduling Use multiple schedulers to improve scheduling power Use data dependence speculation to overcome scheduling limitations - ambiguous dependences Use value speculation to overcome scheduling limitations - true dependences PROGRAM A B C 30

Operand Communication Values bound to registers and memory Values created speculatively Storage where should values be buffered? Synchronization operation uses value of latest producer Communication forwarding created value to (future) consumers PROGRAM A B C Create and exploit localities to reduce/ simplify interconnect! 31

Multiscalar Paradigm Break sequencing process into two steps Sequence through static representation in task-sized steps Sequence through each task in conventional manner Split large instruction window into ordered tasks Assign a task to a simple execution engine; exploit parallelism by overlapping execution of multiple tasks Use separate PCs to sequence through separate tasks Maintain the appearance of a single-pc sequencing through the static representation Use control and data dependence speculation 32

What is a Task/Thread? A portion of the static representation resulting in a contiguous portion of the dynamic instruction stream - part of a basic block - basic block - multiple basic blocks - loop iteration - entire loop - procedure call, etc. 33

Multiscalar Big Picture: Basics PROGRAM A A predict B predict C B C PROC UNIT 1 PROC UNIT 2 PROC UNIT 3 34

Multiscalar: Logical Hardware SEQUENCER PIPE LINE PIPE LINE PIPE LINE PIPE LINE P.U. REG P.U. P.U. P.U. REG REG REG MEMORY DISAMBIGUATION CACHE HIERARCHY 35

Register Values Each core works out of its local register file Multiple register files act like separate renamed files Each register file contains register state at a particular time in the (speculative) execution of a program 36

Memory Values Storage Synchronization Communication Versions 37

Scheduling Memory Operations Data dependence speculation is the default predict no dependences Improving accuracy of data dependence prediction - akin to branch prediction for control dependences Track history of dependence mis-speculations small number of static dependence pairs exhibit temporal locality Use history for future data dependence speculation/ synchronization decisions 38

Example: Problem Process stream of tokens Create entry in list for new token Use information in list to process token 39

Example: C Code for (indx = 0; indx < BUFSIZE; indx++) { /* get the symbol for which to search */ symbol = SYMVAL(buffer[indx]); /* do a linear search fo rthe symbol in the list */ for (list = listhd; list; list = LNEXT(list) { /* if symbol already present, process entry */ if (symbol == LELE(list)) { process(list); break; } } } /* if symbol not found, add it to the tail */ if (! list) { addlist(symbol); } 40

Example Each task is a complete list search Searches are usually independent and parallel Multiscalar can assume they are always independent Branches that separate tasks are predictable Branches within a task unlikely to be 100% predictable Superscalar/VLIW unlikely to be able to overlap processing of different tokens 41

Stop Bits Example: Executable Targ Spec OUTER: beq $17, $0, SKIPINNER INNER: ld $8, LELE($17) bne $8, $23, SKIPCALL move $4, $17 jal process j INNERFALLOUT SKIPCALL: INNERFALLOUT: OUTERFALLOUT: Branch, Branch Targ1 OUTER Targ2 OUTERFALLOUT Create mask $4,$8,$17,$20,$23 SKIPINNER: addu $20, $20, 16 ld $23, SYMVAL 16($20) move $17, $21 ld $17, NEXTLIST($17) bne $17, $0, INNER release $8, $17 bne $17, $0, SKIPINNER move $4, $23 jal addlist release $4 bne $20, $16, OUTER FForward Bits F F Stop Always Going from one generation to another could leave binary untouched! 42

Concluding Remarks - 1 Speculation is a very important tool in computer architect s toolset used to overcome performance-limiting constraints control, dependence, value, other forms,... heavy use likely in future computer systems Wire delays will cause future microarchitectures to be distributed Distributed microarchitectures have many pluses natural support of threads easier to design, verify, test? fault-tolerance 43

Concluding Remarks -- 2 Multiscalar model enables distributed execution of a sequential (or parallel) program - makes heavy use of dependence speculation Beginning of a new generation of microarchitectures - much works remains to be done More info at www.cs.wisc.edu/~mscalar 44