Spring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Similar documents
COSC3330 Computer Architecture Lecture 14. Branch Prediction

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Lecture-13 (ROB and Multi-threading) CS422-Spring

Processor Architecture

Handout 2 ILP: Part B

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Graduate Computer Architecture. Chapter 3. Instruction Level Parallelism and Its Dynamic Exploitation

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Multithreaded Processors. Department of Electrical Engineering Stanford University

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Copyright 2012, Elsevier Inc. All rights reserved.

Tomasulo s Algorithm

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:

Static vs. Dynamic Scheduling

Instruction-Level Parallelism and Its Exploitation

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

COSC4201 Instruction Level Parallelism Dynamic Scheduling

TDT 4260 lecture 7 spring semester 2015

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Scoreboard information (3 tables) Four stages of scoreboard control

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

Processor: Superscalars Dynamic Scheduling

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

CSE502: Computer Architecture CSE 502: Computer Architecture

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

The basic structure of a MIPS floating-point unit

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

EECS 470 Midterm Exam Answer Key Fall 2004

Chapter 4 The Processor 1. Chapter 4D. The Processor

Adapted from David Patterson s slides on graduate computer architecture

Hardware-Based Speculation


Instruction-Level Parallelism (ILP)

TDT 4260 TDT ILP Chap 2, App. C

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Metodologie di Progettazione Hardware-Software

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods

Instruction Level Parallelism

Advanced issues in pipelining

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

CS425 Computer Systems Architecture

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)

Computer Architecture Spring 2016

Multiple Instruction Issue and Hardware Based Speculation

EC 513 Computer Architecture

5008: Computer Architecture

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

Advanced Computer Architecture

Hardware-Based Speculation

Instruction Level Parallelism (ILP)

Chapter. Out of order Execution

EECC551 Exam Review 4 questions out of 6 questions

Lecture: Out-of-order Processors

PIPELINING: HAZARDS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

Advanced Computer Architecture. Chapter 4: More sophisticated CPU architectures

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

EECS 470 Lecture 7. Branches: Address prediction and recovery (And interrupt recovery too.)

CSE502: Computer Architecture CSE 502: Computer Architecture

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EECS 470 Midterm Exam Winter 2008 answers

Instruction Level Parallelism

EECS 470. Branches: Address prediction and recovery (And interrupt recovery too.) Lecture 7 Winter 2018

E0-243: Computer Architecture

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Getting CPI under 1: Outline

ECE 4750 Computer Architecture, Fall 2017 T10 Advanced Processors: Out-of-Order Execution

Out of Order Processing

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

" # " $ % & ' ( ) * + $ " % '* + * ' "

What is instruction level parallelism?

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

CS 423 Computer Architecture Spring Lecture 04: A Superscalar Pipeline

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

Transcription:

Spring 2010 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic

C/C++ program Compiler Assembly Code (binary) Processor 0010101010101011110

Memory MAR MDR INPUT Processing Unit OUTPUT ALU TEMP PC Control Unit IR Introduction to computing systems (patt&patel)

http://www.youtube.com/watch?v=_lm7acr 5ysY&feature=related

PC 0x0002 0x0001 0x000 0x0001 LD R1, MEM[R0] I-$ DEC RF WB 0x0002 ADD R2, R2, #1 Non-pipelined D-$ 0x000 BRZERO 0x0001 t 15 cycles Pipelined 7 cycles

ALU Data Dependencies RAW: Read-After-Write (True Dependence) WAR: Anti-Depedence WAW: Output Dependence Control Dependence When following instructions depend on the outcome of a previous branch/jump Ifetch Reg DMem Reg

ALU ALU ALU I n s t r. add r1,r2,r sub r4,r1,r Ifetch Reg Ifetch Reg DMem Reg Reg Reg DMem Reg O r d e r and r6,r2,r7 Ifetch All sources are ready? Why not execute them? Reg DMem Reg

Read-After-Write R1 R2 R R4 A: R1 = R2 + R B: R4 = R1 * R4 5 A 7 B 7 21 Write-After-Read R1 R2 R R4 A: R1 = R / R4 B: R = R2 * R4 5 A B -6 Write-After-Write R1 R2 R R4 A: R1 = R2 + R B: R1 = R * R4 5 A 7 B 27 R1 R2 R R4 5 B 5 15 A 7 15 R1 R2 R R4 5 B 5-6 A -6 R1 R2 R R4 5 B 27 A 7

WAR dependencies are from reusing registers A: R1 = R / R4 B: R = R2 * R4 A: R1 X = R / R4 B: R5 = R2 * R4 R1 R2 R R4 5 A B -6 R1 R2 R R4 5 B 5-6 A -6 R1 R2 R R4 5 B 5 A R5 4-6 -6 With no dependencies, reordering still produces the correct results

WAW dependencies are also from reusing registers A: R1 = R2 + R B: R1 = R * R4 A: R5 X = R2 + R B: R1 = R * R4 R1 R2 R R4 5 A 7 B 27 R1 5 R2 R R4 B 27 A 7 R1 R2 R R4 5 4 B 27 A 27 R5 4 7 Same solution works

Give processor more registers than specified by the ISA temporarily map ISA registers ( logical or architected registers) to the physical registers to avoid overwrites Components: mapping mechanism physical registers allocated vs. free registers allocation/deallocation mechanism

Example I can not exec before I2 because I will overwrite R5 I5 can not go before I2 because I2, when it goes, will overwrite R2 with a stale value Program code I1: ADD R1, R2, R I2: SUB R2, R1, R5 I: AND R5, R11, R7 I4: OR R8, R5, R2 I5: XOR R2, R4, R11 RAW WAR WAW

Solution: Let s give I temporary name/ location (e.g., S) for the value it produces. But I4 uses that value, so we must also change that to S In fact, all uses of R5 from I to the next instruction that writes to R5 again must now be changed to S! We remove WAW deps in the same way: change R2 in I5 (and subsequent instrs) to T. I1: ADD R1, R2, R I2: SUB R2, R1, R5 I: AND R5 S, R11, R7 I4: OR R8, R5, S, R2 I5: XOR R2, T, R4, R11

Implementation Space for S, T, etc. How do we know when to rename a register? Simple Solution Do renaming for every instruction Change the name of a register each time we decode an instruction that will write to it. Remember what name we gave it Program code I1: ADD R1, R2, R I2: SUB R2, R1, R5 I: AND S, R11, R7 I4: OR R8, S, R2 I5: XOR T, R4, R11

We need some physical structure to store the register values ARF Architected Register File Outside world sees the ARF RAT One PREG per instruction in-flight PRF Register Alias Table Physical Register File

Separates architected vs. physical registers Tracks program order of all in-flight insts Enables in-order completion or commit

RAT Architected Register File Instruction Buffers ROB Reservation Stations and ALUs head op Qj Qk Vj Vk op Qj Qk Vj Vk op Qj Qk Vj Vk op Qj Qk Vj Vk Add op Qj Qk Vj Vk op Qj Qk Vj Vk Mult type dest value fin

Read inst from inst buffer Instruction Buffers RAT Architected Register File Check if resources available: Appropriate RS entry ROB entry Read RAT, read (available) sources, update RAT Reservation Stations and ALUs op Qj Qk Vj Vk op Qj Qk Vj Vk op Qj Qk Vj Vk op Qj Qk Vj Vk Add op Qj Qk Vj Vk op Qj Qk Vj Vk Mult ROB type dest value fin head Write to RS and ROB

Same as before Wait for all operands to arrive Compete to use functional unit Execute!

Broadcast result on CDB (any dependents will grab the value) Write result back to your ROB entry The ARF holds the official register state, which we will only update in program order Mark ready/finished bit in ROB (note that this inst has completed execution) Reservation station can be freed.

When an inst is the oldest in the ROB i.e., ROB-head points to it Write result (if ready/finished bit is set) If register producing instruction: write to architected register file If store: write to memory Q: What about load? Advance ROB-head to next instruction This is what the outside world sees And it s all in-order

Make instruction execution visible to the outside world Commit the changes to the architected state ROB A B C D E F G H J K WB result ARF Outside World sees : A executed B executed C executed D executed E executed Instructions execute out of program order, but outside world still believes it s in-order

Single thread in superscalar execution: dependences cause most of stalls Idea: when one thread stalled, other can go Different granularities of multithreading Coarse MT: can change thread every few cycles Fine MT: can change thread every cycle Simultaneous Multithreading (SMT) Instrs from different threads even in the same cycle AKA Hyperthreading

Uni-Processor: 4-6 wide, lucky if you get 1 IPC poor utilization SMP: 2-4 CPUs, but need independent tasks else poor utilization as well SMT: Idea is to use a single large uni-processor as a multi-processor

Regular CPU SMT (4 threads) CMP 2x HW Cost Approx 1x HW Cost

For an N-way (N threads) SMT, we need: Ability to fetch from N threads N sets of architectural registers (including PCs) N rename tables (RATs) N virtual memory spaces Front-end: branch predictor?: no, RAS? :yes But we don t need to replicate the entire OOO execution engine (schedulers, execution units, bypass networks, ROBs, etc.) 1

Multiplex the Fetch Logic RS PC 0 PC 1 PC 2 I$ fetch Decode, etc. cycle % N Can do simple round-robin between active threads, or favor some over the others based on how much each is stalling relative to the others 2

Thread #1 s R12!= Thread #2 s R12 separate name spaces need to disambiguate Thread 0 Register # RAT 0 PRF Thread 1 Register # RAT 1

No change needed After Renaming Thread 0: Add R1 = R2 + R Sub R4 = R1 R5 Xor R = R1 ^ R4 Load R2 = 0[R] Thread 1: Add R1 = R2 + R Sub R4 = R1 R5 Xor R = R1 ^ R4 Load R2 = 0[R] Thread 0: Add T12 = T20 + T8 Sub T1 = T12 T16 Xor T14 = T12 ^ T1 Load T2 = 0[T14] Thread 1: Add T17 = T2 + T Sub T5 = T17 T2 Xor T1 = T17 ^ T5 Load T25 = 0[T1] Shared RS Entries Sub T5 = T17 T2 Add T12 = T20 + T8 Load T25 = 0[T1] Xor T14 = T12 ^ T1 Load T2 = 0[T14] Sub T1 = T12 T16 Xor T1 = T17 ^ T5 Add T17 = T2 + T 4

Register File Management ARF/PRF organization need one ARF per thread Need to maintain interrupts, exceptions, faults on a per-thread basis like OOO needs to appear to outside world that it is in-order, SMT needs to appear as if it is actually N CPUs 5