CS 152, Spring 2012 Section 8

Similar documents
CS 152, Spring 2011 Section 8

CS 152, Spring 2013 Section 7

CS 152, Spring 2011 Section 10

E0-243: Computer Architecture

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

C152 Laboratory Exercise 3

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Handout 2 ILP: Part B

Lecture-13 (ROB and Multi-threading) CS422-Spring

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

CS425 Computer Systems Architecture

CS 152 Computer Architecture and Engineering

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

November 7, 2014 Prediction

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Processor: Superscalars Dynamic Scheduling

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Computer Architecture Spring 2016

Superscalar Organization

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques,

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Computer Science 146. Computer Architecture

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Hardware-Based Speculation

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Limitations of Scalar Pipelines

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Intel released new technology call P6P

Chapter 4 The Processor 1. Chapter 4D. The Processor

Super Scalar. Kalyan Basu March 21,

Lecture 19: Instruction Level Parallelism

CS 152 Computer Architecture and Engineering. Lecture 16 - VLIW Machines and Statically Scheduled ILP

Communications and Computer Engineering II: Lecturer : Tsuyoshi Isshiki

EECC551 Exam Review 4 questions out of 6 questions

CS 152, Spring 2011 Section 2

Lecture 9: Multiple Issue (Superscalar and VLIW)

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

CS152 Computer Architecture and Engineering. Complex Pipelines

Out of Order Processing

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

CS 152 Computer Architecture and Engineering. Lecture 13 - VLIW Machines and Statically Scheduled ILP

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

ECE/CS 552: Introduction to Superscalar Processors

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

Processor (IV) - advanced ILP. Hwansoo Han

Hardware Speculation Support

C152 Laboratory Exercise 3 Rev. D

ECE 552: Introduction To Computer Architecture 1. Scalar upper bound on throughput. Instructor: Mikko H Lipasti. University of Wisconsin-Madison

COMPUTER ORGANIZATION AND DESI

Static & Dynamic Instruction Scheduling

Computer Architecture Spring 2016

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Static vs. Dynamic Scheduling

Superscalar Organization

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

" # " $ % & ' ( ) * + $ " % '* + * ' "

Instruction Level Parallelism

Lecture: Out-of-order Processors

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Instruction Level Parallelism

Superscalar Processors

Instruction Level Parallelism

Metodologie di Progettazione Hardware-Software

Superscalar Processor

Multiple Instruction Issue. Superscalars

Transcription:

CS 152, Spring 2012 Section 8 Christopher Celio University of California, Berkeley

Agenda More Out- of- Order

Intel Core 2 Duo (Penryn) Vs. NVidia GTX 280 Intel Core 2 Duo (Penryn) dual- core 2007+ 45nm 410 million transistors ~2GHz 3 or 6MB of cache 10-35 Watts 107mm 2 NVidia GTX 280 each core is 22mm 2 L2 SRAM is 6mm 2 /MB 10 core(?) (240 stream processors) 2008 65nm 1.4 Billion transistors 576mm 2 602 MHz(core clock) 236 Watts!!! http://3dimensionaljigsaw.wordpress.com/2008/06/18/physics-based-games-the-new-genre/

Quiz 2 Will be returned this Tuesday

Out-of-Order Control Complexity: MIPS R10000 Control Logic [ SGI/MIPS Technologies Inc., 1995 ] March 14, 2011 CS152, Spring 2011 5

Out of Order Processors Yeager. The MIPS R10000 Superscalar Microprocesor. IEE Micro. 1996

Out of Order Processors

BOOM: A Single Issue Slot Question 1 Br Logic = = collect CPI with BHT, without, compare to 5- stage Resolve = = Kill in- order Question 2 UOP Code BrMask Ctrl... Val RDst RS1 p1 ready issue slot is valid Probe the Instruction Window to potential benefit issue of dual issue Question 3 (From the register file's two write ports) WDest0 WDest1 Probe IW for dual issue of ALU/Mem ops Control Signals Physical Destination Register Physical Source Registers RS2 p2 ready request Issue Select Logic Issued to the Register Read stage 8

BOOM: A Single Issue Slot each instruction gets a br mask... allows us to kill instructions Br Logic Resolve or Kill (From the register file's two write ports) WDest0 WDest1 UOP Code BrMask Ctrl... Val RDst RS1 p1 issue slot is valid = = the register file has two write-ports, so watch both ports write addresses ready RS2 = = p2 ready each slot asserts request when ready to fire request issue one slot gets the issue Issue Select Logic uop holds the micro-op code (is it a LD, an ADD, etc.) Control Signals Physical Destination Register Issued to the Register Read stage Physical Source Registers (note: I show a bus implementation, but it s actually implemented with 9 a bunch of muxes)

OOO Styles

Data-in-ROB Design (HP PA8000, Intel Pentium Pro, Core2 Duo & Nehalem) Register File holds only committed state Reorder buffer Ins# use exec op p1 src1 p2 src2 pd dest data t 1 t 2.. t n Load Unit FU FU FU Store Unit Commit < t, result > On dispatch into ROB, ready sources can be in regfile or in ROB dest (copied into src1/src2 if ready before dispatch) On completion, write to dest field and broadcast to src fields. On issue, read from ROB src fields March 9, 2011 CS152, Spring 2011 11

Unified Physical Register File (MIPS R10K, Alpha 21264, Intel Pentium 4 & Sandy Bridge) Rename all architectural registers into a single physical register file during decode, no register values read Functional units read and write from single unified register file holding committed and temporary registers in execute Commit only updates mapping of architectural register to physical register, no data movement Decode Stage Register Mapping Read operands at issue Unified Physical Register File Commited Register Mapping Write results at completion Functional Units March 9, 2011 CS152, Spring 2011 12

21264 Instruction Reordering As mentioned earlier, 21264 uses explicit renaming, as opposed to data- in- ROB design What does ROB hold?

BOOM Fetch Decode Rename Dispatch Issue RegisterRead Execute Memory WB Branch Prediction Br Logic Resolve Branch Fetch Fetch Buffer Decode Register Rename Issue Window Unified Register File 2R,2W ALU LAQ ROB SAQ addr wdata Data Mem rdata Commit SDQ 14

DEC Alpha 21264 1996/1997 single- core 4- way out- of- order highly speculative 7- stage up to 80 instructions in flight tournament branch predictor 15.2M transistors 6M for logic rest is caching, history tables 350 nm 600 MHz 64KB I$, 64KB D$ (on- chip) 1 to 16MB L2$ (off- chip) 314mm 2 die (fairly large)

DEC Alpha 21264

21264 Register Renaming Registers are renamed, then instructions are inserted into the issue queue (window) Map table backed up on every in- flight insn

21264 Register Renaming What hazards does renaming obviate? In what situations is renaming useful? If you had to choose between branch prediction and renaming, which would you pick?

21264 Register Renaming What hazards does renaming obviate? WAR, WAW In what situations is renaming useful? If you had to choose between branch prediction and renaming, which would you pick?

21264 Register Renaming What hazards does renaming obviate? WAR, WAW In what situations is renaming useful? Code with ILP and name dependencies: loops If you had to choose between branch prediction and renaming, which would you pick?

21264 Register Renaming What hazards does renaming obviate? WAR, WAW In what situations is renaming useful? Code with ILP and name dependencies: loops If you had to choose between branch prediction and renaming, which would you pick? Not much ILP within a basic block, so renaming isn t too useful without branch prediction

21264 Superscalar Execution 21264 couldn t fit full bypassing into one clock cycle Instead, they fully bypass within each of two clusters; inter- cluster bypass takes another cycle

Question: Stores When are stores sent to memory? at commit time Why are stores saved in a store buffer before commit time? so they can be forwarded to dependent loads

val SDQ data SAQ BOOM: val addr LD/ST Unit addr = = = = LAQ val st_mask 4 4 sta_val std_val st_addr_ st_addr_ st_addr_ eq eq eq ld_val LD/ST Compare st_mask ld_is_rdy ld_is_byp byp_idx only showing comparision logic for one Load load is ready to fire load can be bypassed out of SDQ location in SDQ to get ld data from addr wdata Data Mem rdata to RF

BOOM Fetch Decode Rename Dispatch Issue RegisterRead Execute Memory WB Branch Prediction Br Logic Resolve Branch Fetch Fetch Buffer Decode Register Rename Issue Window Unified Register File 2R,2W ALU LAQ single issue 6- stage full branch speculation (BHT) magic, 1- cycle memory (no caches) no bypasses no floating point ROB Commit no exceptions 25 SAQ SDQ addr wdata Data Mem rdata

Memory Ordering in the 21264 To execute the critical instruction path quickly, want to execute loads ASAP Initially, loads speculatively bypass stores On a misspeculation, set a wait bit for that load s PC, so it will behave conservatively from then on Clear wait bits periodically

Speculation in the 21264 What does the 21264 speculate on? Next I$ line/way Branches, indirect jumps Exceptions Load/Store ordering Load hit/miss Shortens hit time by a cycle Anything else?

Pentium http:// www.cs.clemson.edu/ ~mark/330/p6.html Pentium processor

Questions?