CS252 Spring 2017 Graduate Computer Architecture. Lecture 6: Out-of-Order Processors

Size: px

Start display at page:

Download "CS252 Spring 2017 Graduate Computer Architecture. Lecture 6: Out-of-Order Processors"

Reginald Lindsey
6 years ago
Views:

1 CS252 Sprig 2017 Graduate Computer Architecture Lecture 6: Out-of-Order Processors Lisa Wu, Krste Asaovic WU UCB CS252 SP17

2 2 WU UCB CS252 SP17

3 Last Time i Lecture 5 I-order completio vs. Out-of-order completio Precise exceptio hadlig ad precise iterrupts ROB History Buffer Future File IBM Stretch Compiler schedulig istructios to elimiate cotrol ad data hazards Simple decoupled machie Amdahl s Law 3 WU UCB CS252 SP17

4 Supercomputers Defiitios of a supercomputer: Fastest machie i world at give task A device to tur a compute-boud problem ito a I/O boud problem Ay machie costig $30M+ Ay machie desiged by Seymour Cray CDC6600 (Cray, 1964) regarded as first supercomputer Krste Asaovic,

CDC 6600 Seymour Cray, 1963 A fast pipelied machie with 60-bit words - 128 Kword mai memory capacity, 32 baks Te fuctioal uits (parallel, upipelied) -

.. Hardwired cotrol (o microcodig) Scoreboard for dyamic schedulig of istructios Te Peripheral Processors for Iput/Output - a fast multi-threaded 12-bit

5 CDC 6600 Seymour Cray, 1963 A fast pipelied machie with 60-bit words Kword mai memory capacity, 32 baks Te fuctioal uits (parallel, upipelied) - Floatig Poit: adder, 2 multipliers, divider - Iteger: adder, 2 icremeters,... Hardwired cotrol (o microcodig) Scoreboard for dyamic schedulig of istructios Te Peripheral Processors for Iput/Output - a fast multi-threaded 12-bit iteger ALU Very fast clock, 10 MHz (FP add i 4 clocks) >400,000 trasistors, 750 sq. ft., 5 tos, 150 kw, ovel freo-based techology for coolig Fastest machie i world for 5 years (util 7600) - over 100 sold ($7-10M each) 3/10/2009 Krste Asaovic,

6 CDC 6600: A Load/Store Architecture Separate istructios to maipulate three types of reg. 8x60-bit data registers (X) 8x18-bit address registers (A) 8x18-bit idex registers (B) All arithmetic ad logic istructios are register-to-register opcode i j k Ri Rj op Rk Oly Load ad Store istructios refer to memory! opcode i j disp Ri M[Rj + disp] Touchig address registers 1 to 5 iitiates a load 6 to 7 iitiates a store - very useful for vector operatios Krste Asaovic,

7 CDC 6600: Datapath Operad Regs 8 x 60-bit Cetral Memory 128K words, 32 baks, 1µs cycle operad addr result addr operad result Address Regs Idex Regs 8 x 18-bit 8 x 18-bit 10 Fuctioal Uits IR Ist. Stack 8 x 60-bit Krste Asaovic,

8 CDC6600 ISA desiged to simplify highperformace implemetatio Use of three-address, register-register ALU istructios simplifies pipelied implemetatio - Oly 3-bit register specifier fields checked for depedecies - No implicit depedecies betwee iputs ad outputs Decouplig settig of address register (Ar) from retrievig value from data register (Xr) simplifies providig multiple outstadig memory accesses - Software ca schedule load of address register before use of value - Ca iterleave idepedet istructios i-betwee CDC6600 has multiple parallel but upipelied fuctioal uits - E.g., 2 separate multipliers Follow-o machie CDC7600 used pipelied fuctioal uits - Foreshadows later RISC desigs Krste Asaovic,

9 CDC6600 Scoreboard Istructios dispatched i-order to fuctioal uits provided o structural hazard or WAW - Stall o structural hazard, o fuctioal uits available - Oly oe pedig write to ay register Istructios wait for iput operads (RAW hazards) before executio - Ca execute out-of-order Istructios wait for output register to be read by precedig istructios (WAR) - Result held i fuctioal uit util register free Krste Asaovic,

10 [ IBM] Krste Asaovic,

11 Dyamic Schedulig Dyamic schedulig implies: Out-of-order executio Out-of-order completio Brach Predictio Creates the possibility for WAR ad WAW hazards Tomasulo s Approach Tracks whe operads are available Itroduces register reamig i hardware Miimizes WAW ad WAR hazards Copyright 2012, Elsevier Ic. All rights reserved. 11

12 Register Reamig Example: Brach Predictio DIV.D F0,F2,F4 ADD.D F6,F0,F8 S.D F6,0(R1) SUB.D F8,F10,F14 MUL.D F6,F10,F8 atidepedece atidepedece + ame depedece with F6 Copyright 2012, Elsevier Ic. All rights reserved. 12

13 Register Reamig Example: Brach Predictio DIV.D F0,F2,F4 ADD.D S,F0,F8 S.D S,0(R1) SUB.D T,F10,F14 MUL.D F6,F10,T Now oly RAW hazards remai, which ca be strictly ordered Copyright 2012, Elsevier Ic. All rights reserved. 13

14 Dyamic Schedulig Rearrage order of istructios to reduce stalls while maitaiig data flow Brach Predictio Advatages: Compiler does t eed to have kowledge of microarchitecture Hadles cases where depedecies are ukow at compile time Disadvatage: Substatial icrease i hardware complexity Complicates exceptios Copyright 2012, Elsevier Ic. All rights reserved. 14

15 Out-of-Order Fades ito Backgroud Out-of-order processig implemeted commercially i 1960s, but disappeared agai util 1990s as two major problems had to be solved: Precise traps - Imprecise traps complicate debuggig ad OS code - Note, precise iterrupts are relatively easy to provide Brach predictio - Amout of exploitable istructio-level parallelism (ILP) limited by cotrol hazards Also, simpler machie desigs i ew techology beat complicated machies i old techology - Big advatage to fit processor & caches o oe chip - Microprocessors had era of 1%/week performace scalig Krste Asaovic,

16 Separatig Completio from Commit Re-order buffer holds register results from completio util commit - Etries allocated i program order durig decode - Buffers completed values ad exceptio state util iorder commit poit - Completed values ca be used by depedets before committed (bypassig) - Each etry holds program couter, istructio type, destiatio register specifier ad value if ay, ad exceptio status (ifo ofte compressed to save hardware) Memory reorderig eeds special data structures - Speculative store address ad data buffers - Speculative load address ad data buffers Krste Asaovic,

17 I-Order Commit for Precise Traps I-order Out-of-order I-order Fetch Decode Reorder Buffer Commit Kill Iject hadler PC Kill Execute Kill Trap? I-order istructio fetch ad decode, ad dispatch to reservatio statios iside reorder buffer Istructios issue from reservatio statios out-of-order Out-of-order completio, values stored i temporary buffers Commit is i-order, checks for traps, ad if oe updates architectural state Krste Asaovic,

19 Ackowledgemets This course is partly ispired by previous MIT ad Berkeley CS252 computer architecture courses created by my collaborators ad colleagues: - Arvid (MIT) - Joel Emer (Itel/MIT) - James Hoe (CMU) - Joh Kubiatowicz (UCB) - David Patterso (UCB) Krste Asaovic,

C 1. Last time. CSE 490/590 Computer Architecture. Complex Pipelining I. Complex Pipelining: Motivation. Floating-Point Unit (FPU) Floating-Point ISA

C 1. Last time. CSE 490/590 Computer Architecture. Complex Pipelining I. Complex Pipelining: Motivation. Floating-Point Unit (FPU) Floating-Point ISA CSE 490/590 Computer Architecture Complex Pipelining I Steve Ko Computer Sciences and Engineering University at Buffalo Last time Virtual address caches Virtually-indexed, physically-tagged cache design