" # " $ % & ' ( ) * + $ " % '* + * ' "

Size: px
Start display at page:

Download "" # " $ % & ' ( ) * + $ " % '* + * ' ""

Transcription

1 ! )! # & ) * + * + * & *,+,-

2 Update Instruction Address IA Instruction Fetch IF Instruction Decode ID Execute EX Memory Access ME Writeback Results WB Program Counter Instruction Register Register File MUX PC Update Instruction Cache Decoder Execution Units Data Cache MUX Forwarding Paths! * - * - +/ + + / * + & 2001 r1r2 + r3 r3mem[r1 + 8] r5r3 - r4 Cycle IA IF ID EX ME WB IA IF ID EX ME WB IA IF ID EX ME WB Figure 2 Three instructions flow down a pipeline; the first forwards data to the second as shown with an arrow) The second instruction first stalls for one cycle, then forwards data to the third * + 3 # # # 2!!#!3#3 # )2/456/7584 # #

3 * / 9, 9,!*4: ; VLIW Functional Unit Instr Fetch Decode Functional Unit Memory Access Register File Memory Access # # ; <-= * : * + * >? * : >? >?

4 Physical Register Files) Branch Predictor I-cache Fetch buffer Decode Pipeline Issue F D Register Rename D Buffer I Exec Unit Exec Unit D Exec Unit to I-cache Load Queue Store Queue L1 Data Cache MSHRs L2 Cache to main memory Reorder Buffer Window) & R!& * : * * : 3) 3)3) >? 3)

5 !!) * : * A 3) 3) * B ** 3) 3) C & * B * place exceptions in entry mark entry when complete reserve entry at tail when dispatched Exceptions Reg Mapping Prog Counter r7 p5 r3 p6 Store remove from head if complete; STOP if exception present Complete 1 0 Figure 5 Reorder buffer; instructions are dispatched into the tail and exit from the head only after they have completed 3) **

6 6 3 3)) * : ) 6 +, D * 4 * loop: Static Instructions r3 memr4+r2) r7 memr5+r2) r7 r7 * r3 r1 r1-1 memr6+r2) r7 r2 r2 + 8 PC loop; r1!=0 Branch Predict & Fetch Dynamic Instruction Stream r3 memr4+r2) r7 memr5+r2) r7 r7 * r3 r1 r1-1 memr6+r2) r7 r2 r2 + 8 PC loop; r1!=0 r3 memr4+r2) r7 memr5+r2) r7 r7 * r3 r1 r1-1 memr6+r2) r7 r2 r2 + 8 PC loop; r1!=0 Figure 6 A block of instructions on the left) are fetched with the benefit of branch prediction) to form the dynamic instruction stream shown at right A branch instruction appears as an assignment to the program counter PC) ; * 7 >? *

7 ) * 7 * 7 :- + : A * 7 8 & / & /8 / 8 * 8 r3 memr4+r2) r7 memr5+r2) Register MAP r1 p3 r2 p4 r3 p6 r4 p1 r5 p2 r6 p7 r7 p5 a) r3 memp1+p4) r3 memr4+r2) r7 memr5+r2) Register MAP r1 p3 r2 p4 r3 p8 r4 p1 r5 p2 r6 p7 r7 p5 b) p8 memp1+p4) p9 memp2+p4) Free Pool p8,p9,p10,p11,p12,p13, p14,p15,p16,p17,p18, P19,p20,p21,p22,p23,p24 Figure 7 The register renaming process a) first source registers access the logical-to-physical register map to find their current mappings b) then the first physical register in the free pool is assigned to the result register and the register map table is updated DA : * : 6,E /,E + B

8 * 8 * Renamed Stream dispatch issue complete commit p8 memp1+p4) p9 memp2+p4) p10 p9 * p p11 p memp7+p4)p p12 p PC loop; p11!= p13 memp1+p12) p14 memp2+p12) p15 p14 * p p16 p memp7+p12) p p17 p PC loop; p16!= p18 memp1+p17) p19 memp2+p17) p20 p19 * p p21 p memp7+p17) p p22 p Figure 8 Three iterations of the example instruction stream after renaming Dispatch, issue, complete, and commit cycles illustrate out-of-order instruction issue and in-order instruction commit # B 43 * 7 / & -/!# /B!# C /B!# ; -

9 3) * B * 8 * F * 3)+4* 8 +F 3)!#B# 3) 3) 3) 3) * Register MAP r1 p21 r2 p22 r3 p18 r4 p1 r5 p2 r6 p7 r7 p20 Restore Register Map Register MAP r1 p16 r2 p17 r3 p18 r4 p1 r5 p2 r6 p7 r7 p14 Exceptions Reg Mapping Prog Counter 0000 r2 p17 6C r1 r7 p16 p19 60 r7 p14 5C Restore Program Counter 5C Store Complete tail head Figure 9 Example of ROB restoring architected state after an exception The instruction at the head of the ROB has an exception The register mapping and PC are backed up, and the pending store instruction is flushed!/ 01##2,+ 6 #-

10 3 * D ),6 * 4 * 8 D 6 * +5 6,,),) &,- 2 ; 3 2 ;3 * * +5 3) & 3) )

11 MSHRs Data from Memory Miss to Memory Instruction Issue hit/miss Address Generation Loads MUX L1 Data Cache Data on hit) MUX Data to Processor TLB Store-to- Load Forward Data Load Address Buffer Enable Store-to-Load Forward Address Compare store addresses store data Store Commit from ROB store data Pending Complete Store Queue Commit Stores Coalescing) Store Buffers Figure 10 L1 data cache and buffering subsystem that allow load/store reordering with forwarding of load data * ; * +5

12 +& -& 2E1 / * ++ G *- H H*+ D * 8 address from Address Generation Logic store address load address load address Load Address Queue address SQ tag pending Compare1 tag match & pending Compare 2 address match & valid Enable Forward1 Store-to-Load Forward Data1 Enable Forward2 data tag from pipeline control logic store data from execution units store data address data Store-to-Load Forward Data2 Commit from ROB valid Store Queue Figure 11 Detailed drawing of load/store buffering and comparison logic

13 * +- * ++ #,H G address from Address Generation Logic load address Load Address Queue address address store address SQ tag SQ tag Commit from ROB pending forwarded Compare3 address match & not forwarded flush/restart data tag from pipeline control logic store data from execution units store data address data Commit from ROB valid Store Queue Figure 12 Portion of load/store unit that implements speculative issuing of load instructions before prior store addresses are known # </-=* <//=!!!+2

14 * * * 8 * 6 / D * 3 #+ 3) CI I *0<-5= C *!! Issue Width Linear Relationship Linear Relationship Linear Relationship ~ Quadratic Relationship I-Fetch Resources aciheved width) Commit Width Numbers of Functional Units ROB Size Linear Relationships Issue Buffer Size # Rename Registers Load/Store Buffer Sizes +

15 @ 3 Processor Intel Core IBM Power4 MIPS R10000 Table 1 The relationship between window size ROB) and issue width for some real processors Intel PentiumPro Alpha AMD Opteron HP PA-8000 Intel Pentium 4 Reorder Buffer Size Issue Buffer Size Issue Width log 2 ROB) log 2 Issue Width) Issue Buffer Size ROB Size A * +/ 3) 6 * : 6+4 & / </:= #+ / 3! * 7#+ &

16 * - ) G / 3! THE 6600 BARREL AND SLOT +F45,E! ## 4455 </B= ## 4455 A ##4455!! +B 44556!! 0!!!!,E!!& >?>?3* +:!!&!!!!& A!!!!*!!& )!!,E *

17 I/O Programs in Barrel Memory Latency = 1 Barrel Rotation PC reg0 reg1 regn-1 SLOT Time-shared instruction control PC reg0 reg1 regn-1 ALU DENELCOR HEP Figure 14 CDC 6600 Barrel and Slot multi-threading ;0!</=;0! * +B ;0! 3 2*! opcode Scheduler Function Unit Main Memory PSW Queue PSW Instruction Memory reg addresses Register Memory operands Increment Delay nonmemory insts memory insts PSW Buffer pending memory results Function Units Figure 15 Block diagram of the Denelcor HEP 3 * +B!! 0!!

18 &!#!!! 2 6!!#!!! ;0!+-5!! ;! /!! ##4455!! ;0!2 ) 3)6 * +4! : # 2 <4=! : * +4 </4= -

19 L1 Cache Store Buffer I-Fetch Uop Queue Rename Queue Sched Register Read Execute Register Write Commit Prog Counters Trace Cache Allocate Registers Data Cache Reorder Buffer Figure 16 Intel Pentium 4 hyperthreading Registers,3E * +7 * * +7 address TId tag offset V TId tag data Compare == hit/miss Figure 17 A thread identifier TId) separates the entries belonging to different threads in a shared buffer or memory 3)6 D

20 >? ; )!/ 2 6 * * * & >?>?*D * +8 * D

21 3 * +8D D D Objectives Policies Mechanisms Capacity Resource Bandwidth Resource Capacity Resource Bandwidth Resource Figure 18 Objectives, policies, and mechanisms 4!5 D 6 D 6 D D 6 D D * DD PERFORMANCE D D * D! D *

22 @ FAIRNESS >? *! : <4= D & >? 4455!!! : 2 & <-7= & A K )& * & <+5= >? * D * D D D! : >?* )&<-+= ISOLATION )

23 6 # D D IMPLEMENTING OBJECTIVES * +8D D D 2 C>?</7= *! :! :! : ) /B/ )2!B D /!

24 BANDWIDTH SHARING ) + + J! : CAPACITY SHARING * +7!! :! :* +F! :,

25 Instruction Fetch Instruction Dispatch Instruction Issue Read Registers Execute Memory Access Write-Back Commit Program Counters Mechanisms part shared part part part part shared shared shared shared shared Round- Robin PRE-EMPTION Uop queue tracecache Round- Robin Rename/ Allocation Tables Issue Buffers FR-FCFS Policies Registers Execution part Ld/St Buffers shared Data Cache Figure 19 Pentium 4 hyper-threading mechanisms and policies shared shared Registers! FEEDBACK MECHANISMS & ; A POLICY COORDINATION D * -5 * part Round- Robin part ROB

26 bandwidth resource capacity resource bandwidth resource capacity resource bandwidth resource capacity resource bandwidth resource Local Policy Local Policy Local Policy Local Policy Local Policy Local Policy Local Policy Global Policy Figure 20 Local policies manage local resources in accordance with a global policy * -+ bandwidth resource capacity resource bandwidth resource capacity resource bandwidth resource capacity resource bandwidth resource Policy Feedback Figure 21 A policy may incorporate a feedback mechanism that monitors the status at a later pipeline stage SCHEDULING GRANULARITY + + # * -- D Monitor Status *

27 ! * --; ;0!##4455!!! : * issue width cycles cycles cycles a) coarse-grain b) fine-grain c) simultaneous Figure 22 Multi-threaded scheduling policies G G 2 2 * -- 2 *! : G! : * +F ; THREAD SELECTION, ) 4455!! 33

28 -G,3E ; /G 0 /,* *#* ; *3*#*! : 1G WORK CONSERVATION J *! : J #! :! : CAPACITY POLICIES #B ) -

29 )!! : # * F#+ #B!2!#! ) J # * -/ >? * -/ * -/ # + 4

30 cache misses pipeline stalls thread 1 thread 2 fine-grain MT coarsegrain MT time clock cycles) Figure 23 Scheduling of two threads with fine-grain round-robin scheduling and coarse-grain switch-on-event scheduling E * * ; * -/ )23 4:9 /B+ D * -: * -:; * -: ;0!

31 thread 1 thread 2 fine-grain MT time clock cycles) Figure 24 Fine-grain multithreading of pipelines without forwarding hardware There are more gaps due to stalls in the individual threads However, fine-grain multi-threading is able to fill in most of the gaps #!2! A SCHEDULING GRANULARITY * -B # branch misprediction cache miss Processor Instructions per Cycle Figure 25 Superscalar processors instruction execution is interspersed with miss events branch mispredictions and cache/tlb misses) A *,+,+ *,-,+,-,)

32 J<+5=,- /55,+ # * )23 4:9 /B+2 </8=,/ <-8= & *,+! : 2 SINGLE THREAD POLICIES D * +/0 *

33 @ L&* * 4: * Total Issue Buffer Size Active Threads Figure 26 Relationship between the number of active threads and the aggregate issue buffer size D D * D FETCH UNIT MECHANISMS AND POLICIES # &

34 D * 2 ; * </F= </4= E 2 2 D 6 ; 2</4= ) <B= ; D 2 )2!B INSTRUCTION ISSUE

35 A 2<+:= /034* -7 #EA #EA#EA 6 5 Instructions per Cycle Round Robin ICOUNT Threads Figure 27 Performance comparison of Round-Robin and ICOUNT fetch policies in an 8-way SMT prrocessor from [reference]) RETIREMENT POLICIES A3) ** 3) 3) 3) )2!B FAIRNESS POLICIES D /:+ * ) 3

36 D D <-/= <+5= * )& <-+= * #EA EXPLICIT PRIORITY POLICIES ; *!B /B/ * /!05!2! ) 5 </5= / 36 *

37 /*7 )23 4:9!!# 3 4:9 Update Instruction Address IA Instruction Fetch IF Instruction Decode ID Execute EX Memory Access ME Writeback Results WB Program Counter Instruction Buffer 16) Register File MUX PC Update Instruction Cache Decoder Execution Units Data Cache Branch Target Buffer 8) MUX Forwarding Paths Thread Switch Buffer 8) Figure 28 The IBM RS64 IV pipeline has conventional in-order pipeline stages It is a 4-way superscalar processor and has instruction buffers to reduce branch misprediction and thread switch delays!!# : B * -8)! : 3 4:9! * #3 # D,+,-,) ),-,+,+,-,+,+,-

38 @,),) # * -8 * -F,+ + ) A +& cycles: Load 1 IA IF inst buffer ID EX ME WB Inst 1 IA IF inst buffer ID EX ME Inst 2 IA IF thread switch buffer ID EX ME WB miss => flush; thread-switch 3 cycle switch penalty Figure 29 Thread switch timing on a data cache miss The processor is 4-way superscalar, but to simplify the figure this is not shown * /5 2,+ +7M -

39 L1 Cache Misses IERAT Misses TLB Misses L2 Cache Misses Timeout Priority Miscellaneous 9 13 Figure 30 Causes for thread switches The IERAT serves as an instruction TLB BM 8992 EAA <+-=A ),++4N8N /2,- ; A #42! A Instruction Fetch IF Thread Select TS Instruction Decode ID Execute EX Memory Access ME Writeback Results WB Select Logic PC Instruction Buffers Select Logic Register File 4 threads MUX Store Buffers partitioned) PCs 4 threads Instruction Cache 2 Decoder Execution Units Data Cache MUX Instruction types Misses Traps Resource conflicts Thread Select Policy Forwarding Paths Figure 31 Block diagram of SUN Niagara multi-threaded processor pipeline A * /+,E, 6,

40 !#,3E G # & * / ,3E + -5 & 5# 5 + A/ ; 5 5 / 5

41 cycles: load 0 TS ID EX ME WB add 1 TS ID EX ME WB load 1 TS ID EX ME WB add 0 ID EX ME WB TS hit/ miss? forward Figure 32 Example of Niagara thread scheduling The add instruction from thread 0 is issued speculatively assuming thread 0s load hits in the data cache) / : )2!B!B!: 0!B!B 2!B +87B2),-,/ #4 * // D G -:!B ** ) * *#*

42 Instruction Fetch Instruction Dispatch Instruction Issue Read Registers Execute Memory Access Write-Back Commit Mechanisms pooled part pooled part part part part pooled pooled pooled pooled pooled Branch Predictors Program Counters Round- Robin I-cache Priority Inst Buffers Dispatch Policy Rename Tables Issue Buffers FCFS Registers Execution Load Miss Queue occupancy part Ld/St Buffers pooled Data Cache pooled pooled Registers GCT occupancy Round- Robin pooled GCT Policies Figure 33 Resource management diagram for IBM /-: * // 0, * /: 0 5+ Figure 34 Thread performance for different setting of thread priorities in the IBM Power5

43 - / * )2 #EA /:B J#,2H J#,2H * J#,2H,2H * #! 1 O3@># J!!? F8-+- +F 2 O 0 J > 2!? FFB+45FG+4-: 3 )O > ;0!2 #?!03! 9+F8+-:+-:8 4 0 > 2 C! A!? FF7 5 O0P098C!E P!# P; 2 P O*-55-7 O2)P2!!#!# P )2O ! 8 3 # # #-5+ 9 O #2 2 #! +: #! # -55BB+4-103J 2>* 02?2-554+:F ,;)AN> # 2?777/ 7F8B +FF7

44 12! N A C /- 2!3#! B-+-F 13,)! C ) # 2!-7 # O >0 # C * 2!?19 / 6 2+FF4+F+-5-15) >!B 2?:8; O-55BB5BB-+ 16*O#@> #3 2!?2-55: N A O 2 >#D ## 2? 2-55:+FB >) 2 C,# 2 ;?2-55:+8/+F: 19;2J>A -*?2 3A >2 2!? N3 >3! 2!?!#-55/+B-B 22*O#@>!! 2!C ) 2?000#O-55478B7FF 23*O#@>H ;! 2! 0? 0002 O-55:-:/+ 24*O#@>C! 20?+8!! -55:7:8/ 25* O#@ > 0 3 2!?-55: 0-55:::B O ) >;? /+8/-7 27N, O JD 2* >) 2?-55+! A-55++4:+7+ 28!2 N D>)?-55- * J ;!! O O #O; 9>?-/ /:+:B 30 > O 2!?F!, -555-/:-::

45 31N>;!!85553 ##!E? FF7-7/- 32J#O0>2!?-B # +FF8+:-+B/ 33O> #2 2?# N ># # 2?!, 7 +FF4 35O0>! ##4455?*!!*O##--4 +F4://:5 3603P#C,,; ) * P-F 0006#2 2 2#3-F+FF4-:/: 37 P# # P777/ O+F8+ 38# 2A 3) >2 C #!? B *2 ; 3?-- # O+FFB////::

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB)

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2. Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 0 Consider the following LSQ and when operands are

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2. Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 1 Consider the following LSQ and when operands are

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized

More information

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007, Chapter 3 (CONT II) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007, 2013 1 Hardware-Based Speculation (Section 3.6) In multiple issue processors, stalls due to branches would

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Simultaneous Multithreading Processor

Simultaneous Multithreading Processor Simultaneous Multithreading Processor Paper presented: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor James Lue Some slides are modified from http://hassan.shojania.com/pdf/smt_presentation.pdf

More information

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections ) Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections 2.3-2.6) 1 Correlating Predictors Basic branch prediction: maintain a 2-bit saturating counter for each

More information

E0-243: Computer Architecture

E0-243: Computer Architecture E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation

More information

Processor Architecture

Processor Architecture Processor Architecture Advanced Dynamic Scheduling Techniques M. Schölzel Content Tomasulo with speculative execution Introducing superscalarity into the instruction pipeline Multithreading Content Tomasulo

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?) Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

LSU EE 4720 Dynamic Scheduling Study Guide Fall David M. Koppelman. 1.1 Introduction. 1.2 Summary of Dynamic Scheduling Method 3

LSU EE 4720 Dynamic Scheduling Study Guide Fall David M. Koppelman. 1.1 Introduction. 1.2 Summary of Dynamic Scheduling Method 3 PR 0,0 ID:incmb PR ID:St: C,X LSU EE 4720 Dynamic Scheduling Study Guide Fall 2005 1.1 Introduction David M. Koppelman The material on dynamic scheduling is not covered in detail in the text, which is

More information

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory ECE4750/CS4420 Computer Architecture L11: Speculative Execution I Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcements Lab3 due today 2 1 Overview Branch penalties limit performance

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections ) Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections 2.3-2.6) 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB) Branch prediction and instr fetch R1 R1+R2 R2 R1+R3 BEQZ R2 R3

More information

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1) Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1) 1 Problem 3 Consider the following LSQ and when operands are available. Estimate

More information

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false. CS 2410 Mid term (fall 2015) Name: Question 1 (10 points) Indicate which of the following statements is true and which is false. (1) SMT architectures reduces the thread context switch time by saving in

More information

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Pipelining to Superscalar Forecast Real

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

Lecture 14: Multithreading

Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw

More information

Superscalar Processor Design

Superscalar Processor Design Superscalar Processor Design Superscalar Organization Virendra Singh Indian Institute of Science Bangalore virendra@computer.org Lecture 26 SE-273: Processor Design Super-scalar Organization Fetch Instruction

More information

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Lecture 26: Parallel Processing. Spring 2018 Jason Tang Lecture 26: Parallel Processing Spring 2018 Jason Tang 1 Topics Static multiple issue pipelines Dynamic multiple issue pipelines Hardware multithreading 2 Taxonomy of Parallel Architectures Flynn categories:

More information

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2. Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)

More information

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted

More information

Four Steps of Speculative Tomasulo cycle 0

Four Steps of Speculative Tomasulo cycle 0 HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly

More information

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem

More information

Lecture: Out-of-order Processors

Lecture: Out-of-order Processors Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer 1 Amdahl s Law Architecture design is very bottleneck-driven

More information

Super Scalar. Kalyan Basu March 21,

Super Scalar. Kalyan Basu March 21, Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build

More information

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue Lecture 11: Out-of-order Processors Topics: more ooo design details, timing, load-store queue 1 Problem 0 Show the renamed version of the following code: Assume that you have 36 physical registers and

More information

Pentium IV-XEON. Computer architectures M

Pentium IV-XEON. Computer architectures M Pentium IV-XEON Computer architectures M 1 Pentium IV block scheme 4 32 bytes parallel Four access ports to the EU 2 Pentium IV block scheme Address Generation Unit BTB Branch Target Buffer I-TLB Instruction

More information

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

Chapter 4 The Processor 1. Chapter 4D. The Processor

Chapter 4 The Processor 1. Chapter 4D. The Processor Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline

More information

Pipelining to Superscalar

Pipelining to Superscalar Pipelining to Superscalar ECE/CS 752 Fall 207 Prof. Mikko H. Lipasti University of Wisconsin-Madison Pipelining to Superscalar Forecast Limits of pipelining The case for superscalar Instruction-level parallel

More information

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others Schedule of things to do By Wednesday the 9 th at 9pm Please send a milestone report (as

More information

Static vs. Dynamic Scheduling

Static vs. Dynamic Scheduling Static vs. Dynamic Scheduling Dynamic Scheduling Fast Requires complex hardware More power consumption May result in a slower clock Static Scheduling Done in S/W (compiler) Maybe not as fast Simpler processor

More information

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis6627 Powerpoint Lecture Notes from John Hennessy

More information

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Metodologie di Progettazione Hardware-Software

Metodologie di Progettazione Hardware-Software Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism

More information

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] Review and Advanced d Concepts Adapted from instructor s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] Pipelining Review PC IF/ID ID/EX EX/M

More information

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

EITF20: Computer Architecture Part3.2.1: Pipeline - 3 EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done

More information

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers) physical register file that is the same size as the architectural registers

More information

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog) Announcements EE382A Lecture 6: Register Renaming Project proposal due on Wed 10/14 2-3 pages submitted through email List the group members Describe the topic including why it is important and your thesis

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 18 Advanced Processors II 2006-10-31 John Lazzaro (www.cs.berkeley.edu/~lazzaro) Thanks to Krste Asanovic... TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/

More information

CS 152, Spring 2011 Section 8

CS 152, Spring 2011 Section 8 CS 152, Spring 2011 Section 8 Christopher Celio University of California, Berkeley Agenda Grades Upcoming Quiz 3 What it covers OOO processors VLIW Branch Prediction Intel Core 2 Duo (Penryn) Vs. NVidia

More information

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar Complex Pipelining COE 501 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Diversified Pipeline Detecting

More information

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Hiroaki Kobayashi // As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Branches will arrive up to n times faster in an n-issue processor, and providing an instruction

More information

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 9: Multithreading

More information

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士 Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types

More information

TDT 4260 TDT ILP Chap 2, App. C

TDT 4260 TDT ILP Chap 2, App. C TDT 4260 ILP Chap 2, App. C Intro Ian Bratt (ianbra@idi.ntnu.no) ntnu no) Instruction level parallelism (ILP) A program is sequence of instructions typically written to be executed one after the other

More information

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case

More information

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars Krste Asanovic Electrical Engineering and Computer

More information

Computer System Architecture Quiz #2 April 5th, 2019

Computer System Architecture Quiz #2 April 5th, 2019 Computer System Architecture 6.823 Quiz #2 April 5th, 2019 Name: This is a closed book, closed notes exam. 80 Minutes 16 Pages (+2 Scratch) Notes: Not all questions are of equal difficulty, so look over

More information

Out of Order Processing

Out of Order Processing Out of Order Processing Manu Awasthi July 3 rd 2018 Computer Architecture Summer School 2018 Slide deck acknowledgements : Rajeev Balasubramonian (University of Utah), Computer Architecture: A Quantitative

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic

More information

Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques,

Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques, Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques, ARM Cortex-A53, and Intel Core i7 CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica e Informatica 1 Introduction Hardware-based speculation is a technique for reducing the effects of control dependences

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation TDT4255 Lecture 9: ILP and speculation Donn Morrison Department of Computer Science 2 Outline Textbook: Computer Architecture: A Quantitative Approach, 4th ed Section 2.6: Speculation Section 2.7: Multiple

More information

Chapter. Out of order Execution

Chapter. Out of order Execution Chapter Long EX Instruction stages We have assumed that all stages. There is a problem with the EX stage multiply (MUL) takes more time than ADD MUL ADD We can clearly delay the execution of the ADD until

More information

Spring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Spring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic Spring 2010 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic C/C++ program Compiler Assembly Code (binary) Processor 0010101010101011110 Memory MAR MDR INPUT Processing Unit OUTPUT ALU TEMP PC Control

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 17 Advanced Processors I 2005-10-27 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: David Marquardt and Udam Saini www-inst.eecs.berkeley.edu/~cs152/

More information

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

Multiple Instruction Issue and Hardware Based Speculation

Multiple Instruction Issue and Hardware Based Speculation Multiple Instruction Issue and Hardware Based Speculation Soner Önder Michigan Technological University, Houghton MI www.cs.mtu.edu/~soner Hardware Based Speculation Exploiting more ILP requires that we

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

Hyperthreading Technology

Hyperthreading Technology Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II CS252 Spring 2017 Graduate Computer Architecture Lecture 8: Advanced Out-of-Order Superscalar Designs Part II Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time

More information

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue 1 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction

More information

CS 152, Spring 2012 Section 8

CS 152, Spring 2012 Section 8 CS 152, Spring 2012 Section 8 Christopher Celio University of California, Berkeley Agenda More Out- of- Order Intel Core 2 Duo (Penryn) Vs. NVidia GTX 280 Intel Core 2 Duo (Penryn) dual- core 2007+ 45nm

More information

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 15: Instruction Level Parallelism and Dynamic Execution March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002

More information

Current Microprocessors. Efficient Utilization of Hardware Blocks. Efficient Utilization of Hardware Blocks. Pipeline

Current Microprocessors. Efficient Utilization of Hardware Blocks. Efficient Utilization of Hardware Blocks. Pipeline Current Microprocessors Pipeline Efficient Utilization of Hardware Blocks Execution steps for an instruction:.send instruction address ().Instruction Fetch ().Store instruction ().Decode Instruction, fetch

More information

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW) 1 EEC 581 Computer Architecture Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW) Chansu Yu Electrical and Computer Engineering Cleveland State University Overview

More information

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW Computer Architecture ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW 1 Review from Last Lecture Leverage Implicit

More information

Case Study IBM PowerPC 620

Case Study IBM PowerPC 620 Case Study IBM PowerPC 620 year shipped: 1995 allowing out-of-order execution (dynamic scheduling) and in-order commit (hardware speculation). using a reorder buffer to track when instruction can commit,

More information

Superscalar Processor

Superscalar Processor Superscalar Processor Design Superscalar Architecture Virendra Singh Indian Institute of Science Bangalore virendra@computer.orgorg Lecture 20 SE-273: Processor Design Superscalar Pipelines IF ID RD ALU

More information