" # " $ % & ' ( ) * + $ " % '* + * ' "
|
|
- Tiffany Stephens
- 5 years ago
- Views:
Transcription
1 ! )! # & ) * + * + * & *,+,-
2 Update Instruction Address IA Instruction Fetch IF Instruction Decode ID Execute EX Memory Access ME Writeback Results WB Program Counter Instruction Register Register File MUX PC Update Instruction Cache Decoder Execution Units Data Cache MUX Forwarding Paths! * - * - +/ + + / * + & 2001 r1r2 + r3 r3mem[r1 + 8] r5r3 - r4 Cycle IA IF ID EX ME WB IA IF ID EX ME WB IA IF ID EX ME WB Figure 2 Three instructions flow down a pipeline; the first forwards data to the second as shown with an arrow) The second instruction first stalls for one cycle, then forwards data to the third * + 3 # # # 2!!#!3#3 # )2/456/7584 # #
3 * / 9, 9,!*4: ; VLIW Functional Unit Instr Fetch Decode Functional Unit Memory Access Register File Memory Access # # ; <-= * : * + * >? * : >? >?
4 Physical Register Files) Branch Predictor I-cache Fetch buffer Decode Pipeline Issue F D Register Rename D Buffer I Exec Unit Exec Unit D Exec Unit to I-cache Load Queue Store Queue L1 Data Cache MSHRs L2 Cache to main memory Reorder Buffer Window) & R!& * : * * : 3) 3)3) >? 3)
5 !!) * : * A 3) 3) * B ** 3) 3) C & * B * place exceptions in entry mark entry when complete reserve entry at tail when dispatched Exceptions Reg Mapping Prog Counter r7 p5 r3 p6 Store remove from head if complete; STOP if exception present Complete 1 0 Figure 5 Reorder buffer; instructions are dispatched into the tail and exit from the head only after they have completed 3) **
6 6 3 3)) * : ) 6 +, D * 4 * loop: Static Instructions r3 memr4+r2) r7 memr5+r2) r7 r7 * r3 r1 r1-1 memr6+r2) r7 r2 r2 + 8 PC loop; r1!=0 Branch Predict & Fetch Dynamic Instruction Stream r3 memr4+r2) r7 memr5+r2) r7 r7 * r3 r1 r1-1 memr6+r2) r7 r2 r2 + 8 PC loop; r1!=0 r3 memr4+r2) r7 memr5+r2) r7 r7 * r3 r1 r1-1 memr6+r2) r7 r2 r2 + 8 PC loop; r1!=0 Figure 6 A block of instructions on the left) are fetched with the benefit of branch prediction) to form the dynamic instruction stream shown at right A branch instruction appears as an assignment to the program counter PC) ; * 7 >? *
7 ) * 7 * 7 :- + : A * 7 8 & / & /8 / 8 * 8 r3 memr4+r2) r7 memr5+r2) Register MAP r1 p3 r2 p4 r3 p6 r4 p1 r5 p2 r6 p7 r7 p5 a) r3 memp1+p4) r3 memr4+r2) r7 memr5+r2) Register MAP r1 p3 r2 p4 r3 p8 r4 p1 r5 p2 r6 p7 r7 p5 b) p8 memp1+p4) p9 memp2+p4) Free Pool p8,p9,p10,p11,p12,p13, p14,p15,p16,p17,p18, P19,p20,p21,p22,p23,p24 Figure 7 The register renaming process a) first source registers access the logical-to-physical register map to find their current mappings b) then the first physical register in the free pool is assigned to the result register and the register map table is updated DA : * : 6,E /,E + B
8 * 8 * Renamed Stream dispatch issue complete commit p8 memp1+p4) p9 memp2+p4) p10 p9 * p p11 p memp7+p4)p p12 p PC loop; p11!= p13 memp1+p12) p14 memp2+p12) p15 p14 * p p16 p memp7+p12) p p17 p PC loop; p16!= p18 memp1+p17) p19 memp2+p17) p20 p19 * p p21 p memp7+p17) p p22 p Figure 8 Three iterations of the example instruction stream after renaming Dispatch, issue, complete, and commit cycles illustrate out-of-order instruction issue and in-order instruction commit # B 43 * 7 / & -/!# /B!# C /B!# ; -
9 3) * B * 8 * F * 3)+4* 8 +F 3)!#B# 3) 3) 3) 3) * Register MAP r1 p21 r2 p22 r3 p18 r4 p1 r5 p2 r6 p7 r7 p20 Restore Register Map Register MAP r1 p16 r2 p17 r3 p18 r4 p1 r5 p2 r6 p7 r7 p14 Exceptions Reg Mapping Prog Counter 0000 r2 p17 6C r1 r7 p16 p19 60 r7 p14 5C Restore Program Counter 5C Store Complete tail head Figure 9 Example of ROB restoring architected state after an exception The instruction at the head of the ROB has an exception The register mapping and PC are backed up, and the pending store instruction is flushed!/ 01##2,+ 6 #-
10 3 * D ),6 * 4 * 8 D 6 * +5 6,,),) &,- 2 ; 3 2 ;3 * * +5 3) & 3) )
11 MSHRs Data from Memory Miss to Memory Instruction Issue hit/miss Address Generation Loads MUX L1 Data Cache Data on hit) MUX Data to Processor TLB Store-to- Load Forward Data Load Address Buffer Enable Store-to-Load Forward Address Compare store addresses store data Store Commit from ROB store data Pending Complete Store Queue Commit Stores Coalescing) Store Buffers Figure 10 L1 data cache and buffering subsystem that allow load/store reordering with forwarding of load data * ; * +5
12 +& -& 2E1 / * ++ G *- H H*+ D * 8 address from Address Generation Logic store address load address load address Load Address Queue address SQ tag pending Compare1 tag match & pending Compare 2 address match & valid Enable Forward1 Store-to-Load Forward Data1 Enable Forward2 data tag from pipeline control logic store data from execution units store data address data Store-to-Load Forward Data2 Commit from ROB valid Store Queue Figure 11 Detailed drawing of load/store buffering and comparison logic
13 * +- * ++ #,H G address from Address Generation Logic load address Load Address Queue address address store address SQ tag SQ tag Commit from ROB pending forwarded Compare3 address match & not forwarded flush/restart data tag from pipeline control logic store data from execution units store data address data Commit from ROB valid Store Queue Figure 12 Portion of load/store unit that implements speculative issuing of load instructions before prior store addresses are known # </-=* <//=!!!+2
14 * * * 8 * 6 / D * 3 #+ 3) CI I *0<-5= C *!! Issue Width Linear Relationship Linear Relationship Linear Relationship ~ Quadratic Relationship I-Fetch Resources aciheved width) Commit Width Numbers of Functional Units ROB Size Linear Relationships Issue Buffer Size # Rename Registers Load/Store Buffer Sizes +
15 @ 3 Processor Intel Core IBM Power4 MIPS R10000 Table 1 The relationship between window size ROB) and issue width for some real processors Intel PentiumPro Alpha AMD Opteron HP PA-8000 Intel Pentium 4 Reorder Buffer Size Issue Buffer Size Issue Width log 2 ROB) log 2 Issue Width) Issue Buffer Size ROB Size A * +/ 3) 6 * : 6+4 & / </:= #+ / 3! * 7#+ &
16 * - ) G / 3! THE 6600 BARREL AND SLOT +F45,E! ## 4455 </B= ## 4455 A ##4455!! +B 44556!! 0!!!!,E!!& >?>?3* +:!!&!!!!& A!!!!*!!& )!!,E *
17 I/O Programs in Barrel Memory Latency = 1 Barrel Rotation PC reg0 reg1 regn-1 SLOT Time-shared instruction control PC reg0 reg1 regn-1 ALU DENELCOR HEP Figure 14 CDC 6600 Barrel and Slot multi-threading ;0!</=;0! * +B ;0! 3 2*! opcode Scheduler Function Unit Main Memory PSW Queue PSW Instruction Memory reg addresses Register Memory operands Increment Delay nonmemory insts memory insts PSW Buffer pending memory results Function Units Figure 15 Block diagram of the Denelcor HEP 3 * +B!! 0!!
18 &!#!!! 2 6!!#!!! ;0!+-5!! ;! /!! ##4455!! ;0!2 ) 3)6 * +4! : # 2 <4=! : * +4 </4= -
19 L1 Cache Store Buffer I-Fetch Uop Queue Rename Queue Sched Register Read Execute Register Write Commit Prog Counters Trace Cache Allocate Registers Data Cache Reorder Buffer Figure 16 Intel Pentium 4 hyperthreading Registers,3E * +7 * * +7 address TId tag offset V TId tag data Compare == hit/miss Figure 17 A thread identifier TId) separates the entries belonging to different threads in a shared buffer or memory 3)6 D
20 >? ; )!/ 2 6 * * * & >?>?*D * +8 * D
21 3 * +8D D D Objectives Policies Mechanisms Capacity Resource Bandwidth Resource Capacity Resource Bandwidth Resource Figure 18 Objectives, policies, and mechanisms 4!5 D 6 D 6 D D 6 D D * DD PERFORMANCE D D * D! D *
22 @ FAIRNESS >? *! : <4= D & >? 4455!!! : 2 & <-7= & A K )& * & <+5= >? * D * D D D! : >?* )&<-+= ISOLATION )
23 6 # D D IMPLEMENTING OBJECTIVES * +8D D D 2 C>?</7= *! :! :! : ) /B/ )2!B D /!
24 BANDWIDTH SHARING ) + + J! : CAPACITY SHARING * +7!! :! :* +F! :,
25 Instruction Fetch Instruction Dispatch Instruction Issue Read Registers Execute Memory Access Write-Back Commit Program Counters Mechanisms part shared part part part part shared shared shared shared shared Round- Robin PRE-EMPTION Uop queue tracecache Round- Robin Rename/ Allocation Tables Issue Buffers FR-FCFS Policies Registers Execution part Ld/St Buffers shared Data Cache Figure 19 Pentium 4 hyper-threading mechanisms and policies shared shared Registers! FEEDBACK MECHANISMS & ; A POLICY COORDINATION D * -5 * part Round- Robin part ROB
26 bandwidth resource capacity resource bandwidth resource capacity resource bandwidth resource capacity resource bandwidth resource Local Policy Local Policy Local Policy Local Policy Local Policy Local Policy Local Policy Global Policy Figure 20 Local policies manage local resources in accordance with a global policy * -+ bandwidth resource capacity resource bandwidth resource capacity resource bandwidth resource capacity resource bandwidth resource Policy Feedback Figure 21 A policy may incorporate a feedback mechanism that monitors the status at a later pipeline stage SCHEDULING GRANULARITY + + # * -- D Monitor Status *
27 ! * --; ;0!##4455!!! : * issue width cycles cycles cycles a) coarse-grain b) fine-grain c) simultaneous Figure 22 Multi-threaded scheduling policies G G 2 2 * -- 2 *! : G! : * +F ; THREAD SELECTION, ) 4455!! 33
28 -G,3E ; /G 0 /,* *#* ; *3*#*! : 1G WORK CONSERVATION J *! : J #! :! : CAPACITY POLICIES #B ) -
29 )!! : # * F#+ #B!2!#! ) J # * -/ >? * -/ * -/ # + 4
30 cache misses pipeline stalls thread 1 thread 2 fine-grain MT coarsegrain MT time clock cycles) Figure 23 Scheduling of two threads with fine-grain round-robin scheduling and coarse-grain switch-on-event scheduling E * * ; * -/ )23 4:9 /B+ D * -: * -:; * -: ;0!
31 thread 1 thread 2 fine-grain MT time clock cycles) Figure 24 Fine-grain multithreading of pipelines without forwarding hardware There are more gaps due to stalls in the individual threads However, fine-grain multi-threading is able to fill in most of the gaps #!2! A SCHEDULING GRANULARITY * -B # branch misprediction cache miss Processor Instructions per Cycle Figure 25 Superscalar processors instruction execution is interspersed with miss events branch mispredictions and cache/tlb misses) A *,+,+ *,-,+,-,)
32 J<+5=,- /55,+ # * )23 4:9 /B+2 </8=,/ <-8= & *,+! : 2 SINGLE THREAD POLICIES D * +/0 *
33 @ L&* * 4: * Total Issue Buffer Size Active Threads Figure 26 Relationship between the number of active threads and the aggregate issue buffer size D D * D FETCH UNIT MECHANISMS AND POLICIES # &
34 D * 2 ; * </F= </4= E 2 2 D 6 ; 2</4= ) <B= ; D 2 )2!B INSTRUCTION ISSUE
35 A 2<+:= /034* -7 #EA #EA#EA 6 5 Instructions per Cycle Round Robin ICOUNT Threads Figure 27 Performance comparison of Round-Robin and ICOUNT fetch policies in an 8-way SMT prrocessor from [reference]) RETIREMENT POLICIES A3) ** 3) 3) 3) )2!B FAIRNESS POLICIES D /:+ * ) 3
36 D D <-/= <+5= * )& <-+= * #EA EXPLICIT PRIORITY POLICIES ; *!B /B/ * /!05!2! ) 5 </5= / 36 *
37 /*7 )23 4:9!!# 3 4:9 Update Instruction Address IA Instruction Fetch IF Instruction Decode ID Execute EX Memory Access ME Writeback Results WB Program Counter Instruction Buffer 16) Register File MUX PC Update Instruction Cache Decoder Execution Units Data Cache Branch Target Buffer 8) MUX Forwarding Paths Thread Switch Buffer 8) Figure 28 The IBM RS64 IV pipeline has conventional in-order pipeline stages It is a 4-way superscalar processor and has instruction buffers to reduce branch misprediction and thread switch delays!!# : B * -8)! : 3 4:9! * #3 # D,+,-,) ),-,+,+,-,+,+,-
38 @,),) # * -8 * -F,+ + ) A +& cycles: Load 1 IA IF inst buffer ID EX ME WB Inst 1 IA IF inst buffer ID EX ME Inst 2 IA IF thread switch buffer ID EX ME WB miss => flush; thread-switch 3 cycle switch penalty Figure 29 Thread switch timing on a data cache miss The processor is 4-way superscalar, but to simplify the figure this is not shown * /5 2,+ +7M -
39 L1 Cache Misses IERAT Misses TLB Misses L2 Cache Misses Timeout Priority Miscellaneous 9 13 Figure 30 Causes for thread switches The IERAT serves as an instruction TLB BM 8992 EAA <+-=A ),++4N8N /2,- ; A #42! A Instruction Fetch IF Thread Select TS Instruction Decode ID Execute EX Memory Access ME Writeback Results WB Select Logic PC Instruction Buffers Select Logic Register File 4 threads MUX Store Buffers partitioned) PCs 4 threads Instruction Cache 2 Decoder Execution Units Data Cache MUX Instruction types Misses Traps Resource conflicts Thread Select Policy Forwarding Paths Figure 31 Block diagram of SUN Niagara multi-threaded processor pipeline A * /+,E, 6,
40 !#,3E G # & * / ,3E + -5 & 5# 5 + A/ ; 5 5 / 5
41 cycles: load 0 TS ID EX ME WB add 1 TS ID EX ME WB load 1 TS ID EX ME WB add 0 ID EX ME WB TS hit/ miss? forward Figure 32 Example of Niagara thread scheduling The add instruction from thread 0 is issued speculatively assuming thread 0s load hits in the data cache) / : )2!B!B!: 0!B!B 2!B +87B2),-,/ #4 * // D G -:!B ** ) * *#*
42 Instruction Fetch Instruction Dispatch Instruction Issue Read Registers Execute Memory Access Write-Back Commit Mechanisms pooled part pooled part part part part pooled pooled pooled pooled pooled Branch Predictors Program Counters Round- Robin I-cache Priority Inst Buffers Dispatch Policy Rename Tables Issue Buffers FCFS Registers Execution Load Miss Queue occupancy part Ld/St Buffers pooled Data Cache pooled pooled Registers GCT occupancy Round- Robin pooled GCT Policies Figure 33 Resource management diagram for IBM /-: * // 0, * /: 0 5+ Figure 34 Thread performance for different setting of thread priorities in the IBM Power5
43 - / * )2 #EA /:B J#,2H J#,2H * J#,2H,2H * #! 1 O3@># J!!? F8-+- +F 2 O 0 J > 2!? FFB+45FG+4-: 3 )O > ;0!2 #?!03! 9+F8+-:+-:8 4 0 > 2 C! A!? FF7 5 O0P098C!E P!# P; 2 P O*-55-7 O2)P2!!#!# P )2O ! 8 3 # # #-5+ 9 O #2 2 #! +: #! # -55BB+4-103J 2>* 02?2-554+:F ,;)AN> # 2?777/ 7F8B +FF7
44 12! N A C /- 2!3#! B-+-F 13,)! C ) # 2!-7 # O >0 # C * 2!?19 / 6 2+FF4+F+-5-15) >!B 2?:8; O-55BB5BB-+ 16*O#@> #3 2!?2-55: N A O 2 >#D ## 2? 2-55:+FB >) 2 C,# 2 ;?2-55:+8/+F: 19;2J>A -*?2 3A >2 2!? N3 >3! 2!?!#-55/+B-B 22*O#@>!! 2!C ) 2?000#O-55478B7FF 23*O#@>H ;! 2! 0? 0002 O-55:-:/+ 24*O#@>C! 20?+8!! -55:7:8/ 25* O#@ > 0 3 2!?-55: 0-55:::B O ) >;? /+8/-7 27N, O JD 2* >) 2?-55+! A-55++4:+7+ 28!2 N D>)?-55- * J ;!! O O #O; 9>?-/ /:+:B 30 > O 2!?F!, -555-/:-::
45 31N>;!!85553 ##!E? FF7-7/- 32J#O0>2!?-B # +FF8+:-+B/ 33O> #2 2?# N ># # 2?!, 7 +FF4 35O0>! ##4455?*!!*O##--4 +F4://:5 3603P#C,,; ) * P-F 0006#2 2 2#3-F+FF4-:/: 37 P# # P777/ O+F8+ 38# 2A 3) >2 C #!? B *2 ; 3?-- # O+FFB////::
Multithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More information250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019
250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr
More informationHandout 2 ILP: Part B
Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationLecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ
Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB)
More informationLecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)
Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling
More informationComputer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley
Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue
More informationLecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.
Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 0 Consider the following LSQ and when operands are
More informationLecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.
Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 1 Consider the following LSQ and when operands are
More informationCPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor
1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationLecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized
More informationCS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars
CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationChapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,
Chapter 3 (CONT II) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007, 2013 1 Hardware-Based Speculation (Section 3.6) In multiple issue processors, stalls due to branches would
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationCPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor
Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationSimultaneous Multithreading Processor
Simultaneous Multithreading Processor Paper presented: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor James Lue Some slides are modified from http://hassan.shojania.com/pdf/smt_presentation.pdf
More informationLecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )
Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections 2.3-2.6) 1 Correlating Predictors Basic branch prediction: maintain a 2-bit saturating counter for each
More informationE0-243: Computer Architecture
E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation
More informationProcessor Architecture
Processor Architecture Advanced Dynamic Scheduling Techniques M. Schölzel Content Tomasulo with speculative execution Introducing superscalarity into the instruction pipeline Multithreading Content Tomasulo
More informationTDT 4260 lecture 7 spring semester 2015
1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding
More informationEECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)
Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static
More informationHardware-based Speculation
Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions
More informationLSU EE 4720 Dynamic Scheduling Study Guide Fall David M. Koppelman. 1.1 Introduction. 1.2 Summary of Dynamic Scheduling Method 3
PR 0,0 ID:incmb PR ID:St: C,X LSU EE 4720 Dynamic Scheduling Study Guide Fall 2005 1.1 Introduction David M. Koppelman The material on dynamic scheduling is not covered in detail in the text, which is
More informationAnnouncements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory
ECE4750/CS4420 Computer Architecture L11: Speculative Execution I Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcements Lab3 due today 2 1 Overview Branch penalties limit performance
More informationAdvanced issues in pipelining
Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one
More informationLecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )
Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections 2.3-2.6) 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB) Branch prediction and instr fetch R1 R1+R2 R2 R1+R3 BEQZ R2 R3
More informationCS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines
CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per
More informationLecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)
Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1) 1 Problem 3 Consider the following LSQ and when operands are available. Estimate
More informationCS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.
CS 2410 Mid term (fall 2015) Name: Question 1 (10 points) Indicate which of the following statements is true and which is false. (1) SMT architectures reduces the thread context switch time by saving in
More informationECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti
ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Pipelining to Superscalar Forecast Real
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls
More information5008: Computer Architecture
5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage
More informationLecture 14: Multithreading
CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw
More informationSuperscalar Processor Design
Superscalar Processor Design Superscalar Organization Virendra Singh Indian Institute of Science Bangalore virendra@computer.org Lecture 26 SE-273: Processor Design Super-scalar Organization Fetch Instruction
More informationLecture 26: Parallel Processing. Spring 2018 Jason Tang
Lecture 26: Parallel Processing Spring 2018 Jason Tang 1 Topics Static multiple issue pipelines Dynamic multiple issue pipelines Hardware multithreading 2 Taxonomy of Parallel Architectures Flynn categories:
More informationHardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.
Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)
More informationEE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University
EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted
More informationFour Steps of Speculative Tomasulo cycle 0
HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly
More informationCS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25
CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem
More informationLecture: Out-of-order Processors
Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer 1 Amdahl s Law Architecture design is very bottleneck-driven
More informationSuper Scalar. Kalyan Basu March 21,
Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build
More informationLecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue
Lecture 11: Out-of-order Processors Topics: more ooo design details, timing, load-store queue 1 Problem 0 Show the renamed version of the following code: Assume that you have 36 physical registers and
More informationPentium IV-XEON. Computer architectures M
Pentium IV-XEON Computer architectures M 1 Pentium IV block scheme 4 32 bytes parallel Four access ports to the EU 2 Pentium IV block scheme Address Generation Unit BTB Branch Target Buffer I-TLB Instruction
More informationCS 152 Computer Architecture and Engineering. Lecture 18: Multithreading
CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste
More informationChapter 4 The Processor 1. Chapter 4D. The Processor
Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline
More informationPipelining to Superscalar
Pipelining to Superscalar ECE/CS 752 Fall 207 Prof. Mikko H. Lipasti University of Wisconsin-Madison Pipelining to Superscalar Forecast Limits of pipelining The case for superscalar Instruction-level parallel
More informationMultithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others
Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others Schedule of things to do By Wednesday the 9 th at 9pm Please send a milestone report (as
More informationStatic vs. Dynamic Scheduling
Static vs. Dynamic Scheduling Dynamic Scheduling Fast Requires complex hardware More power consumption May result in a slower clock Static Scheduling Done in S/W (compiler) Maybe not as fast Simpler processor
More informationInstruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov
Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated
More informationCISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1
CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationCISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions
CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis6627 Powerpoint Lecture Notes from John Hennessy
More informationPage 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring
More informationMetodologie di Progettazione Hardware-Software
Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism
More informationAdapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]
Review and Advanced d Concepts Adapted from instructor s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] Pipelining Review PC IF/ID ID/EX EX/M
More informationEITF20: Computer Architecture Part3.2.1: Pipeline - 3
EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done
More informationReorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)
Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers) physical register file that is the same size as the architectural registers
More informationAnnouncements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)
Announcements EE382A Lecture 6: Register Renaming Project proposal due on Wed 10/14 2-3 pages submitted through email List the group members Describe the topic including why it is important and your thesis
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 18 Advanced Processors II 2006-10-31 John Lazzaro (www.cs.berkeley.edu/~lazzaro) Thanks to Krste Asanovic... TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/
More informationCS 152, Spring 2011 Section 8
CS 152, Spring 2011 Section 8 Christopher Celio University of California, Berkeley Agenda Grades Upcoming Quiz 3 What it covers OOO processors VLIW Branch Prediction Intel Core 2 Duo (Penryn) Vs. NVidia
More informationComplex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar
Complex Pipelining COE 501 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Diversified Pipeline Detecting
More informationAs the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.
Hiroaki Kobayashi // As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Branches will arrive up to n times faster in an n-issue processor, and providing an instruction
More informationComputer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 9: Multithreading
More informationCS 152 Computer Architecture and Engineering. Lecture 14: Multithreading
CS 152 Computer Architecture and Engineering Lecture 14: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste
More informationComputer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士
Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types
More informationTDT 4260 TDT ILP Chap 2, App. C
TDT 4260 ILP Chap 2, App. C Intro Ian Bratt (ianbra@idi.ntnu.no) ntnu no) Instruction level parallelism (ILP) A program is sequence of instructions typically written to be executed one after the other
More informationILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)
Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case
More informationLecture 12 Branch Prediction and Advanced Out-of-Order Superscalars
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars Krste Asanovic Electrical Engineering and Computer
More informationComputer System Architecture Quiz #2 April 5th, 2019
Computer System Architecture 6.823 Quiz #2 April 5th, 2019 Name: This is a closed book, closed notes exam. 80 Minutes 16 Pages (+2 Scratch) Notes: Not all questions are of equal difficulty, so look over
More informationOut of Order Processing
Out of Order Processing Manu Awasthi July 3 rd 2018 Computer Architecture Summer School 2018 Slide deck acknowledgements : Rajeev Balasubramonian (University of Utah), Computer Architecture: A Quantitative
More informationInstruction Level Parallelism
Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic
More informationLecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques,
Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques, ARM Cortex-A53, and Intel Core i7 CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong
More informationHardware-based Speculation
Hardware-based Speculation M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica e Informatica 1 Introduction Hardware-based speculation is a technique for reducing the effects of control dependences
More informationSimultaneous Multithreading (SMT)
Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue
More informationDonn Morrison Department of Computer Science. TDT4255 ILP and speculation
TDT4255 Lecture 9: ILP and speculation Donn Morrison Department of Computer Science 2 Outline Textbook: Computer Architecture: A Quantitative Approach, 4th ed Section 2.6: Speculation Section 2.7: Multiple
More informationChapter. Out of order Execution
Chapter Long EX Instruction stages We have assumed that all stages. There is a problem with the EX stage multiply (MUL) takes more time than ADD MUL ADD We can clearly delay the execution of the ADD until
More informationSpring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic
Spring 2010 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic C/C++ program Compiler Assembly Code (binary) Processor 0010101010101011110 Memory MAR MDR INPUT Processing Unit OUTPUT ALU TEMP PC Control
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 17 Advanced Processors I 2005-10-27 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: David Marquardt and Udam Saini www-inst.eecs.berkeley.edu/~cs152/
More informationAdvanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University
Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationMultiple Instruction Issue and Hardware Based Speculation
Multiple Instruction Issue and Hardware Based Speculation Soner Önder Michigan Technological University, Houghton MI www.cs.mtu.edu/~soner Hardware Based Speculation Exploiting more ILP requires that we
More informationMulti-cycle Instructions in the Pipeline (Floating Point)
Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining
More informationHyperthreading Technology
Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationCS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II
CS252 Spring 2017 Graduate Computer Architecture Lecture 8: Advanced Out-of-Order Superscalar Designs Part II Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time
More informationLecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue 1 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction
More informationCS 152, Spring 2012 Section 8
CS 152, Spring 2012 Section 8 Christopher Celio University of California, Berkeley Agenda More Out- of- Order Intel Core 2 Duo (Penryn) Vs. NVidia GTX 280 Intel Core 2 Duo (Penryn) dual- core 2007+ 45nm
More informationPage 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 15: Instruction Level Parallelism and Dynamic Execution March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002
More informationCurrent Microprocessors. Efficient Utilization of Hardware Blocks. Efficient Utilization of Hardware Blocks. Pipeline
Current Microprocessors Pipeline Efficient Utilization of Hardware Blocks Execution steps for an instruction:.send instruction address ().Instruction Fetch ().Store instruction ().Decode Instruction, fetch
More informationEEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)
1 EEC 581 Computer Architecture Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW) Chansu Yu Electrical and Computer Engineering Cleveland State University Overview
More informationAdvanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationBeyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji
Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of
More informationESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW
Computer Architecture ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW 1 Review from Last Lecture Leverage Implicit
More informationCase Study IBM PowerPC 620
Case Study IBM PowerPC 620 year shipped: 1995 allowing out-of-order execution (dynamic scheduling) and in-order commit (hardware speculation). using a reorder buffer to track when instruction can commit,
More informationSuperscalar Processor
Superscalar Processor Design Superscalar Architecture Virendra Singh Indian Institute of Science Bangalore virendra@computer.orgorg Lecture 20 SE-273: Processor Design Superscalar Pipelines IF ID RD ALU
More information