T a s k B C D. O r d e r

Size: px

Start display at page:

Download "T a s k B C D. O r d e r"

Elinor Wilson
5 years ago
Views:

1 an Jose tate University EE176 - Fall 1998 Computer rchitecture and Engineering Lecture 12: Introduction to Pipelining (Part II) Instructor: Christopher H. Pham T a s k O r d e r Recap: equential Laundry C 6 P Time equential laundry takes 8 hours for 4 loads If they learned pipelining, how long would laundry take EE176-JU Lec11 PipelinePart2.1 EE176-JU Lec11 PipelinePart2.2 T a s k O r d e r Recap: Pipelining Lessons (its intuitive!) C 6 P Time Pipelining doesn t help latency of single task, it helps throughput of entire workload ultiple tasks operating simultaneously using different resources Potential speedup Number pipe stages Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup tall for ependences Recap: Ideal Pipelining IF C EX E W IF C EX E W IF C EX E W IF C EX E W aximum peedup Number of stages peedup Time for unpipelined operation Time for longest stage ssume instructions are completely independent! IF C EX E W Example: 40ns data path, 5 stages, Longest stage is 10 ns, peedup 4 EE176-JU Lec11 PipelinePart2.3 EE176-JU Lec11 PipelinePart2.4

2 Recap: Graphically Representing Pipelines Can help with answering questions like: how many cycles does it take to execute this code what is the LU doing during cycle 4 use this representation to help understand datapaths Recap: Can pipelining get us into trouble Yes: Pipeline Hazards structural hazards: attempt to use the same resource two different ways at the same time - e.g., multiple memory accesses, multiple register writes - solutions: multiple memories, stretch pipeline control hazards: attempt to make a decision before condition is evaulated - e.g., any conditional branch - solutions: prediction, delayed branch data hazards: attempt to use item before it is ready - e.g., add r1,r2,r3; sub r4, r1,r5; lw r6, 0(r7); or r8, r6,r9 - solutions: foarding/bypassing, stall/bubble EE176-JU Lec11 PipelinePart2.5 EE176-JU Lec11 PipelinePart2.6 Recap: Pipelined atapath with ata tationary Control s alu mem m s IU npc I mem lw $2,20($5) im n op Operand ister elects LU Op E Op Just like Time-tate! < immed Result elect and Enable Recap Pipelining is a fundamental concept multiple steps using distinct resources Utilize capabilities of the atapath by pipelined instruction processing start next instruction while working on the current one limited by length of longest stage (plus fill/flush) detect and resolve hazards What makes it easy all instructions are the same length just a few instruction formats memory operands appear only in loads and stores Hazards make it hard We ll build a simple pipeline and look at these issues EE176-JU Lec11 PipelinePart2.7 EE176-JU Lec11 PipelinePart2.8

3 The ig Picture: Where are We Now Recap: Control iagram The Five Classic Components of a Computer Processor Input Control ory atapath Output < + ; < or ZX; <- []; < +4; <- R[rs]; < R[rt] < + X; < + X; If Cond < +X; Today s Topics: Recap last lecture Pipelined Control/ o it yourself Pipelined Control dministrivia Hazards/Foarding Exceptions Review IP R3000 pipeline dvanced Pipelining < R[rd] < ; Next < R[rt] < ; Inst. < [] R[rd] < ; Equal [] <- ata. EE176-JU Lec11 PipelinePart2.9 EE176-JU Lec11 PipelinePart2.10 ut recall use of ata tationary Control atapath + ata tationary Control The ain Control generates the control signals during /ec Control signals for (ExtOp, LUrc,...) are used 1 cycle later Control signals for (Wr ranch) are used 2 cycles later Control signals for Wr (to Wr) are used 3 cycles later /ec Wr Inst. fun rt rs op ecode rs rt v wb me ex im v wb me v wb W IF/I ister ain Control ExtOp LUrc LUOp st Wr ranch to Wr I/Ex ister ExtOp LUrc LUOp st Wr ranch to Wr Ex/ ister Wr ranch to Wr /Wr ister to Wr Next ata. EE176-JU Lec11 PipelinePart2.11 EE176-JU Lec11 PipelinePart2.12

4 Let s Try it Out tart: Fetch 10 n n n n 14 addi r2, r2, 3 20 sub r3, r4, r5 these addresses are octal Inst. ecode rs rt Next im 10 ata IF W. 14 addi r2, r2, 3 20 sub r3, r4, r5 EE176-JU Lec11 PipelinePart2.13 EE176-JU Lec11 PipelinePart2.14 Fetch 14, ecode 10 Fetch 20, ecode 14, 10 Inst. lw r1, r2(35) ecode 2 rt n n n im W Inst. addi r2, r2, 3 ecode 2 rt lw r1 35 n n W Next 14 ata I IF. 14 addi r2, r2, 3 20 sub r3, r4, r5 Next r2 20 ata. EX I 14 addi r2, r2, 3 IF 20 sub r3, r4, r5 EE176-JU Lec11 PipelinePart2.15 EE176-JU Lec11 PipelinePart2.16

5 Fetch 24, ecode 20, 14, 10 Inst. sub r3, r4, r5 ecode EE176-JU Lec11 PipelinePart Next addi r2, r2, 3 3 r2 24 lw r1 r2+35 ata n W. EX 14 addi r2, r2, 3 I 20 sub r3, r4, r5 IF Fetch 30, cd 24, Ex 20, 14, W 10 Inst. beq r6, r7 100 EE176-JU Lec11 PipelinePart2.18 ecode 6 7 Next sub r3 r4 r5 30 addi r2 r2+3 lw r1 [r2+35] ata Note elayed ranch: always execute ori after beq W. W 14 addi r2, r2, 3 EX 20 sub r3, r4, r5 I IF Fetch 100, cd 30, Ex 24, 20, W 14 Fetch 104, cd 100, Ex 30, 24, W 20 Inst. ori r8, r9 17 EE176-JU Lec11 PipelinePart2.19 ecode 9 xx Next beq 100 r6 r7 100 sub r3 r4-r5 addi r2 ata r2+3 W. r1[r2+35] W 14 addi r2, r2, 3 20 sub r3, r4, r5 EX I IF Inst. EE176-JU Lec11 PipelinePart2.20 ecode Next Fill it in yourself! ata W. 14 addi r2, r2, 3 20 sub r3, r4, r5 EX W I

6 Fetch 110, cd 104, Ex 100, 30, W 24 Fetch 114, cd 110, Ex 104, 100, W 30 Inst. ecode W Inst. ecode W.. ata 14 addi r2, r2, 3 ata 14 addi r2, r2, 3 Next W 20 sub r3, r4, r5 Next W 20 sub r3, r4, r5 Fill it in yourself! EE176-JU Lec11 PipelinePart2.21 EX Fill it in yourself! EE176-JU Lec11 PipelinePart2.22 Pipeline Hazards gain I-Fet ch C OpFetch OpFetch tore tructural Hazard IFetch C I-Fet ch C OpFetch Jump Control Hazard ata Hazards void some by design eliminate WR by always fetching operands early (C) in pipe eleminate WW by doing all Ws in order (last stage, static) etect and resolve remaining ones stall or foard (if possible) IFetch C IF C EX W RW (read after write) ata Hazard IF C EX W WW ata Hazard IF C EX W (write after write) IF C OF Ex IF C OF Ex R WR ata Hazard (write after read) IF C EX W RW ata Hazard IF C EX W WW ata Hazard IF C EX W IF C OF Ex IF C OF Ex R RW ata Hazard EE176-JU Lec11 PipelinePart2.23 EE176-JU Lec11 PipelinePart2.24

7 Hazard etection Record of Pending Writes uppose instruction i is about to be issued and a predecessor instruction j is in the instruction pipeline. RW hazard exists on register ρ if ρ Rregs( i ) Wregs( j ) Keep a record of pending writes (for inst's in the pipe) and compare with operand regs of current instruction. When instruction issues, reserve its result register. When on operation completes, remove its write reservation. WW hazard exists on register ρ if ρ Wregs( i ) Wregs( j ) WR hazard exists on register ρ if ρ Wregs( i ) Rregs( j ) s alu mem m IU npc I mem op rs rt im n op n op n op Current operand registers Pending writes hazard < ((rs ex) & regw ex ) OR ((rs mem) & regw me ) OR ((rs & regw wb) wb ) OR ((rt & regw ex) ex ) OR ((rt & regw mem) me ) OR ((rt wb ) & regw wb ) s EE176-JU Lec11 PipelinePart2.25 EE176-JU Lec11 PipelinePart2.26 Resolve RW by foarding What about memory operations s Foard mux alu mem m IU npc I mem op rs rt im n op n op n op etect nearest valid write op operand register and foard into op latches, bypassing remainder of the pipe Increase muxes to add paths from pipeline registers ata Foarding ata ypassing If instructions are initiated in order and operations always occur in the same stage, there can be no hazards between memory operations! What does delaying W on arithmetic operations cost cycles hardware What about data dependence on loads R1 <- R4 + R5 R2 <- [ R2 + I ] R3 <- R2 + R1 > "elayed Loads" op Rd Ra Rb op Rd Ra Rb Rd Rd to reg file R T s EE176-JU Lec11 PipelinePart2.27 EE176-JU Lec11 PipelinePart2.28

8 Compiler voiding Load talls: What about Interrupts, Traps, Faults External Interrupts: llow pipeline to drain, scheduled unscheduled Load with interupt address Faults (within instruction, restartable) gcc spice 14% 31% 42% 54% Force trap instruction into IF disable writes till trap hits W must save multiple s or + state tex 25% 65% 0% 20% 40% 60% 80% % loads stalling pipeline Refer to IP solution EE176-JU Lec11 PipelinePart2.29 EE176-JU Lec11 PipelinePart2.30 Exception Handling Exception Problem s IU npc I mem lw $2,20($5) im n op detect bad instruction address detect bad instruction Exceptions/Interrupts: 5 instructions executing in 5 stage pipeline How to stop the pipeline Restart Who caused the interrupt tage Problem interrupts occurring IF Page fault on instruction fetch; misaligned memory access; memory-protection violation I Undefined or illegal opcode alu mem detect overflow detect bad data address EX E rithmetic exception Page fault on data fetch; misaligned memory access; memory-protection violation; memory error Load with data page fault, dd with instruction page fault olution 1: interrupt vector/instruction 2: interrupt P, restart everything incomplete m s llow exception to take effect EE176-JU Lec11 PipelinePart2.31 EE176-JU Lec11 PipelinePart2.32

9 Resolution: Freeze above & ubble elow FYI: IP R3000 clocking discipline s IU npc I mem op rs rt bubble freeze phi1 phi2 2-phase non-overlapping clocks Pipeline stage is two (level sensitive) latches im n op alu n op mem m n op Edge-triggered phi1 phi2 phi1 s EE176-JU Lec11 PipelinePart2.33 EE176-JU Lec11 PipelinePart2.34 IP R3000 Instruction Pipeline Inst Fetch ecode. Read LU / E. ory Write TL I-Cache RF Operation W Resource Usage TL I-cache RF TL LULU E.. TL -Cache -Cache W I n s t r. O r d e r Recall: ata Hazard on r1 Time (clock cycles) IF I/RF EX E W add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 LU Im m LU Im m LU Im m Im LU m LU Im m Write in phase 1, read in phase 2 > eliminates bypass from W With IP R3000 pipeline, no need to foard from W stage EE176-JU Lec11 PipelinePart2.35 EE176-JU Lec11 PipelinePart2.36

10 IP R3000 ulticycle Operations Issues in Pipelined design op Rd Ra Rb mul Rd Ra Rb Rd Rd to reg file R T Ex: ultiply, ivide, Cache iss tall all stages above multicycle operation in the pipeline rain (bubble) stages below it Use control word of local stage state to step through multicycle operation Pipelining uper-pipeline - Issue one instruction per (fast) cycle - LU takes multiple cycles uper-scalar - Issue multiple scalar instructions per cycle VLIW ( EPIC ) - Each instruction specifies multiple scalar operations - Compiler determines parallelism Vector operations - Each instruction specifies series of identical operations Ex W Ex W Ex W Ex W Ex W Ex W Limitation Issue rate, FU stalls, FU depth Clock skew, FU stalls, FU depth Hazard resolution Packing pplicability EE176-JU Lec11 PipelinePart2.37 EE176-JU Lec11 PipelinePart2.38 Historical Perspective Technology Perspective early 90's RIC uperscalars Today Load/tore I (cdc 6600,7600, Cray-1,...) 1966 vector proc. 60ns hardwired 8x16b bus 780ns mem EE176-JU Lec11 PipelinePart 's RIC pipelines (mips,sparc,...) ynamic Inst. cheduling with extensive pipelining (ibm 360/91) 25x basic model Inst. Pipelining Inst. uffering (tretch - 100x ibm Cache (ibm 360/85,...) Virtual ory (multics, ge-645, ibm 360/67,...) TL icroprogramming 80ns, 2Kb. t 4x16b bus 960ns mem 32K cache ns EE176-JU Lec11 PipelinePart2.40 i4004 i8080 i8086 i80286 i Year 4 bit 8 bit 16 bit 32 bit 64 bit uperscalar Pentium i80486

11 Partitioned Instruction Issue (simple uperscalar) Example: XPY independent int and FP issue to separate pipelines I-Cache asic Loop:<- Rm+Ry Int Inst Issue and ypass FP Operand / Result usses Int Unit Load / tore Unit FP dd FP ul -Cache ingle Issue Total Time Int Time + FP Time ax peedup: Total Time X(Int Time, FP Time) Total ingle Issue Cycles: 19 ( 7 integer, 12 floating point) inimum with ual Issue: 12 Potential peedup: 1.6!!! ctual Cycles: 18 EE176-JU Lec11 PipelinePart2.41 EE176-JU Lec11 PipelinePart2.42 Unrolling oftware Pipelining asic Loop: load a <- i load y <- Yi mult m <- a*s add r <- m+y store i <- r inc i inc Yi dec i branch about 9 inst. per 2 FP ops EE176-JU Lec11 PipelinePart2.43 Unrolled Loop: load,load, mult, add, store load,load mult, add, store load,load mult, add,store load,load, mult, add, store inc,inc, dec, branch about 6 inst. per 2 FP ops dependencies between instructions remain. Reordered Unrolled Loop: load, load, load,... mult, mult, mult, mult, add, add, add, add, store, store, store, store inc, inc, dec, branch schedule 24 inst basic block relative to pipeline - delay slots - function unit stalls - multiple function units - pipeline depth load a <- 1 load y <- Y1 load a' <- 2 mult m <- a*s add r <- m+y load y' <- Y2 load a''<- 3 inc, dec mult m' <- a'*s store i <- r add r' <- m'+y' load y''<- Yi+2 load '''<-i+3 branch inc, dec mult m''<-a''*s store i+1 <- r' add r''<-m''+y'' inc Pipelined Loop: load a''' <- i+3 load y'' <- Yi+2 mult m'' <- a''*s add r' <- m'+y' store i <- r inc i+3 inc Yi dec i a''<- a'''; Y'<- y''; m'<- m'';r<-r' branch EE176-JU Lec11 PipelinePart2.44

12 ultiple Pipes/ Harder uperscalar ranch penalties in superscalar $ R 0 1 Issues:. ports ister R $ etecting ata ependences ypassing RW Hazard WR Hazard ultiple load/store ops Example: resolved in op-fetch stage, single exposed delay (ala IP, parc) I-fetch I-fetch ranch delay ranch delay quash 2 quash 1 T T ranches EE176-JU Lec11 PipelinePart2.45 EE176-JU Lec11 PipelinePart2.46 ummary Pipelines pass control information down the pipe just as data moves down pipe Foarding/talls handled by local control Exceptions stop the pipeline IP I instruction set architecture made pipeline visible (delayed branch, delayed load) ore performance from deeper pipelines, parallelism EE176-JU Lec11 PipelinePart2.47

Recap: Summary of Pipelining Basics

Recap: Summary of Pipelining Basics Recap: ummary of Pipelining asics C152 Computer rchitecture and Engineering Lecture 14 Pipelining Control Continued Introduction to dvanced Pipelining arch 8, 2001 John Kubiatowicz (http.cs.berkeley.edu/~kubitron)