COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor Advanced Issues

Size: px

Start display at page:

Download "COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor Advanced Issues"

Prudence Wilcox
5 years ago
Views:

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 4 The Processor Advaced Issues Review: Pipelie Hazards Structural hazards Desig pipelie to elimiate structural

1 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 4 The Processor Advaced Issues Review: Pipelie Hazards Structural hazards Desig pipelie to elimiate structural hazards. Data hazards read before write Use data forwardig iside the pipelie. For those cases that forwardig wo t solve (e.g., load-use) iclude hazard hardware to isert stalls i the istructio stream. Cotrol hazards beq, be,j,jr,jal Stall hurts performace. Move decisio poit as early i the pipelie as possible reduces umber of stalls at the cost of additioal hardware. Delay decisio (requires compiler support) ot feasible for deeper pipes requirig more tha oe delay slot to be filled. Predict with eve more hardware, ca reduce the impact of cotrol hazard stalls eve further if the brach predictio (Brach History Table) is correct ad if the brached-to istructio is cached (Brach Table Buffer). Cptr350 Chapter 4 The Processor Advaced Material 1

2 Exceptios ad Iterrupts Uexpected evets requirig attetio Differet ISAs use the terms differetly. Exceptios (sometimes called Traps) Arises withi the CPU e.g., udefied opcode, overflow, syscall, divide by zero, Iterrupt Comes from a exteral I/O cotroller. Dealig with them without sacrificig performace is impossible. Hadlig Exceptios I MIPS, exceptios maaged by a System Cotrol Coprocessor (CP0). Save PC of offedig (or iterrupted) istructio i the Exceptio Program Couter (EPC). Save idicatio of the problem i the Cause Register. Jump to hadler at hard address Cptr350 Chapter 4 The Processor Advaced Material 2

3 A Alterate Mechaism Vectored Iterrupts Hadler address determied by the cause. This is very commo i embedded processors. Example: Udefied opcode: C Overflow: C : C Istructios either: Deal with the iterrupt. Jump to the real hadler. Or pass cotrol to the OS. Multiple Exceptios Pipeliig overlaps multiple istructios Could have multiple exceptios at oce. Simple approach: deal with exceptio from earliest istructio Flush subsequet istructios. Precise vs. imprecise exceptio approach. I complex pipelies Multiple istructios issued per cycle. Out-of-order completio. Maitaiig precise exceptios is difficult. Cptr350 Chapter 4 The Processor Advaced Material 3

4 Precise vs. Imprecise Exceptios A iterrupt that leaves the machie i a well-defied state is called a precise iterrupt. Such a iterrupt has four properties: The Program Couter (PC) is saved i a kow place. All istructios before the oe poited to by the PC have fully executed. No istructio beyod the oe poited to by the PC has bee executed. The executio state of the istructio poited to by the PC is kow. A iterrupt that does ot meet these requiremets is called a imprecise iterrupt. Where i the Pipelie Exceptios Occur ALU IM Reg DM Reg Stage(s)? Sychroous? Arithmetic overflow: EX yes Udefied istructio: ID yes TLB or page fault: IF, MEM yes I/O service request: ay o Hardware malfuctio: ay o Multiple exceptios ca occur simultaeously i a sigle clock cycle. Cptr350 Chapter 4 The Processor Advaced Material 4

5 Extractig Yet More Performace Superpipeliig - Icreasig the depth of the pipelie to icrease the clock rate The more stages i the pipelie, the more forwardig/hazard hardware eeded ad the more pipelie latch overhead (i.e., the pipelie latch accouts for a larger ad larger percetage of the clock cycle time). Multiple-issue Fetchig ad executig more tha oe istructio at a time (expad every pipelie stage to accommodate multiple istructios) The istructio executio rate, CPI, will be less tha 1, so istead we use IPC - istructios per clock cycle E.g., a 6 GHz, four-way multiple-issue processor ca execute at a peak rate of 24 billio istructios per secod with a best case CPI of 0.25 or a best case IPC of 4. Types of Parallelism Istructio-level parallelism (ILP) a measure of the average umber of istructios i a program that a processor might be able to execute at the same time Mostly determied by the umber of data depedecies ad cotrol depedecies i relatio to the umber of other istructios. Machie-level parallelism a measure of the ability of the processor to take advatage of the ILP of the program Determied by the umber of istructios that ca be fetched ad executed at the same time. To achieve high performace, we eed both ILP ad machie-level parallelism. Cptr350 Chapter 4 The Processor Advaced Material 5

6 Istructio-Level Parallelism Pipeliig: executig multiple istructios i parallel To icrease ILP Deeper pipelie Less work per stage Þ shorter clock cycle. Multiple issue Replicate pipelie stages Þ multiple pipelies. Start multiple istructios per clock cycle. But depedecies reduce this cosiderably i practice. Multiple Issue Static multiple issue Compiler groups istructios ito issue packets. Compiler must detect ad avoid hazards E.g., Itel Itaium ad Itaium 2 for the IA-64 ISA EPIC (Explicit Parallel Istructio Computer). 128-bit budles cotaiig three istructios, each 41-bits plus a 5- bit template field, which specifies which fuctioal uit each istructio eeds. Dyamic multiple issue CPU examies istructio stream ad chooses istructios to issue each cycle. Compiler ca help by reorderig istructios. CPU must resolve hazards at rutime. Cptr350 Chapter 4 The Processor Advaced Material 6

7 Multiple-Issue Datapath Resposibilities Must hadle, with a combiatio of hardware ad software fixes, the fudametal limitatios of: How may istructios to issue i oe clock cycle. Data hazards Limitatio is more severe i a SS/VLIW processor due to a (usually) lower ILP. Cotrol hazards Must lea heavily o dyamic brach predictio to help resolve the ILP issue. Structural hazards A SS/VLIW processor has a much larger umber of potetial resource coflicts. Fuctioal uits may have to arbitrate for result busses ad register-file write ports. Resource coflicts ca be elimiated by duplicatig the resource or by pipeliig the resource. Static Multiple Issue Machies (VLIW) Static multiple-issue processors (aka Very Log Istructio Word (VLIW) use the compiler to statically decide which istructios to issue ad execute simultaeously: Issue packet the set of istructios that are budled together ad issued i oe clock cycle thik of it as oe large istructio with multiple operatios. The mix of istructios i the packet is usually restricted a sigle istructio with several predefied fields. The compiler does static brach predictio ad code schedulig to reduce cotrol or elimiate data hazards. VLIW s have Multiple fuctioal uits. Multi-ported register files. Wide program busses. Cptr350 Chapter 4 The Processor Advaced Material 7

8 Loop Urollig Loop Urollig is a loop trasformatio techique that attempts to optimize a program's executio speed at the expese of its biary size (space-time tradeoff). The trasformatio ca be udertake maually by the programmer or by a optimizig compiler. The goal of loop urollig is to icrease a program's speed by reducig (or elimiatig) istructios that cotrol the loop, such as poiter arithmetic ad "ed of loop" tests o each iteratio; reducig brach pealties; as well as "hidig latecies, i particular, the delay i readig data from memory". To elimiate this overhead, loops ca be re-writte as a repeated sequece of similar idepedet statemets. Loop Urollig Example A procedure i a computer program is to delete 100 items from a collectio. This is ormally accomplished by meas of a for-loop which calls the fuctio delete(item_umber). Normal loop it x; for (x = 0; x < 100; x++) { delete(x); } After loop urollig it x; for (x = 0; x < 100; x+=5) { delete(x); delete(x+1); delete(x+2); delete(x+3); delete(x+4); } Cptr350 Chapter 4 The Processor Advaced Material 8

Example lw $t0, 20($s2) addu $t1, $t0, $t2 sub $s4, $s4, $t3 slti $t5, $s4, 20 Ca start sub while addu is waitig for lw.

9 Dyamic Pipelie Schedulig Allow the CPU to execute istructios out of order to avoid stalls But commit results to registers i order. Example lw $t0, 20($s2) addu $t1, $t0, $t2 sub $s4, $s4, $t3 slti $t5, $s4, 20 Ca start sub while addu is waitig for lw. Dyamically Scheduled CPU Preserves depedecies Hold pedig operads Results also set to ay waitig reservatio statios Reorder buffer for register writes Ca supply operads for issued istructios Cptr350 Chapter 4 The Processor Advaced Material 9

10 Register Reamig Reservatio statios ad the reorder buffer effectively provide register reamig. O istructio issue to reservatio statio If operad is available i register file or reorder buffer Copied to reservatio statio. No loger required i the register; ca be overwritte. If operad is ot yet available It will be provided to the reservatio statio by a fuctioal uit. Register update may ot be required. I-Order vs Out-of-Order Istructio fetch ad decode uits are required to issue istructios i-order so that depedecies ca be tracked. The commit uit is required to write results to registers ad memory i program fetch order so that: If exceptios occur, the oly registers updated will be those writte by istructios before the oe causig the exceptio. If braches are mispredicted, those istructios executed after the mispredicted brach do t chage the machie state (i.e., we use the commit uit to correct icorrect speculatio). Although the frot ed (fetch, decode, ad issue) ad back ed (commit) of the pipelie ru i-order, the FUs are free to iitiate executio wheever the data they eed is available which ca leads to out-of-order executio. Allowig out-of-order executio icreases the amout of ILP. Cptr350 Chapter 4 The Processor Advaced Material 10

11 Speculatio Speculatio is used to allow executio of future istructios that (may) deped o the speculated istructio: Speculate o the outcome of a coditioal brach (brach predictio). Speculate that a store (for which we do t yet kow the address) that precedes a load, does ot refer to the same address, allowig the load to be scheduled before the store (load speculatio). Must have (hardware ad/or software) mechaisms for: Checkig to see if the guess was correct. Recoverig from the effects of the istructios that were executed speculatively if the guess was icorrect. Igore ad/or buffer exceptios created by speculatively executed istructios util it is clear that they should really occur. Predicatio Predicatio ca be used to elimiate braches by makig the executio of a istructio depedet o a predicate, e.g., if (p) {statemet 1} else {statemet 2} would ormally compile usig two braches. With predicatio, it would compile as: (p) statemet 1 (~p) statemet 2 Predicatio ca be used to speculate as well as to elimiate braches. Cptr350 Chapter 4 The Processor Advaced Material 11

12 Depedecies Review Whe more tha oe istructio refereces a particular locatio for a operad, either readig it (as a iput) or writig it (as a output), executig those istructios i a order differet from the origial program order ca lead to three kids of data hazards: Read-after-write (RAW): A read from a register or memory locatio must retur the value placed there by the last write i program order, ot some other write. This is referred to as a true depedecy or flow depedecy, ad requires the istructios to execute i program order. Write-after-write (WAW): Successive writes to a particular register or memory locatio must leave that locatio cotaiig the result of the secod write. This ca be resolved by squashig (syoyms: cacellig, aullig, mootig) the first write if ecessary. WAW depedecies are also kow as output depedecies. Write-after-read (WAR): A read from a register or memory locatio must retur the last prior value writte to that locatio, ad ot oe writte programmatically after the read. This is the sort of false depedecy that ca be resolved by reamig. WAR depedecies are also kow as ati-depedecies. Depedecy Example With out-of-order executio, a later istructio may execute before a previous istructio so the hardware eeds to resolve both readbefore-write ad write-before-write data hazards lw $t0,0($s1) addu $t0,$t1,$s2... sub $t2, $t0, $s2 If the lw write to $t0 is executed after the addu write, the the sub gets a icorrect value for $t0 The addu has a output depedecy o the lw write-beforewrite - The issuig of the addu might have to be stalled if its result could later be overwritte by a previous istructio that takes loger to complete. Cptr350 Chapter 4 The Processor Advaced Material 12

13 Atidepedecies We also must deal with atidepedecies whe a later istructio (that executes earlier) produces a data value that destroys a data value used as a source i a earlier istructio (that executes later). R3 := R3 * R5 R4 := R3 + 1 R3 := R5 + 1 Atidepedecy True data depedecy Output depedecy The costrait is similar to that of true data depedecies, except reversed: Istead of the later istructio usig a value (ot yet) produced by a earlier istructio (read before write), the later istructio produces a value that destroys a value that the earlier istructio (has ot yet) used (write before read). Does Multiple Issue Work? Yes, but ot as much as we d like. Programs have real depedecies that limit ILP. Some depedecies are hard to elimiate. Some parallelism is hard to expose Limited widow size durig istructio issue. Memory delays ad limited badwidth Hard to keep pipelies full. Speculatio ca help if doe well. Cptr350 Chapter 4 The Processor Advaced Material 13

14 Fallacies Pipeliig is easy: The basic idea is easy. The devil is i the details, e.g., detectig data hazards. Pipeliig is idepedet of techology: So why have t we always doe pipeliig? More trasistors make more advaced techiques feasible. Pipelie-related ISA desig eeds to take accout of techology treds. Cocludig Remarks ISA iflueces desig of datapath ad cotrol Poor ISA desig ca make pipeliig harder. Datapath ad cotrol ifluece desig of ISA. Pipeliig improves istructio throughput usig parallelism: More istructios completed per secod. Latecy for each istructio is ot reduced. Hazards: structural, data, cotrol. Multiple issue ad dyamic schedulig (ILP) Depedecies limit achievable parallelism. Complexity leads to the power wall. Cptr350 Chapter 4 The Processor Advaced Material 14

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter The Processor Part A path Desig Itroductio CPU performace factors Istructio cout Determied by ISA ad compiler. CPI ad