6.004 Computation Structures Spring 2009

Size: px

Start display at page:

Download "6.004 Computation Structures Spring 2009"

Tabitha Reynolds
6 years ago
Views:

1 MIT OpenCourseWare Computation Structures Spring 29 For information about citing these materials or our Terms of Use, visit:

Pipelining the eta bet ta ('be-t&) n. ny of various species of small, brightly colored, long-finned freshwater fishes of the genus etta, found in southeast sia. be ta ( b-t&, be-) n.

modified 4/27/9 :7 maybe they ll give me partial credit L22 Pipelining the eta CPU Performance We ve got a working eta can we make it fast?

? Tough you ll see multiple issue machines in 6.823. 2. INCRESE Freq.

2 Pipelining the eta bet ta ('be-t&) n. ny of various species of small, brightly colored, long-finned freshwater fishes of the genus etta, found in southeast sia. be ta ( b-t&, be-) n.. The second letter of the Greek alphabet. 2. The exemplary computer system used in 6.4. I don t think they mean the fish Lab #7 due Tonight! modified 4/27/9 :7 maybe they ll give me partial credit L22 Pipelining the eta CPU Performance We ve got a working eta can we make it fast? MIPS = Freq CPI MIPS = Millions of s/second Freq = Clock Frequency, MHz CPI = Clocks per To Increase MIPS:. DECRESE CPI. - RISC simplicity reduces CPI to.. - CPI below.? Tough you ll see multiple issue machines in INCRESE Freq. - Freq limited by delay along longest combinational path; hence - PIPELINING is the key to improved performance through fast clocks. L22 Pipelining the eta 2 LDR(X,R3) LD(R,,R) K New PC eta Timing Why isn t this a 2-minute lecture? PC+4 +OFFSET Fetch Inst. R2SEL mux Control Logic Wanted: longest paths We ve learned how to pipeline combinational circuits. What s the big deal? precedence graph =? PCSEL mux PC setup SEL mux Read Regs SEL mux setup K SEL mux Fetch data Mem setup Complications: some apparent paths aren t possible operations have variable execution times (eg, ) time axis is not to scale (eg, t PD,MEM is very big!) L22 Pipelining the eta 3. The eta isn t combinational Explicit state in register file, memory; Hidden state in PC. 2. Consecutive operations executions interact: Jumps, branches dynamically change sequence Communication through registers, memory Our goals: Move slow components into separate pipeline stages, running clock faster Maintain semantics of unpipelined eta as far as possible L22 Pipelining the eta 4

3 Ultimate Goal: 5-Stage Pipeline GOL: Maintain (nearly). CPI, but increase clock speed to barely include slowest components (mems, regfile, ) PPROCH: structure processor as 5-stage pipeline: PCSEL ILL Xdr OP First Steps: Simple 2-Stage Pipeline PC +4 D PC IR MEM W Fetch stage: Maintains PC, fetches one per cycle and passes it to stage: Reads source operands from register file, passes them to stage: Performs indicated operation, passes result to stage: If it s a LD, use result as an address, pass mem data (or result if not LD) to Write-ack stage: writes result back into register file. + XP <25:2> Z <5:> WSEL <2:6> W W SEL FN R RD PC+4 2 <5:> SEL R2 <25:2> R2SEL SEL R/W Data dr RD Wr L22 Pipelining the eta 5 L22 Pipelining the eta 6 2-Stage Pipelined eta Operation Pipeline Control Hazards Consider a sequence of s:.. C(r,, r2) SUC(r,, r3) (r, r5, r) (r2, r6, r) UT consider instead: LOOP: (r, r3, r3) LEC(r3,, r) (r, LOOP) (r3, -, r3) (r, r2, r2) Executed on our 2-stage pipeline: Pipeline C TIME (cycles) SUC C SUC? This is the cycle where the branch decision is made but we ve already fetched the following which should be executed only if branch is not taken! L22 Pipelining the eta 7 L22 Pipelining the eta 8

4 ranch Delay Slots PROLEM: One (or more) following s have been prefetched by the time a branch is taken. POSSILE SOLUTIONS:. Make hardware annul s following branches which are taken, e.g., by disabling and WR. 2. Program around it. Either a) Follow each R with a ; or b) Make compiler clever enough to move USEFUL s into the branch delay slots i. lways execute s in delay slots ii. Conditionally execute s in delay slots = no-operation, e.g. (R3, R3, R3) ranch lternative Make the hardware annul s in the branch delay slots of a taken branch. LOOP: (r, r3, r3) LEC(r3,, r) (r, LOOP) (r3, -, r3) (r, r2, r2) ranch taken Pros: same program runs on both unpipelined and pipelined hardware Cons: in SPEC benchmarks 4% of s are taken branches 2% of total cycles are annulled L22 Pipelining the eta 9 L22 Pipelining the eta ranch nnulment Hardware ILL Xdr OP ranch lternative 2a PCSEL PC +4 + PC IR D NNUL XP <25:2> Z <5:> WSEL <2:6> R W W RD <5:> R2 <25:2> R2SEL Fill branch delay slots with s (i.e., the software equivalent of alternative ) LOOP: (r, r3, r3) LEC(r3,, r) (r, LOOP) () (r3, -, r3) (r, r2, r2) SEL FN PC+4 SEL R/W Data dr RD Wr ranch taken Pros: same as alternative Cons: s make code longer; 2% of cycles spent executing s 2 SEL L22 Pipelining the eta L22 Pipelining the eta 2

5 ranch lternative 2b(i) Put USEFUL s in the branch delay slots; remember they will be executed whether the branch is taken or not LOOP: (r,r3,r3) LOOPx: LEC(r3,,r) (r,loopx) (r,r3,r3) SU(r3,r,r3) (r3,-,r3) (r,r2,r2) ranch taken Pros: only two extra s are executed (on last iteration) Cons: finding useful s that are always executed is difficult; clever rewrite may be required. Program executes differently on naïve unpipelined implementation. We need to add this silly to UNDO the effects of that last ranch lternative 2b(ii) Put USEFUL s in the branch delay slots; annul them if branch doesn t behave as predicted LOOP: (r, r3, r3) LOOPx: LEC(r3,, r).taken(r, LOOPx) (r, r3, r3) (r3, -, r3) (r, r2, r2) ranch taken Pros: only one is annulled (on last iteration); about 7% of branch delay slots can be filled with useful s Cons: Program executes differently on naïve unpipelined implementation; not really useful with more than one delay slot. L22 Pipelining the eta 3 L22 Pipelining the eta 4 rchitectural Issue: ranch Decision Timing ET approach: SIMPLE branch condition logic Test for Reg[Ra] =! DVNTGE: early decision, single delay slot LTERNTIVES: Compare-and-branch (eg, if Reg[Ra] > Reg[Rb]) MORE powerful, but LTER decision (hence more delay slots) Wow! I guess those guys really were thinking when they made up all those s Fetch Write ack (read) (write) Suppose decision were made in the stage then there would be 2 branch delay slots (and s to annul!) L22 Pipelining the eta 5 PCSEL ILL Xdr OP PC PC PC MEM PC <PC>+C (N: SME S OVE!) + C: <5:> << 2 sign-extended D IR IR IR MEM Ra <2:6> 2 W Rb: <5:> SEL Rc <25:2> R2SEL Z XP Rc <25:2> WSEL SEL FN R R2 W RD C: <5:> sign-extended MEM SEL D D MEM Fetch dr R/W Data RD Write ack 4-Stage eta Pipeline Treat register file as two separate devices: combinational RED, clocked WRITE at end of pipe. What other information do we have to pass down pipeline? PC (return addresses) fields (decoding) What sort of improvement should expect in cycle time? L22 Pipelining the eta 6

6 4-Stage eta Operation Pipeline Data Hazard Consider a sequence of s: Executed on our 4-stage pipeline: Pipeline W C SUC C SUC C C(r,, r2) SUC(r,, r3) (r, r5, r) (r2, r6, r) TIME (cycles) SUC C SUC UT consider instead: (r, r2, r3) W LEC(r3,, r) C(r,, r4) SU(r, r2, r5) SU SU SU SU Oops! is trying to read Reg[R3] during cycle i+2 but doesn t write its result into Reg[R3] until the end of cycle i+3! r3 fetched r3 available L22 Pipelining the eta 7 L22 Pipelining the eta 8 Data Hazard Solution Data Hazard Solution 2 Program around it document weirdo semantics, declare it a software problem. - reaks sequential semantics! - Costs code efficiency. EXMPLE: Rewrite (r, r2, r3) LEC(r3,, r) C(r,, r4) SU(r, r2, r5) How often can we do this? Programmer s fallback: Insert s (sigh!) as (r, r2, r3) C(r,, r4) SU(r, r2, r5) LEC(r3,, r) Stall the pipeline: Freeze, stages for 2 cycles, inserting s into -stage register W SU SU Drawback: s mean wasted cycles L22 Pipelining the eta 9 L22 Pipelining the eta 2

= R3 Idea: the result from the which will be written into the register file at the end of cycle I+3 is actually available at output of during cycle I+2 just in time for it to be used by in the stage!

7 Data Hazard Solution 3 ypass Paths (I) ypass (aka forwarding) Paths: dd extra data paths & control logic to re-route data in problem cases. W SU SU SU SU r3 available IR LEC(r3,,r) IR (r,r2,r3) IR W R RD W R2 ypass muxes SELECT this PSS path if OpCode = reads Ra and OpCode = OP, OPC and Ra = Rc i.e., s which use to compute result and Ra!= R3 Idea: the result from the which will be written into the register file at the end of cycle I+3 is actually available at output of during cycle I+2 just in time for it to be used by in the stage! W L22 Pipelining the eta 2 L22 Pipelining the eta 22 ypass Paths (II) Next Time IR (r,r2,r3) IR C(r4,7,r5) R RD R2 ypass muxes SELECT this PSS path if OpCode = reads Ra and Ra!= R3 and not using bypass and = and Ra = W More eta ypasses head IR W (r2,r6,r) W ut why can t we get It from the register file? It s being written this cycle! W L22 Pipelining the eta 23 L22 Pipelining the eta 24

Pipelining the Beta. I don t think they mean the fish...

Pipelining the Beta. I don t think they mean the fish... Pipelining the Beta bet ta ('be-t&) n. ny of various species of small, brightly colored, long-finned freshwater fishes of the genus Betta, found in southeast sia. be ta ( b-t&, be-) n. 1. The second letter