Chapter 3. Pipelining. EE511 In-Cheol Park, KAIST

Chapter 3. Pipelining EE511 In-Cheol Park, KAIST

Terminology Pipeline stage Throughput Pipeline register Ideal speedup Assume The stages are perfectly balanced No overhead on pipeline registers Speedup = # of stages

IF (Instruction fetch) ID (Instruction decode / Register fetch) EX (Execution / Effective address) MEM (Memory access / Branch completion) WB (Write-back) IF/ID ID/EX EX/MEM MEM/WB IF ID EX MEM WB

Hazards Prevent the next instruction from executing during its desired clock cycle pipeline stall Earlier instructions must continue, while later instructions are stalled Classes Structural Hazards Resource conflicts Data Hazards Data dependency Control Hazards Caused by instructions changing the PC such as branches

Functional unit conflict For example, not fully pipelined FU Memory access conflict For example, MEM and IF Solutions: Separate I$ and D$ Dual-port memory On-chip I$ Instruction queue Register file access conflict For example, ID and WB Solutions: Multi-port register file Time multiplexed R/W access

Data hazard classification RAW (True data dependency) Internal data forwarding To reduce forwarding logic, register file write is done before read Instruction scheduling Rearranges code sequence Delayed Load Insert a NOP if there is no proper instruction to be inserted into the delay slot WAR (Anti-dependency), WAW (Output dependency) Register renaming

Pipeline stalls until the new PC is available Branch delay Turns into a branch penalty Solutions Pipeline stalls when we find a branch instruction Fill with NOPs Rearrange code sequence Delayed branch To reduce the branch penalty Compute the branch instructions as early as possible Target, taken/not-taken Delayed branch / Squashed branch / Annulled branch Branch prediction

Static prediction at compile time Predict taken or predict not-taken as a whole Predict on the basis of branch direction Backward-going taken Forward-going not taken Profile-based prediction: Branch prediction for each individual branch instruction Individual branch instructions are highly biased Introduce a prediction bit in the instruction format Dynamic Prediction at run time

Exception / Interrupt Synchronous / Asynchronous User requested / Coerced User maskable / Nonmaskable Within / Between instructions Resume / Terminate Restartability almost all machines support Precise exception Restarting Execution 1. Force a trap instruction into the pipeline on the next IF 2. Turn off all writes for the faulting instructions and all following instructions in the pipeline, but not the preceding instructions 3. Save the PC of the faulting instruction

Initiation interval = repeat interval The number of cycles that must elapse between issuing two operations of a given type Latency The number of intervening cycles between an instruction that produces a result and an instruction that uses the result # of EX stages 1 Problems in longer latency pipelines Structural hazards Multiple register writes Stall before it issues Stall a conflicting instruction when it tries to enter the MEM stage

WAW hazards no longer reach WB in order Delay the issue of the later instruction Stamp out the result of the former instruction so that the instruction does not write its result Instructions can complete in an order different from that of issued (outof-order completion) Leads to imprecise exceptions RAW hazards are more frequent

Precise / Imprecise Precise exceptions Exception is checked at the WB stage Hardware posts all exceptions in a status vector which is carried along as the instruction goes down the pipeline Once an exception indication is set in the status vector, all writes are turned off

Ignore the problem and settle for imprecise exceptions Two operating modes Fast but imprecise / slower but precise Buffer the results of an instruction until all the instruction that were issued earlier are complete History file / future file Smith and Plezskun, Implementing precise interrupts in pipelined processors, IEEE Trans. Computes, 1988 Keep enough information so that the trap-handling routines can create a precise sequence for the exception Hwu and Patt, Check-point repair for out-of-order execution machines, ISCA 1987 Allows the instruction issue only if it is certain that all the instructions before the issuing instruction will complete without causing an exception

Variable instruction lengths and running times can lead to imbalance among pipeline stages Sometimes justify the added complexity cache Sophisticated addressing modes can complicate pipeline control and make it difficult to keep the pipeline flowing Writes into instruction space (self-modifying code) can cause trouble for pipelining Implicitly set condition codes increase the difficulty of finding when a branch has been decided and the difficulty of scheduling branch delays

Deeper integer pipeline 8 stages IF IS RF EX DF DS TC WB IF : First half of instruction fetch IS : Second half of instruction fetch, complete I$ access DF : First half of D$ access DS : Second half of data fetch, completion of D$ access TC : Tag check, determine whether the D$ access hit Two cycle load delay 3 cycle branch delay Single cycle delayed branch Predict-not-taken for the remaining two cycles If taken, two idle cycles

Pitfall: Unexpected execution sequences may cause unexpected hazards Pitfall: Extensive pipelining can impact other aspects of a design, leading to overall worse cost/performance Fallacy: Increasing the number of pipeline stages always increases performance Pitfall: Evaluating a compile-time scheduler on the basis of unoptimized code