COSC 6385 Computer Architecture - Pipelining

COSC 6385 Computer Architecture - Pipelining Fall 2006 Some of the slides are based on a lecture by David Culler,

Instruction Set Architecture Relevant features for distinguishing ISA s Internal storage Memory addressing Type and size of operands Operations Instructions for Flow Control Encoding of the IS

Pipelining Pipelining is an implementation technique whereby multiple instructions are overlapped in execution Split an expensive operation into several sub-operations Execute the sub-operations in a staggered manner Real world analogy: assembly line in car manufacturing Each station is doing something different Each station working on a separate car Pipelining increases the throughput, but does not reduce the latency of an operation

Classes of instructions instructions Take either 2 registers as operands or 1 register and one 16bit immediate offset Results are stored in a 3 rd register Load and store instructions Branches and jumps

Typical implementation of an instruction (I) 1. Instruction fetch cycle (IF): send PC to memory Fetch current instruction Update PC to next sequential PC (+4 bytes) 2. Instruction decode/register fetch cycle (ID) Decode instruction Read registers corresponding to register source specifiers from register file Sign extend offset fields if needed Compute possible branch target address

Typical implementation of an instruction (II) 3. Execution /effective address cycle (EX) adds base register and offset to form effective address or performs operations on the values read from register file or performs operation on value read from register and signextended immediate 4. Memory access cycle (MEM) If instruction is a load, read memory using the effective address computed in step 3 If instruction is a store, write the data from the second register read of the register file to the effective address 5. Write-back cycle (WB) Write result into register file From memory for a load instruction From for an instruction

Typical implementation of an instruction Next PC Instruction Fetch 4 Adder (III) Instr. Decode. Fetch Next SEQ PC RS1 Execute Addr. Calc Zero? Memory Access MUX Write Back PC Memory Inst RS2 RD File MUX MUX Data Memory L M D MUX Imm Sign Extend WB Data

Datapath (I) Fetching instructions and incrementing program count (PC) 4 Adder PC Read address Instruction Instruction memory

Datapath (II) instructions, e.g. add R1, R2, R3 ister number input is 5 bit wide if you have 32(=2 5 ) registers operation control signal (4 bits) ister numbers 5 5 5 Read register 1 Read register 2 Write register ister file Read data 1 Read data 2 Data 4 operation Zero result Data Write Data Write Write control signal

Datapath (III) Load/Store instructions, e.g. LW R1,offset (R2) MemWrite Address Write Data Data memory Read data 16 32 Sign Extend MemRead Basic steps for a load/store operation sign extend the offset from 16 to 32 bit add the sign extended offset to R2 Load the content of the resulting address into R1 or store the data from R1 into the resulting memory address

Datapath (IV) Combining Load/Store and instructions operation Instruction Read register 1 Read register 2 Write register Write Data ister file Read data 1 Read data 2 Write src 0 1 M U X 4 Address Read data Data memory Write Data MemWrite Memto 0 1 M U X 16 32 Sign Extend MemRead

Datapath (V) Branches e.g. beq R1,R2,offset Basic steps for a branch equal instruction compute branch target address sign extended offset field shift offset field by 2 bits in order to ensure a word offset add shifted, sign-extended offset to PC compare registers R1 and R2

Datapath (VI) Implementation of branches, e.g. beq R1,R2,offset PC+4 from instruction datapath Shift Left 2 Add Branch target Instruction Read register 1 Read register 2 Write register Write Data ister file Read data 1 Read data 2 4 operation To branch control logic Write 16 32 Sign Extend

Visualizing pipelining Time (clock cycles) I n s t r. Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 IF ID IF ID Mem WB Mem WB O r d e r IF ID IF ID Mem WB Mem WB

Effects of pipelining A pipeline of depth n requires n-times the memory bandwidth of a non-pipelined processor for the same clock rate Separate data and instruction cache eliminates some memory conflicts ister file is used in stage ID and in WB Usually not a conflict, since write s are executed in the first half of the clock-cycle and read s in the second half Instructions in the pipeline should not attempt to use the same hardware resources at the same time Introduction pipeline registers between successive stages of the pipeline isters named after the stages they connect (e.g. IF/ID, ID/, etc.)

Instruction Fetch Instr. Decode. Fetch Execute Addr. Calc Memory Access Write Back Next PC 4 Adder Next SEQ PC RS1 Next SEQ PC Zero? MUX Address Memory IF/ID RS2 File ID/EX MUX MUX EX/MEM Data Memory MEM/WB MUX Imm Sign Extend RD RD RD

Pipeline Hazards Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of instructions Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).

One Memory Port/Structural Hazards Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load Instr 1 Instr 2 Instr 3 Instr 4

One Memory Port/Structural Hazards Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load Instr 1 Instr 2 Stall Instr 3 Bubble Bubble Bubble Bubble Bubble

Speed Up Equation for Pipelining CPI pipelined = Ideal CPI + Average Stall cycles per Inst Ideal CPI Pipeline depth Speedup = Ideal CPI + Pipeline stall CPI Cycle Cycle Time Time unpipelined pipelined For simple RISC pipeline, CPI = 1: Pipeline depth Speedup = 1 + Pipeline stall CPI Cycle Cycle Time Time unpipelined pipelined

Example: Dual-port vs. Single-port Machine A: Dual ported memory ( Harvard Architecture ) Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate Ideal CPI = 1 for both Loads are 40% of instructions executed SpeedUp A = Pipeline Depth/(1 + 0) x (clock unpipe /clock pipe ) = Pipeline Depth SpeedUp B = Pipeline Depth/(1 + 0.4 x 1) x (clock unpipe /(clock unpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUp A / SpeedUp B = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33 Machine A is 1.33 times faster

Time (clock cycles) Data Hazard on R1 IF ID/RF EX MEM WB I n s t r. add r1,r2,r3 sub r4,r1,r3 O r d e r and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11

Three Generic Data Hazards Read After Write (RAW) Instr J tries to read operand before Instr I writes it I: add r1,r2,r3 J: sub r4,r1,r3 Caused by a Dependence (in compiler nomenclature). This hazard results from an actual need for communication.

Three Generic Data Hazards Write After Read (WAR) Instr J writes operand before Instr I reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an anti-dependence by compiler writers. This results from reuse of the name r1. Can t happen in our 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5

Three Generic Data Hazards Write After Write (WAW) Instr J writes operand before Instr I writes it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an output dependence by compiler writers This also results from the reuse of name r1. Can t happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in more complicated pipes

Forwarding to Avoid Data Hazard Time (clock cycles) I n s t r. add r1,r2,r3 sub r4,r1,r3 O r d e r and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11

Data Hazard Even with Forwarding Time (clock cycles) I n s t r. lw r1, 0(r2) sub r4,r1,r6 O r d e r and r6,r1,r7 or r8,r1,r9

Data Hazard Even with Forwarding Time (clock cycles) I n s t r. O r d e r lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 Bubble Bubble Bubble or r8,r1,r9

Branches: Pipelined Datapath Instruction Fetch Instr. Decode. Fetch Execute Addr. Calc Memory Access Write Back Next PC 4 Adder Next SEQ PC RS1 Adder MUX Zero? Address Memory IF/ID RS2 File ID/EX MUX EX/MEM Data Memory MEM/WB MUX Imm Sign Extend RD RD RD WB Data

Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken Execute successor instructions in sequence Squash instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% branches not taken on average PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken 53% branches taken on average But haven t calculated branch target address yet still incurs 1 cycle branch penalty Other machines: branch target known before outcome

Four Branch Hazard Alternatives #4: Delayed Branch Define branch to take place AFTER a following instruction branch instruction sequential successor 1 sequential successor 2... sequential successor n branch target if taken Branch delay of length n 1 slot delay allows proper decision and branch target address in 5 stage pipeline

Delayed Branch Where to get instructions to fill branch delay slot? Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation