Introduction to Pipelined Datapath

14:332:331 Computer Architecture and Assembly Language Week 12 Introduction to Pipelined Datapath [Adapted from Dave Patterson s UCB CS152 slides and Mary Jane Irwin s PSU CSE331 slides] 331 W12.1

Review: Multicycle Data and Control Path PCWriteCond PCWrite IorD MemRead MemWrite MemtoReg IRWrite Control FSM SrcA RegWrite RegDst PCSource Op SrcB PC 0 1 Address Write Data Memory Read Data (Instr. or Data) IR MDR 0 1 1 0 Instr[31-26] Instr[15-0] Instr[5-0] Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Data 2 Write Data Sign Extend 32 Instr[25-0] Shift left 2 A B 4 0 1 0 1 2 3 PC[31-28] Shift left 2 zero control 28 out 2 0 1 331 W12.2

Review: RTL Summary Step R-type Mem Ref Branch Jump Instr fetch Decode IR = Memory[PC]; PC = PC + 4; A = Reg[IR[25-21]]; B = Reg[IR[20-16]]; Out = PC +(sign-extend(ir[15-0])<< 2); Execute Out = A op B; Memory access Writeback Reg[IR[15-11]] = Out; Out = A + sign-extend (IR[15-0]); MDR = Memory[Out]; or Memory[Out] = B; Reg[IR[20-16]] = MDR; if (A==B) PC = Out; PC = PC[31-28] (IR[25-0] << 2); 331 W12.3

Review: Multicycle Datapath FSM Unless otherwise assigned Start PCWrite,IRWrite, MemWrite,RegWrite=0 others=x 2 SrcA=1 SrcB=10 Op=00 PCWriteCond=0 3 (Op = lw) MemRead IorD=1 PCWriteCond=0 (Op = sw) Memory Access (Op = lw or sw) Execute 0 IorD=0 Instr Fetch 1 5 MemWrite IorD=1 PCWriteCond=0 MemRead;IRWrite SrcA=0 srcb=01 PCSource,Op=00 PCWrite 6 SrcA=1 SrcB=00 Op=10 PCWriteCond=0 7 RegDst=1 RegWrite MemtoReg=0 PCWriteCond=0 (Op = R-type) 8 9 SrcA=1 SrcB=00 Op=01 PCSource=01 PCWriteCond (Op = beq) (Op = j) Decode SrcA=0 SrcB=11 Op=00 PCWriteCond=0 PCSource=10 PCWrite 4 RegDst=0 RegWrite MemtoReg=1 PCWriteCond=0 Write Back 331 W12.4

Review: FSM Implementation Combinational control logic Inputs Outputs PCWrite PCWriteCond IorD MemRead MemWrite IRWrite MemtoReg PCSource Op SourceB SourceA RegWrite RegDst Op5 Op4 Op3 Op2 Op1 Op0 Inst[31-26] System Clock State Reg Next State 331 W12.5

Clk Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently the clock cycle must be timed to accommodate the slowest instruction Single Cycle Implementation: Cycle 1 Cycle 2 lw sw Waste Is wasteful of area since some functional units must (e.g., adders) be duplicated since they can not be shared during a clock cycle but Is simple and easy to understand 331 W12.6

Multicycle Advantages & Disadvantages Uses the clock cycle efficiently the clock cycle is timed to accommodate the slowest instruction step balance the amount of work to be done in each step restrict each step to use only one major functional unit Multicycle implementations allow but functional units to be used more than once per instruction as long as they are used on different clock cycles faster clock rates different instructions to take a different number of clock cycles Requires additional internal state registers, muxes, and more complicated (FSM) control 331 W12.7

The Five Stages of Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 lw IFetch Dec Exec Mem WB IFetch: Instruction Fetch and Update PC Dec: Registers Fetch and Instruction Decode Exec: Execute R-type; calculate memory address Mem: Read/write the data from/to the Data Memory WB: Write the data back to the register file 331 W12.8

Single Cycle vs. Multiple Cycle Timing Single Cycle Implementation: Clk Cycle 1 Cycle 2 lw sw Waste Multiple Cycle Implementation: multicycle clock slower than 1/5 th of single cycle clock due to stage flipflop overhead Clk Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10 lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem R-type IFetch 331 W12.9

Pipelined MIPS Processor Start the next instruction while still working on the current one improves throughput - total amount of work done in a given time Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB R-type IFetch Dec Exec Mem WB instruction latency (execution time, delay time, response time) is not reduced - time from the start of an instruction to its completion 331 W12.10

Single Cycle, Multiple Cycle, vs. Pipeline Single Cycle Implementation: Clk Cycle 1 Cycle 2 Load Store Waste Multiple Cycle Implementation: Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10 Clk lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem R-type IFetch Pipeline Implementation: lw IFetch Dec Exec Mem WB wasted cycle sw IFetch Dec Exec Mem WB R-type IFetch Dec Exec Mem WB 331 W12.11

Pipelining the MIPS ISA What makes it easy all instructions are the same length (32 bits) few instruction formats (three) with symmetry across formats memory operations can occur only in loads and stores operands must be aligned in memory so a single data transfer requires only one memory access What makes it hard structural hazards: what if we had only one memory control hazards: what about branches data hazards: what if an instruction s input operands depend on the output of a previous instruction 331 W12.12

MIPS Pipeline Datapath Modifications What do we need to add/modify in our MIPS datapath? State registers between pipeline stages to isolate them IFetch Dec Exec Mem WB 1 0 Add PC 4 Instruction Memory Read Address IFetch/Dec Read Addr 1 Register Read Read Addr 2Data 1 File Write Addr Read Data 2 Write Data Dec/Exec Shift left 2 0 1 Add Exec/Mem Data Address Memory Write Data Read Data Mem/WB 1 0 System Clock Sign 16 Extend 32 331 W12.13

MIPS Pipeline Control Path Modifications All control signals are determined during Decode and held in the state registers between pipeline stages 1 IFetch Dec Exec Mem WB 0 Control Add PC 4 Instruction Memory Read Address IFetch/Dec Read Addr 1 Register Read Read Addr 2Data 1 File Write Addr Read Data 2 Write Data Dec/Exec Shift left 2 0 1 Add Exec/Mem Data Address Memory Write Data Read Data Mem/WB 1 0 System Clock Sign 16 Extend 32 331 W12.14

Graphically Representing MIPS Pipeline Can help with answering questions like: how many cycles does it take to execute this code? what is the doing during cycle 4? is there a hazard, why does it occur, and how can it be fixed? 331 W12.15

Why Pipeline? For Throughput! Time (clock cycles) I n s t r. Inst 0 Inst 1 Once the pipeline is full, one instruction is completed every cycle O r d e r Inst 2 Inst 3 Inst 4 Time to fill the pipeline 331 W12.16

Can pipelining get us into trouble? Yes: Pipeline Hazards structural hazards: attempt to use the same resource by two different instructions at the same time data hazards: attempt to use item before it is ready - instruction depends on result of prior instruction still in the pipeline control hazards: attempt to make a decision before condition is evaulated - branch instructions Can always resolve hazards by waiting pipeline control must detect the hazard take action (or delay action) to resolve hazards 331 W12.17

A Unified Memory Would Be a Structural Hazard Time (clock cycles) I n s t r. lw Inst 1 Mem Reg Mem Reg Mem Reg Mem Reg Reading data from memory O r d e r Inst 2 Inst 3 Mem Reg Mem Reg Mem Reg Mem Reg Inst 4 Reading instruction from memory Mem Reg Mem Reg 331 W12.18

How About Register File Access? Time (clock cycles) I n s t r. O r d e r add Inst 1 Inst 2 add Inst 4 Can fix register file access hazard by doing reads in the second half of the cycle and writes in the first half. 331 W12.19

Register Usage Can Cause Data Hazards Dependencies backward in time cause hazards I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r5 and r6,r1,r7 or r8, r1, r9 xor r4,r1,r5 331 W12.20

One Way to Fix a Data Hazard I n s t r. add r1,r2,r3 stall Can fix data hazard by waiting stall but affects throughput O r d e r stall sub r4,r1,r5 and r6,r1,r7 331 W12.21

Another Way to Fix a Data Hazard I n s t r. add r1,r2,r3 sub r4,r1,r5 Can fix data hazard by forwarding results as soon as they are available to where they are needed O r d e r and r6,r1,r7 or r8, r1, r9 xor r4,r1,r5 331 W12.22

Loads Can Cause Data Hazards Dependencies backward in time cause hazards I n s t r. O r d e r lw r1,100(r2) sub r4,r1,r5 and r6,r1,r7 or r8, r1, r9 xor r4,r1,r5 331 W12.23

Stores Can Cause Data Hazards Dependencies backward in time cause hazards I n s t r. O r d e r add r1,r2,r3 sw r1,100(r5) and r6,r1,r7 or r8, r1, r9 xor r4,r1,r5 331 W12.24

Forwarding with Load-use Data Hazards 331 W12.25

Branch Instructions Cause Control Hazards Dependencies backward in time cause hazards I n s t r. O r d e r add beq lw Inst 3 Inst 4 331 W12.26

One Way to Fix a Control Hazard I n s t r. add beq Can fix branch hazard by waiting stall but affects throughput O r d e r stall stall lw Inst 3 331 W12.27

Other Pipeline Structures Are Possible What about (slow) multiply operation? let it take two cycles MUL What if the data memory access is twice as slow as the instruction memory? make the clock twice as slow or let data memory access take two cycles (and keep the same clock rate) IM Reg DM1 DM2 Reg 331 W12.28

Sample Pipeline Alternatives ARM7 IM Reg EX PC update IM access decode reg access op DM access shift/rotate commit result (write back) StrongARM-1 XScale PC update BTB access start IM access IM1 IM2 Reg DM1 Reg SHFT DM2 IM access decode reg 1 access op shift/rotate reg 2 access DM write reg write start DM access exception 331 W12.29

Summary All modern day processors use pipelining Pipelining doesn t help latency of single task, it helps throughput of entire workload Multiple tasks operating simultaneously using different resources Potential speedup = Number of pipe stages Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup Must detect and resolve hazards Stalling negatively affects throughput 331 W12.30

Performance Purchasing perspective given a collection of machines, which has the - best performance? - least cost? - best performance / cost? Design perspective faced with design options, which has the - best performance improvement? - least cost? Both require - best performance / cost? basis for comparison metric for evaluation Our goal is to understand cost & performance implications of architectural choices 331 W12.31

Two notions of performance Plane DC to Paris Speed Passengers Throughput (pmph) Boeing 747 6.5 hours 610 mph 470 286,700 BAD/Sud Concodre 3 hours 1350 mph 132 178,200 Which has higher performance? Time to do the task (Execution Time) execution time, response time, latency Tasks per day, hour, week, sec, ns... (Performance) throughput, bandwidth Response time and throughput often are in opposition 331 W12.32

Definitions Performance is in units of things-per-second bigger is better If we are primarily concerned with response time performance(x) = 1 execution_time(x) " X is n times faster than Y" means Performance(X) n = ---------------------- Performance(Y) 331 W12.33

Example Time of Concorde vs. Boeing 747? Concord is 1350 mph / 610 mph = 2.2 times faster = 6.5 hours / 3 hours Throughput of Concorde vs. Boeing 747? Concord is 178,200 pmph / 286,700 pmph = 0.62 times faster Boeing is 286,700 pmph / 178,200 pmph = 1.6 times faster Boeing is 1.6 times ( 60% )faster in terms of throughput Concord is 2.2 times ( 120% ) faster in terms of flying time We will focus primarily on execution time for a single job 331 W12.34

Basis of Evaluation Pros representative portable widely used improvements useful in reality Actual Target Workload Full Application Benchmarks Cons very specific non-portable difficult to run, or measure hard to identify cause less representative easy to run, early in design cycle identify peak capability and potential bottlenecks 331 W12.35 Small Kernel Benchmarks Microbenchmarks easy to fool peak may be a long way from application performance

SPEC95 Eighteen application benchmarks (with inputs) reflecting a technical computing workload Eight integer go, m88ksim, gcc, compress, li, ijpeg, perl, vortex Ten floating-point intensive tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5 Must run with standard compiler flags eliminate special undocumented incantations that may not even generate working code for real programs 331 W12.36

Metrics of performance Application Answers per month Useful Operations per second Programming Language Compiler ISA (millions) of Instructions per second MIPS (millions) of (F.P.) operations per second MFLOP/s Datapath Control Function Units Transistors Wires Pins Megabytes per second Cycles per second (clock rate) Each metric has a place and a purpose, and each can be misused 331 W12.37

Aspects of CPU Performance CPU CPU time time = Seconds = Instructions x Cycles Cycles x Seconds Program Program Instruction Cycle Cycle Program instr. count CPI clock rate Compiler Instr. Set Arch. Organization Technology 331 W12.38

CPI Average cycles per instruction CPI = (CPU Time * Clock Rate) / Instruction Count = Clock Cycles / Instruction Count CPU time = ClockCycleTime * CPI * I i i i = 1 n n CPI = CPI * F where F = I i i i i i = 1 Instruction Count "instruction frequency" Invest Resources where time is Spent! 331 W12.39

Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles CPI(i) % Time 50% 1.5 23% Load 20% 5 1.0 45% Store 10% 3.3 14% Branch 20% 2.4 18% Typical Mix 2.2 How much faster would the machine be is a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two instructions could be executed at once? 331 W12.40

Amdahl's Law Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = -------------------- = --------------------- ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then, ExTime(with E) = ((1-F) + F/S) X ExTime(without E) Speedup(with E) = 1 (1-F) + F/S 331 W12.41

Summary: Evaluating Instruction Sets? Design-time metrics: Can it be implemented, in how long, at what cost? Can it be programmed? Ease of compilation? Static Metrics: How many bytes does the program occupy in memory? Dynamic Metrics: How many instructions are executed? How many bytes does the processor fetch to execute the program? CPI How many clocks are required per instruction? How "lean" a clock is practical? Best Metric: Time to execute the program! Inst. Count Cycle Time NOTE: this depends on instructions set, processor organization, and compilation techniques. 331 W12.42