COSC 6385 Computer Architecture - Pipelining (II) Edgar Gabriel Spring 2018 Performance evaluation of pipelines (I) General Speedup Formula: Time Speedup Time IC IC ClockCycle ClockClycle CPI CPI For a fixed application lets assume that IC = IC ClockCycle Speedup ClockClycle CPI CPI If we assume additionally that the CPU has the same frequency, i.e. ClockCycle = ClockCycle CPI Speedup CPI 1
Performance evaluation of pipelines (II) If looking at individual classes of instructions Speedup overall Time Time ClockClycle ClockClycle Assuming IC total is identical in both architectures n i1 n i1 i IC CPI i IC CPI i i with f i IC IC Speedup i total overall Time Time ClockClycle ClockClycle n i1 n i1 i f CPI i f CPI i i Comparing and non- execution An ideal pipeline produces one result per clock cycle Ideal CPI = 1 Time Time no Time Speedup Time non_ pipeline_ stages non_ no pipeline_ stages using the average instruction execution time (AvIETime) AvIETime Speedup AvIETime CPI non_ CPI non_ ClockCycle ClockCycle non_ 2
Comparing and non execution (II) Realistic CPI = Ideal CPI + Pipeline stall cycles per instruction Thus: AvIETime Speedup AvIETime non_ CPI non_ ClockCycle 1 PipelineStallCyclesPerInstr ClockCycle non_ If ClockCycle is constant: CPInon _ Speedup 1 PipelineStallCyclesPerInstr Example I (A) Given an non- processor: 1 ns clock cycle time 4 cycles for ALU operations 4 cycles for branches 5 cycles for memory operations (B) Given also a processor 1.2 ns clock cycle time Both (A) and (B) have 40% ALU operations 40% branches 20% memory operations What is the speedup of (B) over (A) due to pipelining? 3
For machine (A): AvIETime ( A) Example I ClockCycle A n i1 i i f CPI 1ns(0.4 40.4 40.25) 4. 4ns For machine (B): assuming ideal CPI (= 1) AvIETime ( B) ClockCycle B n i1 i i f CPI 1.2ns(0.41 0.21 0.41) 1. 2ns Thus AvIETime Speedup AvIETime ( A) ( B) 4.4ns 3.7 1.2ns Exceptions Instruction execution order is interrupted E.g. I/O device request Invoking an OS service from an application Tracing execution Breakpoint or FP arithmetic anomaly (e.g. overflow) Page fault Misaligned memory access Memory protection violation Hardware malfunction 4
Classification of Exceptions Problems with pipelining: Different stages of the pipeline can raise exceptions leading to a different order of exceptions compared to the un case Classes of exceptions 1. Synchronous vs. Asynchronous: 2. User requested vs. Coerced 3. User maskable vs. user non-maskable 4. Within vs. between instructions 5. Resume vs. terminate Exceptions Most problematic: exceptions raised within instructions, where the instruction must be resumed Another program must be invoked to save the state of the program Pipelines capable of handling exceptions are called restartable Pipeline stage IF ID EX MEM WB Possible exceptions Page fault on Instruction fetch; misaligned memory access; memory protection violation Undefined or illegal opcode Arithmetic exception Page fault on data fetch; misaligned memory access; memory protection violation Non 5
Exceptions Since an exception can not be raised when it occurs Status vector associated with instruction shows exception Status vector carried along with instruction Writing of data values disabled if status vector is set In WB status vector checked and exception handled => Exception of instruction i handled before exception of instruction i+1 => Since no data values are written back, register file not changed -> instruction can be repeated Multi-cycle instructions Not all instructions will take the same amount of cycles to finish! Floating point instructions can take many cycles to complete Latency: number of intervening cycles between an instruction that produces a result and instruction that uses the result Usually: depth of the EX stage -1 Initiation interval: Number of cycles that must elapse between issuing two operations of a given type Multi-cycle instructions/pipelines increase the probability for occurring WAW and RAW hazards 6
Example for a multi-cycle pipeline EX FP/ multiply unit M1 M2 M3 M4 M5 M6 M7 IF ID FP/ add unit A1 A2 A3 A4 MEM WB FP/ division (non ) DIV Functional unit Latency Initiation interval ALU 0 1 Data memory 1 1 FP add 3 1 FP multiply 6 1 FP divide 24 25 Instruction level parallelism Exploit parallelism between independent instructions Limited by data dependencies Limited by branches Example: for (i=0; i<n; i++ ) { c[i] = a[i] + b[i]; } Each iteration of the loop is independent Exploitation of that fact is not trivial because of register reuse! 7
Instruction level parallelism Data dependencies: True dependencies: instruction i produces a result required by instruction i+k, k>0 (RAW) sharing a register or a memory location Name dependencies: usage of the same register or memory location without data flow Antidependence: instruction i+k writes a register/memory location read by instruction i (WAR) No problem if not reordering instructions Output dependence: instruction i and instruction i+k write the same register/memory location (WAW) No problem if not reordering instructions Control dependencies: determines ordering of an instruction i with respect to a branch Dynamic scheduling Up-to-now Instructions are issued in program order If an instruction is stalled in the pipeline, no later instruction can proceed DIV.D F0, F2, F4 ADD.D F10, F0, F8 SUB.D F12, F8, F14 In order to allow out-of-order execution, the ID stage is split into two parts: Instruction issue: decode instruction and check for structural hazards Read operands: Read operands if no data hazard 8
Dynamic scheduling Out-of-order execution introduces the possibility of WAR and WAW hazards DIV.D F0, F2, F4 DIV.D F0, F2, F4 ADD.D F10, F0, F8 SUB.D F8, F8, F14 SUB.D F8, F8, F14 ADD.D F10, F0, F8 Out-of-order execution only improves performance if Multiple instructions can be executed at once Multiple functional units are available All instructions pass through the issue stage in order Instructions can be bypassed in the read-operand stage Algorithms allowing instructions to execute out-of-order Scoreboarding Tomasulo s approach Scoreboarding First implemented in the CDC6600 Assumption for the following slides: 2 multipliers 1 adder 1 divider 1 integer unit Each instruction goes through the scoreboard Scoreboard determines when an instruction can execute Scoreboard monitors usage of execution units Scoreboard monitors when a result can be written to the destination register 9
Scoreboarding (II) 4 steps of Scoreboarding (replaces ID, EX and WB) 1. Issue: if functional unit is free and no other active instruction has the same destination register 2. Read operands: Scoreboard monitors the availability of operands. 3. Execution 4. Write result: if Execution done, Scoreboard checks for WAR hazards and stalls the instruction if necessary. Scoreboarding (II) Scoreboard data structures: : which of the four steps the instruction is in : status of a functional unit. Busy: indicates whether unit is busy or not Op: operation to be performed Fi: Destination register number Fj, Fk: Source register number Qj, Qk: Functional units producing source registers Fj, Fk Rj, Rk: Flags indicating whether Fj, Fk are ready. Set to NO after operands are read. : which functional unit will write which register 10
Scoreboarding example L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Following slides are based on a lecture by Jelena Mirkovic, University of Delaware http://www.cis.udel.edu/~sunshine/courses/f04/cis662/class10.pdf Assumption: ADD and SUB take 2 clock cycles MULT takes 10 clock cycle DIV takes 40 clock cycles Time=1 Issue first load L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Yes Load F6 R2 Yes Mult1 Add Divide FU 11
Time=2 first load read operands; second load can not issue (structural hazard) L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Yes Load F6 R2 No Mult1 Add Divide FU Time=3 first load completes exec; second load can not issue (SH) L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Yes Load F6 R2 No Mult1 Add Divide FU 12
Time=4 first load writes result; second load can not issue (SH) L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Add Divide FU Time=5 Second load is issued L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Yes Load F2 R3 Yes Mult1 Add Divide FU 13
Time=6 Second load reads operands; Mult is issued L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Yes Load F2 R3 No Mult1 Yes Mult F0 F2 F4 No Yes Add Divide FU Mult1 Time=7 Second load completes exec; Mult is stalled waiting for F2; Sub is issued L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Yes Load F2 R3 No Mult1 Yes Mult F0 F2 F4 No Yes Add Yes Sub F8 F6 F2 Yes No Divide FU Mult1 Add 14
Time=8 Second load writes result; Mult and Sub stalled (F2); Div is issued L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Yes Mult F0 F2 F4 Yes Yes Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes FU Mult1 Add Div Time=9 Mult and Sub read operands; Div stalled waiting for (F0); Add not issued (SH) L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Yes Mult F0 F2 F4 No No Add Yes Sub F8 F6 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes FU Mult1 Add Div 15
Time=10 Mult executing (1 out of 10 cycles); Sub executing (1 out of 2 cycles); Div stalled (F0); L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Yes Mult F0 F2 F4 No No Add Yes Sub F8 F6 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes FU Mult1 Add Div Time=11 Mult executing (2/10); Sub completes execution; Div stalled (F0); L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Yes Mult F0 F2 F4 No No Add Yes Sub F8 F6 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes FU Mult1 Add Div 16
Time=12 Mult executing (3/10); Sub writes result; Div stalled (F0); L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Yes Mult F0 F2 F4 No No Add Divide Yes Div F10 F0 F6 Mult1 No Yes FU Mult1 Div Time=13 Mult executing (4/10); Div stalled (F0); Add issued L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Yes Mult F0 F2 F4 No No Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes FU Mult1 Add Div 17
Time=14 Mult executing (5/10); Div stalled (F0); Add reads operands L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Yes Mult F0 F2 F4 No No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes FU Mult1 Add Div Time=15 Mult executing (6/10); Div stalled (F0); Add executes (1 of 2 cycles) L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Yes Mult F0 F2 F4 No No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes FU Mult1 Add Div 18
Time=16 Mult executing (7/10 cycles); Div stalled (F0); Add completes exec L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Yes Mult F0 F2 F4 No No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes FU Mult1 Add Div Time=17 Mult executing (8/10); Div stalled (F0); Add stalled (WAR hazard on F6) L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Yes Mult F0 F2 F4 No No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes FU Mult1 Add Div 19
Time=19 Mult completes exec; Div stalled (F0); Add stalled (WAR hazard on F6) L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Yes Mult F0 F2 F4 No No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes FU Mult1 Add Div Time=20 Mult writes result; Div stalled (F0); Add stalled (WAR hazard on F6) L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Yes Yes FU Add Div 20
Time=21 Div reads operands; Add stalled (WAR hazard on F6) L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 No No FU Add Div Time=22 Div executes (1/40); Add writes result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Add Divide Yes Div F10 F0 F6 No No FU Div 21
Time=61 Div completes execution L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Add Divide Yes Div F10 F0 F6 No No FU Div Time=62 Div writes result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 Mult1 Add Divide FU 22
Scoreboarding (IV) Performance of scoreboarding depends on The amount of parallelism available among instructions Number of scoreboard entries Number and type of functional units Presence of antidependeces and output dependences 23