CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1
Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per cycle limited to 1 need to start multiple instructions per cycle for IPC > 1 IC is fixed by ISA 2
Scalar Pipeline Limitations frequency increases limited dynamic power consumption stage sizes cannot go much smaller and still achieve useful work (currently around 10 gates) need parallel pipeline 2. Inefficient Unification subcomputations vary in speed e.g. integer add fast (1/2 cycle), f.p. division slow, memory operations slow need specialized execution units need diversified pipeline 3
Scalar Pipeline Limitations 3. Rigid Sequencing if instruction i stalls due to dependency, all following instructions i+1, i+2,... also stall instructions i+1, i+2,... may not share dependency e.g. fmul f3,f1,f2 fadd f5,f3,f4 add r3,r1,r2 sub r3,r3,#1 sub add fadd ---- fmul stall propagation 4
Scalar Pipeline Limitations allowing out-of-order execution can hide penalty stalls need dynamic pipeline 5
Superscalar Concepts 1. Pipeline Parallelism temporal parallelism spatial parallelism Shen + Lipasti, Fig 4.2(d) 6
Superscalar Concepts Intel Pentium Pipeline (s=2, in-order), 1993 requires added register ports uses 8-way interleaved cache for parallel access (accesses to same bank are serialized) V pipeline handles only simple instructions; U handles all instructions Shen + Lipasti, Fig 4.4(b) 7
Superscalar Concepts 2. Pipeline Diversification symmetric EX stages all instructions incur maximum penalty requires more forwarding paths or stalls asymmetric EX stages mix of types should match dynamic instruction mix enough to exploit program ILP e.g. CDC6600 (1964) has 10 functional units 8
Superscalar Concepts e.g. Motorola 88110 (1992) has 10 functional units 2 integer 1 bit-field 2 graphics 1 load/store 1 multiplier 1 f.p. add 1 divide 1 branch single cycle 2 cycles (pipelined) 3 cycles (pipelined) 3 cycles (pipelined) 3 cycles (not pipelined) N/A 9
Superscalar Concepts 3. Dynamic Pipeline scalar pipeline interstage buffers hold one instruction, typically for 1 cycle superscalar pipeline interstage buffers hold n instructions Shen + Lipasti, Fig 4.8(b) 10
Superscalar Concepts if n entries proceed in lock-step a stall for one instruction, stalls all n entries (plus all preceeding stages) 11
Superscalar Concepts if entries are independent an instruction may stall without affecting other instructions in buffer if following instructions are to proceed, buffer size must exceed n instructions may now exit the buffer out-of-order Shen + Lipasti, Fig 4.8(c) 12
Superscalar Concepts example: dynamic pipeline (s=3) instructions enter dispatch buffer in order reorder buffer (ROB) ensures writeback performed in program order necessary for precise exceptions reorder buffer entries are allocated at dispatch Shen + Lipasti, Fig 4.9 13
Superscalar Pipeline Structure subtasks ( stages ) 1. fetch 2. decode 3. dispatch 4. execute 5. complete (update machine state i.e. registers) 6. retire (update memory) 14
1. fetch fetch S (pipeline width) instructions per cycle from I-cache Shen + Lipasti, Fig 4.11 15
requires S instructions per row challenges misalignment requires multiple cycles CISC instructions variable length control-flow instructions Shen + Lipasti, Fig 4.12 16
misalignment solutions software: compiler aligns branch targets makes object code tuned to specific cache organization hardware: added logic to support wrapping end of rows (but not end of cache lines) example: RS6000 I-cache (1990) 4 instructions/row 4 rows/line instructions interleaved across 4 sub-arrays 17
Shen + Lipasti, Fig 4.13 18
T-logic one per sub-array detects misaligned address and increment index e.g. IFAR (instruction fetch address register) indexes to A4» all instructions (A4, A5, A6, A7) from same row e.g. IFAR indexes to A10» two instructions (A10, A11) from row 2, two instructions (A12, A13) from row 3 can t cross cache line boundaries two-way set associate (A and B blocks) 19
calculating average instructions fetched per cycle 16 possible start addresses in a cache line A0-A12: 4 per cycle A13: 3 per cycle A14: 2 per cycle A15: 1 per cycle avg instr cycle = ( 13 4) + ( 1 3) + ( 1 2) + ( 1 1) 16 20
control-flow instructions branch instructions in a fetch group may result in discarding following instructions reduces bandwidth solution profiling: JIT compiler re-organizes basic blocks so that fallthrough (branch not taken) is most common case doesn t help unconditional branches other techniques: branch folding, trace cache (more later) 21
2. decode tasks: identify instruction boundaries, instruction types, interdependencies RISC fixed-length instructions: identifying boundaries easy regular instruction format: common op-code field makes identifying instruction types easy detecting RAW register hazards within fetch group # comparitors S = 2( i 1) = 2 i 2 1 = i= 1 i= 1 i= 1 O( S number of register ports and operand busses increases linearly S S 2 ) 22
CISC takes multiple cycles/stages e.g. 5 stages for Intel P6 microarchitecture (Pentium Pro,...) variable-length instructions must examine multiple bytes in parallel instructions translated to internal 3-address RISC instruction set for pipelining e.g. VAX, 1985 e.g. AMD K5: ROPs = RISC operations e.g. Intel P6: μops = micro operations» 1 IA32 1.5 to 2 μops (on average) 23
e.g. Intel P6 decode unit Shen + Lipasti, Fig 4.14 24
decoder 1 & 2 simple instructions only decoder 0 all instruction types can generate up to 4 μops per cycle if more than 4 needed, the μrom is used to emit a sequence of μops up to 6 μops per cycle μops go to reorder buffer (ROB) for dispatch ROB can hold up to 40 μops 25
complex decoding requires more depth in the decode stage increases branch penalties pre-decoding extra information is added to instructions stored in the I-cache speeds up decode leverages temporal locality of instruction fetches 26
e.g AMD K5 pre-decode Shen + Lipasti, Fig 4.15 27
AMD K5 pre-decode 8 bytes fetched in parallel adds 5 bits per instruction byte identifies start and end bytes of IA32 instruction # of ROPs needed op-code and prefix byte locations decode can generate 4 ROPs per cycle increases I-cache miss penalty high hit rate helps increases I-cache data size ~50% (tags and prediction bits don t change) 28
pre-decode used for some RISC processors lesser gains than for CISC identify branches early identify independent instructions e.g. PowerPC 620, UltraSPARC, MIPS R10000, HP PA- 8000 alternative cache fully-decoded instructions e.g. Intel Netburst (P4) trace cache (more later) e.g. Intel Sandy Bridge (core i) 29
Intel Sandy Bridge 32KB L1 I-Cache pre-decode between I-cache and Instruction Queue 1.5K μop Cache has 80% hit rate David Kanter, http://www.realworldtech.com/sandy-bridge/4/, Fig 3 30
3. dispatch collect operands and distribute instructions to functional units fetch and decode is centralized: the fetch group treated as a unit dispatch de-centralizes execution instructions pending execution are held in reservation stations, together with (available) operands 31
centralized reservation stations Shen + Lipasti, Fig 4.17 32
distributed reservation stations Shen + Lipasti, Fig 4.18 33
centralized best utilization of reservation stations increased hardware complexity for control for multi-ported buffer for insert (by dispatch) and remove by functional units slower Intel P6 to Haswell microarchitectures have centralized (unified) reservation stations 6 ports in Ivy Bridge 8 ports in Sandy Bridge 34
distributed lower overall utilization can t share empty entries between functional units simply control and single-ported insert and remove PowerPC 620 has distributed reservation stations 35
hybrid: clustered reservation stations MIPS R10000 Solihin et al, 1999 doi=10.1.1.24.8528, Fig 3 36
terminology dispatch: associate instruction with functional unit issue: start execution in functional unit dispatch and issue combined in centralized R.S. separate steps in distributed R.S. 37
4. execution specialized function units improve performance e.g. Intel Netburst: double-pumped integer unit could execute two integer instructions/cycle e.g. Intel Sandy Bridge: 256-bit FADD unit 8 single-precision FP adds every cycle e.g. Intel Haswell: 256-bit FMA/FADD unit fused multiply add (1 μop) has same latency (5 cycles) as 1 FMUL instruction dot product, matrix multiply, Horner s method for polynomial evaluation 38
instruction mix ideally matched by functional units in reality number of functional units must exceed pipeline width to avoid stalls waiting on a particular functional unit e.g. Intel Haswell» 8 μop wide» ~20 execution units» forwarding paths for integer, SIMD integer and FP (scalar or SIMD) are separate fewer ports, less 39
5. complete update machine state architected registers those registers that the programmer knows about i.e. those specified in the ISA general purpose registers, f.p. registers, condition-code register, control/status register program counter instructions are marked finished in the reorder buffer when the function unit finishes it (out of order) instructions exit the ROB (in order) 40
6. retire update memory state (usually the D-cache) 41
interrupts and exceptions 1 alter program flow interrupts are generated by hardware outside the CPU exceptions are generated within the processor processor detected: page fault, f.p. overflow,... program generated: trap instructions (used for OS calls) interrupts and program generated exceptions fetch unit stops and instructions in pipeline are finished before servicing 1 Michal Ludvig http://www.logix.cz/michal/doc/i386/chp09-00.htm 42
interrupts instruction fetch stops and instructions in pipeline are finished (drained) before servicing processor-detected exceptions instruction can t complete and usually needs OS intervention the excepting instruction is tagged in the ROB when it reaches the head of the ROB some machine state is checkpointed (e.g. PC, status register) remaining instructions in ROB are discarded (precise exceptions) ISR is invoked execution resumes at excepting instruction 43