Limitations of Scalar Pipelines

Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC <= 1 or CPI >= 1 Inefficient unified pipeline Long latency for each instruction Rigid pipeline stall policy One stalled instruction stalls all newer instructions Parallel Pipelines Intel Pentium Parallel Pipeline D1 D1 D1 (a) No Parallelism (b) Temporal Parallelism D2 D2 D2 (d) Parallel Pipeline (c) Spatial Parallelism U - Pipe V - Pipe 1

Diversified Pipelines ID RD ALU MEM1 FP1 Power4 Diversified Pipelines FP FX/LD 1 Fetch Q Decode FX/LD 2 I-Cache /CR PC Scan Predict Reorder MEM2 FP2 FP3 FP1 FP2 FX1 LD1 LD2 FX2 CR StQ D-Cache Rigid Pipeline Stall Policy Dynamic Pipelines ID Bypassing of Stalled Not Allowed Stalled Backward Propagation of Stalling RD ALU MEM1 FP1 MEM2 FP2 FP3 ( in order ) ( out of order ) Reorder ( out of order ) ( in order ) 2

Interstage s Stage i (1) Stage i + 1 Stage i (>_ n) Stage i + 1 (a) 1 1 Stage i (n) Stage i +1 (c) n ( in order ) n ( in order ) (b) ( in order ) ( out of order ) Superscalar Pipeline Stages In Program Order Out of Order In Program Order Fetch Decode Execute Complete Retire Issuing Completion Store Limitations of Scalar Pipelines Scalar upper bound on throughput IPC <= 1 or CPI >= 1 Solution: wide (superscalar) pipeline Inefficient unified pipeline Long latency for each instruction Solution: diversified, specialized pipelines Rigid pipeline stall policy One stalled instruction stalls all newer instructions Solution: Out-of-order execution, distributed execution pipelines Impediments to High IPC Register Data Branch Predictor Reorder (ROB) Store Queue I-cache FETCH DECODE COMMIT Integer Floating-point Media Memory ECUTE D-cache Memory Data 3

Superscalar Pipeline Design Fetching Issues Decoding Issues ing Issues Execution Issues Completion & Retiring Issues Objective: Fetch multiple instructions per cycle Challenges: Branches: control dependences Branch target misalignment cache misses Solutions Code alignment (static vs.dynamic) Prediction/speculation PC Memory 3 instructions fetched I-Cache Organization Fetch Alignment Address Address PC XX00001 Row Decoder Cache Line R o w De c oder Cache Line Row decoder 000 001 111 00 01 10 11 1 cache line = 1 physical row 1 cache line = 2 physical rows Fetch group Row width 4

Odd directory sets A & B Even directory sets A & B AR TLB hit and buffer control Interlock, dispatch, branch, execution RIOS-I I Fetch Hardware T 255 0 A0 1 A4 2 A8 3 A12 D B0 B4 B8 mux B12 n T 255 0 A1 1 A5 2 A9 3 A13 D B1 B5 B9 mux B13 T 255 0 A2 1 A6 2 A10 3 A14 buffer network n 1 D B2 B6 mux B10 B14 n 2 255 0 A3 1 A7 2 A11 3 A15 D B3 B7 mux B11 B15 n 3 Issues in Decoding Primary Tasks Identify individual instructions (!) Determine instruction types Determine dependences between instructions Two important factors set architecture Pipeline width and Issue Parallel pipeline Centralized instruction fetch Centralized instruction decode Diversified pipeline Distributed instruction execution Necessity of fetching decoding dispatching FU 1 FU 2 FU 3 FU n execution 5

Centralized Reservation Station (issue) Centralized reservation station (dispatch buffer) Distributed Reservation Station Issue buffer Distributed reservation stations Execute Execute Completion buffer Finish Complete Completion buffer Issues in Execution Bypass Networks Fetch Q I-Cache PC Scan Current trends More parallelism bypassing very challenging Deeper pipelines More diversity Functional unit types Integer Floating point Load/store most difficult to make parallel Branch Specialized units (media) FP FP1 FP2 FX/LD 1 Decode /CR Predict Reorder O(n 2 ) interconnect from/to FU inputs and outputs Associative tag-match to find operands Solutions (hurt IPC, help cycle time) Use RF only (Power4) with no bypass network Decompose into clusters (21264) FX1 LD1 StQ FX/LD 2 LD2 D-Cache FX2 CR 6

Specialized units Issues in Completion/Retirement Shift ALU 2 ALU 0 ALU C (a) A B (A B) C Round/Normalize (b) Out-of-order execution ALU instructions Load/store instructions In-order completion/retirement Precise exceptions Memory coherence and consistency Solutions Reorder buffer Store buffer Load queue snooping (later) A Dynamic Superscalar Processor Fetch /decode buffer Impediments to High IPC Decode In order Out of order In order Issue Execute Finish Complete buffer Reservation stations Reorder/completion buffer Store buffer I-cache Branch FETCH Predictor DECODE Integer Floating-point Media Memory Memory Data ECUTE Reorder Register (ROB) Data COMMIT Store D-cache Queue Retire 7

Superscalar Summary flow Branches, jumps, calls: predict target, direction Fetch alignment cache misses Register data flow Register renaming: RAW/WAR/WAW Memory data flow In-order stores: WAR/WAW Store queue: RAW Data cache misses 8