Case Study IBM PowerPC 620 year shipped: 1995 allowing out-of-order execution (dynamic scheduling) and in-order commit (hardware speculation). using a reorder buffer to track when instruction can commit, but using extra extra renaming registers (8 int and 12 FT) to hold uncommitted results. When an instruction issues, it is allocated a rename register for holding the result when it completes; when it finally commits, the result is copied to the permanent register. using separate load and store buffers to hold EFA; store buffers also hold EFA and data until store instructions are ready to commit. Chapter 4 page 83 CS 5515
Superscalar PowerPC 620 64-bit bus PowerPC architecture can fetch, issue and complete 4 instructions/cycle has six execution units with their own reservation stations -- several RSs share one execution unit two simple integer units XSU0 and XSU1 for integer add/subtract/logical take one cycle one complex integer unit MCFXU for integer multiply and divide latency of 3 to 20 cycles pipelined for multiply and unpipelined for divide Chapter 4 page 84 CS 5515
PowerPC 620 Execution Units one load-store unit -- LSU execution latency: integer load is 1 and FP load is 2 cycles fully pipelined and has its own EFA adder has load/store buffers to hold EFA and/or data; load results are written to rename registers and store results are held in the store buffers until instructions commit detect memory alias to allow loads bypass pending stores one floating point unit -- FPU different latencies for use of results by another FP inst. 2 cycles for multiply (fully pipelined); 31 cycles for divide one branch unit -- BPU completes BR and informs fetch and reorder buffer unit of misprediction can evaluate branches independent of other instructions Chapter 4 page 85 CS 5515
Inst. Execution Steps in PowerPC 620 Fetch - PC can be obtained from looking up a 256-entry two-way set associative branch target buffer (BTB) - if it is a branch and BTB misses then use a 2048-entry branch prediction buffer (BPB) to predict the branch outcome. - also uses a stack to predict call returning addresses Decode: decode 4 instructions per cycle Issue: issue the instructions to the RSs and also read operands from the register files if available Execution - execution unit contention occurs when more than one RS want to use the execution unit at the same cycle - RS is freed when the inst uses the execution unit Chapter 4 page 86 CS 5515
Inst. Execution Steps in PowerPC 620 Execution - results are written to the rename register via CDB and also forwarded to any other RSs that need the result. The result is tagged with the name of the renaming register, not the reorder buffer number Commit - commit the instruction when all previous instructions have been committed, i.e., no exception and no speculation. - rename registers are written to the permanent register file and freed - for a store instruction, LSU is notified so it can write stored data to the memory Chapter 4 page 87 CS 5515
Limitation to PowerPC performance (Ideal CPI=1/4 but actual is 1/1.3) shortage of replicated execution units RSs may compete for the same execution unit losses in specific stages fetch: misprediction, cache misses, etc. issue: RS, renaming regs, or reorder buffer not available EX: source operands, FU not available, etc. commit: lack of register/memory write ports limited ILP in programs and finite buffering Chapter 4 page 88 CS 5515
PowerPC 620 Fetch unit cache Branch correction Dispatch unit with 8-entry instruction queue dispatch buses Completion unit with reorder buffer Reorder buffer information GP operand buses Register nos. operation buses GP registers Register nos. FP registers Register nos. FP operand buses Register nos. Reservation stations XSU0 XSU1 MCFSU LSU FPU BPU GP result buses FP result buses Result status buses Data cache Chapter 4 page 89 CS 5515
PowerPC pipeline stages Fetch Issue Execute Commit memory buffer Registers Reservation stations Reorder buffer XSU0 XSU1 MCFXU LSU FPU BPU FUs Rename registers Commit unit Registers Chapter 4 page 90 CS 5515