CS 152, Spring 2012 Section 8 Christopher Celio University of California, Berkeley
Agenda More Out- of- Order
Intel Core 2 Duo (Penryn) Vs. NVidia GTX 280 Intel Core 2 Duo (Penryn) dual- core 2007+ 45nm 410 million transistors ~2GHz 3 or 6MB of cache 10-35 Watts 107mm 2 NVidia GTX 280 each core is 22mm 2 L2 SRAM is 6mm 2 /MB 10 core(?) (240 stream processors) 2008 65nm 1.4 Billion transistors 576mm 2 602 MHz(core clock) 236 Watts!!! http://3dimensionaljigsaw.wordpress.com/2008/06/18/physics-based-games-the-new-genre/
Quiz 2 Will be returned this Tuesday
Out-of-Order Control Complexity: MIPS R10000 Control Logic [ SGI/MIPS Technologies Inc., 1995 ] March 14, 2011 CS152, Spring 2011 5
Out of Order Processors Yeager. The MIPS R10000 Superscalar Microprocesor. IEE Micro. 1996
Out of Order Processors
BOOM: A Single Issue Slot Question 1 Br Logic = = collect CPI with BHT, without, compare to 5- stage Resolve = = Kill in- order Question 2 UOP Code BrMask Ctrl... Val RDst RS1 p1 ready issue slot is valid Probe the Instruction Window to potential benefit issue of dual issue Question 3 (From the register file's two write ports) WDest0 WDest1 Probe IW for dual issue of ALU/Mem ops Control Signals Physical Destination Register Physical Source Registers RS2 p2 ready request Issue Select Logic Issued to the Register Read stage 8
BOOM: A Single Issue Slot each instruction gets a br mask... allows us to kill instructions Br Logic Resolve or Kill (From the register file's two write ports) WDest0 WDest1 UOP Code BrMask Ctrl... Val RDst RS1 p1 issue slot is valid = = the register file has two write-ports, so watch both ports write addresses ready RS2 = = p2 ready each slot asserts request when ready to fire request issue one slot gets the issue Issue Select Logic uop holds the micro-op code (is it a LD, an ADD, etc.) Control Signals Physical Destination Register Issued to the Register Read stage Physical Source Registers (note: I show a bus implementation, but it s actually implemented with 9 a bunch of muxes)
OOO Styles
Data-in-ROB Design (HP PA8000, Intel Pentium Pro, Core2 Duo & Nehalem) Register File holds only committed state Reorder buffer Ins# use exec op p1 src1 p2 src2 pd dest data t 1 t 2.. t n Load Unit FU FU FU Store Unit Commit < t, result > On dispatch into ROB, ready sources can be in regfile or in ROB dest (copied into src1/src2 if ready before dispatch) On completion, write to dest field and broadcast to src fields. On issue, read from ROB src fields March 9, 2011 CS152, Spring 2011 11
Unified Physical Register File (MIPS R10K, Alpha 21264, Intel Pentium 4 & Sandy Bridge) Rename all architectural registers into a single physical register file during decode, no register values read Functional units read and write from single unified register file holding committed and temporary registers in execute Commit only updates mapping of architectural register to physical register, no data movement Decode Stage Register Mapping Read operands at issue Unified Physical Register File Commited Register Mapping Write results at completion Functional Units March 9, 2011 CS152, Spring 2011 12
21264 Instruction Reordering As mentioned earlier, 21264 uses explicit renaming, as opposed to data- in- ROB design What does ROB hold?
BOOM Fetch Decode Rename Dispatch Issue RegisterRead Execute Memory WB Branch Prediction Br Logic Resolve Branch Fetch Fetch Buffer Decode Register Rename Issue Window Unified Register File 2R,2W ALU LAQ ROB SAQ addr wdata Data Mem rdata Commit SDQ 14
DEC Alpha 21264 1996/1997 single- core 4- way out- of- order highly speculative 7- stage up to 80 instructions in flight tournament branch predictor 15.2M transistors 6M for logic rest is caching, history tables 350 nm 600 MHz 64KB I$, 64KB D$ (on- chip) 1 to 16MB L2$ (off- chip) 314mm 2 die (fairly large)
DEC Alpha 21264
21264 Register Renaming Registers are renamed, then instructions are inserted into the issue queue (window) Map table backed up on every in- flight insn
21264 Register Renaming What hazards does renaming obviate? In what situations is renaming useful? If you had to choose between branch prediction and renaming, which would you pick?
21264 Register Renaming What hazards does renaming obviate? WAR, WAW In what situations is renaming useful? If you had to choose between branch prediction and renaming, which would you pick?
21264 Register Renaming What hazards does renaming obviate? WAR, WAW In what situations is renaming useful? Code with ILP and name dependencies: loops If you had to choose between branch prediction and renaming, which would you pick?
21264 Register Renaming What hazards does renaming obviate? WAR, WAW In what situations is renaming useful? Code with ILP and name dependencies: loops If you had to choose between branch prediction and renaming, which would you pick? Not much ILP within a basic block, so renaming isn t too useful without branch prediction
21264 Superscalar Execution 21264 couldn t fit full bypassing into one clock cycle Instead, they fully bypass within each of two clusters; inter- cluster bypass takes another cycle
Question: Stores When are stores sent to memory? at commit time Why are stores saved in a store buffer before commit time? so they can be forwarded to dependent loads
val SDQ data SAQ BOOM: val addr LD/ST Unit addr = = = = LAQ val st_mask 4 4 sta_val std_val st_addr_ st_addr_ st_addr_ eq eq eq ld_val LD/ST Compare st_mask ld_is_rdy ld_is_byp byp_idx only showing comparision logic for one Load load is ready to fire load can be bypassed out of SDQ location in SDQ to get ld data from addr wdata Data Mem rdata to RF
BOOM Fetch Decode Rename Dispatch Issue RegisterRead Execute Memory WB Branch Prediction Br Logic Resolve Branch Fetch Fetch Buffer Decode Register Rename Issue Window Unified Register File 2R,2W ALU LAQ single issue 6- stage full branch speculation (BHT) magic, 1- cycle memory (no caches) no bypasses no floating point ROB Commit no exceptions 25 SAQ SDQ addr wdata Data Mem rdata
Memory Ordering in the 21264 To execute the critical instruction path quickly, want to execute loads ASAP Initially, loads speculatively bypass stores On a misspeculation, set a wait bit for that load s PC, so it will behave conservatively from then on Clear wait bits periodically
Speculation in the 21264 What does the 21264 speculate on? Next I$ line/way Branches, indirect jumps Exceptions Load/Store ordering Load hit/miss Shortens hit time by a cycle Anything else?
Pentium http:// www.cs.clemson.edu/ ~mark/330/p6.html Pentium processor
Questions?