CS 152, Spring 2011 Section 8

Size: px

Start display at page:

Download "CS 152, Spring 2011 Section 8"

Hannah Heath
5 years ago
Views:

1 CS 152, Spring 2011 Section 8 Christopher Celio University of California, Berkeley

2 Agenda Grades Upcoming Quiz 3 What it covers OOO processors VLIW Branch Prediction

3 Intel Core 2 Duo (Penryn) Vs. NVidia GTX 280 Intel Core 2 Duo (Penryn) dual- core nm 410 million transistors ~2GHz 3 or 6MB of cache Watts 107mm 2 NVidia GTX 280 each core is 22mm 2 L2 SRAM is 6mm 2 /MB 10 core(?) (240 stream processors) nm 1.4 Billion transistors 576mm MHz(core clock) 236 Watts!!!

4 Grades Department guidelines: Average GPA Class Average: 75% Class Standard Deviation: 11.5% Homework: 15% Labs: 35% Quizzes: 50%

5 Quiz 3 superscalar pipelines (inorder & out- of- order) out- of- order processors VLIW what are the different stages? What is done in each stage (e.g., what resources are allocated in decode?) register renaming explicit versus implicit register renaming designs when to allocate registers, when to free registers ROBs, instruction windows data- in- ROB versus data- not- in- ROB versus split ROB/instruction window designs branches and exceptions... how are they handled? Load/Store Queues when can stores, loads be fired to memory? software instruction re- ordering loop unrolling software pipelining how code will get scheduled on different pipelines branch prediction BHTs, BTBs, 2- bit counters, local history, global history, tournament branch predictors when can you make predictions? When do you learn prediction was wrong?

6 Out of Order Processors <lots of drawing on the board here>

7 Out-of-Order Control Complexity: MIPS R10000 Control Logic [ SGI/MIPS Technologies Inc., 1995 ] March 14, 2011 CS152, Spring

8 Out of Order Processors Yeager. The MIPS R10000 Superscalar Microprocesor. IEE Micro. 1996

9 Out of Order Processors

10 OOO Styles

11 Data-in-ROB Design (HP PA8000, Intel Pentium Pro, Core2 Duo & Nehalem) Register File holds only committed state Reorder buffer Ins# use exec op p1 src1 p2 src2 pd dest data t 1 t 2.. t n Load Unit FU FU FU Store Unit Commit < t, result > On dispatch into ROB, ready sources can be in regfile or in ROB dest (copied into src1/src2 if ready before dispatch) On completion, write to dest field and broadcast to src fields. On issue, read from ROB src fields March 9, 2011 CS152, Spring

12 Unified Physical Register File (MIPS R10K, Alpha 21264, Intel Pentium 4 & Sandy Bridge) Rename all architectural registers into a single physical register file during decode, no register values read Functional units read and write from single unified register file holding committed and temporary registers in execute Commit only updates mapping of architectural register to physical register, no data movement Decode Stage Register Mapping Read operands at issue Unified Physical Register File Commited Register Mapping Write results at completion Functional Units March 9, 2011 CS152, Spring

DEC Alpha 21264 1996/1997 single- core 4- way out- of- order highly speculative 7- stage up to 80 instructions in flight tournament branch predictor 15.

13 DEC Alpha /1997 single- core 4- way out- of- order highly speculative 7- stage up to 80 instructions in flight tournament branch predictor 15.2M transistors 6M for logic rest is caching, history tables 350 nm 600 MHz 64KB I$, 64KB D$ (on- chip) 1 to 16MB L2$ (off- chip) 314mm 2 die (fairly large)

14 DEC Alpha 21264

15 21264 Register Renaming Registers are renamed, then instructions are inserted into the issue queue Map table backed up on every in- flight insn

16 21264 Register Renaming What hazards does renaming obviate? In what situations is renaming useful? If you had to choose between branch prediction and renaming, which would you pick?

17 21264 Register Renaming What hazards does renaming obviate? WAR, WAW In what situations is renaming useful? If you had to choose between branch prediction and renaming, which would you pick?

18 21264 Register Renaming What hazards does renaming obviate? WAR, WAW In what situations is renaming useful? Code with ILP and name dependencies: loops If you had to choose between branch prediction and renaming, which would you pick?

19 21264 Register Renaming What hazards does renaming obviate? WAR, WAW In what situations is renaming useful? Code with ILP and name dependencies: loops If you had to choose between branch prediction and renaming, which would you pick? Not much ILP within a basic block, so renaming isn t too useful without branch prediction

20 21264 Superscalar Execution couldn t fit full bypassing into one clock cycle Instead, they fully bypass within each of two clusters; inter- cluster bypass takes another cycle

21 21264 Instruction Reordering As mentioned earlier, uses explicit renaming, as opposed to data- in- ROB design What does ROB hold?

22 Memory Ordering in the To execute the critical instruction path quickly, want to execute loads ASAP Initially, loads speculatively bypass stores On a misspeculation, set a wait bit for that load s PC, so it will behave conservatively from then on Clear wait bits periodically

23 Speculation in the What does the speculate on? Next I$ line/way Branches, indirect jumps Exceptions Load/Store ordering Load hit/miss Shortens hit time by a cycle Anything else?

24 Question: Stores When are stores sent to memory? at commit time Why are stores saved in a store buffer before commit time? so they can be forwarded to dependent loads

25 VLIW: Very Long Instruction Word Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2 Two Integer Units, Single Cycle Latency Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency Multiple operations packed into one instruction Each operation slot is for a fixed function Constant operation latencies are specified Architecture requires guarantee of: Parallelism within an instruction => no cross-operation RAW check No data use before data ready => no data interlocks March 14, 2011 CS152, Spring

26 Branch Predictors 26

27 Branch Predictors 2- bit predictor branch history table (BHT) a table of 2- bit predictors predicts taken/not taken branch target buffer (BTB) predicts target typically a table of <PC,target> pairs 27

28 L12-29 Branch Target Buffer (BTB) I-Cache PC 2 k -entry direct-mapped BTB (can also be associative) Entry PC Valid predicted target PC k = October 20, 2010 match Keep both the branch PC and target PC in the BTB PC+4 is fetched if match fails Only taken branches and jumps held in BTB Next PC determined before branch fetched and decoded valid target Emer

29 L12-31 Combining BTB and BHT BTB entries are considerably more expensive than BHT, but can redirect fetches at earlier stage in pipeline and can accelerate indirect branches (JR) BHT can hold many more entries and is more accurate BHT in later pipeline stage corrects when BTB misses a predicted taken branch BTB BHT A PC Generation/Mux P Instruction Fetch Stage 1 F Instruction Fetch Stage 2 B Branch Address Calc/Begin Decode I Complete Decode J Steer Instructions to Functional units R Register File Read E Integer Execute BTB/BHT only updated after branch resolves in E stage October 20, Emer

30 L12-35 Overview of branch prediction BTB BP, JMP, Ret Best predictors reflect program behavior P C Decode Reg Read Execute Need next PC immediately Tight loop Instr type, PC relative targets available Simple conditions, register targets available Complex conditions available Loose loop Loose loop Loose loop Must speculation check always be correct? No October 20, Emer

31 Branch Prediction uses both! (tournament predictor) Local History Table Branch History Table Branch History Table PC Tournament Predictor Global History Local Prediction Global

32 Tournament Branch Predictor (Alpha 21264) L12-24 Local history table (1,024x10b) Local prediction (1,024x3b) Global Prediction (4,096x2b) PC Choice Prediction (4,096x2b) Prediction Global History (12b) Choice predictor learns whether best to use local or global branch history in predicting next branch Global history is speculatively updated but restored on mispredict Claim % success on range of applications October 20, Emer

33 Questions?

34 Questions? ~mark/330/p6.html Pentium processor

35 Acknowledgements These slides contain material developed and copyright by: Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB) MIT material derived from course UCB material derived from course CS252, CS152 March 14, 2011 CS152, Spring

CS 152, Spring 2012 Section 8

CS 152, Spring 2012 Section 8 Christopher Celio University of California, Berkeley Agenda More Out- of- Order Intel Core 2 Duo (Penryn) Vs. NVidia GTX 280 Intel Core 2 Duo (Penryn) dual- core 2007+ 45nm