CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering Lecture 6 Superpipelining + Branch Prediction 2014-2-6 John Lazzaro (not a prof - John is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play: CS 152: L6: Superpipelining + Branch Prediction UC Regents Spring 2014 UCB 1

Today: First advanced processor lecture Super-pipelining: Beyond 5 stages. Short Break. Branch prediction: Can we escape control hazards in long CPU pipelines? CS 152: L6: Superpipelining + Branch Prediction UC Regents Spring 2014 UCB 2

From Appendix C: Filling the branch delay slot 3

Superpipelining CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 UCB 4

5 Stage Pipeline: A point of departure Seconds Program Instructions Program Cycles Instruction Seconds Cycle Perfect caching CS 194-6 L9: Advanced Processors I ALU IM Reg DM Reg At best, the 5-stage pipeline executes one instruction per clock, with a clock period determined by the slowest stage Processor has no multi-cycle instructions (ex: multiply with an accumulate register) Filling all delay slots (branch,load) UC Regents Fall 2008 UCB 5

Superpipelining: Add more stages Today! Seconds Program Instructions Program Cycles Instruction Seconds Cycle Also, power! CS 194-6 L9: Advanced Processors I Goal: Reduce critical path by adding more pipeline stages. Example: 8-stage ARM XScale: extra IF, ID, data cache stages. Difficulties: Added penalties for load delays and branch misses. Ultimate Limiter: As logic delay goes to 0, FF clk-to-q and setup. UC Regents Fall 2008 UCB 6

Note: Some stages now overlap, some instructions take extra stages. 5 Stage 8 Stage IF IR ID+RF IR EX IR MEM IR WB IM Reg DM Reg ALU IF now takes 2 stages (pipelined I-cache) ID and RF each get a stage. ALU split over 3 stages MEM takes 2 stages (pipelined D-cache) CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 UCB 7

Superpipelining techniques... Split ALU and decode logic over several pipeline stages. Pipeline memory: Use more banks of smaller arrays, add pipeline stages between decoders, muxes. Remove rarely-used forwarding networks that are on critical path. Creates stalls, affects CPI. Pipeline the wires of frequently used forwarding networks. Also: Clocking tricks (example: use positive-edge AND negative-edge flip-flops) CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 UCB 8

Recall: IBM Power Timing Closure Pipeline engineering happens here...... about 1/3 of project schedule From The circuit and physical design of the POWER4 microprocessor, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al. CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 UCB 9

Pipelining a 256 byte instruction memory. Fully combinational (and slow). Only read behavior shown. A7-A0: 8-bit read address 3 A7 A6 A5 A4 A3 { { A2 3 Can we add two pipeline stages? 1 D E M U X... OE OE OE OE --> Tri-state Q outputs! Byte 0-31 Byte 32-63... Byte 224-255 Q Q Q... 256 256 256 256 M U X 3 Data output is 32 bits D0-D31 32 i.e. 4 bytes Each register holds 32 bytes (256 bits) CS 152: L6: Superpipelining + Branch Prediction UC Regents Spring 2014 UCB 10

On a chip: Registers become SRAM cells Architects specify number of rows and columns. Word and bit lines slow down as array grows larger! Din 3 Din 2 Din 1 Din 0 Precharge WrEn Parallel Data I/O Lines WrWrite Driver & WrWrite Driver & WrWrite Driver & WrWrite Driver & - Precharger Driver + - Precharger Driver + - Precharger Driver + - Precharger Driver + SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell CS 152: L6: Superpipelining + Branch Prediction SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell : : : : SRAM Cell - Sense Amp + - Sense Amp + - Sense Amp + - Sense Amp + Dout 3 Dout 2 Dout 1 Dout 0 How could we pipeline this memory? See last slide. Word 0 Word 1 Word 15 Address Decoder A0 A1 A2 A3 Add muxes here to select subset of bits Q: Which is longer: word line or bit line? UC Regents Spring 2014 UCB 11

RISC CPU 5.85 million devices 0.65 million devices 12

IC processes are optimized for small SRAM cells From Marvell ARM CPU paper: 90% of the 6.5 million transistors, and 60% of the chip area, is devoted to cache memories. Implication? SRAM is 6X as dense as logic. 13

RAM Compilers Fig. 1. 45-degree image of 22 nm tri-gate LVC SRAM bitcell. On average, 30% of a modern logic chip is SRAM, which is generated by RAM compilers. Compile-time parameters set number of bits, aspect ratio, ports, etc. Fig. 2. 22 nm HDC and LVC SRAM bitcells. Figure 13.1.1: 22nm HDC CS 250 L1: Fab/Design Interface low voltage, achieving low SRAM minimum operati and LVC Tri-gate SRAM bitcells. is desirable to avoid integration, routing, and control of multiple supply domains. In the 22 nm tri-gate technology, ﬁn quantization the ﬁne-grained width tuning conventionally used to read stability and write margin and presents a ch designing minimum-area SRAM bitcells constrain pitch. The 22 nm process technology includes bo density 0.092 m 6T SRAM bitcell (HDC) and a lo 0.108 m 6T SRAM bitcell (LVC) to support tradeo performance, and minimum operating voltage acro of application requirements. In Fig. 1, a 45-degree im LVC tri-gate SRAM is pictured showing the UCB thin s UC Regents Fall 2013 wrapped on three sides by a polysilicon gate. The 14

ALU: Pipelining Unsigned Multiply!"#$%&#%'()*!"#$%&#%+, 3* --.-///0-12 1011 -.--///0--2 Facts to remember 5(,$%(#/&,6*"'$7... --.- --.- --.- m bits x n bits = m+n bit product -...---- 0-412 Binary makes it easy: 0 => place 0 ( 0 x multiplicand) 1 => place a copy ( 1 x multiplicand) CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 UCB 15

Building Block: Full-Adder Variant 1-bit signals: x, y, z, s, Cin, Cout x y z Cout Cin z: one bit of multiplier s x: one bit of multiplicand If z = 1, {Cout, s} <= x + y + Cin If z = 0, {Cout, s} <= y + Cin Verilog for 2-bit entity, assign CS 194-6 L9: Advanced Processors I y: one bit of the running sum UC Regents Fall 2008 UCB 16

Put it together: Array computes P = A x B To pipeline array: Place registers between adder stages (in green). Add registers to delay selected A and B bits (not shown) Cout P 7 CS 194-6 L9: Advanced Processors I Cout A 3 P 6 Cout A 3 A 2 P 5 x A 3 A 2 A 1 P 4 Cout A 3 A 2 A 1 A 0 P 3 0 0 0 0 Fully combinational implementation is slow! A 2 A 1 A 0 P 2 y A 1 A 0 P 1 A 0 P 0 z B 0 B 1 B 2 B 3 UC Regents Fall 2008 UCB 17

Adding pipeline stages is not enough... MIPS R4000: Simple 8-stage pipeline Branch stalls are the main reason why pipeline CPI > 1. 2-cycle load delay, 3-cycle branch delay. (Appendix C, Figure C.52) CS 152: L6: Superpipelining + Branch Prediction UC Regents Spring 2014 UCB 18

Branch Prediction CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 UCB 19

Add pipeline stages, reduce clock period Seconds Program Instructions Cycles Seconds Program Instruction Cycle Q. Could adding pipeline stages hurt the CPI for an application? A. Yes, due to these problems: CPI Problem Possible Solution Taken branches cause longer stalls Branch prediction, loop unrolling ARM XScale 8 stages CS 194-6 L9: Advanced Processors I Cache misses take more clock cycles Larger caches, add prefetch opcodes to ISA UC Regents Fall 2008 UCB 20

Recall: Control hazards... + IF (Fetch) ID (Decode) EX (ALU) MEM WB 0x4 IR IR IR IR D PC Q I-Cache Instr Mem Addr Data We avoiding stalling by (1) adding a branch delay slot, and (2) adding comparator to ID stage If we add more early stages, we must stall. Sample Program Time: t1 t2 t3 t4 t5 t6 t7 t8 (ISA w/o branch Inst EX stage delay slot) I1: IF ID EX MEM WB computes if I2: IF ID branch is I1: BEQ R4,R3,25 I3: IF taken I2: AND R6,R5,R4 I4: I3: SUB R1,R9,R8 If branch is taken, these I5: instructions MUST NOT I6: complete! CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 UCB 21

+ Solution: Branch prediction... IF (Fetch) ID (Decode) EX (ALU) MEM WB 0x4 IR IR IR IR D PC Q I-Cache Instr Mem Addr Data We update the PC based on the outputs of the branch predictor. If it is perfect, pipe stays full! Dynamic Predictors: a cache of branch history A control instr? Taken or Not Taken? Branch Predictor Predictions If taken, where to? What PC? CS 194-6 L9: Advanced Processors I Time: Inst I1: I2: I3: I4: I5: I6: t1 t2 t3 t4 t5 t6 t7 t8 EX stage IF ID EX MEM WB computes if IF ID branch is taken IF If we predicted incorrectly, these instructions MUST NOT complete! UC Regents Fall 2008 UCB 22

Branch predictors cache branch history Address of branch instruction 0b0110[...]01001000 4096 entries... = = = = Hit 30 bits Branch Target Buffer (BTB) 30-bit address tag 0b0110[...]0010 At EX stage, update BTB/BHT, kill instructions, if necessary, CS 152: L6: Superpipelining + Branch Prediction target address PC + 4 + Loop Taken Address Branch instruction BNEZ R1 Loop Branch History Table (BHT) 2 state bits Taken or Not Taken Drawn as fully associative to focus on the essentials. In real designs, always directmapped. UC Regents Spring 2014 UCB 23

Branch predictor: direct-mapped version 0b011[..]010[..]100 BNEZ R1 Loop Address of BNEZ instruction Branch Target Buffer (BTB) 18-bit address tag 0b011[...]01 = Hit 18 bits CS 194-6 L9: Advanced Processors I target address PC + 4 + Loop Taken Address Must check prediction, kill instruction if needed. 80-90% accurate As in real-life... direct-mapped... 12 bits 4096 BTB/BHT entries Branch History Table (BHT) Taken or Not Taken Update BHT/BTB for next time, once true behavior known UC Regents Fall 2008 UCB 24

Simple ( 2-bit ) Branch History Table Entry Prediction for next branch. (1 = take, 0 = not take) Initialize to 0. Was last prediction correct? (1 = yes, 0 = no) Initialize to 1. D Q D Q Flip bit if prediction is not correct and last predict correct bit is 0. After we check prediction... Set to 1 if prediction bit was correct. Set to 0 if prediction bit was incorrect. Set to 1 if prediction bit flips. We do not change the prediction the first time it is incorrect. Why? loop: ADDI R4,R0,11 SUBI R4,R4,-1 BNE R4,R0,loop CS 194-6 L9: Advanced Processors I This branch taken 10 times, then not taken once (end of loop). The next time we enter the loop, we would like to predict take the first time through. UC Regents Fall 2008 UCB 25

80-90% accurate 4096-entry 2-bit predictor Figure C.19 26

Branch Prediction: Trust, but verify... Instr Fetch Decode & Reg Fetch Execute D PC Q Instr I-Cache Mem Addr Data IR IR IR +4 Predicted PC Logic A branch instr? Taken or Not Taken? Branch Predictor and BTB Predictions If taken, where to? What PC? B P CS 152: L6: Superpipelining + Branch Prediction rs1 rs2 ws wd RegFile WE rd1 rd2 Ext Note instruction type and branch target. Pass to next stage. A B B P 32 32 op A L U 32 Branch Taken/Not Taken Prediction info --> Prediction info --> Y Check all predictions. Take actions if needed (kill instructions, update predictor). UC Regents Spring 2014 UCB 27

Flowchart control for dynamic branch prediction. Figure 3.22 28

Spatial Predictors C code snippet: b1 b2 b3 After compilation: b1 Idea: Devote hardware to four 2-bit predictors for BEQZ branch. P1: Use if b1 and b2 not taken. P2: Use if b1 taken, b2 not taken. P3: Use if b1 not taken, b2 taken. P4: Use if b1 and b2 taken. Track the current taken/not-taken status of b1 and b2, and use it to choose from P1... P4 for BEQZ... How? b1 We want to predict this branch. b2 b2 b3 Can b1 and b2 help us predict it? 29

Branch History Register: Tracks global history D PC Q Instr Fetch Instr I-Cache Mem Addr Data Decode & Reg Fetch We choose which predictor to use (and update) based on the Branch History Register. IR IR IR +4 Predicted PC Logic A branch instr? Taken or Not Taken? Branch Predictor and BTB Predictions If taken, where to? What PC? B P CS 152: L6: Superpipelining + Branch Prediction rs1 rs2 ws wd Prediction info --> RegFile WE rd1 rd2 Ext A B 32 32 op A L U 32 Branch History Register Branch Taken/Not Taken D Q D Q WE WE Shift register. Holds taken/not-taken status of last 2 branches. Y UC Regents Spring 2014 UCB 30

Spatial branch predictor (BTB, tag not shown) 0b0110[...]01001000 BEQZ R3 L3 Branch History Tables Map PC to index P1 P2 P3 P4 Detects patterns in: 2 state bits 2 state bits 2 state bits 2 state bits Branch History Register D WE WE Q (bb==2) branch D Q (aa==2) branch CS 152: L6: Superpipelining + Branch Prediction Mux to choose which branch predictor Taken or Not Taken For (aa!= bb) branch code. Yeh and Patt, 1992. 95% accurate UC Regents Spring 2014 UCB 31

Performance For more details on branch prediction: 4096 vs 1024? Fair comparison, matches total # of bits) One BHT (4096 entries) Spatial (4 BHTs, each with 1024 entries) 95% accurate Figure 3.3 32

Predict function returns by stacking call info Program counter Alternate Branch history tables Branch prediction Return stack Target cache Figure 3.24 33

Hardware limits to superpipelining? FO4 Delays Historical limit: about 12 FO4s 100 90 80 70 60 50 40 30 20 10 0 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 CS 250 L3: Timing MIPS 2000 5 stages CPU Clock Periods 1985-2005 Pentium Pro 10 stages FO4: How many fanout-of-4 inverter delays in the clock period. Pentium 4 20 stages Thanks to Francois Labonte, Stanford * intel 386 intel 486 intel pentium intel pentium 2 intel pentium 3 intel pentium 4 intel itanium Alpha 21064 Alpha 21164 Alpha 21264 Sparc SuperSparc Sparc64 Mips HP PA Power PC AMD K6 AMD K7 AMD x86-64 Power wall: Intel Core Duo has 14 stages UC Regents Fall 2013 UCB 34

CPU DB: Recording Microprocessor History With this open database, you can mine microprocessor trends over the past 40 years. Andrew Danowitz, Kyle Kelley, James Mao, John P. Stevenson, Mark Horowitz, Stanford University F04 Delays Per Cycle for Processor Designs 140 120 100 F04 / cycle 80 60 40 20 0 1985 1990 1995 2000 2005 2010 2015 FO4 delay per cycle is roughly proportional to the amount of computation completed per cycle. 35

On Tuesday We turn our focus to memory system design... Have a good weekend! 36