CS 152 Computer Architecture and Engineering

Similar documents
CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering

CS Digital Systems Project Laboratory. Lecture 9: Advanced Processors I

CS 152 Computer Architecture and Engineering Lecture 4 Pipelining

CS 152 Computer Architecture and Engineering

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

EECS Digital Design

HY425 Lecture 05: Branch Prediction

Chapter 4. The Processor

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering

Computer Architecture

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering Lecture 1 Single Cycle Design

COMPUTER ORGANIZATION AND DESIGN

CS 152 Computer Architecture and Engineering

Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

ECE331: Hardware Organization and Design

Advanced Computer Architecture

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

ECE331: Hardware Organization and Design

ECE232: Hardware Organization and Design

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Improving Performance: Pipelining

CS 152 Computer Architecture and Engineering

CS152 Computer Architecture and Engineering Lecture 16: Memory System

COMPUTER ORGANIZATION AND DESIGN

CS 152 Computer Architecture and Engineering Lecture 3 Metrics

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Computer Architecture. Lecture 6.1: Fundamentals of

The Processor: Improving the performance - Control Hazards

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

LECTURE 3: THE PROCESSOR

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

RISC, CISC, and ISA Variations

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

COMPUTER ORGANIZATION AND DESI

CpE 442. Memory System

Modern Computer Architecture

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Lecture 9 Pipeline and Cache

CS146 Computer Architecture. Fall Midterm Exam

Complex Pipelines and Branch Prediction

ECE 2300 Digital Logic & Computer Organization. Caches

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

Multi-cycle Instructions in the Pipeline (Floating Point)

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Chapter 4 The Processor 1. Chapter 4A. The Processor

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

COSC 6385 Computer Architecture - Pipelining

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

Branch prediction ( 3.3) Dynamic Branch Prediction

6.823 Computer System Architecture Datapath for DLX Problem Set #2

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Super Scalar. Kalyan Basu March 21,

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Static Branch Prediction

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering. Lecture 5 - Pipelining II (Branches, Exceptions)

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Computer Architecture Spring 2016

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Hakim Weatherspoon CS 3410 Computer Science Cornell University

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

CS 152 Computer Architecture and Engineering

Chapter 1. Computer Abstractions and Technology. Lesson 3: Understanding Performance

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

EE 3170 Microcontroller Applications

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Topic #6. Processor Design

Memory. Lecture 22 CS301

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

November 7, 2014 Prediction

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Transcription:

CS 152 Computer Architecture and Engineering Lecture 6 Superpipelining + Branch Prediction 2014-2-6 John Lazzaro (not a prof - John is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play: CS 152: L6: Superpipelining + Branch Prediction UC Regents Spring 2014 UCB 1

Today: First advanced processor lecture Super-pipelining: Beyond 5 stages. Short Break. Branch prediction: Can we escape control hazards in long CPU pipelines? CS 152: L6: Superpipelining + Branch Prediction UC Regents Spring 2014 UCB 2

From Appendix C: Filling the branch delay slot 3

Superpipelining CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 UCB 4

5 Stage Pipeline: A point of departure Seconds Program Instructions Program Cycles Instruction Seconds Cycle Perfect caching CS 194-6 L9: Advanced Processors I ALU IM Reg DM Reg At best, the 5-stage pipeline executes one instruction per clock, with a clock period determined by the slowest stage Processor has no multi-cycle instructions (ex: multiply with an accumulate register) Filling all delay slots (branch,load) UC Regents Fall 2008 UCB 5

Superpipelining: Add more stages Today! Seconds Program Instructions Program Cycles Instruction Seconds Cycle Also, power! CS 194-6 L9: Advanced Processors I Goal: Reduce critical path by adding more pipeline stages. Example: 8-stage ARM XScale: extra IF, ID, data cache stages. Difficulties: Added penalties for load delays and branch misses. Ultimate Limiter: As logic delay goes to 0, FF clk-to-q and setup. UC Regents Fall 2008 UCB 6

Note: Some stages now overlap, some instructions take extra stages. 5 Stage 8 Stage IF IR ID+RF IR EX IR MEM IR WB IM Reg DM Reg ALU IF now takes 2 stages (pipelined I-cache) ID and RF each get a stage. ALU split over 3 stages MEM takes 2 stages (pipelined D-cache) CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 UCB 7

Superpipelining techniques... Split ALU and decode logic over several pipeline stages. Pipeline memory: Use more banks of smaller arrays, add pipeline stages between decoders, muxes. Remove rarely-used forwarding networks that are on critical path. Creates stalls, affects CPI. Pipeline the wires of frequently used forwarding networks. Also: Clocking tricks (example: use positive-edge AND negative-edge flip-flops) CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 UCB 8

Recall: IBM Power Timing Closure Pipeline engineering happens here...... about 1/3 of project schedule From The circuit and physical design of the POWER4 microprocessor, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al. CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 UCB 9

Pipelining a 256 byte instruction memory. Fully combinational (and slow). Only read behavior shown. A7-A0: 8-bit read address 3 A7 A6 A5 A4 A3 { { A2 3 Can we add two pipeline stages? 1 D E M U X... OE OE OE OE --> Tri-state Q outputs! Byte 0-31 Byte 32-63... Byte 224-255 Q Q Q... 256 256 256 256 M U X 3 Data output is 32 bits D0-D31 32 i.e. 4 bytes Each register holds 32 bytes (256 bits) CS 152: L6: Superpipelining + Branch Prediction UC Regents Spring 2014 UCB 10

On a chip: Registers become SRAM cells Architects specify number of rows and columns. Word and bit lines slow down as array grows larger! Din 3 Din 2 Din 1 Din 0 Precharge WrEn Parallel Data I/O Lines WrWrite Driver & WrWrite Driver & WrWrite Driver & WrWrite Driver & - Precharger Driver + - Precharger Driver + - Precharger Driver + - Precharger Driver + SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell CS 152: L6: Superpipelining + Branch Prediction SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell : : : : SRAM Cell - Sense Amp + - Sense Amp + - Sense Amp + - Sense Amp + Dout 3 Dout 2 Dout 1 Dout 0 How could we pipeline this memory? See last slide. Word 0 Word 1 Word 15 Address Decoder A0 A1 A2 A3 Add muxes here to select subset of bits Q: Which is longer: word line or bit line? UC Regents Spring 2014 UCB 11

RISC CPU 5.85 million devices 0.65 million devices 12

IC processes are optimized for small SRAM cells From Marvell ARM CPU paper: 90% of the 6.5 million transistors, and 60% of the chip area, is devoted to cache memories. Implication? SRAM is 6X as dense as logic. 13

RAM Compilers Fig. 1. 45-degree image of 22 nm tri-gate LVC SRAM bitcell. On average, 30% of a modern logic chip is SRAM, which is generated by RAM compilers. Compile-time parameters set number of bits, aspect ratio, ports, etc. Fig. 2. 22 nm HDC and LVC SRAM bitcells. Figure 13.1.1: 22nm HDC CS 250 L1: Fab/Design Interface low voltage, achieving low SRAM minimum operati and LVC Tri-gate SRAM bitcells. is desirable to avoid integration, routing, and control of multiple supply domains. In the 22 nm tri-gate technology, fin quantization the fine-grained width tuning conventionally used to read stability and write margin and presents a ch designing minimum-area SRAM bitcells constrain pitch. The 22 nm process technology includes bo density 0.092 m 6T SRAM bitcell (HDC) and a lo 0.108 m 6T SRAM bitcell (LVC) to support tradeo performance, and minimum operating voltage acro of application requirements. In Fig. 1, a 45-degree im LVC tri-gate SRAM is pictured showing the UCB thin s UC Regents Fall 2013 wrapped on three sides by a polysilicon gate. The 14

ALU: Pipelining Unsigned Multiply!"#$%&#%'()*!"#$%&#%+, 3* --.-///0-12 1011 -.--///0--2 Facts to remember 5(,$%(#/&,6*"'$7... --.- --.- --.- m bits x n bits = m+n bit product -...---- 0-412 Binary makes it easy: 0 => place 0 ( 0 x multiplicand) 1 => place a copy ( 1 x multiplicand) CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 UCB 15

Building Block: Full-Adder Variant 1-bit signals: x, y, z, s, Cin, Cout x y z Cout Cin z: one bit of multiplier s x: one bit of multiplicand If z = 1, {Cout, s} <= x + y + Cin If z = 0, {Cout, s} <= y + Cin Verilog for 2-bit entity, assign CS 194-6 L9: Advanced Processors I y: one bit of the running sum UC Regents Fall 2008 UCB 16

Put it together: Array computes P = A x B To pipeline array: Place registers between adder stages (in green). Add registers to delay selected A and B bits (not shown) Cout P 7 CS 194-6 L9: Advanced Processors I Cout A 3 P 6 Cout A 3 A 2 P 5 x A 3 A 2 A 1 P 4 Cout A 3 A 2 A 1 A 0 P 3 0 0 0 0 Fully combinational implementation is slow! A 2 A 1 A 0 P 2 y A 1 A 0 P 1 A 0 P 0 z B 0 B 1 B 2 B 3 UC Regents Fall 2008 UCB 17

Adding pipeline stages is not enough... MIPS R4000: Simple 8-stage pipeline Branch stalls are the main reason why pipeline CPI > 1. 2-cycle load delay, 3-cycle branch delay. (Appendix C, Figure C.52) CS 152: L6: Superpipelining + Branch Prediction UC Regents Spring 2014 UCB 18

Branch Prediction CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 UCB 19

Add pipeline stages, reduce clock period Seconds Program Instructions Cycles Seconds Program Instruction Cycle Q. Could adding pipeline stages hurt the CPI for an application? A. Yes, due to these problems: CPI Problem Possible Solution Taken branches cause longer stalls Branch prediction, loop unrolling ARM XScale 8 stages CS 194-6 L9: Advanced Processors I Cache misses take more clock cycles Larger caches, add prefetch opcodes to ISA UC Regents Fall 2008 UCB 20

Recall: Control hazards... + IF (Fetch) ID (Decode) EX (ALU) MEM WB 0x4 IR IR IR IR D PC Q I-Cache Instr Mem Addr Data We avoiding stalling by (1) adding a branch delay slot, and (2) adding comparator to ID stage If we add more early stages, we must stall. Sample Program Time: t1 t2 t3 t4 t5 t6 t7 t8 (ISA w/o branch Inst EX stage delay slot) I1: IF ID EX MEM WB computes if I2: IF ID branch is I1: BEQ R4,R3,25 I3: IF taken I2: AND R6,R5,R4 I4: I3: SUB R1,R9,R8 If branch is taken, these I5: instructions MUST NOT I6: complete! CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 UCB 21

+ Solution: Branch prediction... IF (Fetch) ID (Decode) EX (ALU) MEM WB 0x4 IR IR IR IR D PC Q I-Cache Instr Mem Addr Data We update the PC based on the outputs of the branch predictor. If it is perfect, pipe stays full! Dynamic Predictors: a cache of branch history A control instr? Taken or Not Taken? Branch Predictor Predictions If taken, where to? What PC? CS 194-6 L9: Advanced Processors I Time: Inst I1: I2: I3: I4: I5: I6: t1 t2 t3 t4 t5 t6 t7 t8 EX stage IF ID EX MEM WB computes if IF ID branch is taken IF If we predicted incorrectly, these instructions MUST NOT complete! UC Regents Fall 2008 UCB 22

Branch predictors cache branch history Address of branch instruction 0b0110[...]01001000 4096 entries... = = = = Hit 30 bits Branch Target Buffer (BTB) 30-bit address tag 0b0110[...]0010 At EX stage, update BTB/BHT, kill instructions, if necessary, CS 152: L6: Superpipelining + Branch Prediction target address PC + 4 + Loop Taken Address Branch instruction BNEZ R1 Loop Branch History Table (BHT) 2 state bits Taken or Not Taken Drawn as fully associative to focus on the essentials. In real designs, always directmapped. UC Regents Spring 2014 UCB 23

Branch predictor: direct-mapped version 0b011[..]010[..]100 BNEZ R1 Loop Address of BNEZ instruction Branch Target Buffer (BTB) 18-bit address tag 0b011[...]01 = Hit 18 bits CS 194-6 L9: Advanced Processors I target address PC + 4 + Loop Taken Address Must check prediction, kill instruction if needed. 80-90% accurate As in real-life... direct-mapped... 12 bits 4096 BTB/BHT entries Branch History Table (BHT) Taken or Not Taken Update BHT/BTB for next time, once true behavior known UC Regents Fall 2008 UCB 24

Simple ( 2-bit ) Branch History Table Entry Prediction for next branch. (1 = take, 0 = not take) Initialize to 0. Was last prediction correct? (1 = yes, 0 = no) Initialize to 1. D Q D Q Flip bit if prediction is not correct and last predict correct bit is 0. After we check prediction... Set to 1 if prediction bit was correct. Set to 0 if prediction bit was incorrect. Set to 1 if prediction bit flips. We do not change the prediction the first time it is incorrect. Why? loop: ADDI R4,R0,11 SUBI R4,R4,-1 BNE R4,R0,loop CS 194-6 L9: Advanced Processors I This branch taken 10 times, then not taken once (end of loop). The next time we enter the loop, we would like to predict take the first time through. UC Regents Fall 2008 UCB 25

80-90% accurate 4096-entry 2-bit predictor Figure C.19 26

Branch Prediction: Trust, but verify... Instr Fetch Decode & Reg Fetch Execute D PC Q Instr I-Cache Mem Addr Data IR IR IR +4 Predicted PC Logic A branch instr? Taken or Not Taken? Branch Predictor and BTB Predictions If taken, where to? What PC? B P CS 152: L6: Superpipelining + Branch Prediction rs1 rs2 ws wd RegFile WE rd1 rd2 Ext Note instruction type and branch target. Pass to next stage. A B B P 32 32 op A L U 32 Branch Taken/Not Taken Prediction info --> Prediction info --> Y Check all predictions. Take actions if needed (kill instructions, update predictor). UC Regents Spring 2014 UCB 27

Flowchart control for dynamic branch prediction. Figure 3.22 28

Spatial Predictors C code snippet: b1 b2 b3 After compilation: b1 Idea: Devote hardware to four 2-bit predictors for BEQZ branch. P1: Use if b1 and b2 not taken. P2: Use if b1 taken, b2 not taken. P3: Use if b1 not taken, b2 taken. P4: Use if b1 and b2 taken. Track the current taken/not-taken status of b1 and b2, and use it to choose from P1... P4 for BEQZ... How? b1 We want to predict this branch. b2 b2 b3 Can b1 and b2 help us predict it? 29

Branch History Register: Tracks global history D PC Q Instr Fetch Instr I-Cache Mem Addr Data Decode & Reg Fetch We choose which predictor to use (and update) based on the Branch History Register. IR IR IR +4 Predicted PC Logic A branch instr? Taken or Not Taken? Branch Predictor and BTB Predictions If taken, where to? What PC? B P CS 152: L6: Superpipelining + Branch Prediction rs1 rs2 ws wd Prediction info --> RegFile WE rd1 rd2 Ext A B 32 32 op A L U 32 Branch History Register Branch Taken/Not Taken D Q D Q WE WE Shift register. Holds taken/not-taken status of last 2 branches. Y UC Regents Spring 2014 UCB 30

Spatial branch predictor (BTB, tag not shown) 0b0110[...]01001000 BEQZ R3 L3 Branch History Tables Map PC to index P1 P2 P3 P4 Detects patterns in: 2 state bits 2 state bits 2 state bits 2 state bits Branch History Register D WE WE Q (bb==2) branch D Q (aa==2) branch CS 152: L6: Superpipelining + Branch Prediction Mux to choose which branch predictor Taken or Not Taken For (aa!= bb) branch code. Yeh and Patt, 1992. 95% accurate UC Regents Spring 2014 UCB 31

Performance For more details on branch prediction: 4096 vs 1024? Fair comparison, matches total # of bits) One BHT (4096 entries) Spatial (4 BHTs, each with 1024 entries) 95% accurate Figure 3.3 32

Predict function returns by stacking call info Program counter Alternate Branch history tables Branch prediction Return stack Target cache Figure 3.24 33

Hardware limits to superpipelining? FO4 Delays Historical limit: about 12 FO4s 100 90 80 70 60 50 40 30 20 10 0 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 CS 250 L3: Timing MIPS 2000 5 stages CPU Clock Periods 1985-2005 Pentium Pro 10 stages FO4: How many fanout-of-4 inverter delays in the clock period. Pentium 4 20 stages Thanks to Francois Labonte, Stanford * intel 386 intel 486 intel pentium intel pentium 2 intel pentium 3 intel pentium 4 intel itanium Alpha 21064 Alpha 21164 Alpha 21264 Sparc SuperSparc Sparc64 Mips HP PA Power PC AMD K6 AMD K7 AMD x86-64 Power wall: Intel Core Duo has 14 stages UC Regents Fall 2013 UCB 34

CPU DB: Recording Microprocessor History With this open database, you can mine microprocessor trends over the past 40 years. Andrew Danowitz, Kyle Kelley, James Mao, John P. Stevenson, Mark Horowitz, Stanford University F04 Delays Per Cycle for Processor Designs 140 120 100 F04 / cycle 80 60 40 20 0 1985 1990 1995 2000 2005 2010 2015 FO4 delay per cycle is roughly proportional to the amount of computation completed per cycle. 35

On Tuesday We turn our focus to memory system design... Have a good weekend! 36