CS 152 Computer Architecture and Engineering

Size: px

Start display at page:

Download "CS 152 Computer Architecture and Engineering"

Roger Allen
5 years ago
Views:

1 CS 152 Computer Architecture and Engineering Lecture 20 Advanced Processors I John Lazzaro ( TAs: Ted Hong and David Marquardt www-inst.eecs.berkeley.edu/~cs152/

2 Last Time: Error Correcting Codes We write: Later, we read: D₃D₂D₁P₂D₀P₁P₀ D₃D₂D₁P₂D₀P₁P₀ Cosmic ray hit D1. But how do we know that? On readout we compute: P₀ xor D₃ xor D₁ xor D₀ = 1 xor 0 xor 0 xor 0 = 1 P₁ xor D₃ xor D₂ xor D₀ = 1 xor 0 xor 1 xor 0 = 0 P₂ xor D₃ xor D₂ xor D₁ = 0 xor 0 xor 1 xor 0 = 1 Note: we number the least significant bit with 1, not 0! 0 is reserved for no errors D₃D₂D₁P₂D₀P₁P₀ P₂P₁P₀ = b101 = 5 What does 5 mean? The position of the flipped bit! To repair, just flip it back...

3 Today: Beyond the 5-stage pipeline Taxonomy: Introduction to advanced processor techniques. Superpipelining: Increasing the number of pipeline stages. Superscalar: Issuing several instructions in a single cycle.

4 5 Stage Pipeline: A point of departure Seconds Program Instructions Program Cycles Instruction Seconds Cycle Perfect caching ALU IM Reg DM Reg At best, the 5-stage pipeline executes one instruction per clock, with a clock period determined by the slowest stage Application does not need multi-cycle instructions (multiply, divide, etc) Filling all delay slots (branch,load)

5 Superpipelining: Add more stages Today! Seconds Program Instructions Program Cycles Instruction Seconds Cycle Goal: Reduce critical path by adding more pipeline stages. Example: 8-stage ARM XScale: extra IF, ID, data cache stages. Difficulties: Added penalties for load delays and branch misses. Ultimate Limiter: As logic delay goes to 0, FF clk-to-q and setup.

6 Superscalar: Multiple issues per cycle Today! I7 IJ I##8% KL 7'DD OPQR# 7PQR# 7N8A M4< &%N Seconds Program Instructions Program Cycles Instruction Seconds Cycle Goal: Improve CPI by issuing several instructions per cycle. Example: CPU with floating point ALUs: issue 1 FP + 1 integer instruction per cycle. Difficulties: Load and branch delays affect more instructions. Ultimate Limiter: Programs may be a poor match to issue rules.

7 Out of Order: Going around stalls Seconds Program Instructions Cycles Seconds Program Instruction Cycle Thursday Goal: Issue instructions out of program order Example:... so let ADDD go first!"#$%&'!" #$%& '()*$+ (!" #(% (,)*'+!*%+ -.!/" #0% #(% #$, ADDD 1.2" #3% #$% #$ ( MULTD waiting on F4 to load... Difficulties: Bookkeeping is highly complex. A poor fit for lockstep instruction scheduling. Ultimate Limiter: The amount of instruction level parallelism present in an application.

8 Dynamic Scheduling: End lockstep Goal: Enable out-of-order by breaking pipeline in two: fetch and execution. Example: IBM Power 5: Thursday Branch redirects Out-of-order processing Instruction fetch IF IC BP Branch MP ISS RF EX pipeline Load/store WB Xfer pipeline MP ISS RF EA DC Fmt WB Xfer CP D0 D1 D2 D3 Xfer GD Group formation and instruction decode MP ISS RF EX Fixed-point WB Xfer pipeline MP ISS RF Interrupts and flushes Limiters: Design complexity, instruction level parallelism. F6 Floatingpoint WB pipeline Xfer

Throughput and multiple threads Next Tuesday Goal: Use multiple CPUs (real and virtual) to improve (1) throughput of machines that run many programs (2) execution time of multithreaded

9 Throughput and multiple threads Next Tuesday Goal: Use multiple CPUs (real and virtual) to improve (1) throughput of machines that run many programs (2) execution time of multithreaded programs. Example: Sun Niagara (8 SPARCs on one chip). Difficulties: Gaining full advantage requires rewriting applications, OS, libraries. Ultimate limiter: Amdahl s law, memory system performance.

10 Reminder: Friday Test Bench Checkoff P i p e l i n e d C P U Week 5 Xlinix IC Bus DC Bus Instruction Cache Week 4 Xlinix Data Cache IM Bus DM Bus D R A M C o n t r o l l e r Week 3 Xlinix DRAM Test vector suite for Week 3/4/5 checkoffs, running in ModelSim (3/4) and SPIM (5). Detailed block diagrams, state machines, and Lab 3 CPU changes

11 Superpipelining

12 Add pipeline stages, reduce clock period Seconds Program Instructions Cycles Seconds Program Instruction Cycle Q. Could adding pipeline stages reduce CPI for an application? A. Yes, due to these problems: ARM XScale 8 stages CPI Problem Taken branches cause longer stalls Cache misses take more clock cycles Possible Solution Branch prediction, loop unrolling Larger caches, add prefetch opcodes to ISA

13 + Recall: Control hazards... IF (Fetch) ID (Decode) EX (ALU) MEM WB 0x4 IR IR IR IR D PC Q I-Cache Instr Mem Addr Data We avoiding stalling by (1) adding a branch delay slot, and (2) adding comparator to ID stage If we add more early stages, we must stall. Sample Program Time: t1 t2 t3 t4 t5 t6 t7 t8 (ISA w/o branch Inst EX stage delay slot) I1: IF ID EX MEM WB computes I2: IF ID if branch I1: BEQ R4,R3,25 I3: IF is taken I2: AND R6,R5,R4 I4: I3: SUB R1,R9,R8 If branch is taken, I5: these instructions I6: MUST NOT complete!

14 + Solution: Branch prediction... IF (Fetch) ID (Decode) EX (ALU) MEM WB 0x4 IR IR IR IR D PC Q I-Cache Instr Mem Addr Data We update the PC based on the outputs of the branch predictor. If it is perfect, pipe stays full! Dynamic Predictors: a cache of branch history A control instr? Taken or Not Taken? Branch Predictor Predictions The PC a branch targets Time: Inst I1: I2: I3: I4: I5: I6: t1 t2 t3 t4 t5 t6 t7 t8 EX stage IF ID EX MEM WB computes IF ID if branch is taken IF If we predicted incorrectly, these instructions MUST NOT complete!

15 Branch predictors cache branch history Address of BNEZ instruction 0b0110[...] BNEZ R1 Loop 2 bits Branch Target Buffer (BTB) 28-bit address tag 0b0110[...]0100 = Hit 28 bits target address PC Loop Taken Address Branch History Table (BHT) Taken or Not Taken 80-90% accurate Update BHT/BTB for next time, once true behavior known Must check prediction, kill instructions if needed.

16 Simple ( 2-bit ) Branch History Table Entry Prediction for next branch (1 = take, 0 = not take) Was last prediction correct? (1 = yes, 0 = no) D Q D Q We do not change the prediction the first time it is incorrect. Why? loop: ADDI R4,R0,11 SUBI R4,R4,-1 BNE R4,R0,loop This branch taken 10 times, then not taken once (end of loop). The next time we enter the loop, we would like to predict take the first time through.

17 Spatial enhancements: many BHTs... 0b0110[...] BNEZ R1 Loop Branch History Tables (BHT00) (BHT01) (BHT10) (BHT11) Detects patterns in: if (x < 12) [...] if (x < 6) [...] code. 95% accurate Were last two branches in instruction stream taken or not? Taken or Not Taken Update the table whose value was used for the branch instruction. Yeh and Patt, 1992.

18 Hardware limits to superpipelining? FO4 Delays Historical limit: about 12 =88 B8 > =8 MIPS stages 8 A> A? A@ AA AB B8 B= B6 B7 B4 B> B? B@ BA BB 88 8= > CPU Clock Periods Pentium Pro 10 stages FO4: How many fanout-of-4 inverter delays in the clock period. Pentium 4 20 stages Thanks to Francois Labonte, Stanford '$,-/)7A? '$,-/)4A? '$,-/)C-$,'3D '$,-/)C-$,'3D)6 '$,-/)C-$,'3D)7 '$,-/)C-$,'3D)4 '$,-/)',#$'3D E/CF#)6=8?4 E/CF#)6==?4 E/CF#)6=6?4 9C#"% 93C-"9C#"% 9C#"%?4 G'C( HI)IE I&J-")IK EGL)M? EGL)M@ EGL)NA?O?4 x Cell: 11 FO4 delays

19 Superscalar

20 Superscalar: A simple example... M4< &%N I7 IJ I##8% KL 7'DD OPQR# 7PQR# 7N8A Example: Superscalar MIPS. Fetches 2 instructions at a time. If first integer and second floating point, issue in same cycle 7D:@ Integer instruction FP instruction Two issues per cycle LD F0,0(R1) LD F6,-8(R1) LD F10,-16(R1) ADDD F4,F0,F2 LD F14,-24(R1) ADDD F8,F6,F2 LD F18,-32(R1) ADDD F12,F10,F2 SD 0(R1),F4 ADDD F16,F14,F2 SD -8(R1),F8 ADDD F20,F18,F2 SD -16(R1),F12 SD -24(R1),F16 One issue per cycle

21 Superscalar: Visualizing the pipeline M4< &%N I7 IJ I##8% KL 7'DD OPQR# 7PQR# 7N8A Type Pipe Stages Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Three instructions affected by a single cycle of load delay. Why?

22 Limitations of lockstep superscalar Only get 0.5 CPI for a 50/50 float/int mix with no hazards. For games/media, this may be OK. Extending scheme to speed up general apps (Microsoft Office,...) is complicated. If one accepts building a complicated machine, there are better ways to do it. Branch redirects Out-of-order processing Next time: Dynamic Scheduling Instruction fetch IF IC BP D0 D1 D2 D3 Xfer GD Group formation and instruction decode Branch MP ISS RF EX pipeline Load/store WB Xfer pipeline MP ISS RF EA DC Fmt WB Xfer MP ISS RF EX Fixed-point WB Xfer pipeline MP ISS RF CP Interrupts and flushes F6 Floatingpoint WB pipeline Xfer

23 Conclusion: Superpipelining, Superscalar The 5 stage pipeline: a starting point for performance enhancements, a building block for multiprocessing. Superpipelining: Reduce critical path by adding more pipeline stages. Has the potential to increase the CPI. Superscalar: Multiple instructions at once. Programs must fit the issue rules. Adds complexity.

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering Lecture 17 Advanced Processors I 2005-10-27 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: David Marquardt and Udam Saini www-inst.eecs.berkeley.edu/~cs152/