Pipeline Processor Design

Size: px

Start display at page:

Download "Pipeline Processor Design"

Dina Day
6 years ago
Views:

1 Pipeline Processor Design Beyond Pipeline Architecture Virendra Singh Computer Design and Test Lab. Indian Institute of Science Bangalore Advance Computer Architecture

2 Branch Hazard Consider heuristic branch not taken. Continue fetching instructions in sequence following the branch instructions. If branch is taken (indicated by zero output of ALU): Control generates branch signal in ID cycle. branch activates PCSource signal in the MEM cycle to load PC with new branch address. Three instructions in the pipeline must be flushed if branch is taken can this penalty be reduced? Advance Computer Architecture 2

3 Pipeline Flush If branch is taken (as indicated by zero), then control does the following: Change all control signals to 0, similar to the case of stall for data hazard, i.e., insert bubble in the pipeline. Generate a signal IF.Flush that changes the instruction in the pipeline register IF/ID to 0 (nop). Penalty of branch hazard is reduced by Adding branch detection and address generation hardware in the decode cycle one bubble needed a next address generation logic in the decode stage writes PC+4, branch address, or jump address into PC. Using branch prediction. Advance SE-273@SERC Computer Architecture 3

4 Branch Prediction Useful for program loops. A one-bit prediction scheme: a one-bit buffer carries a history bit that tells what happened on the last branch instruction History bit = 1, branch was taken History bit = 0, branch was not taken taken Predict branch taken 1 Not taken taken Predict branch not taken 0 Not taken Feb 26, 14, 2011 Advance SE-273@SERC Computer Architecture 4

5 Branch Prediction Address of Target History recent branch addresses bit(s) instructions Low-order bits used as index PC+4 Next PC 0 1 PC = Prediction Logic Feb 26, 14, 2011 Advance SE-273@SERC Computer Architecture 5

6 Branch Prediction for a Loop N I = 0 I = I + 1 X = X + R(I) I 10 = 0? Y Store X in memory Execution seq. Execution of Instruction 4 Old Next instr. New hist. hist. Pred. I Act. bit bit Feb 26, 14, 2011 Advance SE-273@SERC Computer Architecture 6 Predicti on Bad Good Good Good Good Good Good Good Good h.bit 10 = 0 branch 1 not 2 taken, 10 h.bit 5 = 1 branch 0 Bad taken.

7 Two-Bit Prediction Buffer Can improve correct prediction statistics. taken Predict branch taken 11 taken Predict branch not taken 01 Not taken taken Not taken taken Predict branch taken 10 Predict branch not taken 00 Not taken Not taken Feb 26, 14, 2011 Advance Computer Architecture 7

8 Branch Prediction for a Loop 1 2 I = 0 I = I + 1 Execution seq. Execution of Instruction 4 Next instr. Old Pred.Bu f Pred. I Act. New pred.b uf Predicti on Good X = X + R(I) Good Good N Good I 10 = 0? Good Y Good Good Store X in memory Good Good Bad Feb 26, 14, 2011 Advance SE-273@SERC Computer Architecture 8

9 Exceptions A typical exception occurs when ALU produces an overflow signal. Control asserts following actions on exception: Change the PC address to hex. This is the location of the exception routine. This is done by adding an additional input to the PC input multiplexer. Overflow is detected in the EX cycle. Similar to data hazard and pipeline flush, Set IF/ID to 0 (nop). Generate ID.Flush and EX.Flush signals to set all control signals to 0 in ID/EX and EX/MEM registers. This also prevents the ALU result (presumed contaminated) from being written in the WB cycle. Feb 26, 14, 2011 Advance SE-273@SERC Computer Architecture 9

10 Limits of Pipelining IBM RISC Experience Control and data dependences add 15% Best case CPI of 1.15, IPC of 0.87 Deeper pipelines (higher frequency) magnify dependence penalties This analysis assumes 100% cache hit rates Hit rates approach 100% for some programs Many important programs have much worse hit rates Later! Feb 26, 14, 2011 Advance Computer Architecture 10

11 Processor Performance Time Processor Performance = Program Instructions Cycles = X X Program Instruction (code size) (CPI) (cycle time) In the 1980 s (decade of pipelining): CPI: 5.0 => 1.15 In the 1990 s (decade of superscalar): CPI: 1.15 => 0.5 (best case) In the 2000 s (decade of multicore): Marginal CPI improvement Time Cycle Feb 26, 14, 2011 Advance SE-273@SERC Computer Architecture 11

12 Amdahl s Law N No. of Processors 1 h 1 - h 1 - f f Time h = fraction of time in serial code f = fraction that is vectorizable v = speedup for f Overall speedup: Speedup = 1 Feb 26, 14, 2011 Advance SE-273@SERC Computer Architecture 12 f 1 + f v

13 Revisit Amdahl s Law Sequential bottleneck Even if v is infinite lim 1 Performance limited by nonvectorizable portion (1-f) f 1 f v = v f N No. of Processors 1 h 1 - h 1 - f f Time Feb 26, 14, 2011 Advance SE-273@SERC Computer Architecture 13

14 Pipelined Performance Model N Pipeline Depth 1 1-g g g = fraction of time pipeline is filled 1-g = fraction of time pipeline is not filled (stalled) Feb 26, 14, 2011 Advance SE-273@SERC Computer Architecture 14

15 Pipelined Performance Model N Pipeline Depth 1 1-g g g = fraction of time pipeline is filled 1-g = fraction of time pipeline is not filled (stalled) Feb 26, 14, 2011 Advance SE-273@SERC Computer Architecture 15

16 Pipelined Performance Model N Pipeline Depth 1 1-g g Tyranny of Amdahl s Law [Bob Colwell] When g is even slightly below 100%, a big performance hit will result Stalled cycles are the key adversary and must be minimized as much as possible Feb 26, 14, 2011 Advance SE-273@SERC Computer Architecture 16

17 Motivation for Superscalar [Agerwala and Cocke] Speedup jumps from 3 to 4.3 for N=6, f=0.8, but s =2 instead of s=1 (scalar) Typical Range Feb 26, 14, 2011 Advance SE-273@SERC Computer Architecture 17

18 Superscalar Proposal Moderate tyranny of Amdahl s Law Ease sequential bottleneck More generally applicable Robust (less sensitive to f) Revised Amdahl s Law: Speedup = 1 1 ( f ) s + f v Feb 26, 14, 2011 Advance SE-273@SERC Computer Architecture 18

19 Limits on Instruction Level Parallelism (ILP) Weiss and Smith [1984] 1.58 Sohi and Vajapeyam [1987] 1.81 Tjaden and Flynn [1970] Tjaden and Flynn [1973] 1.96 Uht [1986] 2.00 Smith et al. [1989] 2.00 Jouppi and Wall [1988] 2.40 Johnson [1991] 2.50 Acosta et al. [1986] 2.79 Wedig [1982] 3.00 Butler et al. [1991] 5.8 Melvin and Patt [1991] 6 Wall [1991] Kuck et al. [1972] 8 Riseman and Foster [1972] 1.86 (Flynn s bottleneck) 7 (Jouppi disagreed) 51 (no control dependences) Nicolau and Fisher [1984] 90 (Fisher s optimism) Feb 26, 14, 2011 Advance SE-273@SERC Computer Architecture 19

20 Superscalar Proposal Go beyond single instruction pipeline, achieve IPC > 1 Dispatch multiple instructions per cycle Provide more generally applicable form of concurrency (not just vectors) Geared for sequential code that is hard to parallelize otherwise Exploit fine-grained or instructionlevel parallelism (ILP) Feb 26, 14, 2011 Advance SE-273@SERC Computer Architecture 20

21 Classifying ILP Machines [Jouppi, DECWRL 1991] Baseline scalar RISC Issue parallelism = IP = 1 Operation latency = OP = 1 Peak IPC = 1 SUCCESSIVE INSTRUCTIONS IF DE EX WB TIME IN CYCLES (OF BASELINE MACHINE) Feb 26, 14, 2011 Advance SE-273@SERC Computer Architecture 21

22 Beyond Pipelining Time taken by a program The number of instruction required to execute the program Average number of cycle required to execute an instruction The processor cycle time Constraints Feb 26, 14, 2011 Advance SE-273@SERC Computer Architecture 22

23 Beyond Pipelining Superscalar Processor Reduces average number of cycles per instruction beyond what is possible in pipelined scalar RISC processor by allowing concurrent execution of instructions in the same pipelined stage as well as concurrent execution of instructions in different pipeline stages Emphasizes multiple concurrent operations on scalar quantities There are many hurdles to support it Feb 26, 2011 Advance Computer Architecture 23

24 Superscalar Architecture Superscalar Processor Simple concept Wide pipeline Instructions are not independent Superscalar architecture is natural descendant of pipelined scalar RISC Superscalar techniques largely concern the processor organization, independent of the ISA and the other architectural features Thus, possibility to develop a processor code compatible with an existing architecture Feb 26, 2011 Advance Computer Architecture 24

25 Superscalar Architecture Fundamental Limitations True data dependencies If an instruction uses a value produced by a previous instruction, the second instruction has a true data dependency on the first instruction Second instruction must be delayed Procedural dependencies Due to change in the program flow Resource conflicts Arises when two instructions want to use the same resource at the same time Can be eliminated by resource duplication Feb 26, 2011 Advance Computer Architecture 25

26 Superscalar Architecture Instruction parallelism and Machine parallelism Instruction parallelism of a program is a measure of the average number of instructions that a superscalar processor might be able to execute at the same time Mostly, ILP is determined by the number of true dependencies and the number of branches in relation to other instructions Feb 26, 2011 Advance Computer Architecture 26

27 Superscalar Architecture Instruction parallelism and Machine parallelism Machine parallelism of a processor is a measure of the ability of processor to take advantage of the ILP Determined by the number of instructions that can be fetched and executed at the same time A challenge in the design of superscalar processor is to achieve good balance between instruction parallelism and machine parallelism Feb 26, 2011 Advance Computer Architecture 27

28 Superscalar Architecture Instruction issue and machine parallelism ILP is not necessarily exploited by widening the pipelines and adding more resources Processor policies towards fetching decoding, and executing instruction have significant effect on its ability to discover instructions which can be executed concurrently Instruction issue is refer to the process of initiating instruction execution Instruction issue policy limits or enhances performance because it determines the processor s look ahead capability Feb 26, 2011 Advance Computer Architecture 28

29 Superscalar Pipelines IF ID RD ALU MEM WB Feb 26, 2011 Advance Computer Architecture 29

30 Superscalar Pipelines Dynamic Pipelines 1. Alleviate the limitations of pipelined implementation 2. Use diversified pipelines 3. Temporal machine parallelism Feb 26, 2011 Advance Computer Architecture 30

31 Superscalar Pipelines (Diversified) IF ID RD EX ALU Mem1 Mem2 FP1 FP2 FP3 BR WB Feb 26, 2011 Advance Computer Architecture 31

32 Superscalar Pipelines (Diversified) Diversified Pipelines Each pipeline can be customized for particular instruction type Each instruction type incurs only necessary latency Certainly less expensive than identical copies If all inter-instruction dependencies are resolved then there is no stall after instruction issue Require special consideration Number and Mix of functional units Feb 26, 2011 Advance Computer Architecture 32

33 Dynamic Pipelines Superscalar Pipelines (Dynamic Pipelines) Buffers are needed Multi-entry buffers Every entry is hardwired to one read port and one write port Complex multi-entry buffers Minimize stalls Feb 26, 2011 Advance Computer Architecture 33

34 Thank You Feb 26, 14, 2011 Advance Computer Architecture 34

Pipelining to Superscalar

Pipelining to Superscalar ECE/CS 752 Fall 207 Prof. Mikko H. Lipasti University of Wisconsin-Madison Pipelining to Superscalar Forecast Limits of pipelining The case for superscalar Instruction-level parallel