EECS 470. Further review: Pipeline Hazards and More. Lecture 2 Winter 2018

Size: px

Start display at page:

Download "EECS 470. Further review: Pipeline Hazards and More. Lecture 2 Winter 2018"

Harvey Joseph
5 years ago
Views:

1 EECS 470 Further review: Pipeline Hazards and ore Lecture 2 Winter 208 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, and Wenisch of Carnegie ellon niversity, Purdue niversity, niversity of ichigan, niversity of Pennsylvania, and niversity of Wisconsin.

2 Bureaucracy & Scheduling Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Wenisch, Vijaykumar Announcements HW due HW2 posted by end of the week. se office hours! Programming assignment due Tuesday (7 days) Hand-in electronically by :59pm Should be reading C.-C.3 (review) 3., (new material) Get on 470 s Piazza site (link on website) 2

3 Bureaucracy & Scheduling Today Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Wenisch, Vijaykumar Touch on performance Cover a bit on ISAs Pickup where we left off on pipelining 3/6

4 Performance Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Wenisch, Vijaykumar Performance Key Points Amdahl s law S overall = / ( (-f) f/s ) Iron law Time Program Instructions Program Cycles Instruction Time Cycle Averaging Techniques Arithmetic Time n i Timei n Harmonic Rates n i n Rate i 4

5 Performance Speedup Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Wenisch, Vijaykumar While speedup is generally is used to explain the impact of parallel computation, we can also use it to discuss any performance improvement. Keep in mind that if execution time stays the same, speedup is. Speedup= T old T new 200% speedup means that it takes half as long to do something. So 50% speedup actually means it takes twice as long to do something. 5/73

6 ISA Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Wenisch, Vijaykumar Instruction Set Architecture EECS 470

7 ISA Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Wenisch, Vijaykumar Instruction Set Architecture Instruction set architecture (ISA) is the structure of a computer that a machine language programmer (or a compiler) must understand to write a correct (timing independent) program for that machine IB introducing 360 in IB 360 is a family of binary-compatible machines with distinct microarchitectures and technologies, ranging from odel 30 (8- bit datapath, up to 64KB memory) to odel 70 (64-bit datapath, 52KB memory) and later odel 360/9 (the Tomasulo). - IB 360 replaced 4 concurrent, but incompatible lines of IB architectures developed over the previous 0 years

8 ISA Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Wenisch, Vijaykumar ISA: A contract between HW and SW ISA (instruction set architecture) A well-defined hardware/software interface The contract between software and hardware Functional definition of operations, modes, and storage locations supported by hardware Precise description of how to invoke, and access them No guarantees regarding How operations are implemented Which operations are fast and which are slow and when Which operations take more power and which take less

9 ISA Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Wenisch, Vijaykumar Components of an ISA Programmer-visible states Program counter, general purpose registers, memory, control registers Programmer-visible behaviors (state transitions) What to do, when to do it Example register-transfer-level description of an instruction if imem[pc]== add rd, rs, rt then pc pc A binary encoding gpr[rd]=gpr[rs]grp[rt] ISAs last 25 years (because of SW cost) be careful what goes in

10 ISA Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Wenisch, Vijaykumar RISC vs CISC Recall Iron law: (instructions/program) * (cycles/instruction) * (seconds/cycle) CISC (Complex Instruction Set Computing) Improve instructions/program with complex instructions Easy for assembly-level programmers, good code density RISC (Reduced Instruction Set Computing) Improve cycles/instruction with many single-cycle instructions Increases instruction/program, but hopefully not as much Help from smart compiler Perhaps improve clock cycle time (seconds/cycle) via aggressive implementation allowed by simpler instructions

11 ISA Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Wenisch, Vijaykumar What akes a Good ISA? Programmability Easy to express programs efficiently? Implementability Easy to design high-performance implementations? ore recently Easy to design low-power implementations? Easy to design high-reliability implementations? Easy to design low-cost implementations? Compatibility Easy to maintain programmability (implementability) as languages and programs (technology) evolves? x86 (IA32) generations: 8086, 286, 386, 486, Pentium, PentiumII, PentiumIII, Pentium4,

12 ISA Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Wenisch, Vijaykumar Typical Instructions (Opcodes) Type Arithmetic and logical and, add Data transfer move, load Control branch, jump, call, return System trap, rett Floating point add, mul, div, sqrt Decimal addd, convert String move, compare What operations are necessary? {sub, ld & st, conditional br.} What is the minimum complete ISA for a von Neuman machine? Too little or too simple not expressive enough difficult to program (by hand) programs tend to be bigger Too much or too complex most of it won t be used Example Instruction too much baggage for implementation. difficult choices during compiler optimization

13 Basic Pipelining Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Wenisch, Vijaykumar Basic Pipelining

14 Basic Pipelining Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Wenisch, Vijaykumar Before there was pipelining Single-cycle insn0.fetch ulti-cycle Basic datapath: fetch, decode, execute Single-cycle control: hardwired Low CPI () Long clock period (to accommodate slowest instruction) ulti-cycle control: micro-programmed Short clock period High CPI Can we have both low CPI and short clock period? insn0.fetch, dec, exec insn0.dec insn0.exec insn.fetch insn.fetch, dec, exec insn.dec Not if datapath executes only one instruction at a time No good way to make a single instruction go faster insn.exec

15 Basic Pipelining Pipelining Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Wenisch, Vijaykumar insn0.fetch ulti-cycle insn0.fetch Pipelined insn0.dec insn0.dec insn.fetch Important performance technique Improves throughput at the expense of latency Why does latency go up? Begin with multi-cycle design insn0.exec insn0.exec insn.dec When instruction advances from stage to 2 allow next instruction to enter stage Each instruction still passes through all stages But instructions enter and leave at a much faster rate Automotive assembly line analogy insn.fetch insn.exec insn.dec insn.exec

16 Basic Pipelining Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Wenisch, Vijaykumar Pipeline Illustrated: L Comb. Logic n Gate Delay BW = ~(/n) L n -- 2 Gate Delay L n -- 2 Gate Delay BW = ~(2/n) L n -- Gate 3 Delay L n -- Gate 3 Delay L n -- Gate 3 Delay BW = ~(3/n)

17 Basic Pipelining Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Wenisch, Vijaykumar 370 Processor Pipeline Review Fetch Decode Execute emory (Write-back) PC I-cache Reg File AL D-cache T pipeline = T base / 5

18 Basic Pipelining Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Wenisch, Vijaykumar Basic Pipelining Data hazards What are they? How do you detect them? How do you deal with them? icro-architectural changes Pipeline depth Pipeline width Forwarding ISA (minor point) Control hazards (time allowing) 8

19 Register file Basic Pipelining Fetch Decode Execute emory WB PC Inst mem PC instruction rega regb R0 R R2 R3 R4 R5 R6 R7 0 PC vala valb offset A L target eq? AL result valb Data memory AL result mdata data dest Bits 0-2 Bits 6-8 Bits dest op dest op dest op IF/ ID ID/ E E/ em em/ WB

20 PC Inst mem Register file A L Data memory IF/ ID ID/ E E/ em em/ WB op dest offset valb vala PC PC target AL result op dest valb op dest AL result mdata eq? instruction 0 R2 R3 R4 R5 R R6 R0 R7 rega regb data dest Fetch Decode Execute emory WB Basic Pipelining

21 Basic Pipelining Fetch Decode Execute emory WB Register file PC Inst mem PC instruction rega regb data R0 R R2 R3 R4 R5 R6 R7 0 PC vala valb offset A L target eq? AL result valb Data memory AL result mdata op op op IF/ ID fwd fwd fwd ID/ E E/ em em/ WB

22 Basic Pipelining Pipeline function for ADD Fetch: read instruction from memory Decode: read source operands from reg Execute: calculate sum emory: Pass results to next stage Writeback: write sum into register file

23 Pipelining & Data Hazards Data Hazards add 2 3 nand time add fetch decode execute memory writeback nand fetch decode execute memory writeback If not careful, you will read the wrong value of R3

24 Pipelining & Data Hazards Three approaches to handling data hazards ake sure there are no hazards in the code If hazards exist, stall the processor until they go away. Detect and Forward If hazards exist, fix up the pipeline to get the correct value (if possible)

25 Pipelining & Data Hazards Detect and Forward Handling data hazards: avoid all hazards Assume the programmer (or the compiler) knows about the processor implementation. ake sure no hazards exist. Put noops between any dependent instructions. add 2 3 noop noop nand write R3 in cycle 5 read R3 in cycle 6

26 Pipelining & Data Hazards Detect and Forward Problems with this solution Old programs (legacy code) may not run correctly on new implementations Longer pipelines need more noops Programs get larger as noops are included Especially a problem for machines that try to execute more than one instruction every cycle Intel EPIC: Often 25% - 40% of instructions are noops Program execution is slower CPI is one, but some I s are noops

27 Pipelining & Data Hazards Detect and Forward Handling data hazards: Detection: detect and stall Compare rega with previous DestRegs 3 bit operand fields Compare regb with previous DestRegs Stall: 3 bit operand fields Keep current instructions in fetch and decode Pass a noop to execute

28 Register file Pipelining & Data Hazards Detect and Forward End of Cycle PC Inst mem PC add 2 3 rega regb data R0 R R2 R3 R4 R5 R6 R PC vala valb offset A L target eq? AL result valb Data memory AL result mdata op op op IF/ ID ID/ E E/ em em/ WB

29 Register file Pipelining & Data Hazards Detect and Forward End of Cycle 2 PC Inst mem PC nand rega regb data R0 R R2 R3 R4 R5 R6 R PC A L target eq? AL result valb Data memory AL result mdata add op op IF/ ID ID/ E E/ em em/ WB

30 Register file Pipelining & Data Hazards Detect and Forward First half of cycle 3 PC Inst mem PC nand Hazard detection 3 3 rega regb data R0 R R2 R3 R4 R5 R6 R PC A L target eq? AL result valb Data memory AL result mdata add op op IF/ ID ID/ E E/ em em/ WB

31 Pipelining & Data Hazards Detect and Forward Hazard detected 3 compare compare compare compare rega regb 3 REG file IF/ ID ID/ E

32 Pipelining & Data Hazards Detect and Forward Hazard detected compare rega 0 regb 3

33 Pipelining & Data Hazards Detect and Forward Handling data hazards: detect and stall the pipeline Detection: until ready Compare rega with previous DestReg 3 bit operand fields Compare regb with previous DestReg Stall: 3 bit operand fields Keep current instructions in fetch and decode Pass a noop to execute

34 Register file First half of cycle 3 Pipelining & Data Hazards Detect and Forward en en PC Inst mem 2 nand Hazard 3 3 rega regb data R0 R R2 R3 R4 R5 R6 R A L target eq? AL result Data memory AL result mdata valb add IF/ ID ID/ E E/ em em/ WB

35 Pipelining & Data Hazards Detect and Forward Handling data hazards: detect and stall the pipeline Detection: until ready Compare rega with previous DestReg 3 bit operand fields Compare regb with previous DestReg Stall: 3 bit operand fields Keep current instructions in fetch and decode Pass a noop to execute

36 Register file Pipelining & Data Hazards Detect and Forward End of cycle 3 PC Inst mem 2 nand rega regb 3 data R0 R R2 R3 R4 R5 R6 R A L 2 Data memory AL result mdata noop add IF/ ID ID/ E E/ em em/ WB

37 Register file First half of cycle 4 Pipelining & Data Hazards Detect and Forward en en PC Inst mem 2 nand Hazard 3 rega regb 3 data R0 R R2 R3 R4 R5 R6 R A L 2 Data memory AL result mdata noop add IF/ ID ID/ E E/ em em/ WB

38 Register file Pipelining & Data Hazards Detect and Forward End of cycle 4 PC Inst mem 2 nand rega regb 3 data R0 R R2 R3 R4 R5 R6 R A L Data memory 2 noop noop add IF/ ID ID/ E E/ em em/ WB

39 Register file Pipelining & Data Hazards Detect and Forward First half of cycle 5 PC Inst mem 2 nand No Hazard 3 rega regb 3 data R0 R R2 R3 R4 R5 R6 R A L Data memory 2 noop noop add IF/ ID ID/ E E/ em em/ WB

40 Pipelining & Data Hazards Detect and Forward Register file End of cycle 5 PC Inst mem 3 add rega regb data R0 R R2 R3 R4 R5 R6 R A L Data memory nand noop noop IF/ ID ID/ E E/ em em/ WB

41 Pipelining & Data Hazards Detect and Forward No more hazard: stalling time add 2 3 nand add nand fetch decode execute memory writeback fetch decode decode decode execute hazard hazard We are careful to get the right value of R3

42 Pipelining & Data Hazards Detect and Forward Problems with detect and stall CPI increases every time a hazard is detected! Is that necessary? Not always! Re-route the result of the add to the nand nand no longer needs to read R3 from reg file It can get the data later (when it is ready) This lets us complete the decode this cycle But we need more control to remember that the data that we aren t getting from the reg file at this time will be found elsewhere in the pipeline at a later cycle.

43 Pipelining & Data Hazards Detect and Forward Handling data hazards: detect and forward Detection: same as detect and stall Except that all 4 hazards are treated differently i.e., you can t logical-or the 4 hazard signals Forward: New datapaths to route computed data to where it is needed New ux and control to pick the right data

44 Pipelining & Data Hazards Detect and Forward Detect and Forward Example add 2 3 nand add lw sw // r3 = r r2 // r5 = r3 NAND r4 // r7 = r3 r6 // r6 = E[r30] // E[r62]=r2

45 Pipelining & Data Hazards Detect and Forward Register file First half of cycle 3 PC Inst mem 2 nand Hazard 3 3 rega regb data R0 R R2 R3 R4 R5 R6 R A L Data memory add IF/ ID fwd fwd fwd ID/ E E/ em em/ WB

46 Pipelining & Data Hazards Detect and Forward Register file End of cycle 3 PC Inst mem 3 add rega regb 3 data R0 R R2 R3 R4 R5 R6 R A L 2 Data memory nand add IF/ ID H ID/ E E/ em em/ WB

47 Pipelining & Data Hazards Detect and Forward Register file First half of cycle 4 PC Inst mem 3 add New Hazard rega regb data R0 R R2 R3 R4 R5 R6 R A L 2 Data memory nand add IF/ ID H ID/ E E/ em em/ WB

48 Pipelining & Data Hazards Detect and Forward Register file End of cycle 4 PC Inst mem 4 lw rega regb 75 3 data R0 R R2 R3 R4 R5 R6 R A L -2 Data memory 2 add nand add IF/ ID H2 ID/ E H E/ em em/ WB

49 Pipelining & Data Hazards Detect and Forward Register file First half of cycle 5 PC Inst mem 4 lw No Hazard 3 rega regb 75 3 data R0 R R2 R3 R4 R5 R6 R A L -2 Data memory 2 add nand add IF/ ID H2 ID/ E H E/ em em/ WB

50 Pipelining & Data Hazards Detect and Forward Register file End of cycle 5 PC Inst mem 5 sw rega regb 67 5 data R0 R R2 R3 R4 R5 R6 R A L 22 Data memory -2 lw add nand IF/ ID ID/ E H2 E/ em H em/ WB

51 Register file First half of cycle 6 Pipelining & Data Hazards Detect and Forward en en PC Inst mem 5 sw Hazard 6 rega regb 67 5 L data R0 R R2 R3 R4 R5 R6 R A L 22 Data memory -2 lw add nand IF/ ID ID/ E H2 E/ em H em/ WB

52 Pipelining & Data Hazards Detect and Forward Register file End of cycle 6 PC Inst mem 5 sw rega regb 6 7 data R0 R R2 R3 R4 R5 R6 R A L 3 Data memory 22 noop lw add IF/ ID ID/ E E/ em H2 em/ WB

53 Pipelining & Data Hazards Detect and Forward Register file First half of cycle 7 PC Inst mem 5 sw Hazard 6 rega regb 6 7 data R0 R R2 R3 R4 R5 R6 R A L 3 Data memory 22 noop lw add IF/ ID ID/ E E/ em H2 em/ WB

54 Register file Pipelining & Data Hazards Detect and Forward End of cycle 7 PC Inst mem rega regb 6 data R0 R R2 R3 R4 R5 R6 R A L Data memory 99 sw noop lw IF/ ID H3 ID/ E E/ em em/ WB

55 Pipelining & Data Hazards Detect and Forward Register file First half of cycle 8 PC Inst mem rega regb 6 data R0 R R2 R3 R4 R5 R6 R A L Data memory 99 sw noop lw IF/ ID H3 ID/ E E/ em em/ WB

56 Pipelining & Data Hazards Detect and Forward Register file End of cycle 8 PC Inst mem rega regb data R0 R R2 R3 R4 R5 R6 R A L Data memory 7 sw noop IF/ ID ID/ E H3 E/ em em/ WB

57 Pipelining & Control Hazards Pipeline function for BEQ Fetch: read instruction from memory Decode: read source operands from reg Execute: calculate target address and test for equality emory: Send target to PC if test is equal Writeback: Nothing left to do

58 Pipelining & Control Hazards Control Hazards beq 0 sub beq sub t 0 t t 2 t 3 t 4 t 5 F D E W F D E W squash

59 Pipelining & Control Hazards Handling Control Hazards (static) No branches? Convert branches to predication Control dependence becomes data dependence (dynamic) Stop fetch until branch resolves Speculate and squash (dynamic) Keep going past branch, throw away instructions if wrong

60 Pipelining & Control Hazards Speculate and Squash Via Predication if (a == b) { x; y = n / d; } sub t a, b jnz t, PC2 add x x, # div y n, d sub t a, b add(!t) x x, # div(!t) y n, d sub t a, b add t2 x, # div t3 n, d cmov(!t) x t2 cmov(!t) y t3

61 Pipelining & Control Hazards Speculate and Squash Handling Control Hazards: Detect & Stall Detection In decode, check if opcode is branch or jump Stall Hold next instruction in Fetch Pass noop to Decode

62 Pipelining & Control Hazards Speculate and Squash PC Inst mem REG file sign ext A L Data memory bnz r IF/ ID Control ID/ E E/ em em/ WB

63 Pipelining & Control Hazards Speculate and Squash Control Hazards beq 0 sub time beq sub fetch decode execute memory writeback fetch fetch fetch fetch Target: or fetch

64 Pipelining & Control Hazards Speculate and Squash Problems with Detect & Stall CPI increases on every branch Are these stalls necessary? Not always! Branch is only taken half the time Assume branch is NOT taken Keep fetching, treat branch as noop If wrong, make sure bad instructions don t complete

65 Pipelining & Control Hazards Speculate and Squash Handling Control Hazards: Speculate & Squash Speculate Not-Taken Assume branch is not taken Squash Overwrite opcodes in Fetch, Decode, Execute with noop Pass target to Fetch

66 Pipelining & Control Hazards Speculate and Squash PC beq sub add nand Inst mem noop add IF/ ID Control REG file sign ext noop sub ID/ E equal A L noop beq E/ em Data memory beq em/ WB

67 Pipelining & Control Hazards Speculate and Squash Problems with Speculate & Squash Always assumes branch is not taken Can we do better? Yes. Predict branch direction and target! Why possible? Program behavior repeats. ore on branch prediction to come...

68 Pipelining & Control Hazards Branch Delay Slot branch: next: target: Branch Delay Slot (IPS, SPARC) t 0 t t 2 t 3 t 4 t 5 F D E W F Squash F D E - Instruction in delay slot executes even on taken branch branch: delay: target: i: beq, 2, tgt j: add 3, 4, 5 What can we put here? W F D E W F D E W F D E W

69 Improving pipeline performance Improving pipeline performance Add more stages Widen pipeline

70 Improving pipeline performance Adding pipeline stages Pipeline frontend Fetch, Decode Pipeline middle Execute Pipeline backend emory, Writeback

71 Improving pipeline performance Adding stages to fetch, decode Delays hazard detection No change in forwarding paths No performance penalty with respect to data hazards

72 Improving pipeline performance Adding stages to execute Check for structural hazards AL not pipelined ultiple AL ops completing at same time Data hazards may cause delays If multicycle op hasn't computed data before the dependent instruction is ready to execute Performance penalty for each stall

73 Improving pipeline performance Adding stages to memory, writeback Instructions ready to execute may need to wait longer for multi-cycle memory stage Adds more pipeline registers Thus more source registers to forward ore complex hazard detection Wider muxes ore control bits to manage muxes

74 Improving pipeline performance Wider pipelines fetch decode execute mem WB fetch decode execute mem WB ore complex hazard detection 2 pipeline registers to forward from 2 more instructions to check 2 more destinations (muxes) Need to worry about dependent instructions in the same stage

75 aking forwarding explicit add r r2, E/em AL result Include direct mux controls into the ISA Hazard detection is now a compiler task New micro-architecture leads to new ISA Is this why this approach always seems to fail? (e.g., simple VLIW, otorola 88k) Can reduce some resources Eliminates complex conflict checkers

EECS 470 Lecture 4. Pipelining & Hazards II. Fall 2018 Jon Beaumont

EECS 470 Lecture 4. Pipelining & Hazards II. Fall 2018 Jon Beaumont GAS STATION Pipelining & Hazards II Fall 208 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith,