EECS 470 Lecture 2. Performance, Power & ISA. Fall Jon Beaumont

Size: px

Start display at page:

Download "EECS 470 Lecture 2. Performance, Power & ISA. Fall Jon Beaumont"

Cathleen Singleton
5 years ago
Views:

1 Performance, Power & ISA Fall 218 Jon Beaumont Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, udge, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, and Wenisch of Carnegie ellon niversity, Purdue niversity, niversity of ichigan, niversity of Pennsylvania, and niversity of Wisconsin. Slide 1

2 Warm p Riddle A car must drive 2 miles. It drives with an average speed of V 1 the first mile. How fast must it travel during the second mile so that its total average speed is twice that of the first mile (i.e. V Total =2*V 1 )? (Vote here: etc.ch/zwn) a) b) ½ V 1 c) 2 V 1 d) 4 V 1 e) Other Slide 2

3 Class logistics Last Time Discussed high level goals of computer architecture Performance Power Cost, security, ease of programmability, etc. Discussed how to increase program performance ostly through adding parallelism Limits of parallelism Amdahl s Law Slide 3

4 Today Dive into performance metrics a bit more Quantifying performance (throughput and latency) Discuss arithmetic of averages ISA overview Von Neumann architecture CISC vs RISC Power and Energy Start on 5-stage processor and pipeline review Slide 4

5 Administrative Lab 1 due Thursday at 4:29pm Check off with GSI in OH Project 1 due Saturday at 11:59pm 9 submissions so far Don t leave to the last minute HW 1 due next Tuesday (9/18) at 11:59pm Submit to Gradescope (see website) Should cover all material by Wednesday Everyone have access to Canvas/Piazza/Gradescope? Do I have everyone s picture? Slide 5

6 Performance and Power Trends Source: Chris Batten Dissertation, IT (21) Slide 6

7 Performance Two definitions Latency (execution time): time to finish a fixed task Throughput (bandwidth): number of tasks in fixed time Very different: throughput can exploit parallelism, latency can t Baking bread analogy Often contradictory Choose definition to match measurement goals Example: move people from A to B, 1 miles Car: capacity = 5, speed = 6 miles/hour Bus: capacity = 6, speed = 2 miles/hour Latency: car = 1 min, bus = 3 min Throughput: car = 3 PPH, bus = 12 PPH Slide 7

8 Performance Improvement Processor A is times faster than processor B if Latency(P,A) = Latency(P,B) / Throughput(P,A) = Throughput(P,B) * Processor A is % faster than processor B if Latency(P,A) = Latency(P,B) / (1+/1) Throughput(P,A) = Throughput(P,B) * (1+/1) Car/bus example Latency? Car is 3 times (and 2%) faster than bus Throughput? Bus is 4 times (and 3%) faster than car Slide 8

9 Latency vs Throughput What are three computing applications where we care mostly about throughput? What about latency? Slide 9

10 Averaging Performance Numbers I You can add latencies, but not throughput Latency(P1+P2, A) = Latency(P1,A) + Latency(P2,A) Throughput(P1+P2,A)!= Throughput(P1,A) + Throughput(P2,A) E.g., 1 3 miles/hour miles/hour Average is not 6 miles/hour.33 hours at 3 miles/hour +.1 hours at 9 miles/hour Average is only 47 miles/hour! (2 miles / ( hours)) Slide 1

11 Averaging Performance Numbers II Latency(P1+P2, A) = Latency(P1,A) + Latency(P2,A) Throughput(P1+P2,A) = 1 1 Throughput P1,A + 1 Throughput P2,A Three averaging techniques: Arithmetic : (1/N) * P=1..N Latency(P) For times: units proportional to time (e.g., latency) Harmonic : N / P=1..N 1/Throughput(P) For rates: units inversely proportional to time (e.g., throughput) (nless time is fixed) Geometric : N P=1..N Speedup(P) For ratios: unitless quantities (e.g., speedups) Slide 11

12 The Iron Law of Processor Performance Time Processor Performance = Program Instructions Cycles Time = Program Instruction Cycle (code size) (CPI) (cycle time) Architecture --> Implementation --> Realization Compiler Designer Processor Designer Chip Designer Slide 12

13 Danger: Partial Performance etrics icro-architects often ignore dynamic instruction count Typically work in one ISA/one compiler treat it as fixed Iron law reduces to seconds / instruction = (cycles / instruction) * (seconds / cycle) IPS (millions of instructions per second) Instructions / second * 1-6 Cycles / second: clock frequency (in Hz) Example: CPI = 2, clock = 5 Hz, what is IPS?.5 * 5 Hz * 1-6 = 25 IPS Problems: compiler removes instructions, program faster However, IPS goes down (misleading) Slide 13

14 Danger: Partial Performance etrics II icro-architects often ignore instructions/program but general public (mostly) also ignores CPI Equates clock frequency with performance!! Which processor would you buy? Processor A: CPI = 2, clock = 5 Hz Processor B: CPI = 1, clock = 3 Hz Probably A, but B is faster (assuming same ISA/compiler) Classic example 8 Hz Pentium III faster than 1 GHz Pentium 4 Same ISA and compiler Slide 14

15 Performance Key Points Amdahl s law S overall = 1 Iron law Time Program 1 f + f S Instructions Program Cycles Instruction Time Cycle Averaging Techniques Arithmetic Time Harmonic Rates Geometric Ratios 1 n i 1Timei n n i 1 n 1 n Rate i n Ratio i i 1 Slide 15

16 Instruction Set Architecture Slide 16

17 Instruction Set Architecture Instruction set architecture (ISA) is the structure of a computer that a machine language programmer (or a compiler) must understand to write a correct (timing independent) program for that machine IB introducing 36 in IB 36 is a family of binary-compatible machines with distinct microarchitectures and technologies, ranging from odel 3 (8-bit datapath, up to 64KB memory) to odel 7 (64-bit datapath, 512KB memory) and later odel 36/91 (the Tomasulo). - IB 36 replaced 4 concurrent, but incompatible lines of IB architectures developed over the previous 1 years Slide 17

18 ISA: A contract between HW and SW ISA (instruction set architecture) A well-defined hardware/software interface The contract between software and hardware Functional definition of operations, modes, and storage locations supported by hardware Precise description of how to invoke, and access them No guarantees regarding How operations are implemented Which operations are fast and which are slow and when Which operations take more power and which take less Slide 18

19 von Neumann odel of a Computer Key idea: emory contains both instructions and data Instructions can be operated on as if they are data Self-modifying code mostly discouraged now But compilers take as input a program and produce another program! Turing machines are vn machines Slide 19

Sequential odel of Computing Each instruction is executed one after the other Branch instructions can change this done conditionally Tied to a program counter The

20 Sequential odel of Computing Each instruction is executed one after the other Branch instructions can change this done conditionally Tied to a program counter The microarchitectures that we will study conform to the sequential execution model but under the hood they execute instructions out-of-order (OoO) Other models? Dataflow? Slide 2

21 Components of an ISA Programmer-visible states Program counter, general purpose registers, memory, control registers Programmer-visible behaviors (state transitions) What to do, when to do it Example register-transfer-level description of an instruction if imem[pc]== add rd, rs, rt then pc pc+1 gpr[rd]=gpr[rs]+grp[rt] A binary encoding ISAs last 25+ years (because of SW cost) be careful what goes in Slide 21

22 RISC vs CISC Recall Iron law: (instructions/program) * (cycles/instruction) * (seconds/cycle) CISC (Complex Instruction Set Computing) Improve instructions/program with complex instructions Easy for assembly-level programmers, good code density RISC (Reduced Instruction Set Computing) Improve cycles/instruction with many single-cycle instructions Increases instruction/program, but hopefully not as much Help from smart compiler Perhaps improve clock cycle time (seconds/cycle) via aggressive implementation allowed by simpler instructions Slide 22

23 What akes a Good ISA? Programmability Easy to express programs efficiently? Implementability Easy to design high-performance implementations? ore recently Easy to design low-power implementations? Easy to design high-reliability implementations? Easy to design low-cost implementations? Compatibility Easy to maintain programmability (implementability) as languages and programs (technology) evolves? x86 (IA32) generations: 886, 286, 386, 486, Pentium, PentiumII, PentiumIII, Pentium4, Slide 23

24 Type Typical Instructions (Opcodes) Arithmetic and logical Data transfer Control System Floating point Decimal String Example Instruction and, add move, load branch, jump, call, return trap, rett add, mul, div, sqrt addd, convert move, compare What operations are necessary? {sub, ld & st, conditional br.} What is the minimum complete ISA for a von Neuman machine? Too little or too simple not expressive enough difficult to program (by hand) programs tend to be bigger Too much or too complex most of it won t be used too much baggage for implementation. difficult choices during compiler optimization Slide 24

25 Power Slide 25

26 Introduction Why is power a problem in a μp? Power used by the μp, vs. system power Dissipating Heat elting (very bad) Packaging (to cool $) Heat leads to poorer performance. Providing Power Battery Cost of electricity Slide 26

27 Where does the juice go in laptops? Others have measured ~55% processor increase under max load in laptops [Hsu+Kremer, 22] Slide 27

28 What about servers? SunFire T2 DRA >2%; growing 2% CP <25%; shrinking 23% 2% 4% 1% 9% 14% AC to DC only 6-9% efficient Processor emory I/O Disk Services Fans AC/DC Conversion Need whole-system approaches to save energy Slide 28

29 Why is power a problem? Why worry about power dissipation? Battery life Thermal issues: affect cooling, packaging, reliability, timing Environment Slide 29

30 Why is power a problem? Total Power Dissipation Trends Power Density (W/cm 2 ) Nuclear Reactor Pentium 4 (Prescott) Pentium 4 Hot Plate Pentium 3 Pentium 2 Pentium Pro Pentium Slide 3

31 Why is power a problem? Spot Heat Issues in icroprocessors Slide 31

32 Why is power a problem? Packaging cost Complex and expensive (note heatpipe) Source: H. ie et al. Packaging the Itanium icroprocessor Electronic Components and Technology Conference 22 Slide 32

33 Temperature/di-dt-Constrained Power-Aware Computing Applications Energy-Constrained Computing Slide 33

34 CO2 Emissions (mil. metric tons) Data center energy use Installed base grows 11%/yr. By 211, 2.5% of S energy $7.4 billion/yr. Source: S EPA Source: ankoff et al, IEEE Computer th 34th.5% of world CO 2 emissions; rivals entire Czech Republic Improving energy efficiency is a critical challenge Nigeria Data Centers Czech Republic Slide 34

Source: Liebert 27 Servers account for barely half of power 1W of

35 Where does all the power go? 38% 5% 4% 1% 52% IT Equipment Cooling PS Power Delivery Lighting Source: Liebert 27 Servers account for barely half of power 1W of cooling per 1.5W of IT load 1W data center: cooling costs $4 to $8 / yr. System designers must think about cooling Slide 35

36 Why is power a problem? Power-Aware Needed across all computing platforms obile/portable (cell phones, laptops, PDA) Battery life is critical Desktops/Set-Top (PCs and game machines) Packaging cost is critical Servers (ainframes and compute-farms) Packaging limits Volumetric (performance density) Slide 36

37 What uses power in a chip Slide 37

38 What uses power in a chip? How COS Transistors Work Slide 38

39 What uses power in a chip? OS Transistors are Switches Slide 39

40 What uses power in a chip? Power: The Basics Dynamic power vs. Static power Dynamic: switching power Static: leakage power Dynamic power dominates, but static power increasing in importance Static power: steady, per-cycle energy cost Dynamic power: capacitive and short-circuit Capacitive power: charging/discharging at transitions from 1 and 1 Short-circuit power: power due to brief short-circuit current during transitions. Slide 4

41 What uses power in a chip? Dynamic (Capacitive) Power Dissipation I V IN V OT C L Data dependent a function of switching activity Slide 41

42 What uses power in a chip? Capacitive Power dissipation Capacitance: Function of wire length, transistor size Power ~ ½ CV 2 Af Activity factor: How often, on average, do wires switch? Supply Voltage: Has been dropping with successive fab generations Clock frequency: Increasing Slide 42

43 What uses power in a chip? Power vs. Energy Power consumption in Watts Determines battery life in hours Sets packaging limits Energy efficiency in joules Rate at which energy is consumed over time Energy = power * delay (joules = watts * seconds) Lower energy number means less power to perform a computation at same frequency Slide 43

44 What uses power in a chip? Power vs. Energy Slide 44

45 Energy vs Power What are three computing applications where we care about energy more than power? What about power over energy? Slide 45

46 What uses power in a chip? Voltage Scaling Scenario: 8W, 1 BIPS, 1.5V, 1GHz Cache Optimization: IPC decreases by 1%, reduces power by 2% => Final Processor: 9 IPS, 64W What if we just adjust frequency/voltage on processor? How to reduce power by 2%? P = CV 2 F = CV 3 => Drop voltage by 7% (and also Freq) =>.93*.93*.93 =.8x So for equal power (64W) Cache Optimization = 9IPS Simple Voltage/Frequency Scaling = 93IPS Slide 46

Power-constrained design? Same power budget, but 1.6x performance!

47 Power scales roughly cubically with frequency Scale clock frequency to 8% Now add a second core ulticore: Solution to Power-constrained design? Same power budget, but 1.6x performance! But: ust parallelize application Remember Amdahl s Law! Performance Power Slide 47

48 The Execution Core: Pipelining Slide 48

49 Outline: nderstanding the Execution Core s 5-stage pipeline (review) 2. Implementing pipeline interlocks (review) 3. Scoreboard scheduling (CDC 66) 4. Tomasulo s OoO scheduling algorithm (IB 36) 5. Precise interrupts with a Reorder Buffer (P6, Core) 6. odern OoO (IPS R1K, Alpha 21264, Netburst) Slide 49

50 Single-cycle ulti-cycle Before there was pipelining insn.fetch, dec, exec insn1.fetch, dec, exec insn.fetch insn.dec insn.exec insn1.fetch insn1.dec insn1.exec Basic datapath: fetch, decode, execute Single-cycle control: hardwired + Low CPI (1) Long clock period (to accommodate slowest instruction) ulti-cycle control: micro-programmed + Short clock period High CPI Slide 5

51 Single-cycle ulti-cycle insn.fetch, dec, exec Speeding p Remember, three ways to speed up a process: Reduce number of tasks (possible?) Decrease latency of tasks (possible?) Parallelize How do we parallelize this pipeline? insn1.fetch, dec, exec insn.fetch insn.dec insn.exec insn1.fetch insn1.dec insn1.exec Slide 51

52 Parallelize insn.fetch insn.dec insn.exec insn1.fetch insn1.dec insn1.exec Duplicate pipeline (superscalar) Effective, but expensive (>2x hardware overhead) Discuss more later in semester insn.fetch insn.dec insn.exec insn1.fetch insn1.dec insn1.exec insn.fetch insn.dec insn.exec insn1.fetch insn1.dec insn1.exec Or pipeline! Slide 52

53 ulti-cycle Pipelined Pipelining insn.fetch insn.dec insn.exec insn1.fetch insn1.dec insn1.exec insn.fetch insn.dec insn.exec insn1.fetch insn1.dec insn1.exec Important performance technique Improves throughput at the expense of latency Why does latency go up? Begin with multi-cycle design When instruction advances from stage 1 to 2 allow next instruction to enter stage 1 Each instruction still passes through all stages + But instructions enter and leave at a much faster rate Not much hardware overhead (what needs to be added?) Slide 53

54 Pipeline Illustrated: L Comb. Logic n Gate Delay BW = ~(1/n) L n -- 2 Gate Delay L n -- 2 Gate Delay BW = ~(2/n) L n -- Gate 3 Delay L n -- Gate 3Delay L n -- Gate 3 Delay BW = ~(3/n) Slide 54

55 37 Processor Pipeline Review Fetch Decode Execute emory (Write-back) +4 PC I-cache Reg File AL D-cache T pipeline = T base / 5 Slide 55

56 Stage 1: Fetch Fetch an instruction from memory every cycle. se PC to index memory Increment PC (assume no branches for now) Write state to the pipeline register (IF/ID) The next stage will read this pipeline register. Note that pipeline register must be edge triggered Slide 56

57 Instruction bits PC + 1 Rest of pipelined datapath 1 + en PC Instruction emory/ Cache en IF / ID Pipeline register Slide 57

58 Stage 2: Decode Decodes opcode bits ay set up control signals for later stages Read input operands from registers file specified by rega and regb of instruction bits Write state to the pipeline register (ID/E) Opcode Register contents Offset & destination fields PC+1 (even though decode didn t use it) Slide 58

59 Instruction bits Control Signals Stage 1: Fetch datapath Contents Of regb PC + 1 Contents Of rega Rest of pipelined datapath PC + 1 rega regb Destreg Register File Data en IF / ID Pipeline register ID / E Pipeline register Slide 59

60 Stage 3: Execute Perform AL operation. Input operands can be: Contents of rega or RegB Offset field on the instruction Branches: calculate PC+1+offset Write state to the pipeline register (E/em) AL result, contents of RegB and PC+1+offset Instruction bits for opcode and destreg specifiers Slide 6

61 Control Signals Control Signals Stage 2: Decode datapath Contents Of regb contents of regb Contents Of rega AL Result Rest of pipelined datapath PC + 1 PC+1 +offset + A L ID / E Pipeline register E/em Pipeline register Slide 61

62 Stage 4: emory Operation Perform data cache access for memory ops AL result contains address for ld and st Opcode bits control mem R/W and enable signals Write state to the pipeline register (em/wb) AL result and emdata Instruction bits for opcode and destreg specifiers Slide 62

63 Control Signals Control Signals Stage 3: Execute datapath contents of regb emory Read Data Alu Result Alu Result Rest of pipelined datapath PC+1 +offset This goes back to the before the PC in stage 1. control for PC input Data emory en R/W E/em Pipeline register em/wb Pipeline register Slide 63

64 Stage 5: Write back Writing result to register file (if required) Write emdata to destreg for ld instruction Write AL result to destreg for arithmetic instruction Opcode bits control register write enable signal Slide 64

65 Control Signals Stage 4: emory datapath emory Read Data Alu Result This goes back to data input of register file em/wb Pipeline register register write enable This goes back to the destination register specifier bits -2 bits Slide 65

66 Sample Code (Simple) Run the following code on a pipelined datapath: add ; reg 3 = reg 1 + reg 2 nand ; reg 6 = reg 4 & reg 5 lw ; reg 4 = em[reg2+2] add ; reg 5 = reg 2 + reg 5 sw ; em[reg3+1] =reg 7 Slide 66

67 Slide 67 PC Inst mem Register file A L 1 Data memory + + IF/ ID ID/ E E/ em em/ WB Bits -2 Bits op dest offset valb vala PC+1 PC+1 target AL result op dest valb op dest AL result mdata eq? instruction R2 R3 R4 R5 R1 R6 R R7 rega regb Bits data dest

68 Slide 68 PC Inst mem Register file A L 1 Data memory + + IF/ ID ID/ E E/ em em/ WB Bits -2 Bits noop noop noop noop R2 R3 R4 R5 R1 R6 R R7 Bits data dest Initial State

69 Register file add PC 1 + Inst mem Fetch: add Time: 1 1 add IF/ ID Bits -2 Bits Bits R R1 R2 R3 R4 R5 R6 R noop ID/ E + A L noop E/ em Data memory noop em/ WB data dest Slide 69

70 Register file nand add PC 1 + Inst mem Fetch: nand Time: 2 2 nand IF/ ID 1 2 Bits -2 Bits Bits R R1 R2 R3 R4 R5 R6 R add ID/ E + A L noop E/ em Data memory noop em/ WB data dest Slide 7

71 Register file lw nand add PC 1 + Inst mem Fetch: lw Time: 3 3 lw IF/ ID 4 5 Bits -2 Bits Bits R R1 R2 R3 R4 R5 R6 R nand ID/ E A L add E/ em Data memory noop em/ WB data dest Slide 71

72 Register file add lw nand add PC 1 + Inst mem Fetch: add Time: 4 4 add IF/ ID 2 4 Bits -2 Bits Bits R R1 R2 R3 R4 R5 R6 R lw ID/ E A L nand E/ em 45 3 Data memory 45 3 add em/ WB data dest Slide 72

73 Register file sw add lw nand add PC 1 + Inst mem Fetch: sw Time: 5 5 sw IF/ ID 2 5 Bits -2 Bits Bits R R1 R2 R3 R4 R5 R6 R add ID/ E A L lw E/ em -3 6 Data memory -3 6 nand 45 3 em/ WB data dest Slide 73

74 Register file sw add lw nand PC 1 + Inst mem No more instructions Time: 6 IF/ ID 3 7 Bits -2 Bits Bits R R1 R2 R3 R4 R5 R6 R sw ID/ E A L add E/ em 29 4 Data memory lw -3 6 em/ WB data dest Slide 74

75 Register file sw add lw PC 1 + Inst mem No more instructions Time: 7 IF/ ID Bits -2 Bits Bits R R1 R2 R3 R4 R5 R6 R ID/ E A L sw E/ em 16 5 Data memory 16 5 add 99 4 em/ WB data dest Slide 75

76 Register file sw add PC 1 + Inst mem R R1 R2 R3 R4 R5 R6 R A L Data memory data dest No more instructions Time: 8 IF/ ID Bits -2 Bits Bits ID/ E E/ em 7 sw 5 em/ WB Slide 76

77 Register file sw PC 1 + Inst mem R R1 R2 R3 R4 R5 R6 R A L Data memory data dest No more instructions Bits -2 Bits Bits Time: 9 IF/ ID ID/ E E/ em em/ WB Slide 77

78 Time graphs Time: add fetch decode execute memory writeback nand fetch decode execute memory writeback lw fetch decode execute memory writeback add fetch decode execute memory writeback sw fetch decode execute memory writeb Slide 78

EECS 470. Further review: Pipeline Hazards and More. Lecture 2 Winter 2018

EECS 470. Further review: Pipeline Hazards and More. Lecture 2 Winter 2018 EECS 470 Further review: Pipeline Hazards and ore Lecture 2 Winter 208 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar,