ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

Size: px

Start display at page:

Download "ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017"

Damon Dorsey
6 years ago
Views:

1 ECE 550D Funamentals of Computer Systems an Engineering Fall 017 Datapaths Prof. John Boar Duke University Slies are erive from work by Profs. Tyler Bletch an Anrew Hilton (Duke) an Amir Roth (Penn)

2 What i we o last time? Last time MIPS Assembly Practice translating C to assembly together Using functions Calling conventions jal (call) jr (return)

3 Now confluence of MIPS + igital logic Start of semester: Digital Logic Builing blocks of igital esign Most recently: MIPS assembly, ISA Lowest level software Now: where they meet Datapaths: harware implementation of processors By the way: homework 4 = buil a atapath With some components from the TAs 3

4 Necessary ingreient: the ALU ALU: Arithmetic/Logic Unit Performs any supporte math or logic operation on two inputs Which operation is chosen by a thir input A B ALU out op 4

5 A/Subtract With Overflow Detection Overflow S n- 1 S n- S 1 S 0 Full Aer Full Aer Full Aer Full Aer A/Sub b n- 1 a n- 1 b n- a n- b 1 a 1 b 0 a 0 5

6 ALU Slice Cin a b 3 1 Q A F Q 0 0 a + b 1 0 a - b - 1 NOT b - a OR b - 3 a AND b A/sub 0 A/sub Cout F 6

7 The ALU A ALU out Overflow Is non-zero? B op Q n-1 Q n- Q 1 Q 0 ALU Slice ALU Slice ALU Slice ALU Slice ALU control b n-1 a n-1 b n- a n- b 1 a 1 b 0 a 0 7

8 Datapath for MIPS ISA Consier only the following instructions a $1,$,$3 ai $1,,$3 lw $1,4($3) sw $1,4($3) beq $1,$,PC_relative_target j absolute_target Why only these? Most other instructions are the same from atapath viewpoint The one s that aren t are left for you to figure out 8

9 Remember The von Neumann Moel? Instruction Fetch Instruction Decoe Operan Fetch Execute Result Store Next Instruction Instruction Fetch: Rea instruction bits from memory Decoe: Figure out what those bits mean Operan Fetch: Rea registers (+ mem to get sources) Execute: Do the actual operation (e.g., a the #s) Result Store: Write result to register or memory Next Instruction: Figure out mem ar of next insn, repeat 9

10 Start With Fetch + 4 P C Same for all instructions (on t know insn yet) PC an instruction memory A +4 incrementer computes efault next instruction PC Details of : later For now: just assume a bunch of DFFs 10

11 First Instruction: a Decoing: Very easy in MIPS + 4 P C Register File s1 s R-type Op(6) Rs(5) Rt(5) R(5) Sh(5) Func(6) A register file an ALU 11

12 First Instruction: a Decoing: Very easy in MIPS P C + 4 Register File s1 s AND, OR, other r- type ientical, just change func coe! Same atapath R-type Op(6) Rs(5) Rt(5) R(5) Sh(5) Func(6) A register file an ALU 1

13 Secon Instruction: ai + 4 P C Register File s1 s S X I-type Op(6) Rs(5) Rt(5) Imme(16) Destination register can now be either R or Rt A sign extension unit an mux into secon ALU input 13

14 Thir Instruction: lw + 4 P C Register File s1 s a Data S X I-type Op(6) Rs(5) Rt(5) Imme(16) A ata memory, aress is ALU output A register write ata mux to select memory output or ALU output 14

15 Fourth Instruction: sw + 4 P C Register File s1 s a Data S X I-type Op(6) Rs(5) Rt(5) Imme(16) A path from secon input register to ata memory ata input 15

16 Fifth Instruction: beq + 4 << P C Register File s1 s z a Data S X I-type Op(6) Rs(5) Rt(5) Imme(16) A left shift unit an aer to compute PC-relative branch target A PC input mux to select PC+4 or branch target Note: shift by fixe amount very simple 16

17 Sixth Instruction: j + 4 << << P C Register File s1 s a Data S X J-type Op(6) Imme(6) A shifter to compute left shift of 6-bit immeiate A aitional PC input mux for jump target 17

18 More Instructions + 4 << << P C Register File s1 s a Data S X Figure out atapath moifications for jal (J-type) jr (R-type) 18

19 Jal + 4 << << P C Register File s1 s a Data S X For jal, nee to get PC+4 to RF write mux (an constant 31 to estination register ID probably another mux) 19

20 JR + 4 << << P C Register File s1 s a Data S X For JR nee to get RF rea value to next PC mux (an constant 31 to source register ID, again probably another mux) 0

21 Goo practice: Try other insns + 4 << << P C Register File s1 s a Data S X Pick other MIPS instructions, contemplate how to a them 1

22 Continuous Rea Datapath Timing + 4 P C Register File s1 s a Data S X Rea I Rea Registers Rea DMEM Write DMEM Write Registers Write PC Works because writes (PC, RegFile, D) are inepenent An because no rea logically follows any write (until next complete instruction) ONE LONG CLOCK CYCLE!

23 What Is Control? + 4 << << BR JP P C Register File s1 s a Data Rw Rwe S X ALUop DMwe Rst ALUinB 8 signals control flow of ata through this atapath (well, ALUop is more than one bit..) MUX selectors, or register/memory write enable signals A real atapath might have control signals 3

24 Example: Control for a + 4 << << BR=0 JP=0 P C Register File s1 s a Data Rw=0 Rwe=1 S X ALUop=0 DMwe=0 Rst=1 ALUinB=0 Control for an instruction: Values of all control signals to correctly execute it 4

25 Example: Control for sw + 4 << << BR=0 JP=0 P C Register File s1 s a Data Rw=X Rwe=0 S X ALUop=0DMwe=1 Rst=X ALUinB=1 Difference between sw an a is 5 signals 3 if you on t count the X (on t care) signals 5

26 Example: Control for beq + 4 << << BR=1 JP=0 P C Register File s1 s a Data Rw=X Rwe=0 Rst=X Difference between sw an beq is only 4 signals S X ALUinB=0 ALUop=1DMwe=0 6

27 Let s figure out LW + 4 << << BR JP P C Register File s1 s a Data Rw Rwe S X ALUop DMwe Rst ALUinB How woul these control signals be set for LW? 7

28 Example: Control for LW + 4 << << BR=0 JP=0 P C Register File s1 s a Data Rw=1 Rwe=1 S X ALUop=0 DMwe=0 Rst=0 ALUinB=1 8

29 How Is Control Implemente? + 4 << << BR JP P C Register File s1 s a Data Rw Rwe S X ALUop DMwe Rst ALUinB Control? 9

30 Implementing Control Each insn has a unique set of control signals Most are function of opcoe Some may be encoe in the instruction itself E.g., the ALUop signal is some portion of the MIPS Func fiel + Simplifies controller implementation Requires careful ISA esign 30

31 Control Implementation: ROM ROM (rea only memory): think rows of bits Bits in ata wors are control signals Lines inexe by opcoe Example: ROM control for 6-insn MIPS atapath X is on t care (electrically must be 0 or 1 but oesn t matter) BR JP ALUinB ALUop DMwe Rwe Rst Rw opcoe a ai lw sw X X beq X X j X X 31

32 Control Implementation: Ranom Logic Real machines have 100+ insns 300+ control signals 30,000+ control bits (~4KB) Not huge, but har to make faster than atapath (important!) Alternative: ranom logic (ranom = non-repeating ) More or less what I i for protocomputer Exploits the observation: many signals have few 1s or few 0s Example: ranom logic control for 6-insn MIPS atapath opcoe a ai lw sw beq j BR JP DMwe Rwe Rw Rst ALUop Yes, ranom logic is a very umb an misleaing name for this concept. Sorry. ALUinB 3

33 Datapath an Control Timing + 4 P C Register File s1 s a Data S X Control ROM/ranom logic Rea I Rea Registers (Rea Control ROM) Rea DMEM Write DMEM Write Registers Write PC Will usually nee an IR (instruction register) buffering current instruction, as in protocomputer, but here can get by with Imem output 33

34 Single-Cycle Datapath Performance + 4 P C Register File s1 s a Data S X Control ROM/ranom logic This machine will work, an it will be simple, but it will be slow Goes against make common case fast (MCCF) principle + Low Cycles Per Instruction (CPI): 1 Long clock perio: to accommoate slowest insn 34

35 Interlue: Performance Previous slie allues to something new: Performance Don t just want it to work But want it to go fast! Three components to performance: Number of instructions x Cycles per instruction x Clock Perio (CPI) (1 / Clock frequency) Instructions Cycles Secons Secons x x = Program Instruction Cycle Program 35

36 Interlue: Performance Three components to performance: Number of instructions <- Compiler s Job x Cycles per instruction (CPI) x Clock Perio (1 / Clock frequency) Instructions Cycles Secons Secons x x = Program Instruction Cycle Program s/program: etermine by compiler + ISA Generally assume fixe program when oing micro-architecture 36

Micro-architectural factors Micro-architecture: The etails of how the ISA is implemente Affects CPI an Clock frequency Often will look at fixe program, an consier MIPS Million Instructions Per

The use of MIPS to mean Millions of Instructions Per Secon has nothing to o with the CPU architecture also calle MIPS, which actually stans for Microprocessor without Interlocke Pipeline Stages.

37 Micro-architectural factors Micro-architecture: The etails of how the ISA is implemente Affects CPI an Clock frequency Often will look at fixe program, an consier MIPS Million Instructions Per Secon MIPS = IPC * Frequency (in MHz) IPC = Instruction Per Cycle (1 / CPI) Gives Bigger is better number Instructions Cycles Instructions x = Cycle Secon Secon (IPC) (Frequency) (Throughput) The use of MIPS to mean Millions of Instructions Per Secon has nothing to o with the CPU architecture also calle MIPS, which actually stans for Microprocessor without Interlocke Pipeline Stages. This fact that a major CPU architecture shares a name with an important metric for performance is increibly confusing an umb, an I apologize. I blame the cocaine-fuele CPU architects of the 1980s. 37

38 Best IPC For now, best we can o: IPC = 1 (CPI = 1) Do 1 instruction every cycle Later: Real processors can o multiple instructions at once! Potentially: IPC > 1! (CPI < 1!) Best possible IPC epens on esign 38

39 Performance vs. 1990s: Performance at all cost Actually more clock frequency at all cost Now: Care about other things Energy (electric bill, battery life) Power (cooling, also affects energy) Area (chip cost) Reliability (tolerance of transient faults: e.g., charge particle strikes) Important metric these ays Performance / Watt Throughput ivie by power consumption Why? What evice in particular? 39

40 Performance Moeling an Analysis Speaking of performance Making a processor takes time (years) an money (millions) Want to know it will perform well before you finish If its wrong, oing it all over is painful Performance can be simulate in software Estimate what IPC will be Guie esign Patterson an Hennessy s other more avance textbook: Computer Architecture: A Quantitative Approach" 40

41 Single-Cycle Datapath Performance + 4 P C Register File s1 s a Data S X Control ROM/ranom logic Goes against make common case fast (MCCF) principle + Low Cycles Per Instruction (CPI): 1 Long clock perio: to accommoate slowest insn 41

42 Alternative: Multi-Cycle Datapath s3 + 4 << P C s5 I R s5 Register File s1 s A B Multi-cycle atapath: attacks high clock perio Cut atapath into multiple stages (5 here), isolate using FFs FSM control walks insns thru stages (by staging control signals) Not every instruction nees every stage + Instructions can bypass stages an exit early S X s3 s3 O a Data s4 D s5 4

43 Multi-cycle Datapath FSM Next Decoe First state: Get a New Instruction Output signals to fetch (e.g., rea enable IMEM) Next State: Always Decoe 43

44 Multi-cycle Datapath FSM Next NOP Decoe Execute Secon State: Decoe Output signals to ecoe instruction (REn RegFile) Go to Next if NOP Otherwise Execute 44

45 Multi-cycle Datapath FSM Next NOP Decoe Branch Execute Execute State Execute (varies by insn type) Next State: Also epens on insn type Branches: Next 45

46 Multi-cycle Datapath FSM Next NOP Decoe Branch Execute ALU Writeback Execute State Execute (varies by insn type) Next State: Also epens on insn type ALU op: write register - we call this Writeback (to register file) 46

47 Multi-cycle Datapath FSM Next NOP Decoe Branch Execute ALU Loa Writeback Execute State Execute (varies by insn type) Next State: Also epens on insn type Loa: Rea ory Rea DMEM 47

48 Multi-cycle Datapath FSM Next NOP Decoe Branch Execute Write DMEM Store ALU Loa Writeback Execute State Execute (varies by insn type) Next State: Also epens on insn type Store: Write ory Rea DMEM 48

49 Multi-cycle Datapath FSM Next NOP Decoe Branch Execute Write DMEM Store ALU Loa Writeback Rea DMEM Rea DMEM State Control signals enable DMEM Rea Next state is writeback (what we rea from memory nees to go to a register) 49

50 Multi-cycle Datapath FSM Next NOP Decoe Branch Execute Write DMEM Store ALU Loa Writeback Writeback state Control signals enable regfile write Next state: Next Rea DMEM 50

51 Multi-cycle Datapath FSM Next NOP Decoe Branch Execute Write DMEM Store ALU Loa Writeback Write DMEM state Control signals enable memory write Next state: Next Rea DMEM 51

52 Multi-Cycle Datapath Example: A + 4 << P C I R Example: A Cycle 1: Rea IMEM Register File s1 s A B S X O a Data D 5

53 Multi-Cycle Datapath Example: A + 4 << P C I R Register File s1 s Example: A Cycle 1: Rea IMEM Cycle : Decoe + Rea RF A B S X O a Data D 53

54 Multi-Cycle Datapath Example: A + 4 << P C I R Register File s1 s Example: A Cycle 1: Rea IMEM Cycle : Decoe + Rea RF Cycle 3: ALU A B S X O a Data D 54

55 Multi-Cycle Datapath Example: A + 4 << P C I R Register File s1 s Example: A Cycle 1: Rea IMEM Cycle : Decoe + Rea RF Cycle 3: ALU Cycle 4: Writeback + Increment PC A B S X O a Data D 55

56 Multi-Cycle Datapath Performance + 4 << P C I R Register File s1 s A B S X O a Data D Opposite performance split of single-cycle atapath + Short clock perio High CPI 56

57 CPI epens on instructions Branches / Jumps: 3 cycles ALU: 4 cycles Stores: 4 cycles Loas: 5 cycles Multi-cycle Data-path CPI Overall CPI is weighte average Example: 0% loas, 15% stores, 0% branches, 45% ALU 57

58 CPI epens on instructions Branches / Jumps: 3 cycles ALU: 4 cycles Stores: 4 cycles Loas: 5 cycles Multi-cycle Data-path CPI Overall CPI is weighte average Example: 0% loas, 15% stores, 0% branches, 45% ALU CPI= 0.0 *

59 CPI epens on instructions Branches / Jumps: 3 cycles ALU: 4 cycles Stores: 4 cycles Loas: 5 cycles Multi-cycle Data-path CPI Overall CPI is weighte average Example: 0% loas, 15% stores, 0% branches, 45% ALU CPI= 0.0 * *

60 CPI epens on instructions Branches / Jumps: 3 cycles ALU: 4 cycles Stores: 4 cycles Loas: 5 cycles Multi-cycle Data-path CPI Overall CPI is weighte average Example: 0% loas, 15% stores, 0% branches, 45% ALU CPI= 0.0 * * * * 4 =

61 Multi-cycle Datapath Performance Single-cycle Clock perio = 50ns, CPI = 1 Performace = 50 ns/insn Multi-cycle Clock perio = 10ns CPI = (0.*3+0.*5+0.6*4) = 4 Performance = 40 ns/insn But wait 61

62 Multi-Cycle Datapath Performance + 4 << P C I R Register File s1 s A B S X O a Data D Di not just cut up existing logic into 5 pieces Also ae logic (flip flops) So clock perio not 1/5 of single cycle, but slightly longer 6

63 Multi-cycle Datapath Performance Single-cycle Clock perio = 50ns, CPI = 1 Performace = 50 ns/insn Multi-cycle Clock perio = 1ns CPI = (0.*3+0.*5+0.6*4) = 4 Performance = 48 ns/insn Better, but not as exciting Can we o better still? Have our cake (low CPI) an eat it too (high clock frequency)? 63

64 Clock Perio an CPI Single-cycle atapath + Low CPI: 1 Long clock perio: to accommoate slowest insn insn0.fetch, ec, exec Multi-cycle atapath + Short clock perio High CPI insn1.fetch, ec, exec insn0.fetch insn0.ec insn0.exec insn1.fetch insn1.ec insn1.exec Can we have both low CPI an short clock perio? No goo way to make a single insn go faster + latency oesn t matter anyway insn throughput matters Key: exploit inter-insn parallelism 64

65 Pipelining Pipelining: important performance technique Improves insn throughput rather than insn latency Exploits parallelism at insn-stage level to o so Begin with multi-cycle esign insn0.fetch insn0.ec insn0.exec insn1.fetch insn1.ec insn1.exec When insn avances from stage 1 to, next insn enters stage 1 insn0.fetch insn0.ec insn1.fetch insn0.exec insn1.ec insn1.exec Iniviual insns take same number of stages + But insns enter an leave at a much faster rate Physically breaks atomic VN loop... but must maintain illusion Revisit at en of semester (hopefully) 65

66 Datapaths: Single Cycle What o we nee? Control How control is implemente Multi-cycle Faster clock (yay!) Worse CPI (boooo) Performance: IPC Performance / Watt CPU Performance Equation Pipelining Teaser for later! Summary 66

ECE 250 / CPS 250 Computer Architecture. Processor Design Datapath and Control

ECE 250 / CPS 250 Computer Architecture Processor Design Datapath and Control Benjamin Lee Slides based on those from Andrew Hilton (Duke), Alvy Lebeck (Duke) Benjamin Lee (Duke), and Amir Roth (Penn)