COMPUTER ORGANIZATION AND DESIGN

Size: px
Start display at page:

Download "COMPUTER ORGANIZATION AND DESIGN"

Transcription

1 ARM COMPUTER ORGANIZATION AND DESIGN Edition The Hardware/Software Interface Chapter 4 The Processor Modified and extended by R.J. Leduc

2 To understand this chapter, you will need to understand some basic digital logic concepts Earlier, we discussed AND, OR, NOT, and XOR (exclusive OR) logic gates You should go back and review this You should also review our earlier discussion on how RAM works You should review the Boolean logic axioms we discussed We will shortly discuss how multiplexors and registers work This section of slides includes information from Section Digital Logic Introduction Digital Logic Introduction Chapter 4 The Processor 2

3 Logic Design Basics Information in CPU encoded in binary Combinational component Low voltage = 0, High voltage = 1 One wire per bit Multi-bit data is encoded on multi-wire buses Operates on data Output is a function of current inputs i.e. no history Examples are circuits created from AND, OR, Not gates, but without any feedback loops State (sequential) elements Store information (i.e. flip-flops, registers) Chapter 4 The Processor 3

4 2- Input Multiplexor A 2-input multiplexor has two data sources, x1 and x2, and one output, f The third input, s, selects which input is transmitted to the output If s = 0, then f has the same value as x1 If s = 1, then f has the same value as x2 Chapter 4 The Processor 4

5 2- Input Multiplexor II Chapter 4 The Processor 5

6 4- Input Multiplexor For 4 inputs, need two select lines: s and s 0 1 If we wish to select between 32 sources as possible inputs, we would need five select inputs (i.e. 25 = 32) Chapter 4 The Processor 6

7 Multiplexor for 64-Bit Registers A 32-bit multiplexer can select between 32 1-bit sources To select between 32, 64-bit registers, we would need: An array of 32-bit multiplexors One 32-bit multiplexor for each bit of the register (64 muxes in total) i.e. the first 32-bit multiplexor would select between b0 of each of the LEGv8 registers Chapter 4 The Processor 7

8 Combinational Components I(m-1) I1 I0 n S0 n-bit, m x 1 Multiplexor S(log n m) O O= I0 if S=0..00 I1 if S=0..01 I(m-1) if S=1..11 I(log n I0-1) A B n A n log n x n Decoder n-bit Adder O(n-1) O1O0 carry sum n O0 =1 if I=0..00 O1 =1 if I=0..01 O(n-1) =1 if I=1..11 sum = A+B (first n bits) carry = (n+1) th bit of A+B With enable input e all O s are 0 if e=0 With carry-in input Ci sum = A + B + Ci B n n-bit Comparato r less equa greate l r less = 1 if A<B equal =1 if A=B greater=1 if A>B A n B n n bit, m function S0 ALU S(log n m) O O = A op B op determined by S. May have status outputs carry, zero, etc. Chapter 4 The Processor 8

9 Combinational Elements Adder Multiplexor Y = S? I1 : I0 I0 I1 M u x S A + Y = A+ B Y B Arithmetic/Logic Unit (ALU) Y = F(A, B) A ALU Y Y B F Chapter 4 The Processor 9

10 Tri-state Buffer When e = 1, acts like a buffer. i.e. output = input When e = 0, output is electrically disconnected from the input For a shared data bus, each register's outputs would go through a tri-state buffer As long as only one register enables output at a time, there is no conflict This is a more efficient method than multiplexors to have a large number of devices share a common wire Chapter 4 The Processor 10

11 D Flip-Flops The basic building block of a register is a device called a D (data) flip-flop (FF) A positive edge-triggered D flip-flop Stores a single bit of data On rising edge (when signal changes from low-to-high) of the clock: Stores the value at the D input Stored value then appears at the Q output, after a short delay Changes of D input otherwise ignored Can add an enable input to flip-flop When enable set to 1, it behaves as above When enable set to 0, it keeps its current value and ignores clock signal Chapter 4 The Processor 11

12 D Flip-Flops II Devices shown from top are: D latch (ignore) Positive edge-triggered D FF Negative edge-triggered (clock transitions from high-to-low) D FF To correctly store data, value at D input must be constant for period just before and after desired clock edge. Chapter 4 The Processor 12

13 Registers To create an 8-bit register: Group eight D flip-flops together Gives 8 D inputs, and 8 Q outputs Connect their enable signals together Connect their clock inputs together Chapter 4 The Processor 13

14 Registers II Example shows two 2-bit registers (R1 and R2), connected two a 2-bit shared data bus Register i (i in {1,2}) has an input enable signal (Riin) and an output enable signal (Riout) All registers use tri-state buffers to connect to the data bus Chapter 4 The Processor 14

15 Clocking Methodology Most data is stored in state elements such as registers Typical operation in a processor: One state element stores data, which appears at its output This data then propagates through a combinational logic circuit Data then appears at the input of a second state element Chapter 4 The Processor 15

16 Clocking Methodology II This occurs between clock edges: On first rising edge of clock, data is stored in State Element 1 Data propagates through combinational logic (i.e. an adder) Output of combinational logic stored in State Element2 on next rising edge of clock Clock period must be long enough for data to reach State element 2 and be stable Longest delay in processor determines the minimum clock period Chapter 4 The Processor 16

17 Clocking Methodology III Edge triggered elements allow for a state element to be read and written in the same clock cycle Clock period needs to be long enough for output of combinational logic to reach next input The delay going through the combinational logic must be long enough that newly loaded value of state element can't propagate too quickly back to input of state element Input must not change for short period after clock edge Chapter 4 The Processor 17

18 From previous chapters, we saw that CPU performance was determined by: Instruction count Determined by ISA and compiler CPI and Cycle time 4.1 Introduction Chapter Introduction Determined by the implementation of the processor In examining the implementation of the CPU: We will see how the ISA determines many aspects of the CPU design How different implementations affect clock rate and CPI Chapter 4 The Processor 18

19 Chapter Introduction II We will examine two LEGv8 implementations A simplified version A more realistic pipelined version For both we will examine the datapath and controller design We will start with a highly abstract and simplified overview, and then refine the design as we add details Chapter 4 The Processor 19

20 Basic LEGv8 Implementation We will focus our implementation on a subset of LEGv8 instructions to simplify things Will demonstrate the key concepts of datapath and controller Implementation of remaining instructions similar The subset of instructions we will focus on is: Memory reference: Arithmetic/logical: Load register: LDUR X0, [X1,#8] Store register: STUR X0, [X1,#0] ADD, SUB, AND, ORR: i.e. ADD X0, X1, X2 Branching: Compare and Branch: CBZ X0, Label1 Unconditional Branch: B Label2 // will add this last to design Chapter 4 The Processor 20

21 Instruction Execution For all three classes of instructions (memory reference, arithmetic/logical, and branching), the first three steps are identical: 1. Set the program counter (PC) register to the memory location that contains the next instruction 2. Fetch the instruction from memory 3. Read zero, one or two registers: Use the fields of the instruction to select registers LDUR/STUR/CBZ only require one register Most other instructions require two Unconditional branch requires none Chapter 4 The Processor 21

22 Instruction Execution II Next action required depends on instruction type Except for unconditional branches, instructions next need to use the ALU Memory reference: need address calculation Arithmetic/logical: perform indicated operation Conditional branch: needs comparison to zero After ALU, next actions are: Memory reference: access memory to read data or write data Arithmetic/logical/load: store output of ALU or data from memory to a register Chapter 4 The Processor 22

23 Instruction Execution III Finally, we need to determine the address of the next instruction to execute If current instruction is a conditional branch and specified register was zero: Load new address (PC plus offset specified as part of instruction) into PC register Otherwise, load PC + 4 into PC register Chapter 4 The Processor 23

24 CPU Overview Provides an abstract and simplified view Omits two important details Chapter 4 The Processor 24

25 (1) Need Multiplexors Can t just join wires together Use multiplexors Chapter 4 The Processor 25

26 (2) Needs Control Signals Chapter 4 The Processor 26

27 Logic Design Conventions A signal is asserted if it is logically high We assert a signal when it should be set to logically high A signal is deasserted if it is logically low We deassert a signal when it should be set to logically low In textbook, we will only store data on the rising clock edge for our flip-flops and registers The data bus signals are assumed to be 64 bit unless specified otherwise Two overlapping signals are not connected unless there is a dot where they cross each other Chapter 4 The Processor 27

28 We will now examine the major components of a datapath needed to execute each class of a LEGv8 instruction A register file is a state element that consists of a set of registers that can be read or written by supplying the register number to be accessed 4.3 Building a Datapath Building a Datapath A datapath element is a unit used to operate on or hold data within the processor For LEGv8 we have: Instruction and data memories, register file, the ALU, and adders We will build a LEGv8 datapath incrementally Refining the overview design Chapter 4 The Processor 28

29 Instruction Fetch 64-bit register Increment by 4 for next instruction Chapter 4 The Processor 29

30 R-Format Instructions opcode Rm shamt Rn Rd 11 bits 5 bits 6 bits 5 bits 5 bits Need to read two register operands Perform arithmetic/logical operation Write result to a register Uses a register file and ALU Chapter 4 The Processor 30

31 R-Format Instructions II Register file requires: three 5-bit selection inputs to specify the two source registers, and the destination register One 64-bit input to load data to be written to the destination register Two 64-bit outputs Must assert RegWrite input to write to register on next clock edge Chapter 4 The Processor 31

32 R-Format Instructions III ALU has: two 64-bit inputs for operands and a 64 bit output A 4-bit input to select which function to perform A 1-bit output that is asserted when the result of the operation is zero Chapter 4 The Processor 32

33 Load/Store Instructions opcode address op2 Rn Rt 11 bits 9 bits 2 bits 5 bits 5 bits To perform load/store/ we need to: Read base address from Rn register Add 9-bit signed offset to base address to get data address If load: Read memory and update register If store: Write register value to memory Chapter 4 The Processor 33

34 Load/Store Instructions II To perform these instructions, we will need a register file and ALU: Register file provides base address and source/destination register ALU will add base address and address offset We will also need a sign extension unit: Unit takes the 32-bit instruction word as input For a load/store, it extracts a 9-bit offset For CBZ, it will extract a 19-bit offset Chapter 4 The Processor 34

35 Load/Store Instructions III We also need a data memory unit to read from and write to Data memory unit has: 64-bit address input to select memory location 64-bit input for write operations 64-bit output for read operations Input MemWrite is asserted to enable writing Input MemRead is asserted to specify a read operation Chapter 4 The Processor 35

36 Branch Instruction (CBZ) opcode Offset Rn 8 bits 19 bits 5 bits Requires register to test for Zero, and signed address offset Register is passed through ALU to output, and the ALU's zero output is set based upon this value We also need to calculate the branch target address which is the address to load into the PC register if the branch is taken (i.e. the register is equal zero) To calculate branch target address: Sign-extend displacement to 64 bits Shift displacement left 2 places to multiply by 4 (displacement is how many words to jump, but each word is 4 bytes) Add to current value of PC register (which is the address of the branch instruction) Chapter 4 The Processor 36

37 Branch Instruction II Shows segment of datapath that handles branches Adds a dedicated adder unit to calculate the branch target address Note: Figure 4.9 of the text incorrectly used PC +4 instead of PC in the top adder Chapter 4 The Processor 37

38 Composing the Elements Our first datapath version will execute one instruction in each clock cycle This means: Each datapath element can only do one function at a time Hence, we need separate instruction and data memories Also need two dedicated adders for calculating next instruction address Use multiplexers where alternate data sources are used for different instructions Add needed control signals Chapter 4 The Processor 38

39 R-Type/Load/Store Datapath They can share register file and ALU, but: Bottom ALU input needs to choose between offset and register Destination register input needs to choose between ALU output and data memory Chapter 4 The Processor 39

40 Full Datapath Still not using ALU zero output to determine PCSrc mux input Chapter 4 The Processor 40

41 We will now look at the design of the control unit To simplify the design of the main control unit, we will first design a simple controller for our ALU Our ALU offers six functions that we need To specify the desired function, we need to specify the value of the ALU's four control bits 4.4 A Simple Implementation Scheme ALU Control Chapter 4 The Processor 41

42 ALU Control II Table below shows the available functions and the corresponding 4-bit selection patterns Below shows which function the ALU is used for when executing the indicated instruction type Load/Store: need to add Base register to offset CBZ Branch: need to pass register to output R-type: operation depends on opcode ALU control Function 0000 AND 0001 OR 0010 add 0110 subtract 0111 pass input b 1100 NOR Chapter 4 The Processor 42

43 ALU Control III We assume the main control unit will provide 2-bit ALUOp signal that is derived from the opcode This will determine ALU control for all but the R-type instructions We use X to indicate a don't care condition opcode ALUO p Operation Opcode field ALU function ALU control LDUR 00 load register XXXXXXXXXX X add 0010 STUR 00 store register XXXXXXXXXX X add 0010 CBZ 01 compare and branch on zero XXXXXXXXXX X pass input b 0111 R-type 10 add add 0010 subtract subtract 0110 AND AND 0000 ORR OR 0001 Chapter 4 The Processor 43

44 ALU Control Truth Table Next step is to create a truth table Can then use standard methods to derive a Boolean logic expression for each bit of ALU Control signal Each Boolean expression implemented as a combinatorial circuit Used don't cares to minimize displayed rows Chapter 4 The Processor 44

45 The Main Control Unit We will create the main Control Unit We note that our control signals will be derived from our 32 bit instruction word Note: figure below as well as Fig 4.14 in text are incorrect. Opcode for conditional branch is 31:24, NOT 31:26. Chapter 4 The Processor 45

46 The Main Control Unit II We make the following important observations: Opcodes are found in bits 31:21 First register operand is bits 9:5 (Rn) for R-type and base register of load/store 2nd register operand is bits 20:16 (Rm) for R-type, but at 4:0 (Rt) for a store operation and register for testing on CBZ Means we require a multiplexor to make selection Destination register for R-type and load operations is bits 4:0 Chapter 4 The Processor 46

47 Datapath With Control Unit Chapter 4 The Processor 47

48 Logic for Control Signals Read Figure 4.16 in text for a good description of the purpose and operation of main control signals The value of the control signals determined by the instructions opcode alone Read description for Figure 4.18 in text Chapter 4 The Processor 48

49 Truth Table For Control Signals Use standard methods to derive Boolean logic expressions for each control signal Implement each expression as combinatorial circuit Used don't cares to minimize size Chapter 4 The Processor 49

50 Operation of R-type Instruction Consider operation of datapath for: ADD X1, X2,X3 Happens in one clock cycle, but steps ordered by flow of information 1. On rising clock edge, new instruction address is loaded into PC register 2. The instruction at this address is loaded 3. Registers X2 and X3 are read from register file, while control unit (then ALU control) sets the control signals 4. ALU does specified operation on data read from register file (ADD, SUB, AND, ORR) 5. Output of ALU directed to Write data input of register file (X1) 6. On next rising edge of clock, data saved to X1, while PC register will be loaded with next instruction address (PC + 4), and process repeats Chapter 4 The Processor 50

51 R-Type Instruction Chapter 4 The Processor 51

52 Operation of Load Instruction Consider operation of datapath for: LDUR X1, [X2,offset] 1. On rising clock edge, new instruction address is loaded into PC register 2. The instruction at this address is loaded 3. Registers X2 is read from register file, while control unit (then ALU control) sets the control signals 4. ALU computes the sum of X2 and the sign-extended address offset 5. Output of ALU used for address of Data memory (MemRead asserted) 6. On next rising edge of clock, output of data memory saved to X1, while PC register will be loaded with next instruction address (PC + 4), and process repeats Chapter 4 The Processor 52

53 Load Instruction For store operation, read description of Figure 4.20 in text Chapter 4 The Processor 53

54 Operation of CBZ Instruction Consider operation of datapath for: CBZ X1, offset 1. On rising clock edge, new instruction address is loaded into PC register 2. The instruction at this address is loaded 3. Registers X1 is read from register file (Read Reg 2) using I[4:0], while control unit (then ALU control) sets the control signals 4. ALU passes value of Read data 2 to output, setting signal zero = 1 if result = 0, sets zero = 0 otherwise 5. Value of PC is added to sign extended offset shifted left by 2 (i.e. branch target address) 6. On next rising edge of clock, PC register loaded with branch target address if zero asserted, otherwise loaded with next instruction address (PC + 4), and process repeats Chapter 4 The Processor 54

55 CBZ Instruction Chapter 4 The Processor 55

56 Implementing Unconditional Branch B 5 Offset 31:26 25:0 Like CBZ, but 26 bit offset, and always branches Branch target calculated by adding PC to sign-extended offset shifted left by 2 Need to add to control unit a new output called Uncondbranch that is only true when bits 31:26 of instruction equal 5 (i.e. B instruction) Need to add an OR gate for select input of top right multiplexor Need to extend sign-extend unit to be able to also select the 26 bit offset from a B instruction Chapter 4 The Processor 56

57 Datapath With B Added Chapter 4 The Processor 57

58 Performance Issues Single-cycle design will work correctly, but too inefficient for modern designs Every instruction must have same clock period Means longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data memory register file Not feasible to vary period for different instructions Violates design principle Making the common case fast We will improve performance by pipelining Chapter 4 The Processor 58

59 Consider the steps needed to process multiple loads of laundry with your roommate 1. Place load in washer 2. Move load to dryer 3. Place dry load on table and fold 4. Have roommate put clothes away 5. Go to step 1 until all loads finished 4.5 An Overview of Pipelining Pipelining Analogy Not a very time efficient approach as doesn't take into account that steps 1-4 can be done in parallel Chapter 4 The Processor 59

60 Pipelining Analogy II Assuming that we have the resources that each step can be done at same time, we can overlap execution For example, immediately start next load in washer after we move wet clothes to dryer Pipelining is a technique where multiple steps (called stages) are operated concurrently For Laundry, after the first load was finished, all four stages would be active, but on different loads Chapter 4 The Processor 60

61 Pipelining Analogy III For sequential version, each load takes 2 hours For pipelined version, first load takes 2 hours, but each additional load takes only an additional 30 minutes! Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup = 2n/(0.5n + 1.5) 4 = number of stages As n gets large, speedup tends to number of stages Chapter 4 The Processor 61

62 Pipelining Analogy IV The time spent per stage (cycle time) is time required for longest stage Laundry must be past forward by all stages at same time We go from a cycle time of 2 hours, to that of 30 minutes After initial latency to complete first load, completion time for each load is that of new cycle time We have not changed latency, but improved throughput Chapter 4 The Processor 62

63 LEGv8 Pipeline We can now apply this concept to instruction execution LEGv8 requires five steps to execute instructions IF: Fetch instruction from memory ID: Decode instruction and read register EX: Execute operation or calculate address MEM: Access operand in data memory (if needed) WB: Write result back to register (if needed) Each step becomes one stage of the instruction pipeline Clock period now only needs to be long enough for slowest stage to complete Chapter 4 The Processor 63

64 Pipeline Performance Assume elapsed times for execution stages are: 100ps for register read or write 200ps for other stages Elapsed time for each instruction class shown in table below Instr Instr fetch Register read ALU op Memory access Register write Total time LDUR 200ps 100 ps 200ps 200ps 100 ps 800ps STUR 200ps 100 ps 200ps 200ps R-format 200ps 100 ps 200ps CBZ 200ps 100 ps 200ps 700ps 100 ps 600ps 500ps Chapter 4 The Processor 64

65 Pipeline Performance II Single-cycle (Tc= 800ps) Pipelined (Tc= 200ps) Chapter 4 The Processor 65

66 Pipeline Speedup If we execute 4 instructions, this takes 800 x 4 = 3200 for single-cycle version For pipelined: x 3 = 1400 Speedup is 3200/1400 = 2.29 Reason is that with only 4 instructions, the first one dominates. For a program executing billions of instructions, this should tend to a speedup of 5, the number of stages of the pipelines Assuming that the execution time of each stage is about the same Chapter 4 The Processor 66

67 Pipeline Speedup II If all stages are balanced i.e., all take the same time Time between instructionspipelined = Time between instructionsnonpipelined Number of stages If not balanced, speedup is less Speedup due to increased throughput Latency (time for each instruction) does not decrease Chapter 4 The Processor 67

68 Pipelining and ISA Design LEGv8 ISA was designed for pipelining All instructions are 32-bits Few and regular instruction formats Easier to fetch in one stage and decode in next Compare with x86: 1- to 15-byte instructions Can decode and read registers in one step Load/store addressing Memory operands only appear here Frees up ALU to calculate addressing in 3rd stage of pipeline, and access memory in 4th stage Compare with x86 that can perform operations (i.e. ADD) on operands in memory would need extra stage! Chapter 4 The Processor 68

69 Hazards Hazards are situations when the next instruction cannot execute in the following clock cycle There are three types: A structural hazard is when a required resource is needed at the same time by more than one instruction A data hazard is when an instruction can not execute a stage because it is waiting on data from an earlier instruction i.e. waiting for a result to be written to destination register A control hazard occurs when the instruction that was fetched was not the one needed i.e. don't know which instruction is need after a conditional branch until the branch is evaluated Chapter 4 The Processor 69

70 Structural Hazards This is when different instructions have a conflict for simultaneous use of a resource If LEGv8 pipeline used a single memory we could get a structural hazard Both load and store require data access If tried to also do an instruction fetch at same time, pipeline would have to stall for that cycle Would cause a pipeline bubble Hence, pipelined datapaths require separate instruction/data memories Or separate instruction/data caches Chapter 4 The Processor 70

71 Data Hazards Occurs when an instruction depends on completion of data access by a previous instruction ADD SUB X19, X0, X1 X2, X19, X3 The result of the ADD will not be written to X19 in time to be read for the SUB operation, causing delay Chapter 4 The Processor 71

72 Forwarding (aka Bypassing) In forwarding, we use result when it is computed Don t wait for it to be stored in a register Requires extra connections in the datapath Chapter 4 The Processor 72

73 Load-Use Data Hazard Can t always avoid stalls by forwarding Consider a data load followed by a SUB instruction If value not ready when needed, we can t forward backward in time! Even with forwarding, data won't be retrieved until one cycle too late When we needed to delay the pipeline by a cycle or more, we call this a pipeline stall Chapter 4 The Processor 73

74 Code Scheduling to Avoid Stalls To avoid a stall, the compiler can often help It can reorder code to avoid use of load result in the next instruction C code for A = B + E; C = B + F; stall stall LDUR LDUR ADD STUR LDUR ADD STUR X1, X2, X3, X3, X4, X5, X5, 13 cycles [X0,#0] [X0,#8] X1, X2 [X0,#24] [X0,#16] X1, X4 [X0,#32] LDUR LDUR LDUR ADD STUR ADD STUR X1, X2, X4, X3, X3, X5, X5, [X0,#0] [X0,#8] [X0,#16] X1, X2 [X0,#24] X1, X4 [X0,#32] 11 cycles Chapter 4 The Processor 74

75 Control Hazards Control Hazards are caused by branch instructions Conditional branches determine whether we execute the next instruction or the one at the branch target We don't know which until after the condition is evaluated Can try to guess which instruction is next Wrong guess means flushing the pipeline and loading the correct instruction In LEGv8 pipeline Need to compare registers and compute target early in the pipeline Add extra hardware to do it in ID stage Still causes a one cycle delay Chapter 4 The Processor 75

76 Stall on Branch One solution to deal with control hazards: Wait until branch outcome determined before fetching next instruction For LEGv8, adds an extra cycle Chapter 4 The Processor 76

77 Branch Prediction Long pipelines can t readily determine branch outcome early Better option is to predict outcome of branch Stall penalty becomes unacceptable Only stall if prediction is wrong In LEGv8 pipeline We can assume (predict) a branch will never be taken Fetch instruction after branch, with no delay If branch is actually taken, flush pipeline, and load branch target instruction Chapter 4 The Processor 77

78 More-Realistic Branch Prediction 1. Static branch prediction Based on typical branch behavior Example: for loop and if-statement branches Always predict backward branches taken Always predict forward branches not taken 2. Dynamic branch prediction Hardware measures actual branch behavior e.g., record recent history of each branch Assume future behavior will continue the trend When wrong, stall while re-fetching, and update history Chapter 4 The Processor 78

79 Pipeline Summary The BIG Picture Pipelining improves performance by increasing instruction throughput Subject to hazards Executes multiple instructions in parallel Each instruction has the same latency Parallelism achieved without requiring action by programmer Structural, data, control Instruction set design affects complexity of pipeline implementation Chapter 4 The Processor 79

80 4.6 Pipelined Datapath and Control LEGv8 Pipelined Datapath Instructions and data mostly move from left to right Exceptions: branch target and register writes Chapter 4 The Processor 80

81 Pipeline registers Once a pipeline is full, each stage is processing part of a different instruction This means each stage needs: To be independent of previous stages To keep a copy of the instruction information that either it needs, or that future stages will need We require registers between stages They will hold needed information produced in previous cycles Chapter 4 The Processor 81

82 Pipeline registers II Figure shows pipeline registers (highlighted) added between stages PC register and pipeline registers assumed to store information each cycle, so no write enable needed Chapter 4 The Processor 82

83 Pipeline Operation We will first examine the flow of data through the pipeline Will focus on what data needs to be saved in the pipeline registers at each stage We will then examine the flow of control signals through the pipeline We will first look at a single-clock-cycle pipeline diagram Shows pipeline usage for a single cycle at a time Highlights resources used We ll look at single-clock-cycle diagrams for load & store Chapter 4 The Processor 83

84 IF Stage for Load, Store, Diagram shows which resources (blue) are being used by current stage On rising edge of the clock, PC register loads new instruction address This progresses to address selection for Instruction memory Selected 32 bit instruction appears at data port of memory Instruction and 64 bit PC register content arrive at input of IF/ID Value of PC+4 computed and fed to multiplexor for input of PC register Chapter 4 The Processor 84

85 IF stage for Load, Store, II Chapter 4 The Processor 85

86 ID for Load, Store, On rising edge of the clock, the current instruction and PC register contents from previous stage is saved to the IF/ID register The new instruction stored in the IF/ID register is then decoded by the main controller to generate control signals The instruction stored in IF/ID is used to select read/write registers The instruction is also used by the sign-extension unit The stored PC value, the output of the two read registers, and the sign-extended offset arrive at input of ID/EX register We need to store any data in the next pipeline register that may be needed by a later stage Chapter 4 The Processor 86

87 ID for Load, Store, II Chapter 4 The Processor 87

88 EX for Load On rising edge of the clock, the stored PC value, the output of the two read registers, and the sign-extended offset are stored in the ID/EX register The new output (PC register and offset) of the ID/EX register is used to calculate a branch target address in case it is needed The stored offset and the stored Read data 1 are used by the ALU to calculate the desired data memory location The branch target address, the output of the ALU (zero and operation result), and the stored value from Read data 2 arrive at input of EX/MEM register Chapter 4 The Processor 88

89 EX for Load II Chapter 4 The Processor 89

90 MEM for Load On rising edge of the clock, branch target address, the output of the ALU (zero and operation result), and the stored value from Read data 2 are stored in the EX/MEM register The new output of the EX/MEM register is used to supply the address and write data inputs to the data memory unit The stored branch target is fed back to the input multiplexor for the PC register The stored output of the ALU (operation result), and the Read data output of the data memory arrive at input of MEM/WB register Chapter 4 The Processor 90

91 MEM for Load II Chapter 4 The Processor 91

92 WB for Load On rising edge of the clock, stored output of the ALU (operation result), and the Read data output of the data memory are stored in the MEM/WB register The new output of the EX/MEM register is connected to the two inputs of the multiplexor that is fed back to the Write data input of the register file The output of the multiplexor arrives at the Write data input of the register file On the next rising edge, this date will be stored in the selected write register of the register file Chapter 4 The Processor 92

93 WB for Load (with error) II Wrong register number We see now that we have made a design error When we feed back data to be written to write register, we are using the select bits from the wrong instruction! Chapter 4 The Processor 93

94 Corrected Datapath for Load During the Instruction decode stage of our load instruction, we need to save the write register select bits to our pipeline registers Then feed back the saved select bits with the data to be written Chapter 4 The Processor 94

95 MEM for Store For a store operation, the memory stage is the last active stage Still have to go through write-back stage as later instructions already progressing at maximum rate Chapter 4 The Processor 95

96 Multi-Cycle Pipeline Diagram Form showing resource usage for each instruction as it progresses over time Chapter 4 The Processor 96

97 Multi-Cycle Pipeline Diagram II More traditional form of this type of diagram Chapter 4 The Processor 97

98 Single-Cycle Pipeline Diagram State of a pipeline in a single given cycle Chapter 4 The Processor 98

99 Pipelined Control (Simplified) We now need to add a controller We start by adding control signals to the existing design Chapter 4 The Processor 99

100 Pipelined Control Design We first note that design ignores the data and control hazards we discussed in Section 4.5 We will borrow as much as we can from Singlecycle design We will keep the same Main and ALU controller, branch logic, control signals, and the same multiplexor design The Main controller will create its control signals during the ID stage, and then we will pass the needed control signals forward via the pipeline registers Chapter 4 The Processor 100

101 Pipelined Control Design II As PC and pipeline registers are written on each clock cycle, they don't need separate write signals As each control signal is associated with a component active only during a signal stage, we can associate each signal to a stage IF: nothing needed here ID: need to set Reg2Loc EX: need to set ALUOp and ALUSrc MEM: need to set Branch, MemRead, and MemWrite WB: need to set MemToReg and RegWrite Chapter 4 The Processor 101

102 Pipelined Control Propagation Control signals are derived from instruction during ID stage Signals needed for ID stage kept local The rest are saved into pipeline registers and passed forward to the stage they are needed Chapter 4 The Processor 102

103 Pipelined Control Next slide shows full design, including pipelined control signals and which stage they are used in As Instruction bits 31:21 are used by ALU Control unit in EX stage, we must add these to the ID/EX register Chapter 4 The Processor 103

104 Pipelined Control II Chapter 4 The Processor 104

105 Read for Own Interest Read Sections 4.7, 4.8, 4.9, 4.10 for your own interest Chapter 4 The Processor 105

106 Read On Your Own Read Sections 4.14, and 4.15 on your own Chapter 4 The Processor 106

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor. COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined

More information

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture The Processor Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut CSE3666: Introduction to Computer Architecture Introduction CPU performance factors Instruction count

More information

Chapter 4 The Processor 1. Chapter 4A. The Processor

Chapter 4 The Processor 1. Chapter 4A. The Processor Chapter 4 The Processor 1 Chapter 4A The Processor Chapter 4 The Processor 2 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware 4.1 Introduction We will examine two MIPS implementations

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Processor (I) - datapath & control. Hwansoo Han

Processor (I) - datapath & control. Hwansoo Han Processor (I) - datapath & control Hwansoo Han Introduction CPU performance factors Instruction count - Determined by ISA and compiler CPI and Cycle time - Determined by CPU hardware We will examine two

More information

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle? CSE 2021: Computer Organization Single Cycle (Review) Lecture-10b CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan 2 Single Cycle with Jump Multi-Cycle Implementation Instruction:

More information

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3. Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =2n/05n+15 2n/0.5n 1.5 4 = number of stages 4.5 An Overview

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Pipelining. CSC Friday, November 6, 2015

Pipelining. CSC Friday, November 6, 2015 Pipelining CSC 211.01 Friday, November 6, 2015 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data memory register file Not

More information

LECTURE 3: THE PROCESSOR

LECTURE 3: THE PROCESSOR LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU

More information

Chapter 4. The Processor. Instruction count Determined by ISA and compiler. We will examine two MIPS implementations

Chapter 4. The Processor. Instruction count Determined by ISA and compiler. We will examine two MIPS implementations Chapter 4 The Processor Part I Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations

More information

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: A Based on P&H

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: A Based on P&H COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 4 The Processor: A Based on P&H Introduction We will examine two MIPS implementations A simplified version A more realistic pipelined

More information

Full Datapath. Chapter 4 The Processor 2

Full Datapath. Chapter 4 The Processor 2 Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

More information

The Processor (1) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

The Processor (1) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University The Processor (1) Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

The Processor: Datapath and Control. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

The Processor: Datapath and Control. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University The Processor: Datapath and Control Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Introduction CPU performance factors Instruction count Determined

More information

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Introduction Chapter 4.1 Chapter 4.2 Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number

More information

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14 MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK

More information

Systems Architecture

Systems Architecture Systems Architecture Lecture 15: A Simple Implementation of MIPS Jeremy R. Johnson Anatole D. Ruslanov William M. Mongan Some or all figures from Computer Organization and Design: The Hardware/Software

More information

Full Datapath. Chapter 4 The Processor 2

Full Datapath. Chapter 4 The Processor 2 Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

More information

Chapter 4. The Processor. Computer Architecture and IC Design Lab

Chapter 4. The Processor. Computer Architecture and IC Design Lab Chapter 4 The Processor Introduction CPU performance factors CPI Clock Cycle Time Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS

More information

Processor (II) - pipelining. Hwansoo Han

Processor (II) - pipelining. Hwansoo Han Processor (II) - pipelining Hwansoo Han Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 =2.3 Non-stop: 2n/0.5n + 1.5 4 = number

More information

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Organization and Structure. Bing-Yu Chen National Taiwan University Computer Organization and Structure Bing-Yu Chen National Taiwan University The Processor Logic Design Conventions Building a Datapath A Simple Implementation Scheme An Overview of Pipelining Pipelined

More information

Chapter 4. The Processor Designing the datapath

Chapter 4. The Processor Designing the datapath Chapter 4 The Processor Designing the datapath Introduction CPU performance determined by Instruction Count Clock Cycles per Instruction (CPI) and Cycle time Determined by Instruction Set Architecure (ISA)

More information

ECE260: Fundamentals of Computer Engineering

ECE260: Fundamentals of Computer Engineering Datapath for a Simplified Processor James Moscola Dept. of Engineering & Computer Science York College of Pennsylvania Based on Computer Organization and Design, 5th Edition by Patterson & Hennessy Introduction

More information

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many

More information

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16 Instruction Execution Consider simplified MIPS: lw/sw rt, offset(rs) add/sub/and/or/slt

More information

CMSC Computer Architecture Lecture 4: Single-Cycle uarch and Pipelining. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 4: Single-Cycle uarch and Pipelining. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 4: Single-Cycle uarch and Pipelining Prof. Yanjing Li University of Chicago Administrative Stuff! Lab1 due at 11:59pm today! Lab2 out " Pipeline ARM simulator "

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version MIPS Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

CSEE 3827: Fundamentals of Computer Systems

CSEE 3827: Fundamentals of Computer Systems CSEE 3827: Fundamentals of Computer Systems Lecture 21 and 22 April 22 and 27, 2009 martha@cs.columbia.edu Amdahl s Law Be aware when optimizing... T = improved Taffected improvement factor + T unaffected

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations Determined by ISA

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor 1 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A

More information

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

ECE260: Fundamentals of Computer Engineering

ECE260: Fundamentals of Computer Engineering Pipelining James Moscola Dept. of Engineering & Computer Science York College of Pennsylvania Based on Computer Organization and Design, 5th Edition by Patterson & Hennessy What is Pipelining? Pipelining

More information

Chapter 5: The Processor: Datapath and Control

Chapter 5: The Processor: Datapath and Control Chapter 5: The Processor: Datapath and Control Overview Logic Design Conventions Building a Datapath and Control Unit Different Implementations of MIPS instruction set A simple implementation of a processor

More information

COMP2611: Computer Organization. The Pipelined Processor

COMP2611: Computer Organization. The Pipelined Processor COMP2611: Computer Organization The 1 2 Background 2 High-Performance Processors 3 Two techniques for designing high-performance processors by exploiting parallelism: Multiprocessing: parallelism among

More information

EIE/ENE 334 Microprocessors

EIE/ENE 334 Microprocessors EIE/ENE 334 Microprocessors Lecture 6: The Processor Week #06/07 : Dejwoot KHAWPARISUTH Adapted from Computer Organization and Design, 4 th Edition, Patterson & Hennessy, 2009, Elsevier (MK) http://webstaff.kmutt.ac.th/~dejwoot.kha/

More information

Lecture 15: Pipelining. Spring 2018 Jason Tang

Lecture 15: Pipelining. Spring 2018 Jason Tang Lecture 15: Pipelining Spring 2018 Jason Tang 1 Topics Overview of pipelining Pipeline performance Pipeline hazards 2 Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time T a s k O r d e r A B C D 30 40 20

More information

CSCI 402: Computer Architectures. Fengguang Song Department of Computer & Information Science IUPUI. Today s Content

CSCI 402: Computer Architectures. Fengguang Song Department of Computer & Information Science IUPUI. Today s Content 3/6/8 CSCI 42: Computer Architectures The Processor (2) Fengguang Song Department of Computer & Information Science IUPUI Today s Content We have looked at how to design a Data Path. 4.4, 4.5 We will design

More information

4.1.3 [10] < 4.3>Which resources (blocks) produce no output for this instruction? Which resources produce output that is not used?

4.1.3 [10] < 4.3>Which resources (blocks) produce no output for this instruction? Which resources produce output that is not used? 2.10 [20] < 2.2, 2.5> For each LEGv8 instruction in Exercise 2.9 (copied below), show the value of the opcode (Op), source register (Rn), and target register (Rd or Rt) fields. For the I-type instructions,

More information

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining Single-Cycle Design Problems Assuming fixed-period clock every instruction datapath uses one

More information

Topic #6. Processor Design

Topic #6. Processor Design Topic #6 Processor Design Major Goals! To present the single-cycle implementation and to develop the student's understanding of combinational and clocked sequential circuits and the relationship between

More information

Full Datapath. CSCI 402: Computer Architectures. The Processor (2) 3/21/19. Fengguang Song Department of Computer & Information Science IUPUI

Full Datapath. CSCI 402: Computer Architectures. The Processor (2) 3/21/19. Fengguang Song Department of Computer & Information Science IUPUI CSCI 42: Computer Architectures The Processor (2) Fengguang Song Department of Computer & Information Science IUPUI Full Datapath Branch Target Instruction Fetch Immediate 4 Today s Contents We have looked

More information

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Consider: a = b + c; d = e - f; Assume loads have a latency of one clock cycle:

More information

Chapter 4 The Processor 1. Chapter 4B. The Processor

Chapter 4 The Processor 1. Chapter 4B. The Processor Chapter 4 The Processor 1 Chapter 4B The Processor Chapter 4 The Processor 2 Control Hazards Branch determines flow of control Fetching next instruction depends on branch outcome Pipeline can t always

More information

Introduction. Datapath Basics

Introduction. Datapath Basics Introduction CPU performance factors - Instruction count; determined by ISA and compiler - CPI and Cycle time; determined by CPU hardware 1 We will examine a simplified MIPS implementation in this course

More information

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

More information

LECTURE 5. Single-Cycle Datapath and Control

LECTURE 5. Single-Cycle Datapath and Control LECTURE 5 Single-Cycle Datapath and Control PROCESSORS In lecture 1, we reminded ourselves that the datapath and control are the two components that come together to be collectively known as the processor.

More information

TDT4255 Computer Design. Lecture 4. Magnus Jahre. TDT4255 Computer Design

TDT4255 Computer Design. Lecture 4. Magnus Jahre. TDT4255 Computer Design 1 TDT4255 Computer Design Lecture 4 Magnus Jahre 2 Outline Chapter 4.1 to 4.4 A Multi-cycle Processor Appendix D 3 Chapter 4 The Processor Acknowledgement: Slides are adapted from Morgan Kaufmann companion

More information

CO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19

CO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19 CO2-3224 Computer Architecture and Programming Languages CAPL Lecture 8 & 9 Dr. Kinga Lipskoch Fall 27 Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently the clock cycle must be

More information

CPU Organization (Design)

CPU Organization (Design) ISA Requirements CPU Organization (Design) Datapath Design: Capabilities & performance characteristics of principal Functional Units (FUs) needed by ISA instructions (e.g., Registers, ALU, Shifters, Logic

More information

Chapter 4 (Part II) Sequential Laundry

Chapter 4 (Part II) Sequential Laundry Chapter 4 (Part II) The Processor Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Sequential Laundry 6 P 7 8 9 10 11 12 1 2 A T a s k O r d e r A B C D 30 30 30 30 30 30 30 30 30 30

More information

COSC 6385 Computer Architecture - Pipelining

COSC 6385 Computer Architecture - Pipelining COSC 6385 Computer Architecture - Pipelining Fall 2006 Some of the slides are based on a lecture by David Culler, Instruction Set Architecture Relevant features for distinguishing ISA s Internal storage

More information

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University The Processor (3) Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception Outline A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception 1 4 Which stage is the branch decision made? Case 1: 0 M u x 1 Add

More information

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 4: Datapath and Control

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 4: Datapath and Control ELEC 52/62 Computer Architecture and Design Spring 217 Lecture 4: Datapath and Control Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn, AL 36849

More information

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism

More information

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Pipeline Hazards Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hazards What are hazards? Situations that prevent starting the next instruction

More information

Lecture 7 Pipelining. Peng Liu.

Lecture 7 Pipelining. Peng Liu. Lecture 7 Pipelining Peng Liu liupeng@zju.edu.cn 1 Review: The Single Cycle Processor 2 Review: Given Datapath,RTL -> Control Instruction Inst Memory Adr Op Fun Rt

More information

Lecture 4: Review of MIPS. Instruction formats, impl. of control and datapath, pipelined impl.

Lecture 4: Review of MIPS. Instruction formats, impl. of control and datapath, pipelined impl. Lecture 4: Review of MIPS Instruction formats, impl. of control and datapath, pipelined impl. 1 MIPS Instruction Types Data transfer: Load and store Integer arithmetic/logic Floating point arithmetic Control

More information

The MIPS Processor Datapath

The MIPS Processor Datapath The MIPS Processor Datapath Module Outline MIPS datapath implementation Register File, Instruction memory, Data memory Instruction interpretation and execution. Combinational control Assignment: Datapath

More information

CS 31: Intro to Systems Digital Logic. Kevin Webb Swarthmore College February 3, 2015

CS 31: Intro to Systems Digital Logic. Kevin Webb Swarthmore College February 3, 2015 CS 31: Intro to Systems Digital Logic Kevin Webb Swarthmore College February 3, 2015 Reading Quiz Today Hardware basics Machine memory models Digital signals Logic gates Circuits: Borrow some paper if

More information

ECE260: Fundamentals of Computer Engineering

ECE260: Fundamentals of Computer Engineering ECE260: Fundamentals of Computer Engineering Pipelined Datapath and Control James Moscola Dept. of Engineering & Computer Science York College of Pennsylvania ECE260: Fundamentals of Computer Engineering

More information

5 th Edition. The Processor We will examine two MIPS implementations A simplified version A more realistic pipelined version

5 th Edition. The Processor We will examine two MIPS implementations A simplified version A more realistic pipelined version COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 4 5 th Edition Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined

More information

Introduction. Chapter 4. Instruction Execution. CPU Overview. University of the District of Columbia 30 September, Chapter 4 The Processor 1

Introduction. Chapter 4. Instruction Execution. CPU Overview. University of the District of Columbia 30 September, Chapter 4 The Processor 1 Chapter 4 The Processor Introduction CPU performance factors Instruction count etermined by IS and compiler CPI and Cycle time etermined by CPU hardware We will examine two MIPS implementations simplified

More information

CENG 3420 Computer Organization and Design. Lecture 06: MIPS Processor - I. Bei Yu

CENG 3420 Computer Organization and Design. Lecture 06: MIPS Processor - I. Bei Yu CENG 342 Computer Organization and Design Lecture 6: MIPS Processor - I Bei Yu CEG342 L6. Spring 26 The Processor: Datapath & Control q We're ready to look at an implementation of the MIPS q Simplified

More information

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ... CHAPTER 6 1 Pipelining Instruction class Instruction memory ister read ALU Data memory ister write Total (in ps) Load word 200 100 200 200 100 800 Store word 200 100 200 200 700 R-format 200 100 200 100

More information

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31 4.16 Exercises 419 Exercise 4.11 In this exercise we examine in detail how an instruction is executed in a single-cycle datapath. Problems in this exercise refer to a clock cycle in which the processor

More information

Single Cycle Datapath

Single Cycle Datapath Single Cycle atapath Lecture notes from MKP, H. H. Lee and S. Yalamanchili Section 4.-4.4 Appendices B.7, B.8, B.,.2 Practice Problems:, 4, 6, 9 ing (2) Introduction We will examine two MIPS implementations

More information

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin.   School of Information Science and Technology SIST CS 110 Computer Architecture Pipelining Guest Lecture: Shu Yin http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University Slides based on UC Berkley's CS61C

More information

ECE369. Chapter 5 ECE369

ECE369. Chapter 5 ECE369 Chapter 5 1 State Elements Unclocked vs. Clocked Clocks used in synchronous logic Clocks are needed in sequential logic to decide when an element that contains state should be updated. State element 1

More information

Single Cycle Datapath

Single Cycle Datapath Single Cycle atapath Lecture notes from MKP, H. H. Lee and S. Yalamanchili Section 4.1-4.4 Appendices B.3, B.7, B.8, B.11,.2 ing Note: Appendices A-E in the hardcopy text correspond to chapters 7-11 in

More information

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: Data Paths and Microprogramming

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: Data Paths and Microprogramming Computer Science 324 Computer Architecture Mount Holyoke College Fall 2007 Topic Notes: Data Paths and Microprogramming We have spent time looking at the MIPS instruction set architecture and building

More information

CS 31: Intro to Systems Digital Logic. Kevin Webb Swarthmore College February 2, 2016

CS 31: Intro to Systems Digital Logic. Kevin Webb Swarthmore College February 2, 2016 CS 31: Intro to Systems Digital Logic Kevin Webb Swarthmore College February 2, 2016 Reading Quiz Today Hardware basics Machine memory models Digital signals Logic gates Circuits: Borrow some paper if

More information

ECE 154A Introduction to. Fall 2012

ECE 154A Introduction to. Fall 2012 ECE 154A Introduction to Computer Architecture Fall 2012 Dmitri Strukov Lecture 10 Floating point review Pipelined design IEEE Floating Point Format single: 8 bits double: 11 bits single: 23 bits double:

More information

Chapter 4 The Processor

Chapter 4 The Processor Chapter 4 The Processor 4.1 Introduction 4.2 Logic Design Conventions 4.3 The Single-Cycle Design 4.4 The Pipelined Design (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University

More information

Inf2C - Computer Systems Lecture Processor Design Single Cycle

Inf2C - Computer Systems Lecture Processor Design Single Cycle Inf2C - Computer Systems Lecture 10-11 Processor Design Single Cycle Boris Grot School of Informatics University of Edinburgh Previous lectures Combinational circuits Combinations of gates (INV, AND, OR,

More information

Lecture 12: Single-Cycle Control Unit. Spring 2018 Jason Tang

Lecture 12: Single-Cycle Control Unit. Spring 2018 Jason Tang Lecture 12: Single-Cycle Control Unit Spring 2018 Jason Tang 1 Topics Control unit design Single cycle processor Control unit circuit implementation 2 Computer Organization Computer Processor Memory Devices

More information

ECEC 355: Pipelining

ECEC 355: Pipelining ECEC 355: Pipelining November 8, 2007 What is Pipelining Pipelining is an implementation technique whereby multiple instructions are overlapped in execution. A pipeline is similar in concept to an assembly

More information

Pipelining. Maurizio Palesi

Pipelining. Maurizio Palesi * Pipelining * Adapted from David A. Patterson s CS252 lecture slides, http://www.cs.berkeley/~pattrsn/252s98/index.html Copyright 1998 UCB 1 References John L. Hennessy and David A. Patterson, Computer

More information

Thomas Polzer Institut für Technische Informatik

Thomas Polzer Institut für Technische Informatik Thomas Polzer tpolzer@ecs.tuwien.ac.at Institut für Technische Informatik Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =

More information

1 Hazards COMP2611 Fall 2015 Pipelined Processor

1 Hazards COMP2611 Fall 2015 Pipelined Processor 1 Hazards Dependences in Programs 2 Data dependence Example: lw $1, 200($2) add $3, $4, $1 add can t do ID (i.e., read register $1) until lw updates $1 Control dependence Example: bne $1, $2, target add

More information

CPE 335 Computer Organization. Basic MIPS Architecture Part I

CPE 335 Computer Organization. Basic MIPS Architecture Part I CPE 335 Computer Organization Basic MIPS Architecture Part I Dr. Iyad Jafar Adapted from Dr. Gheith Abandah slides http://www.abandah.com/gheith/courses/cpe335_s8/index.html CPE232 Basic MIPS Architecture

More information

Computer Architecture V Fall Practice Exam Questions

Computer Architecture V Fall Practice Exam Questions Computer Architecture V22.0436 Fall 2002 Practice Exam Questions These are practice exam questions for the material covered since the mid-term exam. Please note that the final exam is cumulative. See the

More information

ECS 154B Computer Architecture II Spring 2009

ECS 154B Computer Architecture II Spring 2009 ECS 154B Computer Architecture II Spring 2009 Pipelining Datapath and Control 6.2-6.3 Partially adapted from slides by Mary Jane Irwin, Penn State And Kurtis Kredo, UCD Pipelined CPU Break execution into

More information

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds? Chapter 4: Assessing and Understanding Performance 1. Define response (execution) time. 2. Define throughput. 3. Describe why using the clock rate of a processor is a bad way to measure performance. Provide

More information

Mark Redekopp and Gandhi Puvvada, All rights reserved. EE 357 Unit 15. Single-Cycle CPU Datapath and Control

Mark Redekopp and Gandhi Puvvada, All rights reserved. EE 357 Unit 15. Single-Cycle CPU Datapath and Control EE 37 Unit Single-Cycle CPU path and Control CPU Organization Scope We will build a CPU to implement our subset of the MIPS ISA Memory Reference Instructions: Load Word (LW) Store Word (SW) Arithmetic

More information

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University Lecture 9 Pipeline Hazards Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee18b 1 Announcements PA-1 is due today Electronic submission Lab2 is due on Tuesday 2/13 th Quiz1 grades will

More information

Major CPU Design Steps

Major CPU Design Steps Datapath Major CPU Design Steps. Analyze instruction set operations using independent RTN ISA => RTN => datapath requirements. This provides the the required datapath components and how they are connected

More information

CENG 3420 Lecture 06: Pipeline

CENG 3420 Lecture 06: Pipeline CENG 3420 Lecture 06: Pipeline Bei Yu byu@cse.cuhk.edu.hk CENG3420 L06.1 Spring 2019 Outline q Pipeline Motivations q Pipeline Hazards q Exceptions q Background: Flip-Flop Control Signals CENG3420 L06.2

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

University of Jordan Computer Engineering Department CPE439: Computer Design Lab

University of Jordan Computer Engineering Department CPE439: Computer Design Lab University of Jordan Computer Engineering Department CPE439: Computer Design Lab Experiment : Introduction to Verilogger Pro Objective: The objective of this experiment is to introduce the student to the

More information