ECE369. Chapter 5 ECE369

Similar documents
The Processor: Datapath & Control

Lets Build a Processor

Multicycle Approach. Designing MIPS Processor

ﻪﺘﻓﺮﺸﻴﭘ ﺮﺗﻮﻴﭙﻣﺎﻛ يرﺎﻤﻌﻣ MIPS يرﺎﻤﻌﻣ data path and ontrol control

Chapter 4 The Processor (Part 2)

Topic #6. Processor Design

ENE 334 Microprocessors

Lecture 5 and 6. ICS 152 Computer Systems Architecture. Prof. Juan Luis Aragón

EECE 417 Computer Systems Architecture

Systems Architecture I

CPE 335. Basic MIPS Architecture Part II

Processor: Multi- Cycle Datapath & Control

Chapter 5: The Processor: Datapath and Control

CC 311- Computer Architecture. The Processor - Control

LECTURE 6. Multi-Cycle Datapath and Control

CSE 2021 COMPUTER ORGANIZATION

CSE 2021 COMPUTER ORGANIZATION

Implementing the Control. Simple Questions

RISC Processor Design

Mapping Control to Hardware

Lecture 8: Control COS / ELE 375. Computer Architecture and Organization. Princeton University Fall Prof. David August

Processor (I) - datapath & control. Hwansoo Han

Lecture 5: The Processor

Multiple Cycle Data Path

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: Data Paths and Microprogramming

Inf2C - Computer Systems Lecture 12 Processor Design Multi-Cycle

Alternative to single cycle. Drawbacks of single cycle implementation. Multiple cycle implementation. Instruction fetch

Major CPU Design Steps

RISC Architecture: Multi-Cycle Implementation

Points available Your marks Total 100

CS/COE0447: Computer Organization

CS/COE0447: Computer Organization

Design of the MIPS Processor

Computer Science 141 Computing Hardware

Chapter 5 Solutions: For More Practice

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 4: Datapath and Control

Processor (multi-cycle)

Systems Architecture

CO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19

RISC Architecture: Multi-Cycle Implementation

Design of the MIPS Processor (contd)

ALUOut. Registers A. I + D Memory IR. combinatorial block. combinatorial block. combinatorial block MDR

ECE 313 Computer Organization FINAL EXAM December 14, This exam is open book and open notes. You have 2 hours.

Chapter 4. The Processor. Computer Architecture and IC Design Lab

RISC Design: Multi-Cycle Implementation

Microprogramming. Microprogramming

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: A Based on P&H

Multicycle conclusion

TDT4255 Computer Design. Lecture 4. Magnus Jahre. TDT4255 Computer Design

Note- E~ S. \3 \S U\e. ~ ~s ~. 4. \\ o~ (fw' \i<.t. (~e., 3\0)

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Single vs. Multi-cycle Implementation

Using a Hardware Description Language to Design and Simulate a Processor 5.8

Initial Representation Finite State Diagram. Logic Representation Logic Equations

Single Cycle Datapath

THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY Computer Organization (COMP 2611) Spring Semester, 2014 Final Examination

ECE232: Hardware Organization and Design

Control Unit for Multiple Cycle Implementation

Control & Execution. Finite State Machines for Control. MIPS Execution. Comp 411. L14 Control & Execution 1

CPU Organization (Design)

COMP303 - Computer Architecture Lecture 8. Designing a Single Cycle Datapath

Processor Implementation in VHDL. University of Ulster at Jordanstown University of Applied Sciences, Augsburg

ECE 3056: Architecture, Concurrency and Energy of Computation. Single and Multi-Cycle Datapaths: Practice Problems

Chapter 4. The Processor

Outline. Combinational Element. State (Sequential) Element. Clocking Methodology. Input/Output of Elements

CENG 3420 Computer Organization and Design. Lecture 06: MIPS Processor - I. Bei Yu

The Processor (1) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Chapter 4. The Processor

CSE 2021: Computer Organization Fall 2010 Solution to Assignment # 3: Multicycle Implementation

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

Multi-cycle Approach. Single cycle CPU. Multi-cycle CPU. Requires state elements to hold intermediate values. one clock cycle or instruction

The Processor: Datapath and Control. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Digital Design & Computer Architecture (E85) D. Money Harris Fall 2007

CS232 Final Exam May 5, 2001

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

Computer Architecture Chapter 5. Fall 2005 Department of Computer Science Kent State University

ECE260: Fundamentals of Computer Engineering

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Inf2C - Computer Systems Lecture Processor Design Single Cycle

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

LECTURE 5. Single-Cycle Datapath and Control

Single Cycle Datapath

EECS150 - Digital Design Lecture 10- CPU Microarchitecture. Processor Microarchitecture Introduction

Chapter 4. The Processor. Instruction count Determined by ISA and compiler. We will examine two MIPS implementations

Computer Organization & Design The Hardware/Software Interface Chapter 5 The processor : Datapath and control

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

CPE 335 Computer Organization. Basic MIPS Architecture Part I

Lecture 4: Review of MIPS. Instruction formats, impl. of control and datapath, pipelined impl.

CSE 2021 Computer Organization. Hugh Chesser, CSEB 1012U W10-M

The MIPS Processor Datapath

EECS150 - Digital Design Lecture 9- CPU Microarchitecture. Watson: Jeopardy-playing Computer

Single Cycle Data Path

Announcements. This week: no lab, no quiz, just midterm

CS Computer Architecture Spring Week 10: Chapter

CSE Computer Architecture I Fall 2009 Lecture 13 In Class Notes and Problems October 6, 2009

Review: Abstract Implementation View

Chapter 4. The Processor Designing the datapath

CENG 3420 Lecture 06: Datapath

Initial Representation Finite State Diagram Microprogram. Sequencing Control Explicit Next State Microprogram counter

Transcription:

Chapter 5 1

State Elements Unclocked vs. Clocked Clocks used in synchronous logic Clocks are needed in sequential logic to decide when an element that contains state should be updated. State element 1 Combinational logic State element 2 Clock cycle 2

Latches and Flip-flops C Q D _ Q 3

Latches and Flip-flops D D C D latch Q D C D latch Q Q Q Q C 4

Latches and Flip-flops Latches: whenever the inputs change, and the clock is asserted Flip-flop: state changes only on a clock edge (edge-triggered methodology) 5

SRAM 6

SRAM vs. DRAM Which one has a better memory density? static RAM (SRAM): value stored in a cell is kept on a pair of inverting gates dynamic RAM (DRAM), value kept in a cell is stored as a charge in a capacitor. DRAMs use only a single transistor per bit of storage, By comparison, SRAMs require four to six transistors per bit Which one is faster? In DRAMs, the charge is stored on a capacitor, so it cannot be kept indefinitely and must periodically be refreshed. (called dynamic) Synchronous RAMs?? is the ability to transfer a burst of data from a series of sequential addresses within an array or row. 7

Datapath & control design We will design a simplified MIPS processor The instructions supported are Memory-reference instructions: lw, sw Arithmetic-logical instructions: add, sub, and, or, slt Control flow instructions: beq, j Generic implementation Use the program counter (PC) to supply instruction address Get the instruction from memory Read registers Use the instruction to decide exactly what to do All instructions use the ALU after reading the registers Why? memory-reference? arithmetic? control flow? 8

ALU Control ALU's operation based on instruction type and function code Example: add $t1, $s7, $s8 000000 10111 11000 01001 00000 100000 000 AND op rs rt rd shamt funct 001 OR 010 Add 110 Subtract 111 Set-on-less-than lw $t0, 32($s2) 35 18 8 32 op rs rt 16-bit number 9

Summary of Instruction Types R-Type: op=0 31:26 25:21 20:16 15:11 10:6 5:0 op rs rt rd shamt funct Load/Store: op=35 or 43 31:26 25:21 20:16 15:0 op rs rt address Branch: op=4 31:26 25:21 20:16 15:0 op rs rt address 10

Building blocks Instruction address Instruction PC Add Sum MemWrite Instruction memory a. Instruction memory b. Program counter c. Adder Address Write data Data memory Read data 16 Sign 32 extend Register numbers Data 5 Read 3 register 1 Read 5 data 1 Read 5 register 2 Registers Write register Write data Read data 2 RegWrite Data ALU control ALU ALU result a. Registers b. ALU Zero MemRead a. Data memory unit b. Sign-extension unit Why do we need each of these? 11

Fetching instructions 12

Reading registers 13

Load/Store memory access 14

Branch target 15

Combining datapath for memory and R-type instructions 16

Appending instruction fetch 17

Now Insert Branch 18

The simple datapath 19

Control For each instruction Select the registers to be read (always read two) Select the 2nd ALU input Select the operation to be performed by ALU Select if data memory is to be read or written Select what is written and where in the register file Select what goes in PC Information comes from the 32 bits of the instruction 20

Adding control to datapath 21

Adding control to datapath 0 4 Add Instruction [31 26] Control RegDst Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite Shift left 2 Add ALU result M u x 1 PC Read address Instruction memory Instruction [31 0] Instruction [25 21] Instruction [20 16] Instruction [15 11] 0 M u x 1 Read register 1 Read register 2 Registers Write register Write data Read data 1 Read data 2 0 M u x 1 Zero ALU ALU result Address Write data Data memory Read data 1 M u x 0 Instruction [15 0] 16 Sign 32 extend ALU control Instruction [5 0] 22

ALU Control given instruction type 00 = lw, sw 01 = beq, 10 = arithmetic 23

Control (Reading Assignment: Appendix C.2) Simple combinational logic (truth tables) Inputs Op5 Op4 ALUOp ALUOp0 ALUcontrol block Op3 Op2 Op1 Op0 ALUOp1 Outputs F (5 0) F3 F2 F1 F0 Operation2 Operation1 Operation0 Operation R-format Iw sw beq RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp1 ALUOpO 24

Instruction RegDst ALUSrc R-format lw sw beq Memto- Reg Reg Write Mem Read Mem Write Branch ALUOp1 ALUp0 25

Datapath in Operation for R-Type Instruction Memto- Reg Mem Mem Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0 R-format 1 0 0 1 0 0 0 1 0 lw sw beq 26

Datapath in Operation for Load Instruction Memto- Reg Mem Mem Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0 R-format 1 0 0 1 0 0 0 1 0 lw 0 1 1 1 1 0 0 0 0 sw X 1 X 0 0 1 0 0 0 beq 27

Datapath in Operation for Branch Equal Instruction Memto- Reg Mem Mem Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0 R-format 1 0 0 1 0 0 0 1 0 lw 0 1 1 1 1 0 0 0 0 sw X 1 X 0 0 1 0 0 0 beq X 0 X 0 0 0 1 0 1 28

Datapath with control for Jump instruction J-type instructions use 6 bits for the opcode, and 26 bits for the immediate value (called the target). newpc <- PC[31-28] IR[25-0] 00 29

Timing: Single cycle implementation Calculate cycle time assuming negligible delays except Memory (2ns), ALU and adders (2ns), Register file access (1ns) 30

Why is Single Cycle not GOOD??? Memory - 2ns; ALU - 2ns; Adder - 2ns; Reg - 1ns Instruction class Instruction memory Register read ALU Data memory Register write Total (in ns) ALU type 2 1 2 0 1 6 Load word 2 1 2 2 1 8 Store word 2 1 2 2 7 Branch 2 1 2 5 Jump 2 2 what if we had floating point instructions to handle? 31

1 clock cycle fixed vs. variable for each instruction Memory - 2ns; ALU - 2ns; Adder - 2ns; Reg - 1ns Loads 24% Stores 12% R-type 44% Branch 18% Jumps 2% Instruction class Instruction memory Register read ALU Data memory Register write Total (in ns) ALU type 2 1 2 0 1 6 Load word 2 1 2 2 1 8 Store word 2 1 2 2 7 Branch 2 1 2 5 Jump 2 2 32

1 clock cycle fixed vs. variable for each instruction Memory - 2ns; ALU - 2ns; Adder - 2ns; Reg - 1ns Loads 24% Stores 12% R-type 44% Branch 18% Jumps 2% CPU = IC * CPI * CC CPU = 8*24% + 7*12% + 6*44% + 5*18% + 2*2% CPU = 6.3ns Instruction class Instruction memory Register read ALU Data memory Register write Total (in ns) ALU type 2 1 2 0 1 6 Load word 2 1 2 2 1 8 Store word 2 1 2 2 7 Branch 2 1 2 5 Jump 2 2 33

Single Cycle Problems Wasteful of area Each unit used once per clock cycle Clock cycle equal to worst case scenario Will reducing the delay of common case help? 34

Multicycle Approach Ability to allow different cycles for different instructions Ability to share functional units within the execution of a single instruction Each step in execution in 1 cycle Functional unit to be used more than once per instruction As long as it is used in different cycle This will lead to reduction in hardware 35

Where we are headed One Solution: use a smaller cycle time have different instructions take different numbers of cycles a multicycle datapath shares resources Break up the instructions into steps, each step takes a cycle balance the amount of work to be done restrict each cycle to use only one major functional unit At the end of a cycle store values for use in later cycles (easiest thing to do) introduce additional internal registers PC Address Instruction or data Memory Data Instruction register Memory data register Data Register # Registers Register # Register # A B ALU ALUOut What is new? One ALU Single memory Some registers 36

Multicycle Datapath R I op rs rt rd shamt funct op rs rt 16 bit address J op 26 bit address What is this wire for? 37

Multicycle Datapath with Control Signals Missing wires?? 38

Multicycle Datapath, All Together 39

Idea behind multicycle approach We define each instruction from the ISA perspective Break it down into steps following our rule that data flows through at most one major functional unit (e.g., balance work across steps) Introduce new registers as needed (e.g, A, B, ALUOut, MDR, etc.) Finally try and pack as much work into each step (avoid unnecessary cycles) while also trying to share steps where possible (minimizes control, helps to simplify solution) Result: Our book s multicycle Implementation! 40

Breaking down an instruction ISA definition of arithmetic: Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op Reg[Memory[PC][20:16]] Could break down to: IR <= Memory[PC] A <= Reg[IR[25:21]] B <= Reg[IR[20:16]] ALUOut <= A op B Reg[IR[20:16]] <= ALUOut We forgot an important part of the definition of arithmetic! PC <= PC + 4 41

Five Execution Steps Instruction Fetch Instruction Decode and Register Fetch Execution, Memory Address Computation, or Branch Completion Memory Access or R-type instruction completion Write-back step INSTRUCTIONS TAKE FROM 3-5 CYCLES! 42

Step 1: Instruction Fetch Use PC to get instruction and put it in the Instruction Register. Increment the PC by 4 and put the result back in the PC. Can be described succinctly using RTL "Register-Transfer Language" IR <= Memory[PC]; PC <= PC + 4; Can we figure out the values of the control signals? What is the advantage of updating the PC now? 43

Step 2: Instruction Decode and Register Fetch Read registers rs and rt in case we need them Compute the branch address in case the instruction is a branch RTL: A <= Reg[IR[25:21]]; B <= Reg[IR[20:16]]; ALUOut <= PC + (sign-extend(ir[15:0]) << 2); We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic) 44

Step 3 Execution (instruction dependent) Memory Reference: ALUOut <= A + sign-extend(ir[15:0]); R-type: ALUOut <= A op B; Branch: if (A==B) PC <= ALUOut; Jump: PC <= {PC[31:28],(IR[25:0],2 b00)}; 45

Step 4 and 5 Step 4 R-type Completion or memory-access Memory access or R-type completion: MDR <= Memory[ALUout]; or Memory[ALUout] <= B; R-type instructions finish Reg[IR[15:11]] <= ALUOut; Step 5 Write-back step Reg[IR[20:16]] <= MDR; Why not do this in Step 4? 46

Summary: 47

Defining the Control Now that we have determined what the control signals are and when they have to be asserted Next step: implementing the control unit Single cycle datapath used truth tables Not feasible Alternatives Finite state machines Microprogramming 48

Implementing the Control Value of control signals is dependent upon: what instruction is being executed which step is being performed Use the information we ve accumulated to specify a finite state machine specify the finite state machine graphically, or use microprogramming Implementation can be derived from specification 49

Finite State Machine Control A <= Reg[IR[25:21]]; B <= Reg[IR[20:16]]; ALUOut <= PC + (sign-extend(ir[15:0]) << 2); IR <= Memory[PC]; PC <= PC + 4; 50

Big Picture 51

State Table 52

State Table 53

ROM Implementation ROM = "Read Only Memory" values of memory locations are fixed ahead of time A ROM can be used to implement a truth table if the address is m-bits, we can address 2 m entries in the ROM. our outputs are the bits of data that the address points to. m n 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 1 0 1 1 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 1 1 0 1 1 1 m is the "height", and n is the "width" 54

ROM Implementation (Appendix-C!!!) How many inputs are there? 6 bits for opcode, 4 bits for state = 10 address lines (i.e., 2 10 = 1024 different addresses) How many outputs are there? 16 datapath-control outputs, 4 state bits = 20 outputs ROM is 2 10 x 20 = 20K bits (and a rather unusual size) Rather wasteful, since for lots of the entries, the outputs are the same i.e., opcode is often ignored 55

Big Picture 56

Putting All Together 57

PLA 58

59

ROM vs PLA Break up the table into two parts 4 state bits tell you the 16 outputs, 2 4 x 16 bits of ROM 10 bits tell you the 4 next state bits, 2 10 x 4 bits of ROM Total: 4.3K bits of ROM PLA is much smaller can share product terms only need entries that produce an active output can take into account don't cares Size is (#inputs #product-terms) + (#outputs #product-terms) For this example = (10x17)+(20x17) = 510 PLA cells PLA cells usually about the size of a ROM cell (slightly bigger) 60

Another Implementation Style Complex instructions: the "next state" is often current state + 1 Control unit PLA or ROM Input Outputs PCWrite PCWriteCond IorD MemRead MemWrite IRWrite BWrite MemtoReg PCSource ALUOp ALUSrcB ALUSrcA RegWrite RegDst AddrCtl 1 State Adder Address select logic Op[5 0] Instruction register opcode field 61

Microprogramming Control unit Microcode memory Input Outputs PCWrite PCWriteCond IorD MemRead MemWrite IRWrite BWrite MemtoReg PCSource ALUOp ALUSrcB ALUSrcA RegWrite RegDst AddrCtl Datapath 1 Microprogram counter Adder Address select logic Instruction register opcode field What are the microinstructions? 62

Pentium 4 Pipelining is important (last IA-32 without it was 80386 in 1985) Control Control I/O interface Instruction cache Enhanced floating point and multimedia Control Data cache Integer datapath Secondary cache and memory interface Chapter 7 Chapter 6 Advanced pipelining hyperthreading support Control Pipelining is used for the simple instructions favored by compilers Simply put, a high performance implementation needs to ensure that the simple instructions execute quickly, and that the burden of the complexities of the instruction set penalize the complex, less frequently used, instructions 63

Chapter 5 Summary If we understand the instructions We can build a simple processor! If instructions take different amounts of time, multi-cycle is better Datapath implemented using: Combinational logic for arithmetic State holding elements to remember bits Control implemented using: Combinational logic for single-cycle implementation Finite state machine for multi-cycle implementation 64