Lecture 10 Multi-Cycle Implementation

Similar documents
COMP303 - Computer Architecture Lecture 10. Multi-Cycle Design & Exceptions

Single-Cycle Examples, Multi-Cycle Introduction

ECE 313 Computer Organization EXAM 2 November 9, 2001

CSE 2021 Computer Organization. Hugh Chesser, CSEB 1012U W10-M

Lecture 5 and 6. ICS 152 Computer Systems Architecture. Prof. Juan Luis Aragón

ENE 334 Microprocessors

Lecture 5: The Processor

Systems Architecture I

CC 311- Computer Architecture. The Processor - Control

RISC Design: Multi-Cycle Implementation

ECE 313 Computer Organization Name SOLUTION EXAM 2 November 3, Floating Point 20 Points

LECTURE 6. Multi-Cycle Datapath and Control

Lab 8: Multicycle Processor (Part 1) 0.0

Computer Organization & Design The Hardware/Software Interface Chapter 5 The processor : Datapath and control

RISC Processor Design

ECE 313 Computer Organization FINAL EXAM December 11, Multicycle Processor Design 30 Points

The Processor: Datapath & Control

CPE 335. Basic MIPS Architecture Part II

ALUOut. Registers A. I + D Memory IR. combinatorial block. combinatorial block. combinatorial block MDR

Multiple Cycle Data Path

CO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19

ECE369. Chapter 5 ECE369

Processor (multi-cycle)

Multicycle Approach. Designing MIPS Processor

Computer Science 141 Computing Hardware

Inf2C - Computer Systems Lecture 12 Processor Design Multi-Cycle

Initial Representation Finite State Diagram. Logic Representation Logic Equations

ECE 313 Computer Organization FINAL EXAM December 14, This exam is open book and open notes. You have 2 hours.

Implementing the Control. Simple Questions

Computer and Information Sciences College / Computer Science Department The Processor: Datapath and Control

Lets Build a Processor

Multi-cycle Approach. Single cycle CPU. Multi-cycle CPU. Requires state elements to hold intermediate values. one clock cycle or instruction

ﻪﺘﻓﺮﺸﻴﭘ ﺮﺗﻮﻴﭙﻣﺎﻛ يرﺎﻤﻌﻣ MIPS يرﺎﻤﻌﻣ data path and ontrol control

CSE 2021 COMPUTER ORGANIZATION

CS2214 COMPUTER ARCHITECTURE & ORGANIZATION SPRING 2014

Processor: Multi- Cycle Datapath & Control

Review: Abstract Implementation View

RISC Architecture: Multi-Cycle Implementation

CENG 3420 Lecture 06: Datapath

Initial Representation Finite State Diagram Microprogram. Sequencing Control Explicit Next State Microprogram counter

Chapter 5: The Processor: Datapath and Control

CENG 3420 Computer Organization and Design. Lecture 06: MIPS Processor - I. Bei Yu

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 4: Datapath and Control

CSE 2021 COMPUTER ORGANIZATION

RISC Architecture: Multi-Cycle Implementation

Mapping Control to Hardware

ECE473 Computer Architecture and Organization. Processor: Combined Datapath

Microprogramming. Microprogramming

CPE 335 Computer Organization. Basic MIPS Architecture Part I

Major CPU Design Steps

CS152 Computer Architecture and Engineering Lecture 13: Microprogramming and Exceptions. Review of a Multiple Cycle Implementation

Chapter 4 The Processor (Part 2)

EECE 417 Computer Systems Architecture

Outline. Combinational Element. State (Sequential) Element. Clocking Methodology. Input/Output of Elements

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: Data Paths and Microprogramming

ECE232: Hardware Organization and Design

Topic #6. Processor Design

ECE468 Computer Organization and Architecture. Designing a Multiple Cycle Controller

Multicycle conclusion

Processor (I) - datapath & control. Hwansoo Han

Alternative to single cycle. Drawbacks of single cycle implementation. Multiple cycle implementation. Instruction fetch

Processor Design CSCE Instructor: Saraju P. Mohanty, Ph. D. NOTE: The figures, text etc included in slides are borrowed

Points available Your marks Total 100

5.7. Microprogramming: Simplifying Control Design 5.7

COMP303 Computer Architecture Lecture 9. Single Cycle Control

Recap: A Single Cycle Datapath. CS 152 Computer Architecture and Engineering Lecture 8. Single-Cycle (Con t) Designing a Multicycle Processor

Merging datapaths: (add,lw, sw)

Outline of today s lecture. EEL-4713 Computer Architecture Designing a Multiple-Cycle Processor. What s wrong with our CPI=1 processor?

CSE 2021 Computer Organization. Hugh Chesser, CSEB 1012U W9-W

Chapter 5 Solutions: For More Practice

THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY Computer Organization (COMP 2611) Spring Semester, 2014 Final Examination

EECS 322 Computer Architecture Improving Memory Access: the Cache

Systems Architecture

Pipelined Datapath. One register file is enough

CS152 Computer Architecture and Engineering. Lecture 8 Multicycle Design and Microcode John Lazzaro (

CSE 141 Computer Architecture Spring Lectures 11 Exceptions and Introduction to Pipelining. Announcements

CMOS VLSI Design. MIPS Processor Example. Outline

ECS 154B Computer Architecture II Spring 2009

The Processor (1) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Note- E~ S. \3 \S U\e. ~ ~s ~. 4. \\ o~ (fw' \i<.t. (~e., 3\0)

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Designing a Multicycle Processor

The Big Picture: Where are We Now? EEM 486: Computer Architecture. Lecture 3. Designing a Single Cycle Datapath

Lecture Topics. Announcements. Today: Single-Cycle Processors (P&H ) Next: continued. Milestone #3 (due 2/9) Milestone #4 (due 2/23)

CS2214 COMPUTER ARCHITECTURE & ORGANIZATION SPRING 2014

The overall datapath for RT, lw,sw beq instrucution

Lecture 6: Microprogrammed Multi Cycle Implementation. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 8: Control COS / ELE 375. Computer Architecture and Organization. Princeton University Fall Prof. David August

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

Today s Menu. Multi-Cycle Exceptions Pipelining. Exceptions and Interrupts. Handling Exceptions and Interrupts. Why This Is Very Messy.

Introduction to Pipelined Datapath

Review Multicycle: What is Happening. Controlling The Multicycle Design

Chapter 4. The Processor. Computer Architecture and IC Design Lab

CS/COE0447: Computer Organization

CS/COE0447: Computer Organization

Computer Architecture Chapter 5. Fall 2005 Department of Computer Science Kent State University

Lecture 6 Datapath and Controller

Lecture 7 Pipelining. Peng Liu.

CS232 Final Exam May 5, 2001

How to design a controller to produce signals to control the datapath

Transcription:

Lecture 10 ulti-cycle Implementation 1

Today s enu ulti-cycle machines Why multi-cycle? Comparative performance Physical and Logical Design of Datapath and Control icroprogramming 2

ulti-cycle Solution Idea: Let the FASTEST instruction determine clock period Instr Class 1 Instr Class 2 Instr Class 3 Takes 4 cycles Takes 2 cycles 1 3 3 1 2 1 Less Wasted Time 3

ulti-cycle Reality We are going to go further than allowing the fastest instruction to determine rate We are going to break EVERY instruction up into phases R-class Load Branch Store 4

IPS Architecture Instruction Classes AL Load Store Branch Jump IF mem rd mem rd mem rd mem rd mem rd ID OF op code op code op code op code op code operands operands operands operands operands 1 or 2 reg rd 2 reg rd 2 reg rd 2 regs rd PC, (immd) immd immd PC, immd immd E op addr gen addr gen cond gen addr gen mem rd mem wr addr gen RS reg wr reg wr NI PC+4 PC+4 PC+4 if cond; update PC update PC 5

Note That Instruction decode and register read must be done for all classes of instructions PC+4 can be done right after fetch Address generation for Jump can be performed during the decode step The same adder can be shared among: Instruction fetch logic (PC = PC + 4) Address generation logic (Address = A + IR[15:0]) Branch target calculation (Next PC = (PC + 4) + IR[15:0]) AL operations (ALOutput = A + B) Same memory port can be used to access instructions and data 6

Full Diagram of ulti-cycle achine Figure 5.33 7

Recall IPS Instruction Formats R-type Instructions 31-26 25-21 20-16 15-11 10-6 5-0 op rs rt rd shamt funct Load or Store 31-26 25-21 20-16 15-0 35 or 43 rs rt immediate (address) Branch 31-26 25-21 20-16 15-0 4 rs rt immediate (address) Jump 31-26 25-0 2 immediate (address) 8

Execution Steps (1) Instruction Fetch IR = emory[pc]; PC = PC + 4; 9

Fetch PC[31:28] 30 Shift left 2 Jump Address Target 32 emory PC Instruction Register Read Reg1 Read Read Reg2 Data 1 Write Reg Write Data Read Data 2 A B 4 Zero AL ALOut Write Data DR Sign Extend 16 32 Shift left 2 10

Execution Steps (2) Instruction Decode and Register Fetch A = Reg[IR[25..21]]; B = Reg[IR[20..16]]; Target = PC + 4 + (signextend(ir[15..0]) << 2); 11

Decode and Register Op Fetch PC[31:28] 30 Shift left 2 Jump Address Target 32 emory PC Instruction Register Read Reg1 Read Read Reg2 Data 1 Write Reg Write Data Read Data 2 A B 4 Zero AL ALOut Write Data DR Sign Extend 16 32 Shift left 2 12

Execution Step (3) Execution, memory address computation or branch completion emory Reference ALOut = A + signextend(ir[15..0]); Arithmetic/Logical Operation ALOut = A + B Branch If (A == B) PC = Target; Jump PC = PC[31..28] (IR[25..0) << 2); 13

Execute (R-type) PC[31:28] 30 Shift left 2 Jump Address Target 32 emory PC Instruction Register Read Reg1 Read Read Reg2 Data 1 Write Reg Write Data Read Data 2 A B 4 Zero AL ALOut Write Data DR Sign Extend 16 32 Shift left 2 14

Execution Step (4) emory access or R-type instruction completion emory Reference DR = emory[alout]; or emory[alout] = B; Arithmetic/Logical Instructions (R-type) Reg[IR[15..11]] = ALOut; Branch, Jump Nothing 15

Execute (R-type) PC[31:28] 30 Shift left 2 Jump Address Target 32 emory PC Instruction Register Read Reg1 Read Read Reg2 Data 1 Write Reg Write Data Read Data 2 A B 4 Zero AL ALOut Write Data DR Sign Extend 16 32 Shift left 2 16

Execution Step (5) emory Read completion (Load only) Reg[IR[20..16]] = DR; 17

Finite State achine Control Instruction Fetch/Decode and Register Fetch emory Access R-Type Instructions Branch Instruction Jump Instruction 18

ulticycle Control emread emwrite IRWrite RegWrite AL SelA PC IorD Instruction Register RegDest Read Reg1 Read Read Reg2Data 1 Write Reg Write Data Read Data 2 A B 4 Zero AL ALOut Write Data DR Sign Shift Extend left 2 16 32 emtoreg Instruction [5:0] AL SelB AL Control AL Op 19

Instruction Fetch and Decode Start Instruction Fetch State 0 emread ALSelA = 0 IorD = 0 IRWrite ALSelB = 01 ALOp = 00 PCWrite PCSource = 00 Instruction Decode/ Register Fetch State 1 ALSelA = 0 ALSelB = 11 ALOp = 00 emory Reference FS R-Type FS Branch FS Jump FS 20

emory-reference FS From State 1 2 emory address computation ALSelA = 1 ALSelB = 10 ALOp = 00 3 emread IorD =1 5 emory Access emwrite IorD= 1 emory Access 4 Write-back step To State 0 RegWrite emtoreg = 1 RegDst = 0 21

R-type FS From State 1 6 Execution ALSelA = 1 ALSelB =00 ALOp = 10 7 R-type completion RegDst = 1 RegWrite emtoreg = 0 To state 0 22

Branch FS From State 1 8 Branch Completion ALSelA = 1 ALSelB =00 ALOp = 10 PCWriteCond PCSource = 01 To state 0 23

Jump FS From State 1 9 PCWrite PCSource = 10 Jump Completion To state 0 24

Complete State Diagram Figure 5.42 from book 25

ulticycle Control emread emwrite IRWrite RegWrite AL SelA PC IorD Instruction Register RegDest Read Reg1 Read Read Reg2Data 1 Write Reg Write Data Read Data 2 A B 4 Zero AL ALOut Write Data DR Sign Shift Extend left 2 16 32 emtoreg Instruction [5:0] AL SelB AL Control AL Op 26

Performance of ulticycle Implementation Each type of instruction can take a variable # of cycles Example Assume the following instruction distributions: loads 5 cycles 22% stores 4 cycles 11% R-type 4 cycles 49% branches 3 cycles 16% jump 3 cycles 2% What s the average Cycles Per Instruction (CPI) CPI = (CP clock cycles/instruction Count) CPI = (5 cycles * 0.22) + (4 cycles * 0.11) + (4 cycles * 0.49) + (3 cycles * 0.16) + (3 cycles * 0.02) CPI = 4.04 cycles per instruction What was the CPI for the single-cycle machine? Single cycle implies 1 clock cycle per instruction --> CPI = 1.0 So isn t the single-cycle machine about 4 times faster? 27

Performance of ulticycle Implementation The correct answer should consider the clock cycle time as well: For the single cycle implementation, the cycle time is given by the worst case delay: T cycle = 40ns (for load instructions, see slide 8) For the multicycle implementation, the cycle time is given by the worst case delay over all execution steps: T cycle = 10 ns (for each of the steps 1, 2, 3, or 4). The execution time per instruction is: CPI * T cycle = 40 * 1 = 40 ns per instruction for the single cycle machine CPI * T cycle = 10 * 4.04 = 40.4 ns per instruction for the multicycle machine Thus, the single cycle machine is only 1% faster When considering other types of units (e.g., FP), the single cycle implementation can be very inefficient. 28

icrocode: Another Approach Another way to implement a ealy machine: Inputs RO N: Inputs : Outputs S: State Bits Outputs Storage: + S (bits/word) 2 N+S (words) 29

icrocode II: oore achines Inputs RO Inputs RO Outputs Outputs 30

So who cares? Imagine: Hundreds of instructions Tens of different instruction classes (we ve seen four) Instructions that take anywhere from 1 to 100 cycles to complete That s what any real ISA has Now imagine drawing the FS diagram for that! Solution 1: Write Verilog and use synthesis (today) Solution 2: se some programming lessons 31

FS structure Class 1: Fetch Decode First cycles are the same No reconvergence after decode Limited Branching Class 2: Class N: 32

Exploiting the structure for microcode +1 Instruction Class (Decode from InstructionReg) icrocode emory FetchAddr So now what can we do: Go to instruction class address Go to the next word in the memory Go back to the fetch address Outputs Control everything by assigning bits in the word to control signals 33

icrocode Word Definition AL Control: SRC1: SRC2: Register Control: emory: PCWrite control: Sequencing: Add, Subt, Func code PC, A B, 4, Extend, Extshift Read, Write AL, Write DR Read PC, Read AL, Write AL AL, ALOut-cond, JumpAddress Seq, Fetch, Dispatch Total Word Size >= 13 34

Why is this nice? Reduce complexity of control design Only way to do it before synthesis tools Only way to encode Complex Instruction Sets Allows bug fixes, optimizations after real hardware Now how do people do this today? Specify Style for: Registers Latches Combinational Logic Finite State achines 35

And Never Forget arketing It is easy to sell a 1GHz processor It is harder to sell Yes our processor is only 500Hz, but it actually achieves a lower CPI, and therefore really is higher performance. 36

Other ulti-cycle Data Paths Single cycle datapaths are strongly implied by ISA Is this true of multi-cycle implementations? NO! There are a lot of possible data paths 37

Exceptions and Interrupts Exceptions are exceptional events that disrupt the normal flow of a program Terminology varies between different machines Examples of Interrupts ser hitting the keyboard Disk drive asking for attention Arrival of a network packet Examples of Exceptions Divide by zero Overflow Page fault 38

Handling Exceptions and Interrupts When do we jump to an exception? pon detection, invoke the OS to service the event Right when it occurs? What about in the middle of executing a multi-cycle instruction Difficult to abort the middle of an instruction Processor checks for event at the end of every instruction Processor provides EPC & Cause registers to inform OS of cause EPC - Exception Program Counter Holds PC that the OS should jump to when resuming execution Cause Register Holds bit-encoded cause of the exception 39

Exception Flow When an exception (or interrupt) occurs, control is transferred to the OS ser Process Operating System Event exception Exception return (optional) Exception processing by exception handler 40

Summary Single cycle implementations have to consider the worst case delay through the datapath to come-up with the cycle time. ulticycle implementations have the advantage of using a different number of cycles for executing each instruction. In general, the multicycle machine is better than the single cycle machine, but the actual execution time strongly depends on the workload. The most widely used machine implementation is neither single cycle, nor multicycle it s the pipelined implementation. 41