EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 13 EE141

Similar documents
EECS150 - Digital Design Lecture 10- CPU Microarchitecture. Processor Microarchitecture Introduction

EECS150 - Digital Design Lecture 9- CPU Microarchitecture. Watson: Jeopardy-playing Computer

Design of Digital Circuits 2017 Srdjan Capkun Onur Mutlu (Guest starring: Frank K. Gürkaynak and Aanjhan Ranganathan)

ENGN1640: Design of Computing Systems Topic 04: Single-Cycle Processor Design

Processor (I) - datapath & control. Hwansoo Han

COMP303 - Computer Architecture Lecture 8. Designing a Single Cycle Datapath

CS 61C: Great Ideas in Computer Architecture Datapath. Instructors: John Wawrzynek & Vladimir Stojanovic

EECS150 - Digital Design Lecture 08 - Project Introduction Part 1

CS61C : Machine Structures

Systems Architecture

Review. N-bit adder-subtractor done using N 1- bit adders with XOR gates on input. Lecture #19 Designing a Single-Cycle CPU

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: A Based on P&H

Computer Science 61C Spring Friedland and Weaver. The MIPS Datapath

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 4: Datapath and Control

CS 61C: Great Ideas in Computer Architecture. MIPS CPU Datapath, Control Introduction

The Big Picture: Where are We Now? EEM 486: Computer Architecture. Lecture 3. Designing a Single Cycle Datapath

UC Berkeley CS61C : Machine Structures

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 28: Single- Cycle CPU Datapath Control Part 1

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Lecture 8: Control COS / ELE 375. Computer Architecture and Organization. Princeton University Fall Prof. David August

CPU Organization (Design)

Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

CS 110 Computer Architecture Single-Cycle CPU Datapath & Control

EECS150 - Digital Design Lecture 9 Project Introduction (I), Serial I/O. Announcements

inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture 18 CPU Design: The Single-Cycle I ! Nasty new windows vulnerability!

CENG 3420 Lecture 06: Datapath

361 datapath.1. Computer Architecture EECS 361 Lecture 8: Designing a Single Cycle Datapath

Working on the Pipeline

Chapter 4. The Processor. Computer Architecture and IC Design Lab

Adding Support for jal to Single Cycle Datapath (For More Practice Exercise 5.20)

CENG 3420 Computer Organization and Design. Lecture 06: MIPS Processor - I. Bei Yu

LECTURE 5. Single-Cycle Datapath and Control

CS3350B Computer Architecture Quiz 3 March 15, 2018

The MIPS Processor Datapath

Full Datapath. CSCI 402: Computer Architectures. The Processor (2) 3/21/19. Fengguang Song Department of Computer & Information Science IUPUI

Chapter 4. The Processor

The Processor (1) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

ECE232: Hardware Organization and Design

CS3350B Computer Architecture Winter Lecture 5.7: Single-Cycle CPU: Datapath Control (Part 2)

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

The Processor: Datapath & Control

Outline. EEL-4713 Computer Architecture Designing a Single Cycle Datapath

COMP303 Computer Architecture Lecture 9. Single Cycle Control

Chapter 4. The Processor

Single Cycle CPU Design. Mehran Rezaei

CSCI 402: Computer Architectures. Fengguang Song Department of Computer & Information Science IUPUI. Today s Content

CS 61C: Great Ideas in Computer Architecture Control and Pipelining

Lecture Topics. Announcements. Today: Single-Cycle Processors (P&H ) Next: continued. Milestone #3 (due 2/9) Milestone #4 (due 2/23)

361 control.1. EECS 361 Computer Architecture Lecture 9: Designing Single Cycle Control

The Processor: Datapath and Control. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Review: Abstract Implementation View

CPE 335 Computer Organization. Basic MIPS Architecture Part I

CO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19

Processor. Han Wang CS3410, Spring 2012 Computer Science Cornell University. See P&H Chapter , 4.1 4

Lecture 7 Pipelining. Peng Liu.

Chapter 4 The Processor 1. Chapter 4A. The Processor

Fundamentals of Computer Systems

Major CPU Design Steps

Chapter 4. The Processor Designing the datapath

RISC Processor Design

Chapter 4. The Processor

Lecture 10: Simple Data Path

ECE170 Computer Architecture. Single Cycle Control. Review: 3b: Add & Subtract. Review: 3e: Store Operations. Review: 3d: Load Operations

CS61C : Machine Structures

Computer Hardware Engineering

MIPS-Lite Single-Cycle Control

COMPUTER ORGANIZATION AND DESIGN

Ch 5: Designing a Single Cycle Datapath

CS 351 Exam 2 Mon. 11/2/2015

Lecture 4: Review of MIPS. Instruction formats, impl. of control and datapath, pipelined impl.

Inf2C - Computer Systems Lecture Processor Design Single Cycle

Computer Hardware Engineering

CS Computer Architecture Spring Week 10: Chapter

ECE369. Chapter 5 ECE369

EEM 486: Computer Architecture. Lecture 3. Designing Single Cycle Control

Systems Architecture I

COMPUTER ORGANIZATION AND DESIGN

CpE242 Computer Architecture and Engineering Designing a Single Cycle Datapath

Improving Performance: Pipelining

CS61C : Machine Structures

Computer Organization and Structure

Mark Redekopp and Gandhi Puvvada, All rights reserved. EE 357 Unit 15. Single-Cycle CPU Datapath and Control

COMP2611: Computer Organization. The Pipelined Processor

CPE 335. Basic MIPS Architecture Part II

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Single Cycle Datapath

CSEN 601: Computer System Architecture Summer 2014

Processor Design Pipelined Processor (II) Hung-Wei Tseng

Single Cycle Datapath

CC 311- Computer Architecture. The Processor - Control

CS/COE0447: Computer Organization

CS/COE0447: Computer Organization

CS3350B Computer Architecture Winter 2015

CSE 378 Midterm 2/12/10 Sample Solution

Data paths for MIPS instructions

Lecture 6 Datapath and Controller

How to design a controller to produce signals to control the datapath

Transcription:

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits Instructor: John Wawrzynek and Nicholas Weaver Lecture 13

Project Introduction You will design and optimize a RISC-V processor Phase 1: Design a processor that is functional and demonstrate Phase 2: ASIC Lab Improve performance using accelerator FPGA Lab Add streaming input-output accerelation 2

EECS151/251A Project PYNQ Board (FPGA lab) LAB) RISC-V I/O Functions Executes most commonly used RISC-V instructions (http://riscv.org). Lightly Pipelined (high performance) implementation. FPGA Focus: add I/O functions and optimize. ASIC Focus: Adding accelerators. 3

Outline Processor Introduction MIPS CPU implementation Pipelining for Performance 3-Stage Pipeline Check Harris & Harris Chapter 6 4

Processor Introduction

The Purpose of This Lecture... To Speed Run 1.5 weeks of 61C in preparation for the project Since the project is you need to build a RISC-V processor To include the differences between RISC-V and MIPS 6

RISC-V vs MIPS All RISC processors are effectively the same except for one or two design decisions that "seemed like a good idea at the time" MIPS 'seems like a good idea': The branch delay slot: Always execute the instruction after a branch or jump whether or not the branch or jump is taken RISC-V... Nothing yet, but the immediate encoding can be hard to explain to people Lecture are MIPS (match book), project is RISC-V 7

Real Differences Different register naming conventions & calling conventions $4 vs x4, $s0 vs s0, etc... Instruction encodings and encoding formats 3 encodings vs 6 all-0 is a noop in MIPS but invalid in RISC-V all RISC-V immediates are sign extended RISC-V doesn't support "trap on overflow" signed math Instruction alignment RISC-V only requires 2-byte alignment for instructions when including an optional 16b instruction encoding RISC-V also supports some 48b and 64b instructions in extensions Instructions RISC-V has dedicated "compare 2 registers & branch" operation RISC-V doesn't have j or jr, just jal and jalr: Write to x0 to eliminate the side effect 8

Abstraction Layers Architecture: the programmer s view of the computer Defined by instructions (operations) and operand locations Microarchitecture: how to implement an architecture in hardware (covered in great detail later) The microarchitecture is built out of logic circuits and memory elements (this semester). All logic circuits and memory elements are implemented in the physical world with transistors. This semester we will implement our projects using circuits on FPGAs (field programmable gate arrays) or standard-cell ASIC design. 9

Interpreting Machine Code A processor is a machine code interpreter build in hardware! 10

Processor Microarchitecture Introduction Microarchitecture: how to implement an architecture in hardware Good examples of how to put principles of digital design to practice. Introduction to eecs151/251a final project. Page 11

MIPS Microarchitecture Oganization Datapath + Controller + External Memory Controller EECS150 - Lec07-MIPS 12

How to Design a Processor: step-by-step 1. Analyze instruction set architecture (ISA) datapath requirements meaning of each instruction is given by the data transfers (register transfers) datapath must include storage element for ISA registers datapath must support each data transfer 2. Select set of datapath components and establish clocking methodology 3. Assemble datapath meeting requirements 4. Analyze implementation of each instruction to determine setting of control points that effects the data transfer. 5. Assemble the control logic. 13

MIPS CPU Implementation - Datapath

Review: The MIPS Instruction Formats register immediate jump R-type I-type J-type 31 31 31 op rs rt rd shamt funct 6 bits 5 bits 5 bits 16 bits 26 op 26 target address 6 bits 26 bits The different fields are: op: operation ( opcode ) of the instruction rs, rt, rd: the source and destination register specifiers shamt: shift amount funct: selects the variant of the operation in the op field address / immediate: address offset or immediate value target address: target address of jump instruction 21 16 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits 26 21 16 op rs rt address/immediate 11 6 0 0 0 15

Subset for Lecture add, sub, or, slt addu rd,rs,rt subu rd,rs,rt 31 26 21 16 11 6 0 op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits lw, sw lw rt,rs,imm16 sw rt,rs,imm16 31 26 21 16 op rs rt immediate 6 bits 5 bits 5 bits 16 bits 0 beq beq rs,rt,imm16 31 26 21 16 op rs rt immediate 6 bits 5 bits 5 bits 16 bits 0 16

Register Transfer Descriptions All start with instruction fetch: {op, rs, rt, rd, shamt, funct} IMEM[ PC ] OR {op, rs, rt, Imm16} IMEM[ PC ] THEN inst Register Transfers add R[rd] R[rs] + R[rt], PC PC + 4; sub R[rd] R[rs] R[rt], PC PC + 4; or R[rd] R[rs] R[rt], PC PC + 4; slt R[rd] (R[rs] < R[rt])? 1 : 0, PC PC + 4; lw R[rt] DMEM[ R[rs] + sign_ext(imm16)], PC PC + 4; sw DMEM[ R[rs] + sign_ext(imm16) ] R[rt], PC PC + 4; beq if ( R[rs] == R[rt] ) then PC PC + 4 + {sign_ext(imm16), 00} else PC PC + 4; 17

Microarchitecture Multiple implementations for a single architecture: Single-cycle Each instruction executes in a single clock cycle. Multicycle Each instruction is broken up into a series of shorter steps with one step per clock cycle. Pipelined (variant on multicycle ) Each instruction is broken up into a series of steps with one step per clock cycle Multiple instructions execute at once by overlapping in time. Superscalar Multiple functional units to execute multiple instructions at the same time Out of order... Hey, who says we have to follow the program exactly... 18

CPU clocking (1/2) Single Cycle CPU: All stages of an instruction are completed within one long clock cycle. The clock cycle is made sufficient long to allow each instruction to complete all stages without interruption and within one cycle. 1. Instruction Fetch 2. Decode/ Register Read 3. Execute 4. Memory 5. Reg. Write 19

CPU clocking (2/2) Multiple-cycle CPU: Only one stage of instruction per clock cycle. The clock is made as long as the slowest stage. 1. Instruction Fetch 2. Decode/ Register Read 3. Execute 4. Memory 5. Reg. Write Several significant advantages over single cycle execution: Unused stages in a particular instruction can be skipped OR instructions can be pipelined (overlapped). 20

MIPS State Elements State encodes everything about the execution status of a processor: PC register 32 registers Memory Note: for these state elements, clock is used for write but not for read (asynchronous read, synchronous write). 21

Single-Cycle Datapath: lw fetch First consider executing lw STEP 1: Fetch instruction R[rt] DMEM[ R[rs] + sign_ext(imm16)] 22

Single-Cycle Datapath: lw register read R[rt] DMEM[ R[rs] + sign_ext(imm16)] STEP 2: Read source operands from register file 23

Single-Cycle Datapath: lw immediate STEP 3: Sign-extend the immediate R[rt] DMEM[ R[rs] + sign_ext(imm16)] 24

Single-Cycle Datapath: lw address R[rt] DMEM[ R[rs] + sign_ext(imm16)] STEP 4: Compute the memory address 25

Single-Cycle Datapath: lw memory read R[rt] DMEM[ R[rs] + sign_ext(imm16)] STEP 5: Read data from memory and write it back to register file 26

Single-Cycle Datapath: lw PC increment STEP 6: Determine the address of the next instruction PC PC + 4 27

Single-Cycle Datapath: sw DMEM[ R[rs] + sign_ext(imm16) ] R[rt] Write data in rt to memory 28

Single-Cycle Datapath: R-type instructions Read from rs and rt Write ALUResult to register file Write to rd (instead of rt) R[rd] R[rs] op R[rt] 29

Single-Cycle Datapath: beq if ( R[rs] == R[rt] ) then PC PC + 4 + {sign_ext(imm16), 00} Determine whether values in rs and rt are equal Calculate branch target address: BTA = (sign-extended immediate << 2) + (PC+4) 30

Complete Single-Cycle Processor 31

MIPs Processor Implementation - Control

ALU Control F 2:0 Function 000 A & B 001 A B 010 A + B 011 not used 100 A & ~B 101 A ~B 110 A - B 111 SLT 33

Control Unit 34

Control Unit: ALU Decoder ALUOp 1:0 Meaning 00 Add 01 Subtract 10 Look at Funct 11 Not Used ALUOp 1:0 Funct ALUControl 2:0 00 XXXXXX 010 (Add) X1 XXXXXX 110 (Subtract) 1X 100000 (add) 010 (Add) 1X 100010 (sub) 110 (Subtract) 1X 100100 (and) 000 (And) 1X 100101 (or) 001 (Or) 1X 101010 (slt) 111 (SLT) 35

Control Unit: Main Decoder Instruction Op 5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp 1:0 R-type 000000 lw 100011 sw 101011 beq 000100 36

Control Unit: Main Decoder Instruction Op 5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp 1:0 R-type 000000 1 1 0 0 0 0 10 lw 100011 1 0 1 0 0 1 00 sw 101011 0 X 1 0 1 X 00 beq 000100 0 X 0 1 0 X 01 37

Single-Cycle Datapath Example: or 38

Extended Functionality: addi No change to datapath 39

Control Unit: addi Instruction Op 5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp 1:0 R-type 000000 1 1 0 0 0 0 10 lw 100011 1 0 1 0 0 1 00 sw 101011 0 X 1 0 1 X 00 beq 000100 0 X 0 1 0 X 01 addi 001000 40

Control Unit: addi Instruction Op 5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp 1:0 R-type 000000 1 1 0 0 0 0 10 lw 100011 1 0 1 0 0 1 00 sw 101011 0 X 1 0 1 X 00 beq 000100 0 X 0 1 0 X 01 addi 001000 1 0 1 0 0 0 00 41

Extended Functionality: j 42

Control Unit: Main Decoder Instruction Op 5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp 1:0 Jump R-type 000000 1 1 0 0 0 0 10 0 lw 100011 1 0 1 0 0 1 00 0 sw 101011 0 X 1 0 1 X 00 0 beq 000100 0 X 0 1 0 X 01 0 j 000100 43

Control Unit: Main Decoder Instructio n Op 5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp 1:0 Jump R-type 0 1 1 0 0 0 0 10 0 lw 100011 1 0 1 0 0 1 0 0 sw 101011 0 X 1 0 1 X 0 0 beq 100 0 X 0 1 0 X 1 0 j 100 0 X X X 0 X XX 1 44

Reminder on Don't Cares (X) You can use don't cares on any signal that doesn't change state Gives the synthesis tool freedom, at the cost of potential bugs if you mess up You must never specify don't care on signals which cause side effects Otherwise the tool will cause unintended writes 45

A Verilog Convention Control logic works really well as a case statement... always @* begin op = instr[26:31]; imm = instr[15:0];... reg_dst = 1'bx; // Don't care reg_write = 1'b0; // Do care, side effecting... case (op) 6'b000000: begin reg_write = 1;... end... 46

Processor Pipelining

Review: Processor Performance (The Iron Law) Program Execution Time = (# instructions)(cycles/instruction)(seconds/cycle) = # instructions x CPI x T C 48

Single-Cycle Performance T C is limited by the critical path (lw) 49

Single-Cycle Performance Single-cycle critical path: T c = t q_pc + t mem + max(t RFread, t sext + t mux ) + t ALU + t mem + t mux + t RFsetup In most implementations, limiting paths are: memory, ALU, register file. T c = t q_pc + 2t mem + t RFread + t mux + t ALU + t RFsetup 50

Pipelined MIPS Processor Temporal parallelism Divide single-cycle processor into 5 stages: Fetch Decode Execute Memory Writeback Add pipeline registers between stages 51

Single-Cycle vs. Pipelined Performance 52

Single-Cycle and Pipelined Datapath 53

Corrected Pipelined Datapath WriteReg must arrive at the same time as Result 54

Pipelined Control Same control unit as single-cycle processor Control delayed to proper pipeline stage 55

Pipeline Hazards Occurs when an instruction depends on results from previous instruction that hasn t completed. Types of hazards: Data hazard: register value not written back to register file yet Control hazard: next instruction not decided yet (caused by branches) 56

Processor Pipelining Deeper pipeline example. IF1 IF2 ID X1 X2 M1 M2 WB IF1 IF2 ID X1 X2 M1 M2 WB Deeper pipelines => less logic per stage => high clock rate. But Deeper pipelines* => more hazards => more cost and/or higher CPI. Cycles per instruction might go up because of unresolvable hazards. Remember, Performance = # instructions X Frequencyclk / CPI *Many designs included pipelines as long as 7, 10 and even 20 stages (like in the Intel Pentium 4). The later "Prescott" and "Cedar Mill" Pentium 4 cores (and their Pentium D derivatives) had a 31-stage pipeline. How about shorter pipelines... Less cost, less performance 57

3-Stage Pipeline

3-Stage Pipeline (used for project) The blocks in the datapath with the greatest delay are: IMEM, ALU, and DMEM. Allocate one pipeline stage to each: I X M Use PC register as address to IMEM and retrieve next instruction. Instruction gets stored in a pipeline register, also called instruction register, in this case. Use ALU to compute result, memory address, or compare registers for branch. Access data memory or I/O device for load or store. Allow for setup time for register file write. Most details you will need to work out for yourself. Some details to follow... In particular, let s look at hazards. 59

3-stage Pipeline Data Hazard add $5, $3, $4 I X M add $7, $6, $5 I X M The fix: reg 5 value needed here! reg 5 value updated here Selectively forward ALU result back to input of ALU. control ALU Need to add mux at input to ALU, add control logic to sense when to activate. Check book for details. 60

3-stage Pipeline Load Hazard lw $5, offset($4) I X M add $7, $6, $5 I X M value needed here! Memory value known here. It is written into the regfile on this edge. The fix: Delay the dependent instruction by one cycle to allow the load to complete, send the result of load directly to the ALU. lw $5, offset($4) I X M add $7, $6, $5 I nop nop add $7, $6, $5 I X M 61

3-stage Pipeline Control Hazard beq $1, $2, L1 I X M add $5, $3, $4 I X M add $6, $1, $2 I X M L1: sub $7, $6, $5 I X but needed here! branch address ready here The fix: Several Possibilities:* 1. Always delay fetch of instruction after branch 2. Assume branch not taken, continue with instruction at PC+4, and correct later if wrong. 3. Predict branch taken or not based on history (state) and correct later if wrong. 1. Simple, but all branches now take 2 cycles (lowers performance) 2. Simple, only some branches take 2 cycles (better performance) 3. Complex, very few branches take 2 cycles (best performance) * MIPS defines branch delay slot, RISC-V doesn t 62

Predict not taken Control Hazard Branch address ready at end of X stage: If branch not taken, do nothing. If branch taken, then kill instruction in I stage (about to enter X stage) and fetch at new target address (PC) bneq $1, $1, L1 I X M add $5, $3, $4 I X M add $6, $1, $2 I X M L1: sub $7, $6, $5 I X beq $1, $1, L1 I X M add $5, $3, $4 I nop nop L1: sub $7, $6, $5 I X M 63

EECS151 Project CPU Pipelining Summary Pipeline rules: Writes/reads to/from DMem are clocked on the leading edge of the clock in the M stage Writes to RegFile use trailing edge of the clock of M stage reg-file writes are 180 degrees out of phase Instruction Decode and Register File access is up to you. Branch: predict not-taken Load: 1 cycle delay/stall Bypass ALU for data hazards More details in upcoming spec 64 3-stage pipeline I X M instruction fetch execute access data memory