EC 513 Computer Architecture

Similar documents
Lecture 4 - Pipelining

Simple Instruction Pipelining

Lecture 08: RISC-V Single-Cycle Implementa9on CSE 564 Computer Architecture Summer 2017

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 4 Reduced Instruction Set Computers

Non-Pipelined Processors

Lecture 3 - From CISC to RISC

Programmable Machines

CS 152 Computer Architecture and Engineering. Lecture 3 - From CISC to RISC

Programmable Machines

Lecture 3 - From CISC to RISC

Lecture 6 Datapath and Controller

Lecture 7 Pipelining. Peng Liu.

Chapter 4. The Processor. Computer Architecture and IC Design Lab

CS 152 Computer Architecture and Engineering. Lecture 3 - From CISC to RISC

CS 152 Computer Architecture and Engineering. Lecture 3 - From CISC to RISC

Processor (I) - datapath & control. Hwansoo Han

CSCI-564 Advanced Computer Architecture

Full Datapath. CSCI 402: Computer Architectures. The Processor (2) 3/21/19. Fengguang Song Department of Computer & Information Science IUPUI

CS3350B Computer Architecture Quiz 3 March 15, 2018

Chapter 4. The Processor

CS 152, Spring 2011 Section 2

ece4750-tinyrv-isa.txt

Full Name: NetID: Midterm Summer 2017

ISA and RISCV. CASS 2018 Lavanya Ramapantulu

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

EECS150 - Digital Design Lecture 10- CPU Microarchitecture. Processor Microarchitecture Introduction

The Processor (1) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Instruction Set Architecture. "Speaking with the computer"

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 13 EE141

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

C 1. Last Time. CSE 490/590 Computer Architecture. ISAs and MIPS. Instruction Set Architecture (ISA) ISA to Microarchitecture Mapping

CSCI 402: Computer Architectures. Fengguang Song Department of Computer & Information Science IUPUI. Today s Content

ECE 4750 Computer Architecture Topic 2: From CISC to RISC

Chapter 4. The Processor

EECS150 - Digital Design Lecture 9- CPU Microarchitecture. Watson: Jeopardy-playing Computer

Lecture 4: Review of MIPS. Instruction formats, impl. of control and datapath, pipelined impl.

The Processor: Datapath & Control

Stored Program Concept. Instructions: Characteristics of Instruction Set. Architecture Specification. Example of multiple operands

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: A Based on P&H

Chapter 4 The Processor 1. Chapter 4A. The Processor

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

CENG 3420 Lecture 06: Datapath

Lecture 3: Single Cycle Microarchitecture. James C. Hoe Department of ECE Carnegie Mellon University

COMP303 - Computer Architecture Lecture 8. Designing a Single Cycle Datapath

ECE 486/586. Computer Architecture. Lecture # 7

Systems Architecture

The Processor: Datapath and Control. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Lecture Topics. Announcements. Today: Single-Cycle Processors (P&H ) Next: continued. Milestone #3 (due 2/9) Milestone #4 (due 2/23)

CENG 3420 Computer Organization and Design. Lecture 06: MIPS Processor - I. Bei Yu

ECE232: Hardware Organization and Design

are Softw Instruction Set Architecture Microarchitecture are rdw

CPE 335 Computer Organization. Basic MIPS Architecture Part I

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 4: Datapath and Control

Instruction Set Architecture (ISA)

Lecture 12: Single-Cycle Control Unit. Spring 2018 Jason Tang

Review: Abstract Implementation View

Lecture 2: RISC V Instruction Set Architecture. James C. Hoe Department of ECE Carnegie Mellon University

Computer Architecture

Chapter 3 MIPS Assembly Language. Ó1998 Morgan Kaufmann Publishers 1

ECE170 Computer Architecture. Single Cycle Control. Review: 3b: Add & Subtract. Review: 3e: Store Operations. Review: 3d: Load Operations

CS 152 Computer Architecture and Engineering. Lecture 3 - From CISC to RISC. Last Time in Lecture 2

CENG3420 Lecture 03 Review

Major CPU Design Steps

Lecture 10: Simple Data Path

ELEC / Computer Architecture and Design Fall 2013 Instruction Set Architecture (Chapter 2)

Chapter 4. The Processor

Reminder: tutorials start next week!

CSE 378 Midterm 2/12/10 Sample Solution

Lecture 2: RISC V Instruction Set Architecture. Housekeeping

COMPUTER ORGANIZATION AND DESIGN

The MIPS Processor Datapath

Computer Architecture 计算机体系结构. Lecture 2. Instruction Set Architecture 第二讲 指令集架构. Chao Li, PhD. 李超博士

Improving Performance: Pipelining

CSE 141 Computer Architecture Spring Lecture 3 Instruction Set Architecute. Course Schedule. Announcements

Instructions: MIPS arithmetic. MIPS arithmetic. Chapter 3 : MIPS Downloaded from:

Chapter 4. The Processor Designing the datapath

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Operating Systems and Interrupts/Exceptions

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

CO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19

CS/COE1541: Introduction to Computer Architecture

CS31001 COMPUTER ORGANIZATION AND ARCHITECTURE. Debdeep Mukhopadhyay, CSE, IIT Kharagpur. Instructions and Addressing

Outline. EEL-4713 Computer Architecture Designing a Single Cycle Datapath

Design of the MIPS Processor

CS/COE0447: Computer Organization

CS/COE0447: Computer Organization

ECE 4750 Computer Architecture, Fall 2014 T01 Single-Cycle Processors

Lecture 09: RISC-V Pipeline Implementa8on CSE 564 Computer Architecture Summer 2017

The Big Picture: Where are We Now? EEM 486: Computer Architecture. Lecture 3. Designing a Single Cycle Datapath

CPE 335. Basic MIPS Architecture Part II

EC 413 Computer Organization

Pipeline design. Mehran Rezaei

361 datapath.1. Computer Architecture EECS 361 Lecture 8: Designing a Single Cycle Datapath

Initial Representation Finite State Diagram. Logic Representation Logic Equations

Chapter 4. The Processor. Instruction count Determined by ISA and compiler. We will examine two MIPS implementations

Processor Architecture


Transcription:

EC 513 Computer Architecture Single-cycle ISA Implementation Prof. Michel A. Kinsy

Computer System View Processor Applications Compiler Firmware ISA Memory organization Digital Design Circuit Design Operating System Datapath & Control Layout I/O system

Role of the Instruction Set Software Hardware Instruction Set

Instruction Set Architecture (ISA) Instructions are the language the computer understand Instruction Set is the vocabulary of that language It serves as the hardware/software interface Defines instruction format (bit encoding) Number of explicit operands per instruction Operand location Number of bits per instruction Instruction length: fixed, short, long, or variable., Examples: MIPS, Alpha, x86, IBM 360, VAX, ARM, JVM

Instruction Set Architecture (ISA) Many possible implementations of the same ISA 360 implementations: model 30 (c. 1964), z900 (c. 2001) x86 implementations: 8086 (c. 1978), 80186, 286, 386, 486, Pentium, Pentium Pro, Pentium-4, Core i7, AMD Athlon, AMD Opteron, Transmeta Crusoe, SoftPC MIPS implementations: R2000, R4000, R10000,... JVM: HotSpot, PicoJava, ARM Jazelle,...

ISA and Performance Instructions per program depends on source code, compiler technology and ISA Cycles per instructions (CPI) depends upon the ISA and the microarchitecture Time per cycle depends upon the microarchitecture and the base technology Time = Instructions Cycles Time Program Program * Instruction * Cycle

Processor Performance Instructions per program depends on source code, compiler technology and ISA Cycles per instructions (CPI) depends upon the ISA and the microarchitecture Time per cycle depends upon the microarchitecture and the base technology Microarchitecture CPI cycle time Microcoded >1 short Single-cycle unpipelined 1 long Pipelined 1 short

Brief Overview of the RISC-V ISA A new, open, free ISA from Berkeley Several variants RV32, RV64, RV128 Different data widths I Base Integer instructions M Multiply and Divide A Atomic memory instructions F and D Single and Double precision floating point V Vector extension And many other modular extensions We will focus on the RV32I the base 32-bit variant

RV32I Register State 32 general purpose registers (GPR) x0, x1,, x31 32-bit wide integer registers x0 is hard-wired to zero Program counter (PC) 32-bit wide CSR (Control and Status Registers) User-mode cycle (clock cycles) // read only instret (instruction counts) // read only Machine-mode hartid (hardware thread ID) // read only mepc, mcause etc. used for exception handling Custom mtohost (output to host) // write only custom extension

Instruction Types Register-to-Register Arithmetic and Logical operations Control Instructions alter the sequential control flow Memory Instructions move data to and from memory CSR Instructions move data between CSRs and GPRs; the instructions often perform read-modifywrite operations on CSRs Privileged Instructions are needed by the operating systems, and most cannot be executed by user programs

R-type instruction Instruction Formats 7 5 5 3 5 7 funct7 rs2 rs1 funct3 rd opcode I-type instruction & I-immediate (32 bits) 12 5 3 5 7 imm[11:0] rs1 funct3 rd opcode I-imm = signextend(inst[31:20]) S-type instruction & S-immediate (32 bits) 7 5 5 3 5 7 imm[11:5] rs2 rs1 funct3 imm[4:0] opcode S-imm = signextend({inst[31:25], inst[11:7]})

Datapath: Reg-Reg ALU Instructions 0x4 Add RegWrite clk PC clk addr inst Inst. Memory inst<19:15> inst<24:20> inst<11:7> we rs1 rs2 rd1 ws wd rd2 GPRs ALU inst<14:12> ALU Control OpCode 7 5 5 3 5 7 funct7 rs2 rs1 funct3 rd opcode rd ß(rs) func (rt)

Datapath: Reg-Imm ALU Instructions PC clk 0x4 Add addr inst Inst. Memory inst<19:15> inst<11:7> inst<31:20> inst<14:12> RegWrite clk we rs1 rs2 rd1 ws wd rd2 GPRs Imm Select ALU Control ALU ImmSel OpCode 12 5 3 5 7 imm[11:0] rs1 funct3 rd opcode rt ß (rs) op immediate

Conflicts in Merging Datapath PC clk 0x4 Add addr inst Inst. Memory inst<19:15> inst<24:20> inst<15:11> inst<31:20> inst<14:12> RegWrite clk we rs1 rs2 rd1 ws wd rd2 GPRs Imm Select ALU Control ALU ImmSel OpCode 7 5 5 3 5 7 funct7 rs2 rs1 funct3 rd opcode imm[11:0] rs1 funct3 rd opcode rd ß(rs) func (rt) rt ß (rs) op immediate

Conflicts in Merging Datapath PC clk 0x4 Add addr inst Inst. Memory inst<19:15> inst<24:20> inst<15:11> inst<31:20> inst<14:12> RegWrite clk we rs1 rs2 rd1 ws wd rd2 GPRs Imm Select ALU Control ALU ImmSel Op2Sel Reg / Imm OpCode 7 5 5 3 5 7 funct7 rs2 rs1 funct3 rd opcode imm[11:0] rs1 funct3 rd opcode rd ß(rs) func (rt) rt ß (rs) op immediate

Instruction Formats SB-type instruction & B-immediate (32 bits) 1 6 5 5 3 4 1 7 imm[12] imm[10:5] rs2 rs1 funct3 imm[4:1] imm[11] opcode B-imm = signextend({inst[31], inst[7], inst[30:25], inst[11:8], 1 b0}) U-type instruction & U-immediate (32 bits) 20 imm[31:12] U-imm = signextend({inst[31:12], 12 b0}) UJ-type instruction & J-immediate (32 bits) imm[20] imm[10:1] imm[11] imm[19:12] rd J-imm = signextend({inst[31], inst[19:12], inst[20], inst[30:21], 1 b0}) 5 7 rd opcode 1 10 1 8 5 7 opcode

Computational Instructions Register-Register instructions (R-type) 7 5 5 3 5 7 funct7 rs2 rs1 funct3 rd opcode opcode=op: rd ß rs1 (funct3, funct7) rs2 funct3 = SLT/SLTU/AND/OR/XOR/SLL funct3= ADD funct7 = 0000000: rs1 + rs2 funct7 = 0100000: rs1 rs2 funct3 = SRL funct7 = 0000000: logical shift right funct7 = 0100000: arithmetic shift right

Computational Instructions Register-immediate instructions (I-type) 12 5 3 5 7 imm[11:0] rs1 funct3 rd opcode opcode = OP-IMM: rd ß rs1 (funct3) I-imm I-imm = signextend(inst[31:20]) funct3 = ADDI/SLTI/SLTIU/ANDI/ORI/XORI A slight variant in coding for shift instructions - SLLI / SRLI / SRAI rd ß rs1 (funct3, inst[30]) I-imm[4:0]

Computational Instructions Register-immediate instructions (U-type) 20 imm[31:12] 5 7 rd opcode opcode = LUI : rd ß U-imm opcode = AUIPC : rd ß pc + U-imm U-imm = {inst[31:12], 12 b0}

Control Instructions Unconditional jump and link (UJ-type) 1 10 1 8 5 7 imm[20] imm[10:1] imm[11] imm[19:12] rd opcode = JAL: rd ß pc + 4; pc ß pc + J-imm J-imm = signextend({inst[31], inst[19:12], inst[20], inst[30:21], 1 b0}) Jump ±1MB range Unconditional jump via register and link (I-type) 12 5 3 5 7 imm[11:0] rs1 funct3 rd opcode opcode = JALR: rd ß pc + 4; pc ß (rs1 + I-imm) & ~0x01 I-imm = signextend(inst[31:20]) opcode

Control Instructions Conditional branches (SB-type) 1 6 5 5 3 4 1 7 imm[12] imm[10:5] rs2 rs1 funct3 imm[4:1] imm[11] opcode opcode = BRANCH: pc ß compare(funct3, rs1, rs2)? pc + B-imm : pc + 4 B-imm = signextend({inst[31], inst[7], inst[30:25], inst[11:8], 1 b0}) Jump ±4KB range funct3 = BEQ/BNE/BLT/BLTU/BGE/BGEU

Conditional Branches (BEQ/BNE/BLT/BGE/BLTU/BGEU) PCSel br RegWrite MemWrite WBSel pc+4 0x4 Add clk Add PC clk addr inst Inst. Memory we rs1 rs2 rd1 ws wd rd2 GPRs Imm Select ALU Control Br Logic ALU clk we addr rdata Data Memory wdata OpCode ImmSel Op2Sel

Load & Store Instructions Load (I-type) 12 5 3 5 7 imm[11:0] rs1 funct3 rd opcode opcode = LOAD: rd ß mem[rs1 + I-imm] I-imm = signextend(inst[31:20]) funct3 = LW/LB/LBU/LH/LHU Store (S-type) 7 5 5 3 5 7 imm[11:5] rs2 rs1 funct3 imm[4:0] opcode opcode = STORE: mem[rs1 + S-imm] ß rs2 S-imm = signextend({inst[31:25], inst[11:7]}) funct3 = SW/SB/SH

Datapath for Memory Instructions Should program and data memory be separate? Harvard style: separate (Aiken and Mark 1 influence) read-only program memory read/write data memory Princeton style: the same (von Neumann s influence) single read/write memory for program and data Executing a Load or Store instruction requires accessing the memory more than once

Harvard Architecture RegWrite MemWrite PC clk 0x4 Add addr inst Inst. Memory base disp clk we rs1 rs2 rd1 ws wd rd2 GPRs Imm Select ALU clk we addr rdata Data Memory wdata WBSel ALU / Mem ALU Control OpCode ImmSel Op2Sel 7 5 5 3 5 7 imm rs2 rs1 funct3 imm opcode imm[11:0] rs1 funct3 rd opcode Store (rs) + displacement Load

Hardwired Control Hardwired Control is pure Combinational Logic ImmSel Op2Sel op code equal? Combinational Logic FuncSel MemWrite WBSel RegDst RegWrite PCSel

ALU Control & Immediate Extension Inst<14:12> (Func3) Inst<6:0> (Opcode) + 0? ALUop FuncSel ( Func, Op, +, 0? ) Decode Map ImmSel ( IType 12,SType 12, UType 20 )

Hardwired Control Table Opcode ImmSel Op2Sel FuncSel MemWr RFWen WBSel WASel PCSel ALU ALUi LW SW BEQ true BEQ false J * Reg Func no yes ALU rd IType 12 Imm Op no yes ALU rd pc+4 IType 12 Imm + no yes Mem rd pc+4 SType 12 Imm + yes no * * pc+4 SBType 12 * * no no * * SBType 12 * * no no * * * * * no no * * pc+4 pc+4 jabs JAL * * * no yes PC X1 jabs JALR * * * no yes PC rd rind br

Single-Cycle Hardwired Control Harvard architecture: we will assume that clock period is sufficiently long for all of and the following steps to be completed : 1. instruction fetch 2. decode and register fetch 3. ALU operation 4. data fetch if required 5. register write-back setup time t C > t IFetch + t RFetch + t ALU + t DMem + t RWB

Princeton Microarchitecture PCen PCSel RegWrite MemWrite WBSel 0x4 Add clk Add PC clk clk IR 31 we rs1 rs2 rd1 ws wd rd2 GPRs Imm Select ALU Control Br Logic ALU clk we addr rdata Data Memory wdata Fetch phase IRen OpCode RegDst ImmSel Op2Sel AddrSel

Two-State Controller In the Princeton Microarchitecture, a flipflop can be used to remember the phase fetch phase AddrSel=PC IRen=on PCen=off Wen=off execute phase AddrSel=ALU IRen=off PCen=on Wen=on

Hardwired Controller IR op code equal? old combinational logic (Harvard)... ImmSel, Op2Sel, FuncSel, MemWrite, WBSel, RegDst, RegWrite, PCSel MemWrite RegWrite Wen S 1-bit Toggle FF I-fetch / Execute new combinational logic Pcen Iren AddrSel

Princeton architecture Clock Period t C-Princeton > max {t M, t RF + t ALU + t M + t WB } t C-Princeton > t RF + t ALU + t M + t WB while in the hardwired Harvard architecture t C-Harvard > t M + t RF + t ALU + t M + t WB which will execute instructions faster?

Clock Rate vs CPI Suppose t M >> t RF + t ALU + t WB t C-Princeton = 0.5 * t C-Harvard CPI Princeton = 2 CPI Harvard = 1 No difference in performance!

Princeton Microarchitecture Can we overlap instruction fetch and execute? 0x4 PC Add we addr rdata Memory wdata IR we rs1 rs2 rd1 ws wd rd2 GPRs Imm Select ALU we addr rdata Memory wdata fetch phase execute phase

Princeton Microarchitecture Only one of the phases is active in any cycle a lot of datapath is not in use at any given time PC 0x4 Add we addr rdata Memory wdata IR The same (mux not shown) we rs1 rs2 rd1 ws wd rd2 GPRs Imm Select ALU we addr rdata Memory wdata fetch phase execute phase

Stalling the instruction fetch PC 0x4 Add we addr rdata Memory wdata fetch phase nop stall? IR we rs1 rs2 rd1 ws wdrd2 GPRs Imm Select execute phase When stall condition is indicated don t fetch a new instruction and don t change the PC insert a nop in the IR set the Memory Address mux to ALU (not shown) ALU we addr rdata Memory wdata

Pipelined Princeton Architecture Clock: t C-Princeton > t RF + t ALU + t M CPI: (1- f) + 2f cycles per instruction where f is the fraction of instructions that cause a stall

Instructions to Read and Write CSR 12 5 3 5 7 csr rs1 funct3 rd opcode opcode = SYSTEM CSRW rs1, csr (funct3 = CSRRW, rd = x0): csr ß rs1 CSRR csr, rd (funct3 = CSRRS, rs1 = x0): rd ß csr

GCD in C // require: x >= 0 && y > 0 int gcd(int a, int b) { int t; while(a!= 0) { if(a >= b) { a = a - b; } else { t = a; a = b; b = t; } } return b; }

GCD in RISC-V Assembler // a: x1, b: x2, t: x3 begin: beqz x1, done // if(x1 == 0) goto done blt x1, x2, bigger // if(x1 < x2) goto b_bigger sub x1, x1, x2 j begin bigger: mv x3, x1 mv x1, x2 mv x2, x3 j begin done: // x1 := x1 - x2 // goto begin // x3 := x1 // x1 := x2 // x2 := x3 // goto begin // now x2 contains the gcd

GCD in RISC-V Assembler // a: a0, b: a1, t: t0 gcd: beqz a0, done blt a0, a1, bigger sub a0, a0, a1 j gcd bigger: mv t0, a0 mv a0, a1 mv a1, t0 j gcd done: // now a1 contains the gcd mv a0, a1 // move to a0 for returning ret // jr ra

Exception handling When an exception is caused Hardware saves the information about the exception in CSRs: mepc exception PC mcause cause of the exception mstatus.mpp privilege mode of exception Processor jumps to the address of the trap handler (stored in the mtvec CSR) and increases the privilege level An exception handler, a software program, takes over and performs the necessary action

Software for interrupt handling Hardware transfers control to the common software interrupt handler (CH) which: 1. Saves all GPRs into the memory pointed by mscratch 2. Passes mcause, mepc, stack pointer to the IH (a C function) to handle the specific interrupt 3. On the return from the IH, writes the return value to mepc CH 4. Loads all GPRs from the memory 1 5. Execute ERET, which does: 2 set pc to mepc pop mstatus (mode, enable) stack 3 4 5 GPR IH IH IH

Single Cycle RISC-V Processor

Single Cycle RISC-V Processor

Single Cycle RISC-V Processor

Single Cycle RISC-V Processor

Single Cycle RISC-V Processor

Next Class Instruction Pipelining and Hazards