Lecture 08: RISC-V Single-Cycle Implementa9on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu www.secs.oakland.edu/~yan 1
Acknowledgements The notes cover Appendix C of the textbook, but we use RISC-V instead of MIPS ISA Slides for general RISC ISA implementalon are adapted from Lecture slides for Computer OrganizaLon and Design, FiRh EdiLon: The Hardware/SoRware Interface textbook for general RISC ISA implementalon Slides for RISC-V single-cycle implementalon are adapted from Computer Science 152: Computer Architecture and Engineering, Spring 2016 by Dr. George Michelogiannakis from UC Berkeley 2
Introduc9on CPU performance factors CPU Time = Instructions InstrucLon count Program * Cycles Instruction *Time Cycle Determined by ISA and compiler CPI and Cycle Lme Determined by CPU hardware Simple subset, shows most aspects Memory reference: lw, sw ArithmeLc/logical: add, sub, and, or, slt Control transfer: beq, j
Components of a Computer Processor Control Enable? Read/Write Memory Input Datapath PC Address Program Bytes Registers Write Data ArithmeLc & Logic Unit (ALU) Read Data Data Output Processor-Memory Interface I/O-Memory Interfaces 4
The CPU Processor (CPU): the aclve part of the computer that does all the work (data manipulalon and decisionmaking) Datapath: porlon of the processor that contains hardware necessary to perform operalons required by the processor (the brawn) Control : porlon of the processor (also in hardware) that tells the datapath what needs to be done (the brain) 5
Datapath and Control Datapath designed to support data transfers required by instruclons Controller causes correct transfers to happen PC instruction memory rd rs rt registers ALU Data memory +4 imm opcode, funct Controller 6
Logic Design Basics Information encoded in binary Low voltage = 0, High voltage = 1 One wire per bit Multi-bit data encoded on multi-wire buses Combinational circuit Operate on data Output is a function of input State (sequential) circuit Store information
Combinational Circuits AND-gate Y = A & B n Adder n Y = A + B A B + Y A B I0 I1 M u x S Y n MulLplexer n Y = S? I1 : I0 Y n ArithmeLc/Logic Unit n Y = F(A, B) A ALU Y B F Chapter 4 The Processor 8
Sequential Circuits Register: stores data in a circuit Uses a clock signal to determine when to update the stored value Edge-triggered: update when Clk changes from 0 to 1 D Clk Q Clk D Q
Edge-Triggered D Flip Flops D Q Value of D is sampled on posi9ve clock edge. Q outputs sampled value for rest of cycle. CLK D Q
Sequential Circuits Register with write control Only updates on clock edge when write control input is 1 Used when stored value is required later Clk D Write Clk Q Write D Q
Clocking Methodology Combinational logic transforms data during clock cycles Between clock edges Input from state elements, output to state element Longest delay determines clock period
Single cycle data paths Processor uses synchronous logic design (a clock ). f! T! 1 MHz! 1 μs! 10 MHz! 100 ns! 100 MHz! 10 ns! 1 GHz! 1 ns! All state elements act like posilve edge-triggered flip flops. D Q Reset? clk
Hardware Elements of CPU CombinaLonal circuits Mux, Decoder, ALU,... A 0 A 1 A n-1. Sel Mux lg(n) O A lg(n) Decoder. O 0 O 1 O n-1 A B OpSelect - Add, Sub,... - And, Or, Xor, Not,... - GT, LT, EQ, Zero,... ALU Result Comp? Synchronous state elements Flipflop, Register, Register file, SRAM, DRAM En Clk D ff Q Clk En D Q Edge-triggered: Data is sampled at the rising edge
Reads are combinalonal Register Files D 0 register D 1 D 2... D n-1 En Clk ff ff ff... ff Q 0 Q 1 Q 2... Q n-1 Clock WE ReadSel1 ReadSel2 WriteSel WriteData rs1 rs2 ws wd we Register file 2R+1W rd1 rd2 ReadData1 ReadData2 15
Register File Implementa9on RISC-V integer instruclons have at most 2 register source operands rd clk wdata rdata1 rdata2 5 32 32 32 reg 0 rs1 5 rs2 5 we reg 1 reg 31 16
A Simple Memory Model WriteEnable Clock Address WriteData MAGIC RAM ReadData Reads and writes are always completed in one cycle a Read can be done any Lme (i.e. combinalonal) a Write is performed at the rising clock edge if it is enabled => the write address and data must be stable at the clock edge Later in the course we will present a more realis:c model of memory 17
Five Stages of Instruc9on Execu9on Stage 1: InstrucLon Fetch Stage 2: InstrucLon Decode Stage 3: ALU (ArithmeLc-Logic Unit) Stage 4: Memory Access Stage 5: Register Write 18
Stages of Execu9on on Datapath PC instruction memory rd rs rt registers ALU Data memory +4 imm 1. InstrucLon Fetch 2. Decode/ Register Read 3. Execute 4. Memory 5. Register Write 19
Stages of Execu9on (1/5) There is a wide variety of instruclons: so what general steps do they have in common? Stage 1: InstrucLon Fetch The 32-bit instruclon word must first be fetched from memory the cache-memory hierarchy also, this is where we Increment PC PC = PC + 4, to point to the next instruclon: byte addressing so + 4 20
Stages of Execu9on (2/5) Stage 2: InstrucLon Decode: gather data from the fields (decode all necessary instruclon data) 1. read the opcode to determine instruclon type and field lengths 2. read in data from all necessary registers for add, read two registers for addi, read one register for jal, no reads necessary 21
Stages of Execu9on (3/5) Stage 3: ALU (ArithmeLc-Logic Unit): the real work of most instruclons is done here AL operalons: arithmelc (+, -, *, /), shiring, logic (&, ), comparisons (slt) loads and stores lw $t0, 40($t1) the address we are accessing in memory = the value in $t1 PLUS the value 40 AddiLon is done in this stage CondiLonal branch Comparison is done in this stage (one solulon) 22
Stages of Execu9on (4/5) Stage 4: Memory Access: only load and store instruclons the others remain idle during this stage or skip it all together since these instruclons have a unique step, we need this extra stage to account for them as a result of the cache system, this stage is expected to be fast 23
Stages of Execu9on (5/5) Stage 5: Register Write most instruclons write the result of some computalon into a register examples: arithmelc, logical, shirs, loads, slt what about stores, branches, jumps? don t write anything into a register at the end these remain idle during this firh stage or skip it all together 24
Stages of Execu9on on Datapath PC instruction memory rd rs rt registers ALU Data memory +4 imm 1. InstrucLon Fetch 2. Decode/ Register Read 3. Execute 4. Memory 5. Register Write 25
Instruction Execution PC instruction memory, fetch instruction Register numbers register file, read registers Depending on instruction class Use ALU to calculate Arithmetic result Memory address for load/store Branch condition and target address Access data memory for load/store PC target address or PC + 4
CPU Components
Multiplexers n Can t just join wires together n Use mullplexers
Control Signals
Building a Datapath Datapath Elements that process data and addresses in the CPU Registers, ALUs, mux s, memories, We will build a RISCV datapath incrementally Refining the overview design
Instruction Fetch 32-bit register Increment by 4 for next instruclon
R-Format Instructions Read two register operands Perform arithmetic/logical operation Write register result
Load/Store Instructions Read register operands Calculate address using 12-bit offset Use ALU, but sign-extend offset Load: Read memory and update register Store: Write register value to memory
Branch Instructions Read register operands Compare operands Use ALU, subtract and check Zero output Calculate target address Sign-extend displacement Shift left 2 places (word displacement) Add to PC + 4 Already calculated by instruction fetch
Branch Instructions Just re-routes wires Sign-bit wire replicated
Composing the Elements First-cut data path does an instruction in one clock cycle Each datapath element can only do one function at a time Hence, we need separate instruction and data memories Use multiplexers where alternate data sources are used for different instructions
R-Type/Load/Store Datapath
Full Datapath
ALU Control ALU used for Load/Store: F = add Branch: F = subtract R-type: F depends on funct field ALU control Function 0000 AND 0001 OR 0010 add 0110 subtract 0111 set-on-less-than 1100 NOR
ALU Control Assume 2-bit ALUOp derived from opcode Combinational logic derives ALU control opcode ALUOp Operation funct ALU function ALU control lw 00 load word XXXXXX add 0010 sw 00 store word XXXXXX add 0010 beq 01 branch equal XXXXXX subtract 0110 R-type 10 add 100000 add 0010 subtract 100010 subtract 0110 AND 100100 AND 0000 OR 100101 OR 0001 set-on-less-than 101010 set-on-less-than 0111
Datapath: Reg-Reg ALU Instruc9ons 0x4 Add RegWriteEn clk PC clk addr inst Inst. Memory Inst<19:15> Inst<24:20> Inst<11:7> we rs1 rs2 rd1 wa wd rd2 GPRs ALU Inst<14:12> ALU Control OpCode RegWrite Timing? 7 5 5 3 5 7 func7 rs2 rs1 func3 rd opcode rd (rs1) func (rs2) 31 25 24 20 19 15 14 12 11 7 6 0 41
Datapath: Reg-Imm ALU Instruc9ons 0x4 Add RegWriteEn clk PC clk addr inst Inst. Memory Inst<19:15> Inst<11:7> Inst<31:20> Inst<14:12> we rs1 rs2 rd1 wa wd rd2 GPRs Imm Select ALU Control ALU OpCode ImmSel 12 5 3 5 7 immediate12 rs1 func3 rd opcode rd (rs1) op immediate 31 20 19 15 14 12 11 7 6 0 42
Conflicts in Merging Datapath PC clk 0x4 Add addr inst Inst. Memory Inst<19:15> Inst<24:20> Inst<11:7> Inst<31:20> Inst<14:12> RegWrite clk we rs1 rs2 rd1 wa wd rd2 GPRs Imm Select ALU Control ALU Introduce muxes OpCode ImmSel 7 5 5 3 5 7 func7 rs2 rs1 func3 rd opcode rd (rs1) func (rs2) immediate12 rs1 func3 rd opcode rd (rs1) op immediate 31 20 19 15 14 12 11 7 6 0 43
Datapath for ALU Instruc9ons 0x4 Add RegWriteEn clk PC clk addr inst Inst. Memory <19:15> <24:20> <11:7> Inst<31:20> we rs1 rs2 rd1 wa wd rd2 GPRs Imm Select ALU <14:12> ALU Control <6:0> OpCode ImmSel Op2Sel Reg / Imm 7 5 5 3 5 7 func7 rs2 rs1 func3 rd opcode rd (rs1) func (rs2) immediate12 rs1 func3 rd opcode rd (rs1) op immediate 31 20 19 15 14 12 11 7 6 0 44
Load/Store Instruc9ons PC clk 0x4 Add addr inst Inst. Memory base disp RegWriteEn clk we rs1 rs2 rd1 wa wd rd2 GPRs Imm Select ALU Control ALU MemWrite clk we addr rdata Data Memory wdata WBSel ALU / Mem OpCode ImmSel Op2Sel 7 5 5 3 5 7 imm rs2 rs1 func3 imm opcode Store (rs1) + displacement immediate12 rs1 func3 rd opcode Load 31 20 19 15 14 12 11 7 6 0 rs1 is the base register rd is the destination of a Load, rs2 is the data source for a Store 45
RISC-V Condi9onal Branches 7 imm 31 25 5 rs2 24 20 5 rs1 19 15 3 func3 14 12 5 imm 11 7 7 opcode 6 0 BEQ/BNE BLT/BGE BLTU/BGEU Compare two integer registers for equality (BEQ/BNE) or signed magnitude (BLT/BGE) or unsigned magnitude (BLTU/ BGEU) 12-bit immediate encodes branch target address as a signed offset from PC, in units of 16-bits (i.e., shir ler by 1 then add to PC). 46
Condi9onal Branches (BEQ/BNE/BLT/BGE/BLTU/BGEU) PCSel br RegWrEn MemWrite WBSel pc+4 0x4 Add Add clk PC clk addr inst Inst. Memory we rs1 rs2 rd1 wa wd rd2 GPRs Imm Select Br Logic ALU Control Bcomp? ALU clk we addr rdata Data Memory wdata OpCode ImmSel Op2Sel 47
Full RISCV1Stage Datapath PC JumpReg TargGen rs1 rs2 Branch CondGen br_eq? br_lt? br_ltu? RISC-V Sodor 1-Stage pc+4 jalr branch jump exception pc_sel PC +4 addr val Instruction Mem PC+4 data Inst ir[31:25], ir[11:7] ir[31:12] ir[31:20] ir[31:20] ir[31:12] Branch TargGen Jump TargGen IType Sign Extend SType Sign Extend UType PC Op2Sel ALU co-processor (CSR) registers wb_sel ir[11:7] wa rf_wen en ir[24:20] ir[19:15] addr Reg File data rs2 rs1 wd Reg File Op1Sel AluFun Decoder Control Signals rs2 rdata addr wdata Data Mem Note: for simplicity, the CSR File (control and status registers) and associated datapath is not shown Execute Stage mem_rw mem_val 48
Hardwired Control is pure Combina9onal Logic op code Equal? combinalonal logic ImmSel Op2Sel FuncSel MemWrite WBSel WASel RegWriteEn PCSel 49
ALU Control & Immediate Extension Inst<14:12> (Func3) Inst<6:0> (Opcode) + 0? ALUop FuncSel ( Func, Op, +, 0? ) Decode Map ImmSel ( IType 12, SType 12, UType 20 ) 50
Hardwired Control Table Opcode ImmSel Op2Sel FuncSel MemWr RFWen WBSel WASel PCSel ALU ALUi LW SW BEQ true BEQ false J JAL JALR * Reg Func no yes ALU rd IType 12 Imm Op no yes ALU rd pc+4 IType 12 Imm + no yes Mem rd pc+4 SType 12 Imm + yes no * * pc+4 SBType 12 * * no no * * SBType 12 * * no no * * * * * no no * * pc+4 pc+4 jabs * * * no yes PC X1 jabs * * * no yes PC rd rind br Op2Sel= Reg / Imm WBSel = ALU / Mem / PC WASel = rd / X1 PCSel = pc+4 / br / rind / jabs 51
Single-Cycle Hardwired Control clock period is sufficiently long for all of the following steps to be completed : 1. InstrucLon fetch 2. Decode and register fetch 3. ALU operalon 4. Data fetch if required 5. Register write-back setup Lme => tc > tifetch + trfetch + talu+ tdmem+ trwb At the rising edge of the following clock, the PC, register file and memory are updated 52
Implementa9on in Real Load-Store RISC ISAs designed for efficient pipelined implementalons Inspired by earlier Cray machines (CDC 6600/7600) RISC-V ISA implemented using Chisel hardware construclon language Chisel: h}ps://chisel.eecs.berkeley.edu/ Ge~ng started: h}ps://chisel.eecs.berkeley.edu/2.2.0/ge~ng-started.html Check resource page for slides and other info 53
Chisel in one slides Module IO Wire Reg Mem 54
UCB RISC-V Sodor h}ps://github.com/ucb-bar/riscv-sodor Single-cycle: h}ps://github.com/ucb-bar/riscv-sodor/tree/master/src/ rv32_1stage 55