ECE 4750 Computer Architecture Topic 2: From CISC to RISC

ECE 4750 Computer Architecture Topic 2: From CISC to RISC Christopher Batten School of Electrical and Computer Engineering Cornell University http://www.csl.cornell.edu/courses/ece4750 slide revision: 2013-09-08-23-34

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor CPI for Microcoded Machine Inst 1 7 cycles Inst 2 5 cycles Inst 3 10 cycles Total clock cycles = 7 + 5 + 10 = 22 Total instructions = 3 Clocks per Instruction (CPI) = 22 / 3 = 7.33 CPI is always an average over a large number of instructions ECE 4750 T02: From CISC to RISC 2 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Iron Law of Processor Performance Time Program = Instructions Program Cycles Instruction Time Cycles Instructions / program depends on source code, compiler, ISA Cycles / instruction (CPI) depends on ISA, microarchitecture Time / cycle depends upon microarchitecture and implementation Microarchitecture CPI Cycle Time last topic Microcoded >1 short this topic Single-Cycle Unpipelined 1 long this topic Multi-Cycle Unpipelined >1 short next topic Pipelined 1 short ECE 4750 T02: From CISC to RISC 3 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Agenda Technology Trends Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor ECE 4750 T02: From CISC to RISC 4 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Extremely popular VAX 11/780 first available in 1977; often used as a baseline for benchmarking and assumed to have a speed of 1M instructions/section (1 MIPS): 5 MHz, TTL devices Minicomputers in the 1970 s Implemented with racks of discrete components Used microcode to implement CISC ISA Applications in business, scientific, commercial computing ECE 4750 T02: From CISC to RISC 5 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Microprocessors in the 1970 s Microprocessors made possible by new integrated circuit tech Constrained by what could fit on a single chip leading to few-bit datapaths with hardwired control Initial application was for embedded control First microprocessor is the Intel 4004 fabricated in 1971: designed for desktop printing calculator: 750 KHz, 8 16 cycles/inst, 8 µm PMOS, 2.3K transistors, 12 mm 2, microcoded control to implement CISC ISA 8-bit microprocessors used in hobbyist personal computers Micral, Alrair, TRS-80, Apple-II Usually had 16-bit address space (65KB directly addressable) Simple BASIC interpreter in ROM or cassette tape ECE 4750 T02: From CISC to RISC 6 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor DRAM in the 1970 s Dramatic progress in MOSFET memory technology 1970 Intel introduces first DRAM (Model 1103 w/ 1 Kb) 1979 Fujitsu introduces 64 Kb DRAM By mid-1970 s became obvious that microprocessors would soon have >64 KB of physical memory ECE 4750 T02: From CISC to RISC 7 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor VisiCalc as Killer App and Eventually the IBM PC ECE 4750 T02: From CISC to RISC 8 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Analyzing Microcoded Machines John Cocke and group at IBM Working on a simple pipelined processor, 801, and advanced compilers Ported experimental PL8 compiler to IBM 370, and only used simple register-register and load/store instructions similar to 801 Code ran faster than other existing compilers that used all 370 instructions! (up to 6 MIPS, whereas 2 MIPS considered good before) Joel Emer and Douglas Clark at DEC Measured VAX-11/780 using external hardware Found it was actually a 0.5 MIPS machine, not a 1 MIPS machine 20% of VAX instrs = 60% of µcode, but only 0.2% of the dynamic execution VAX 8800, high-end VAX in 1984 Control store: 16K 147b RAM, Unified Cache: 64K 8b RAM 4.5 more microstore RAM than cache RAM! ECE 4750 T02: From CISC to RISC 9 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Key changes in tech constraints From CISC to RISC Logic, RAM, ROM all implemented with MOS transistors RAM same speed as ROM Use fast RAM to build fast instruction cache of user-visible instructions, not fixed hardware microfragments Change contents of fast instruction memory to fit what app needs Use simple ISA to enable hardwired pipelined implementation Most compiled code only used a few of CISC instructions Simpler encoding allod pipelined implementations Load/Store Reg-Reg ISA as opposed to Mem-Mem ISA Further benefit with integration Early 1980 s fit 32-bit datapath, small caches on single chip No chip crossing in common case allows faster operation ECE 4750 T02: From CISC to RISC 10 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor From CISC to RISC Vertical μcode Controller RISC Controller μpc User PC ROM for μinst RAM for Instr Cache Small Decoder "Larger" Decoder ECE 4750 T02: From CISC to RISC 11 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Berkeley RISC Chips RISC-I fabricated in 1982 under the direction of David Patterson and probably the first VLSI RISC processor: 1 MHz, 5 µm NMOS, 44.5K transistors, 77 mm 2 RISC-II was the 1983 follow up with several improvements: 3 MHz, 3 µm NMOS, 40.7K transistors, 60 mm 2 ECE 4750 T02: From CISC to RISC 12 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Stanford MIPS Chips First MIPS prototype fabricated in 1984 under direction of John Hennessy; MIPS-X was the 1986 follow up: 5-stage, 20 MHz, 2 µm 2-layer CMOS John Hennessy leaves Stanford to form MIPS Computer Systems and their first chip is MIPS R2000 in 1986: 8 15 MHz, 2 µm 2-layer CMOS, 110K transistors, 80 mm 2 ECE 4750 T02: From CISC to RISC 13 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor MIPS vs. VAX 4.0 Ratio of MIPS to VAX 3.5 3.0 2.5 2.0 Performance Ratio Instructions Excuted Ratio 1.5 1.0 0.5 2x more instr 6x lor CPI 2-4x higher perf CPI Ratio 0.0 spice matrix nasa7 fpppp tomcatv doduc espresso eqntott li -- H&P, Appendix J, from Bhandarkar and Clark, 1991 ECE 4750 T02: From CISC to RISC 14 / 43

to deliver high performance throughout that period. The new processor uses deep queues decouple the instruction fetch logic from the execution units. Instruc- Speculative Execution Beyond Branches Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor tions that are ready to execute can jump ahead of those waiting for operands, increasing the utilization of the execution units. This technique, known as out-of-order execution, has been used in PorPC processors for some time (see 081402.PDF), but the new MIPS design is the most aggressive implementation yet, allowing more instructions to be queued than any of its competitors. ITLB 8 entry PC Unit Memory Queue 16 entries Address Adder virtual addr Main TLB 64 entries BHT 512 x 2 1 Instruction Cache 32K, two-way associative 4 instr Integer Registers 64! 64 bits phys addr Decode, Map, Dispatch 2! 4 instr Integer Queue 16 entries Resume Cache 64 FP " 4 instr Figure 1. The R10000 uses deep instruction queues to decouple the instruction fetch logic from the five function units. the performance of its processor. The front end of the processor is responsible for maintaining a continuous flow of instructions into the queues, despite problems caused by branches and cache misses. As Figure 1 shows, the chip uses a two-way set- CISC/RISC associative instruction Convergence cache of 32K. Like other highly Active List FP Mult Predecode Unit FP Registers 64! 64 bits Data Cache 32K, two-way associative Map Table FP Queue 16 entries FP Adder MIPS R10K uses sophisticated out-of-order engine; branch delay slot not useful 128 Data SRAM superscalar designs, the R10000 predecodes instructions as they are loaded into this cache, which holds four extra bits per instruction. These bits reduce the time needed to determine the appropriate queue for each instruction. The processor fetches four instructions per cycle from the cache and decodes them. If a branch is discovered, it is immediately predicted; if it is predicted taken, the target address is sent to the instruction cache, redirecting the fetch stream. Because of the one cycle needed to decode the branch, taken branches create a bubble in the fetch stream; the deep queues, hover, generally prevent this bubble from delaying the execution pipeline. The sequential instructions that are loaded during this extra cycle are not discarded but are saved in a resume cache. If the branch is later determined to have been mispredicted, the sequential instructions are reloaded from the resume cache, reducing the mispredicted branch penalty by one cycle. The resume cache has four entries of four instructions each, allowing speculative execution beyond four branches. The R10000 design uses the standard two-bit Smith method to predict MIPS R10000 Uses Decoupled Architecture Vol. 8, No. 14, October 24, 1994 1994 MicroDesign Resources L2 Cache Interface System Interface Gnnap, MPR, 1994 128 Tag SRAM 512K -16M Avalanche Bus (64 bit addr/data) Intel Nehalem frontend breaks x86 CISC into smaller RISC-like µops; µcode engine handles rarely used complex instr Kanter, Real World Technologies, 2009 ECE 4750 T02: From CISC to RISC 15 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Register File with Combinational Read En Clk Single Register D 0 D 1 D 2 ff ff ff...... D n-1 ff Q 0 Q 1 Q 2... Q n-1 Clock WE ReadSel1 ReadSel2 WriteSel WriteData rs1 rs2 ws wd Register file 2R+1W rd1 rd2 ReadData1 ReadData2 ECE 4750 T02: From CISC to RISC 17 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Register File Implementation ws wd rd1 rd2 5 32 32 32 rs1 rs2 5 5 reg 0 reg 1 reg 31 Register files with large number of ports are difficult to implement Almost all MIPS instrs have exactly two register source operands Intel s Itanium general-purpose register file has 128 registers with 8 read ports and 4 write ports! ECE 4750 T02: From CISC to RISC 18 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Magic Memory Model WriteEnable Clock Address WriteData MAGIC RAM ReadData Read is combinational Write is performed at the rising clock edge if enabled Write address must be stable at the clock edge Later will consider using more realistic memory ECE 4750 T02: From CISC to RISC 19 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor More Realistic Memory Model Address WriteData WriteEnable Clock SRAM ReadData Synchronous operation Read data ready next cycle Read/write data buses share single internal bit lines Simplified SRAM Read Simplified SRAM Write ECE 4750 T02: From CISC to RISC 20 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor MIPS Instruction Formats 6 5 5 5 5 6 0 rs rt rd 0 func R[rd] R[rs] func R[rt] 31 26 25 21 20 16 15 11 10 6 5 0 6 5 5 16 I opcode rs rt immediate R[rt] R[rs] op immediate 31 26 25 21 20 16 15 0 6 5 5 16 LD/ST opcode rs rt offset 31 26 25 21 20 16 15 0 6 5 5 16 BEQZ opcode rs 0 offset 31 26 25 21 20 16 15 0 6 5 5 16 JR/JALR opcode rs 0 0 31 26 25 21 20 16 15 0 6 26 J/JAL opcode target 31 26 25 0 ST: M[ R[rs] + sext(offset) ] R[rt] LD: R[rt] M[ R[rs] + sext(offset) ] if ( R[rs] == 0 ) PC PC+4 + offset*4 PC R[rs] JALR also does R[31] PC+8 PC jtarg( PC, target ) JAL also does R[31] PC+8 ECE 4750 T02: From CISC to RISC 22 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Instruction Execution Steps 1. Instruction fetch 2. Decode and register fetch 3. operation 4. Memory operation if required 5. Register write-back if required Computation of the next instruction to fetch ECE 4750 T02: From CISC to RISC 23 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Datapath: Reg-Reg Instructions (ADDU) 0x4 Add RegWrite PC addr inst Inst. Memory inst<25:21> inst<20:16> inst<15:11> rs1 rs2 rd1 ws wd rd2 GPRs z inst<5:0> Control OpCode ECE 4750 T02: From CISC to RISC 24 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Datapath: I Reg-Imm Instructions (ADDIU) 0x4 Add RegWrite PC addr inst Inst. Memory inst<25:21> inst<20:16> inst<15:0> inst<31:26> rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext Control z OpCode ExtSel ECE 4750 T02: From CISC to RISC 25 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Address Conflicts in Merged Datapath with Muxes 0x4 Add RegWrite PC addr inst Inst. Memory inst<25:21> inst<20:16> inst<20:16> inst<15:11> inst<15:0> inst<31:26> inst<5:0> rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext Control z OpCode ExtSel ECE 4750 T02: From CISC to RISC 26 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Datapath: and I Instructions 0x4 Add RegWrite PC addr inst Inst. Memory <25:21> <20:16> <15:11> <15:0> rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext z <31:26>, <5:0> Control OpCode RegDst rt / rd ExtSel OpSel BSrc Reg / Imm ECE 4750 T02: From CISC to RISC 27 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Approach for Program and Data Memory Harvard-style : separate program and data memories Inspired by Howard Aiken and the Mark I Read-only program memory Read/write data memory Need some way to load program memory Princeton-style : unified program and data memories Inspired by von Neumann Single read/write memory for both Load/store instructions require accessing memory twice during execution Most modern machines are mixed with separate instruction and data caches but a unified main memory that holds both the program and data ECE 4750 T02: From CISC to RISC 28 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Datapath: Load Instructions (LW) PC 0x4 Add addr inst Inst. Memory rs offset RegWrite rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext Control z MemWrite addr rdata Data Memory wdata WBSrc / Mem OpCode RegDst ExtSel OpSel BSrc ECE 4750 T02: From CISC to RISC 29 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Datapath: Store Instructions (SW) PC 0x4 Add addr inst Inst. Memory rs offset RegWrite rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext Control z MemWrite addr rdata Data Memory wdata WBSrc / Mem OpCode RegDst ExtSel OpSel BSrc ECE 4750 T02: From CISC to RISC 30 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Datapath: Conditional Branches (BEQZ) PCSrc br RegWrite MemWrite WBSrc pc+4 0x4 Add Add PC addr inst Inst. Memory rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext Control z addr rdata Data Memory wdata OpCode RegDst ExtSel OpSel BSrc zero? ECE 4750 T02: From CISC to RISC 31 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Datapath: Register-Indirect Jumps (JR) PCSrc br rind RegWrite MemWrite WBSrc pc+4 0x4 Add Add PC addr inst Inst. Memory rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext Control z addr rdata Data Memory wdata OpCode RegDst ExtSel OpSel BSrc zero? ECE 4750 T02: From CISC to RISC 32 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Datapath: Register-Indirect Jump-&-Link (JALR) PCSrc br rind RegWrite MemWrite WBSrc pc+4 0x4 Add Add PC addr inst Inst. Memory 31 rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext Control z addr rdata Data Memory wdata OpCode RegDst ExtSel OpSel BSrc zero? ECE 4750 T02: From CISC to RISC 33 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Datapath: Absolute Jump-&-Link (J,JAL) PCSrc br rind jabs pc+4 RegWrite MemWrite WBSrc 0x4 Add Add PC addr inst Inst. Memory 31 rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext Control z addr rdata Data Memory wdata OpCode RegDst ExtSel OpSel BSrc zero? ECE 4750 T02: From CISC to RISC 34 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Final Harvard Style Datapath for MIPS PCSrc br rind jabs pc+4 RegWrite MemWrite WBSrc 0x4 Add Add PC addr inst Inst. Memory 31 rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext Control z addr rdata Data Memory wdata OpCode RegDst ExtSel OpSel BSrc zero? ECE 4750 T02: From CISC to RISC 35 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Hardwired Controller is Pure Combinational Logic op code zero? Comb Logic ExtSel BSrc OpSel MemWrite WBSrc RegDst Inst<5:0> (Func) Inst<31:26> (Opcode) + 0? op RegWrite PCSrc OpSel ( Func,Op,+,0? ) Decode Map ExtSel ( sext 16, uext 16, High 16 ) ECE 4750 T02: From CISC to RISC 36 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Hardwired Control Table Hardwired Control Table Opcode ExtSel BSrc OpSel MemW RegW WBSrc RegDst PCSrc i iu LW SW BEQZ z=0 BEQZ z=1 J JAL JR JALR * Reg Func no yes rd sext 16 Imm Op no yes rt pc+4 uext 16 Imm Op no yes rt pc+4 sext 16 Imm + no yes Mem rt pc+4 sext 16 Imm + yes no * * pc+4 sext 16 * 0? no no * * sext 16 * 0? no no * * * * * no no * * pc+4 pc+4 jabs * * * no yes PC R31 jabs * * * no no * * rind * * * no yes PC R31 rind br BSrc = { Reg, Imm } RegDest = { rt, rd, R31 } WBSrc = {, Mem, PC } PCSrc = { pc+4, br, rind, jabs } BSrc = Reg / Imm WBSrc = / Mem / PC RegDst = rt / rd / R31 PCSrc = pc+4 / br / rind / jabs January 26, 2010 CS152, Spring 2010 42 ECE 4750 T02: From CISC to RISC 37 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Single-Cycle Hardwired Control Requires that clock period is sufficiently long so that all of the following steps can be completed 1. Instruction fetch 2. Decode and register fetch 3. operation 4. Data read or data store if required 5. Register write-back setup time if required t c > t ifetch + t rfrd + t + t dmem + t rfwr At the rising edge of the clock: the PC, the register file, and the memory are updated ECE 4750 T02: From CISC to RISC 38 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined Datapath 0x4 PC Add addr rdata Inst. Memory IR rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext addr rdata Data Memory wdata fetch phase decode & Reg-fetch phase execute phase memory phase write -back phase Clock period is reduced by dividing the execution of an instruction into multiple cycles; allows for more realistic synchronous memory t c < max(t ifetch, t rf, t, t dmem, t rfwr ) CPI will of course be greater than one ECE 4750 T02: From CISC to RISC 40 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined Controller Figure 2: Appendix: Multicycle PARCv1 State Diagram 20 ECE 4750 T02: From CISC to RISC 41 / 43 ECE 4750 Computer Architecture, Fall 2011 Lab 2: Multicycle PARCv2 Processor

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Summary Microcoding less attractive due to evolving technology constraints Unpipelined µarch first step towards RISC design philosophy Iron Law of processor performance helps explain design space Single-Cycle Unpipelined Microcoded CPI = 7.33 Inst 1 1 cycle Inst 2 1 cycle Inst 1 7 cycles Inst 3 1 cycle Inst 2 5 cycles Multi-Cycle Unpipelined CPI = 1 CPI = 4.33 Inst 3 10 cycles Inst 1 5 cycles Inst 2 3 cycles Inst 3 5 cycles Microarchitecture CPI Cycle Time last topic Microcoded >1 short this topic Single-Cycle Unpipelined 1 long this topic Multi-Cycle Unpipelined >1 short next topic Pipelined 1 short ECE 4750 T02: From CISC to RISC 42 / 43

Motivating RISC Memory Basics Single-Cycle Unpipelined MIPS Processor Multi-Cycle Unpipelined MIPS Processor Acknowledgements Some of these slides contain material developed and copyrighted by: Arvind (MIT), Krste Asanović (MIT/UCB), Joel Emer (Intel/MIT) James Hoe (CMU), John Kubiatowicz (UCB), David Patterson (UCB) MIT material derived from course 6.823 UCB material derived from courses CS152 and CS252 ECE 4750 T02: From CISC to RISC 43 / 43