Processor Design CSCE Instructor: Saraju P. Mohanty, Ph. D. NOTE: The figures, text etc included in slides are borrowed

Lecture 3: General Purpose Processor Design CSCE 665 Advanced VLSI Systems Instructor: Saraju P. ohanty, Ph. D. NOTE: The figures, tet etc included in slides are borrowed from various books, websites, authors pages, and other sources for academic purpose only. The instructor does not claim any originality.

Lecture Outline General Purpose Processor Program Eecution Construction of a Simple IPS Processor Single Cycle Processor ulticycle Processor Pipelined Processor 2

High Level Language Program Levels of Representation Compiler Assembly Language Program temp = v[k]; v[k] = v[k+]; v[k+] = temp; lw $5, ($2) lw $6, 4($2) sw $6, ($2) Assembler sw $5, 4($2) achine Language Program achine Interpretation Control Signal Specification OP[:3] <= InstReg[9:] & ASK 3

Eecution Cycle Fetch Decode Operand Fetch Eecute Result Store Net Obtain instruction from program storage Determine required actions and instruction size Locate and obtain operand Compute value or status Deposit s in storage for later use Determine successor instruction 4

The Processor: path & Control Let us look at an implementation of the IPS Simplified to contain only: -reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq, j Generic Implementation: use the program counter (PC) to supply instruction address get the instruction from read registers use the instruction to decide eactly what to do 5

ore Implementation Details Abstract / Simplified View: Register # PC ress Registers Register # Register # ress Two types of functional units: elements that operate on values (combinational) elements that contain state (sequential) 6

Register File: Operation Built using D flip-flops (Combinational in nature) register number register number 2 Register Register Register n Register n u u R ead 2 7

Register File: Operation We still use the real clock to determine when to write C D Register Register number n-to- decoder n C D Register n Register C Register n D C D Register n 8

Register File: Block Diagram register number register number 2 register Register file 2 Three ress ports One Input Port Two Output t Ports One Control Signal 9

a: emory Functional Units - I After an instruction ti address is put, the instruction ti residing at the address appears at the output port b: Program Counter -- A simple upcounter c: er -- A 2 s complement adder address PC Sum a. b. Program counter c. er

a: Register File Functional Units - II It s construction, read, and write operations as discussed previously b: (Arithmetic ti & Logic Unit) Recall the design we have discussed in last two classes Note the Zero output control 5 3 register Register 5 numbers register 2 Zero Registers 5 register 2 Reg a. Register File b.

a: emory Unit Functional Units -III Similar to emory Unit, only that it can written into as well Two input ports for address and, one output port (for read out) Two control signals: for and operations b: Sign Etension Unit Etends 6-bit input operand to 32 bits em ress 6 Sig n 32 etend em a. m emory unit b. Sign-etension unit 2

path for Fetch (Piece I) Fetching s and Incrementing the program counter by 4 4 PC ad dre ss 3

path for R-type s (Piece II) path for R-type s register register 2 Registers register 2 3 operation Zero Reg 4

path for Load/Store (Piece III) path for load or store () Register Access; (2) emory ress calculation; (3) / (4) into Register file (if the instruction is a load) 3 operation register register 2 Zero Registers register 2 Reg ress em 6 Sign 32 etend em 5

path for Branch (Piece IV) Unit Shift left 2 adds at the low-order end of the sign-etended offset Control logic is used to decide whether the incremented PC or branch target should replace the PC based on the Zero output of the PC + 4 from instruction path Sum Branch target Shift left 2 register register 2 Registers register Reg 2 6 Sign 32 etend operation 3 To branch Zero control logic 6

path Construction Strategy Now, we have pieces of path that are capable of performing distinct functions We want to stitch them together to yield a final path that can eecute all the instructions (lw, sw, add, sub, and, or, slt, beq, j) We will use multipleors (or mues for short) for stitching the path 7

i register l path Construction (erge Pieces II & III) Piece II + Piece III ri r register register 2 Registers register 2 register register 2 Registers register Reg Reg 2 6 Sign 32 etend 3 operation Zero 3 operation Zero ress em em 8

And you will get.. Rule: Whenever we have more than one input feeding a functional unit, introduce a multipleor (this gives rise to a control signal, more later..) Registers register register 2 register 2 Reg Src u operation 3 em Zero ress emtoreg u 6 32 Sign etend em 9

path Construction contd (merge Piece I) Just tack the Fetch and PC increment logic at the front! 4 PC address Registers register register 2 register Reg 2 Src u operation 3 em Zero ress emtoreg u 6 Sign 32 etend em 2

atapath Construction (Finally, merge Piece IV) PCSrc u 4 Shift left 2 PC address Registers register register 2 register Reg 2 6 Sign 32 etend Src u 3 operation em Zero ress em emtoreg u 2

Final path u 4 Reg Shift left 2 PCSrc PC address [3 ] [25 2] [2 6] [5 ] [5 ] register Src register 2 Zero 2 u u RegDst flows through various paths under the influence of control signals There are seven control signals (of type,, or u Select) register Registers 6 Sign 32 etend ti [5 ] control Op em ress em emtoreg u 22

Defining the Control.. Selecting the operations to perform (, read/write, etc.) Controlling the flow of (multipleor inputs) Information comes from the 32 bits of the instruction Eample: add $8, $7, $8 Format: op rs rt rd shamt funct 's operation based on instruction type and function code We will design two control units: () Control to generate appropriate function select signals for the (2) ain Control to generate signals for functional units other than the 23

Control - Truth Table & Implementation Describe it using a truth table (can turn into gates): Op Funct field Operation Op Op F5 F4 F3 F2 F F X X X X X X X X X X X X X X X X X X X X X X X X X X X X Op Op Op control block F3 Operation2 F (5 ) F2 F F Operation Operation Operation 24

Designing the ain Control 4 [3 26] Control RegDst Branch em emtoreg Op em Src Reg Shift left 2 u PC address [3 ] [25 2] [2 6] [5 ] register register 2 Registers 2 register u u Zero ress u [5 ] 6 Sign 32 etend control [5 ] 25

Control Signals and their Effects Signal Name Effect When deasserted Effect when asserted RegDst The register destination number for the The register destination number for the register comes from the rt field (bits 2-6) register comes from the rd field (bits 5-) Reg NONE The register on the register input is written with the value on the input Src The second operand comes from the The second operand is the sign-etended second register file output ( 2) lower 6 bits of the instruction PCSrc The PC is replaced by the output of the adder that The PC is replaced by the output of the adder computes the value of PC + 4. that computes the branch target em None emory contents t designated d by the address input are put on the output em None contents designated by the address input are replaced by the value on the input emtoreg The value fed to the register input The value fed to the register input comes comes from the from the. 26

ain Control: Truth Table & Implementation emto- Reg em em RegDst Src Reg Branch Op p R-format lw sw X X beq X X Inputs Op5 O p 4 Op3 Op2 Op O p R -fo r m a t Iw s w b e q Outputs RegDst Src emtoreg R e g W rite em em Branch Op OpO 27

Our Simple Control Structure All of the logic is combinational We wait for everything to settle down, and the right thing to be done might not produce right answer right away we use write signals along with clock to determine when to write Cycle time determined by length of the longest path State element Combinational logic State element 2 Clock cycle We are ignoring some details like setup and hold times 28

We designed a Single Cycle Implementation Calculate cycle time assuming negligible delays ecept: (2ns), and adders (2ns), register file access (ns) PCSrc u 4 Reg Shift left 2 PC address [3 ] [25 2] [2 6] u [5 ] RegDst [5 ] register register 2 2 register Registers 6 Sign 32 etend Src u control Zero em ress em emtoreg u [5 ] Op 29

How does the single cycle path works? Let us understand this by highlighting the portions of the path when an R-type instruction is eecuted For an R-type instruction we go through the following phases: Phase : Fetch Phase 2: Register File Phase 3: eecution Phase 4: the Result into the Register File NOTE: All the four phases are completed in only ONE clock cycle and hence it is a single cycle implementation 3

R-type Phase ( Fetch) u 4 [3 26] Control RegDst Branch em emtoreg Op em Src Reg Shift left 2 PC address [3 ] ti [25 2] 2] [2 6] [5 ] register register 2 Registers 2 register u u Zero ress u [5 ] 6 Sign 32 etend control [5 ] 3

R-type Phase 2 (Register ) u 4 [3 26] Control RegDst Branch em emtoreg Op em Src Reg Shift left 2 PC address [3 ] ti [25 2] 2] [2 6] [5 ] register register 2 Registers 2 register u u Zero ress u [5 ] 6 Sign 32 etend control [5 ] 32

R-type Phase 3 ( eecution) u 4 [3 26] Control RegDst Branch em emtoreg Op em Src Reg Shift left 2 PC address [3 ] [25 2] [2 6] [5 ] register register 2 Registers 2 u register u Zero ress u [5 ] 6 Sign 32 etend control [5 ] 33

R-type Phase 4 ( the Result) u 4 [3 26] Control RegDst Branch em emtoreg Op em Src Reg Shift left 2 PC address [3 ] ti [25 2] 2] [2 6] [5 ] register register 2 Registers 2 register u u Zero ress u [5 ] 6 Sign 32 etend control [5 ] 34

How do we handle jump? [25 ] Shift Jump address [3 ] left 2 26 28 4 PC+4 [3 28] [3 26] Control RegDst Jump Branch em emtoreg Op em Src Reg Shift left 2 u u PC address [3 ] [25 2] [2 6] [5 ] register register 2 Zero Registers 2 u register u ress u [5 ] 6 Sign 32 etend control [5 ] 35

Where we are headed Single Cycle Problems: what if we had a more complicated instruction like floating point? wasteful of area One Solution: use a smaller ccletime cycle have different instructions take different numbers of cycles a multicycle path: 36

Single Cycle Implementation: Summary All instructions are eecuted in only clock cycle We built a single cycle path from scratch We designed appropriate controller to generate correct correct signals All instructions are not born equal; that some require more work, some less => disadvantage of single cycle implementation is that the slowest instruction determines the clock cycle width In reality, no body implements single cycle approach Suggestion: STARE, STARE, STARE at the single cycle path to familiarize yourself Given the single cycle path, you should be able to highlight active portions of the path for any given instruction ( just as we did for an R-type instruction in the class) 37

Single Cycle Implementation - Recap Single Cycle Problems: what if we had a more complicated instruction like floating point? wasteful of area Cycle width determined by the slowest instruction One Solution: use a smaller cycle time have different instructions ti take different numbers of cycles a multicycle path: 38

ulticycle Approach High Level View PC ress emory or register emory register Register # Registers Register # Register # A B Out 39

Single Cycle, ultiple Cycle, vs. Pipeline Clk Cycle Single Cycle Implementation: Cycle 2 Load Store Waste Clk Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle ultiple Cycle Implementation: Load Store Ifetch Reg Eec em Wr Ifetch Reg Eec em R-type Ifetch Pipeline Implementation: Load Ifetch Reg Eec em Wr Store Ifetch Reg Eec em Wr R-type Ifetch Reg Eec em Wr 4

Basic Idea IF: fetch u ID: decode/ register file read EX: Eecute/ address calculation E: emory access WB: back 4 Shift left 2 PC ress register register 2 Registers 2 register u Zero ress u 6 Sign etend 32 4

u Corrected path IF/ID ID/EX EX/E E/WB 4 Shift left 2 PC ress Instructio on register register 2 Registers 2 register u Zero ress u 6 Sign etend 32 42

PCSrc path with Control u Control ID/EX WB EX/E WB E/WB IF/ID EX WB PC 4 ress ruction Instr Reg register register 2 Registers 2 register Shift left 2 u Src Zero Branch em ress em mtoreg u [5 ] 6 32 Sign etend 6 control em [2 6] [5 ] Op u RegDst 43

Pipelining Summary Pipelining doesn t help latency of single task, it helps throughput of entire workload ultiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup Three types of pipeline hazards: structural,, and control/branch Stalling helps any kind of hazard hazard solutions: Stalling, forwarding, Hazard detection Control or Branch Hazard solutions: Stalling, Branch Prediction, Delayed Branching 44