Outline ombinational Element ombinational & sequential logic Single-cycle PU ulti-cycle PU Examples of ombinational Elements State (Sequential) Element! "#$$$ #$$ #$$ #$ # & & ) *.// * - + #3, * + - locking ethodology Input/Output of Elements * *! 5, +6 # - * * * * - - - - /
Register File IPS6 Formats 8&, 9 8:5 &.,, " 8,, ' ' :6; ' ' 6; ' ' <6; 3! / ; 3 / ommon Steps in Execution ifferences in Execution Execution of all instructions require the following steps send P to memory and fetch instruction stored at location specified by P read - s, using fields specifying the s in the instruction All instructions use functionality transfer instructions: compute address instructions: execute operations branch instructions: comparison & address compuation transfer (strictly load/store ISA) load: access memory for read {ld R, (R)} store: access memory for write {ld (R), R} instruction no memory access for operands access a for write of {add R,R, R3} Branch instruction change P content based on comparison {bnez R, Loop} Summary Path & path 9 " 5 = = = = = = = = = = = = = = = = = = = = = = = path is the signal path through which in the PU flows including the functional elements Elements of path combinational elements state (sequential) elements path the signal path from the controller to the path elements exercises timing & control over path elements
What Should be in the path path Schematic At a minimum we need combinational and sequential logic elements in the path to support the following functions fetch instructions and from memory s decode instructions and dispatch them to the execution unit execute arithmetic & logic operations update state elements (s and memory) " > > : > " path Building Blocks: Access path Building Blocks: R-Type a that points to the next instruction to be fetched it is incremented each clock cycle ontent of P is input to The instruction is fetched and supplied to upstream path elements Adder is used to increment P by in preparation for the next instruction (why?) Adder: an with control input hardwired to perform add instruction only For reasons that will become clear later, we assume separate memory units for instructions &! " " # $ ( : & ' 6; 9 Used for arithmetic & logic operations two, rs and rt operates on s content to rd Example: add R, R, R3 ' : rs=r, rt=r3, rd=r s Reg is asserted to enable write at clock edge op to control operation 9 ) I-Type : load/store Required path Elements for load/store rs contains the base field for the displacement address mode rt specifies to load from memory for load to write to memory for store Immediate contains address offset To compute memory address, we must sign-extend the 6-bit immediate to 6 bits add it to the base in rs ' ' :6; &. * *6??.( Register file load: s to read for base address & to write for store: s to read for base address & for Sign extender to sign-extend and condition immediate field for s complement addition of address offset using 6-bit to add base address and sign-extended immediate field memory to load/store : memory address; input for store; output for load control inputs: em, em, clock 6 6
path Building Blocks: load/store I-Type : bne I-Type 6 5 5 6 opcode rs rt immediate 5 reg 5 reg 5 Registers reg Reg 6 sign 6 extend op zero em em Branch path must compute branch condition & branch address rs and rt refer to s to be compared for branch condition if Reg[rs]!= Reg[rd], P = P + Imm<< (note that at this point P is already incremented. In effect P current =(P previous +) + Imm<< else if Reg[rs] == Reg[rt] P remains unchanged: P current =(P previous +) the next sequential instruction is taken Required functional elements RegFile, sign extender, adder, shifter ' ' :6; : : Sign Extend & Shift Operations path Building Blocks: bne Sign extension is required because 6-bit offset must be expanded to 6 bits in order to be used in the 6-bit adder we are using s complement arithmetic Shift by is required because instructions are 3-bits wide and are aligned on a word ( bytes) boundary in effect we are using an 8- bit offset instead of 6 -*&),. -*&),. -,&/$ &)*+ &)*+ &', ' 3@( ' ' :6; : ' 3@( :" A ) ; B omputing & Branch ondition Putting it All Together The operands of bne are compared in the same we use for load/store/arithmetic/logic instructions the provides a ZERO output signal to indicate condition the ZERO signal controls what instruction will be fetched next depending on whether the branch is taken or not We also need to compute the address we may not be able to use the if it is being used to compute the branch condition (more on this later) need an additional AER (an hardwired to add only) to compute branch address ombine path building blocks to build the full path now we must decide some specifics of implementation Single-cycle PU each instruction executes in one clock cycle PI= for all instructions ulti-cycle PU instructions execute in multiples of a shorter clock cycle different instructions have different PI
Single-ycle PU The Processor: path & One clock cycle for all instructions No path resource can be used more than once per clock cycle s in resource duplication for elements that must be used more than once examples: separate memory units for instruction and ; two s for conditional branches Some path elements may be shared through multiplexing as long as they are used once We're ready to look at an implementation of the IPS Simplified to contain only: memory-reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq, j Generic Implementation: use the program counter (P) to supply instruction address get the instruction from memory read s use the instruction to decide exactly what to do All instructions use the after reading the s Why? memory-reference? arithmetic? control flow? IPS Fetch-Execute Processor Architecture Initialize first instruction Register Register In In Activate Route to Register Register In In
Route to Register (IR) Select Appropriate From Register File Register Register In In Route to Arithmetic Unit () o the omputation Register Register In In Store the Result Increment P Point to Next Register Register In In
Increment P Point to Next Execute Next Register Register In In State Elements An unclocked state element Unclocked vs. locked locks used in synchronous logic when should an element that contains state be updated? Falling edge The set-reset latch output depends on present inputs and also on past inputs R Q lock period cycle time Rising edge S Q Latches and Flip-flops -latch Output is equal to the stored value inside the element (don't need to ask for permission to look at the value) hange of state (value) is based on the clock Latches: output changes whenever the inputs change, and the clock is asserted (level-triggered methodology) Flip-flops: state changes only on a clock edge (edge-triggered methodology) A clocking methodology defines when signals can be read and written wouldn't want to read a signal at the same time it was being written Two inputs: the value to be stored () the clock signal () indicating when to read & store Two outputs: the value of the internal state (Q) and it's complement Q _ Q Q
flip-flop Our Implementation Output changes only on the clock edge Q Q latch latch Q Q Q An edge triggered methodology Typical execution: read contents of some state elements send values through some combinational logic write s to one or more state elements State element ombinational logic State element lock cycle Q Register File Abstraction Built using flip-flops number number Register file number number Register Register... Register n Register n u x u x ake sure you understand the abstractions! Sometimes it is easy to think you do, when you don t A B 3 3 Select u x 3 A3 B3 A3 B3 Select u x u x. 3 3. o you understand? What is the above? A B u x Register File Simple Implementation Note: we still use the real clock to determine when to write Include the functional units we need for each instruction em Register number n-to-n decoder n n. Register Register. Register n Register n address P Add Sum memory a. memory b. counter c. Adder 5 operation Register 5 numbers Zero Registers 5 memory em a. memory unit 6 3 Sign extend b. Sign-extension unit Register Reg a. Registers b.
Building the path Use multiplexors to stitch them together PSrc Add Add Shift left Src operation P address em Zero emtoreg Registers memory Reg memory 6 3 em Sign extend Selecting the operations to perform (, read/write, etc.) ling the flow of (multiplexor inputs) Information comes from the 3 bits of the instruction Example: add $8, $, $8 Format: op rs rt rd shamt funct 's operation based on instruction type and function code e.g., what should the do with this instruction Example: lw $, ($) 35 op rs rt 6 bit offset control input AN OR add subtract set-on-less-than NOR ust describe hardware to compute -bit control input given instruction type = lw, sw = beq, = arithmetic function code for arithmetic Op computed from instruction type escribe it using a truth table (can turn into gates): Why is the code for subtract and not? R-type Add Add [3 6] Regst Branch em emtoreg Op em Src Reg Shift left P [5 ] address [ 6] Zero [3 ] [5 ] memory Registers memory [5 ] 6 3 Sign extend control [5 ] Regst Src emto- Reg Reg em em Branch Op p R-format lw sw X X beq X X
Load Branch on Equal Our Simple Structure Simple combinational logic (truth tables) Inputs Op5 Op Op3 Op Op Op control block Op Op Op Operation F3 R-format Iw sw beq Operation F Operation F (5 ) F Operation F Outputs Regst Src emtoreg Reg em em All of the logic is combinational We wait for everything to settle down, and the right thing to be done might not produce right answer right away we use write signals along with clock to determine when to write ycle time determined by length of the longest path Branch Op OpO State element ombinational logic State element lock cycle We are ignoring some details like setup and hold times Single ycle Implementation Where we are headed alculate cycle time assuming negligible delays except: memory (ps), and adders (ps), file access (5ps) Add Shift left Add PSrc Single ycle Problems: what if we had a more complicated instruction like floating point? wasteful of area One Solution: use a smaller cycle time have different instructions take different numbers of cycles a multicycle path: P address memory Src Registers Reg 6 3 Sign extend operation em Zero emtoreg memory em P or Register # Registers Register # Register # A B Out
ulticycle Approach ulticycle Approach Reuse functional units used to compute address and to increment P used for instruction and signals will not be determined directly by instruction e.g., what should the do for a subtract instruction? There must be some sequencing involved leading to. Use a finite state machine for control Break up the instructions into steps, each step takes a cycle balance the amount of work to be done restrict each cycle to use only one major functional unit At the end of a cycle store values for use in later cycles (easiest thing to do) this introduces additional internal s P [5 ] A [ 6] em Registers [5 ] [5 ] B 3 [5 ] 6 3 Sign extend Shift left Zero Out s from ISA perspective Breaking down an instruction onsider each instruction from the perspective of ISA Example: The add instruction changes a specified by the P estination specified by bits 5: of instruction New value is the sum ( op ) of two s Source s specified by bits 5: and :6 of the instruction Reg[[P][5:]] <= Reg[[P][5:]] op Reg[[P][:6]] In order to accomplish this, we must break up the instruction (kind of like introducing variables when programming) ISA definition of arithmetic: Reg[[P][5:]] <= Reg[[P][5:]] op Reg[[P][:6]] ould break down to: IR <= [P] A <= Reg[IR[5:]] B <= Reg[IR[:6]] Out <= A op B Reg[IR[:6]] <= Out on t forget an important part of the operation: P <= P + Idea behind multicycle approach Five Execution Steps We define each instruction from the ISA perspective Break it down into steps following the rule that flows through, at most, one major functional unit (e.g., balance work across steps) Introduce new s as needed (A, B, Out, R, etc.) Finally, try and pack as much work into each step (avoid unnecessary cycles) while also trying to share steps where possible (minimizes control and likely hardware, helps to simplify solution) Result: The textbook s multicycle Implementation. Fetch ecode and Register Fetch Execution, omputation, or Branch ompletion Access or R-type instruction completion -back step INSTRUTIONS TAKE FRO 3-5 YLES
Step : Fetch Step : ecode and Register Fetch Use P to get instruction and put it in the Register Increment the P by and put the back in the P an be described succinctly using RTL "Register-Transfer Language" IR <= [P]; P <= P + ; What is the advantage of updating the P now? s rs and rt in case we need them ompute the branch address in case the instruction is a branch RTL: A <= Reg[IR[5:]]; B <= Reg[IR[:6]]; Out <= P + (sign-extend(ir[5:]) << ); We aren't setting any control lines based on the instruction type Step 3: ( dependent) Step : (R-type or memory-access) is performing one of three functions, based on instruction type Reference: Out <= A + sign-extend(ir[5:]); R-type: Out <= A op B; Branch: if (A==B) P <= Out; Loads and stores access memory R <= [Out]; or [Out] <= B; R-type instructions finish Reg[IR[5:]] <= Out; The write actually takes place at the end of the cycle on the edge Step 5: -back step Summary: Reg[IR[:6]] <= R; Which instruction needs this?
Simple Questions How many cycles will it take to execute this code? lw $t, ($t3) lw $t3, ($t3) beq $t, $t3, Label nop add $t5, $t, $t3 sw $t5, 8($t3) Label:... What is going on during the 8th cycle of execution? In what cycle does the actual addition of $t and $t3 take place? Pond Shift [5-] 6 8 left [3 6] P P [3 8] [5 ] A [ 6] Zero em Registers [5 ] [5 ] B 3 [5 ] P Ior em em emtoreg IR Outputs Op [5 ] PSource Op SrcB SrcA Reg Regst 6 3 Sign extend Shift left control Jump address [3 ] Out [5 ] Review: finite state machines Review: finite state machines Finite state machines: a set of states and next state function (determined by current state and the input) output function (determined by current state and possibly input) Inputs urrent state lock Next-state function We ll use a oore machine (output based only on current state) Output function Next state Outputs Example: B. 3 A friend would like you to build an electronic eye for use as a fake security device. The device consists of three lights lined up in a row, controlled by the outputs Left, iddle, and Right, which, if asserted, indicate that a light should be on. Only one light is on at a time, and the light moves from left to right and then from right to left, thus scaring away thieves who believe that the device is monitoring their activity. raw the graphical representation for the finite state machine used to specify the electronic eye. Note that the rate of the eye s movement will be controlled by the clock speed (which should not be too great) and that there are essentially no inputs. Implementing the Value of control signals is dependent upon: what instruction is being executed which step is being performed Use the information we ve accumulated to specify a finite state machine specify the finite state machine graphically, or use microprogramming Implementation can be derived from specification Graphical Specification of FS Note: don t care if not mentioned asserted if name only otherwise exact value How many state bits will we need? 3 address computation SrcA = SrcB = Op = em Ior = access Start 6 em Ior = fetch decode/ fetch em SrcA = Ior = IR SrcA = SrcB = SrcB = Op = Op = P PSource = Execution SrcA = SrcB = Op = access 5 8 Regst = Reg emtoreg = Branch completion SrcA = SrcB = Op = Pond PSource = R-type completion 9 Jump completion P PSource = read completon step Regst = Reg emtoreg =
Finite State achine for Implementation: P PLA Implementation If I picked a horizontal or vertical line could you explain it? Op5 Op Op3 Op logic Pond Ior em em IR emtoreg PSource Op Op S3 S Outputs Op SrcB SrcA S S Op5 Inputs Op Op3 Op Op Op S3 S S S State opcode field Reg Regst NS3 NS NS NS P Pond Ior em em IR emtoreg PSource PSource Op Op SrcB SrcB SrcA Reg Regst NS3 NS NS NS RO Implementation RO Implementation RO = " Only " values of memory locations are fixed ahead of time A RO can be used to implement a truth table if the address is m-bits, we can address m entries in the RO. our outputs are the bits of that the address points to. How many inputs are there? 6 bits for opcode, bits for state = address lines (i.e., = different addresses) How many outputs are there? 6 path-control outputs, state bits = outputs m n RO is x = K bits (and a rather unusual size) Rather wasteful, since for lots of the entries, the outputs are the same i.e., opcode is often ignored m is the "height", and n is the "width" RO vs PLA Another Implementation Style Break up the table into two parts state bits tell you the 6 outputs, x6 bits of RO bits tell you the next state bits, x bits of RO Total:.3K bits of RO PLA is much smaller can share product terms only need entries that produce an active output can take into account don't cares Size is (#inputs #product-terms) + (#outputs #productterms) For this example = (x)+(x) = 5 PLA cells PLA cells usually about the size of a RO cell (slightly bigger) omplex instructions: the "next state" is often current state + unit Adder PLA or RO Input State select logic Op[5 ] Outputs P Pond Ior em em IR B emtoreg PSource Op SrcB SrcA Reg Regst Addrtl opcode field
etails icroprogramming ispatch RO ispatch RO Op Opcode name Value Op Opcode name Value R-format lw jmp sw beq PLA or RO lw sw State Adder 3 Addrtl unit icrocode memory Input Outputs P Pond Ior em em IR B emtoreg PSource Op SrcB SrcA Reg Regst Addrtl path icroprogram counter ispatch RO ispatch RO Adder select logic select logic State number -control action Value of Addrtl Use incremented state 3 Use dispatch RO Use dispatch RO 3 Use incremented state 3 Replace state number by 5 Replace state number by 6 Use incremented state 3 Replace state number by 8 Replace state number by 9 Replace state number by opcode field opcode field What are the microinstructions? icroprogramming A specification methodology appropriate if hundreds of opcodes, modes, cycles, etc. signals specified symbolically using microinstructions Label control SR SR Register control P control Sequencing Fetch Add P P Seq Add P Extshft ispatch em Add A Extend ispatch LW Seq R Fetch SW Fetch Rformat Func code A B Seq Fetch BEQ Subt A B Out-cond Fetch JUP Jump address Fetch Will two implementations of the same architecture have the same microcode? What would a microassembler do? icroinstruction format Field name Value Signals active omment Op = Add ause the to add. control Subt Op = ause the to subtract; this implements the compare for branches. Func code Op = Use the instruction's function code to determine control. SR P SrcA = Use the P as the first input. A SrcA = Register A is the first input. B SrcB = Register B is the second input. SR SrcB = Use as the second input. Extend SrcB = Use output of the sign extension unit as the second input. Extshft SrcB = Use the output of the shift-by-two unit as the second input. two s using the rs and rt fields of the IR as the numbers and putting the into s A and B. Reg, a using the rd field of the IR as the number and Register Regst =, the contents of the Out as the. control emtoreg = R Reg, a using the rt field of the IR as the number and Regst =, the contents of the R as the. emtoreg = P em, memory using the P as address; write into IR (and lor = the R). em, memory using the Out as address; write into R. lor = em, memory using the Out as address, contents of B as the lor =. PSource = the output of the into the P. P P write control Out-cond PSource =, If the Zero output of the is active, write the P with the contents Pond of the Out. jump address PSource =, the P with the jump address from the instruction. P Seq Addrtl = hoose the next microinstruction sequentially. Sequencing Fetch Addrtl = Go to the first microinstruction to begin a new instruction. ispatch Addrtl = ispatch using the RO. ispatch Addrtl = ispatch using the RO. aximally vs. inimally Encoded icrocode: Trade-offs No encoding: bit for each path operation faster, requires more memory (logic) used for Vax 8 an astonishing K of memory! Lots of encoding: send the microinstructions through logic to get control signals uses less memory, slower Historical context of IS: Too much logic to put on a single chip with everything else Use a RO (or even RA) to hold the microcode It s easy to add new instructions istinction between specification and implementation is sometimes blurred Specification Advantages: Easy to design and write esign architecture and microcode in parallel Implementation (off-chip RO) Advantages Easy to change since values are in memory an emulate other architectures an make use of internal s Implementation isadvantages, SLOWER now that: is implemented on same chip as processor RO is no longer faster than RA No need to go back and make changes
Historical Perspective Pentium In the 6s and s microprogramming was very important for implementing machines This led to more sophisticated ISAs and the VAX In the 8s RIS processors based on pipelining became popular Pipelining the microinstructions is also possible! Implementations of IA-3 architecture processors since 86 use: hardwired control for simpler instructions (few cycles, FS control implemented using PLA or random logic) microcoded control for more complex instructions (large numbers of cycles, central control store) The IA-6 architecture uses a RIS-style ISA and can be implemented without a large central control store Pipelining is important (last IA-3 without it was 8386 in 985) cache Enhanced floating point and multimedia Advanced pipelining hyperthreading support Pipelining is used for the simple instructions favored by compilers Simply put, a high performance implementation needs to ensure that the simple instructions execute quickly, and that the burden of the complexities of the instruction set penalize the complex, less frequently used, instructions cache Integer path I/O interface Secondary cache and memory interface hapter hapter 6 Pentium hapter 5 Summary Somewhere in all that control we must handle complex instructions If we understand the instructions We can build a simple processor! cache Enhanced floating point and multimedia cache Integer path I/O interface Secondary cache and memory interface If instructions take different amounts of time, multi-cycle is better path implemented using: ombinational logic for arithmetic Advanced pipelining hyperthreading support Processor executes simple microinstructions, bits wide (hardwired) control lines for integer path ( for floating point) If an instruction requires more than microinstructions to implement, control from microcode RO (8 microinstructions) Its complicated! State holding elements to remember bits implemented using: ombinational logic for single-cycle implementation Finite state machine for multi-cycle implementation