The final datapath. M u x. Add. 4 Add. Shift left 2. PCSrc. RegWrite. MemToR. MemWrite. Read data 1 I [25-21] Instruction. Read. register 1 Read.

The final path PC 4 Add Reg Shift left 2 Add PCSrc Instrction [3-] Instrction I [25-2] I [2-6] I [5 - ] register register 2 register 2 Registers ALU Zero Reslt ALUOp em Data emtor RegDst ALUSrc em I [5 - ] Sign etend

Control The control nit is responsible for setting all the control signals so that each instrction is eected properly. The control nit s inpt is the 32-bit instrction word. The otpts are vales for the ble control signals in the path. ost of the signals can be generated from the instrction opcode alone, and not the entire 32-bit word. To illstrate the relevant control signals, we will show the rote that is taken throgh the path by R-type, lw, sw and beq instrctions.

R-type instrction path The R-type instrctions inclde add, sb, and, or, and slt. The ALUOp is determined by the instrction s fnc field. PC 4 Add Reg Shift left 2 Add PCSrc Instrction [3-] Instrction I [25-2] I [2-6] I [5 - ] register register 2 register 2 Registers ALU Zero Reslt ALUOp em Data emtore RegDst ALUSrc em I [5 - ] Sign etend

lw instrction path An eample load instrction is lw $t, 4($sp). The ALUOp mst be (add), to compte the effective. PC 4 Add Reg Shift left 2 Add PCSrc Instrction [3-] Instrction I [25-2] I [2-6] I [5 - ] register register 2 register 2 Registers ALU Zero Reslt ALUOp em Data emtore RegDst ALUSrc em I [5 - ] Sign etend

sw instrction path An eample store instrction is sw $a, 6($sp). The ALUOp mst be (add), again to compte the effective. PC 4 Add Reg Shift left 2 Add PCSrc Instrction [3-] Instrction I [25-2] I [2-6] I [5 - ] register register 2 register 2 Registers ALU Zero Reslt ALUOp em Data emtore RegDst ALUSrc em I [5 - ] Sign etend

beq instrction path One sample branch instrction is beq $at, $, offset. The ALUOp is (sbtract), to test for eqality. The branch may or may not be taken, depending PC Instrction [3-] Instrction 4 I [25-2] I [2-6] I [5 - ] I [5 - ] Add RegDst register register 2 register Reg 2 Registers Sign etend Shift left 2 ALUSrc Add ALU Zero Reslt ALUOp PCSrc on the ALU s Zer otpt em Data em emtore

Control signal table Operation RegDst Reg ALUSrc ALUOp em em emtoreg add sb and or slt lw sw X X beq X X sw and beq are the only instrctions that do not write any registers. lw and sw are the only instrctions that se the constant field. They also depend on the ALU to compte the effective. ALUOp for R-type instrctions depends on the instrctions fnc field. The PCSrc control signal (not listed) shold be set if the instrction is beq and the ALU s Zero otpt is tre.

Generating control signals The control nit needs 3 bits of inpts. Si bits make p the instrction s opcode. Si bits come from the instrction s fnc field. It also needs the Zero otpt of the ALU. The control nit generates bits of otpt, corresponding to the signals mentioned on the previos page. Yo can bild the actal circit by sing big K-maps, big Boolean algebra, or big circit design programs. The tetbook presents a slightly different control nit. RegDst Reg Instrction [3-] Instrction I [3-26] I [5 - ] Control ALUSrc ALUOp em em emtoreg PCSrc Zero

Smmary - Single Cycle Datapath A path contains all the fnctional nits and connections necessary to implement an instrction set architectre. For or single-cycle implementation, we se two separate memories, an ALU, some etra adders, and lots of mltipleers. IPS is a 32-bit machine, so most of the bses are 32-bits wide. The control nit tells the path what to do, based on the instrction that s crrently being eected. Or processor has ten control signals that reglate the path. The control signals can be generated by a combinational circit with the instrction s 32-bit binary encoding as inpt. Now we ll see the performance limitations of this single-cycle machine and try to improve pon it.

lticycle path We jst saw a single-cycle path and control nit for or simple IPSbased instrction set. A mlticycle processor fies some shortcomings in the single-cycle CPU. Faster instrctions are not held back by slower ones. The clock cycle time can be decreased. We don t have to dplicate any hardware nits. A mlticycle processor reqires a somewhat simpler path which we l see today, bt a more comple control nit that we ll see later.

The single-cycle design again PC 4 Add Reg Shift left 2 Add PCSrc A control nit (no shown) generates the control signa from the instrctio op and fnc fie Instrction [3-] Instrction I [25-2] I [2-6] I [5 - ] register register 2 register 2 Registers ALU Zero Reslt ALUOp em Data emtoreg RegDst ALUSrc em I [5 - ] Sign etend

The eample add from last time Consider the instrction add $s4, $t, $t2. op rs rt rd shamt fnc Assme $t and $t2 initially contain and 2 respectively. Eecting this instrction involves several steps.. The instrction word is read from the instrction, and the program conter is incremented by 4. 2. The sorces $t and $t2 are read from the register file. 3. The vales and 2 are added by the ALU. 4. The reslt (3) is stored back into $s4 in the register file.

How the add goes throgh the path PC+4 PC 4 Add Reg Shift left 2 Add PCSrc Instrction [3-] Instrction I [25-2] I [2-6] I [5 - ] RegDst register register 2 register 2 Registers...... ALUSrc ALU Zero Reslt ALUOp em Data em emtoreg I [5 - ] Sign etend...

State elements In an instrction like add $t, $t, $t2, how do we know $t is not pdated ntil after its original vale is read? register register 2 register Reg 2 Registers em Data em PC

The path and the clock STEP : A new instrction is loaded from. The control nit sets the path signals appropriately so that registers are read, ALU otpt is generated, is read and branch target es are compted. STEP 2: The register file is pdated for arithmetic or lw instrctions. Data is written for a sw instrction. The PC is pdated to point to the net instrction. In a single-cycle path everything in Step mst complete within one clock cycle.

The slowest instrction... If all instrctions mst complete within one clock cycle, then the cycle time has to be large enogh to accommodate the slowest instrction. For eample, lw $t, 4($sp) needs 8ns, assming the delays shown here. Instrction [3-] Instrction 2 ns I [5 - ] reading the instrction reading the base register $sp compting $sp-4 2ns reading the storing back to $t I [25-2] I [2-6] I [5 - ] ns register register 2 register ns 2 Registers Sign etend ns ns ALU Zero Reslt 2 ns 2ns ns 2ns ns 8ns Data 2 ns ns

...determines the clock cycle time If we make the cycle time 8ns then every instrction will take 8ns, even if they don t need that mch time. For eample, the instrction add $s4, $t, $t2 really needs jst 6ns. reading the instrction reading registers $t and $t2 compting $t + $t2 storing the reslt into $s 2ns ns 2ns ns 6ns Instrction [3-] Instrction 2 ns I [25-2] I [2-6] I [5 - ] I [5 - ] ns register register 2 register ns 2 Registers Sign etend ns ns ALU Zero Reslt 2 ns Data 2 ns ns

How bad is this? With these same component delays, a sw instrction wold need 7ns, and beq wold need jst 5ns. Let s consider the gcc instrction mi from p. 89 of the tetbook. Instrction Arithmetic Loads Stores Branches Freqency 48% 22% % 9% With a single-cycle path, each instrction wold reqire 8ns. Bt if we cold eecte instrctions as fast as possible, the average time per instrction for gcc wold be: (48% 6ns) + (22% 8ns) + (% 7ns) + (9% 5ns) = 6.36ns The single-cycle path is abot.26 times slower!

It gets worse... We ve made very optimistic assmptions abot latency: ain accesses on modern machines is >5ns. For comparison, an ALU on the Pentim4 takes ~.3ns. Or worst case cycle (loads/stores) incldes 2 accesses A modern single cycle implementation wold be stck at <hz. Caches will improve common case access time, not worst case. Tying freqency to worst case path violates first law of performance!!

A mltistage approach to instrction eection We ve informally described instrctions as eecting in several steps.. Instrction fetch and PC increment. 2. ing sorces from the register file. 3. Performing an ALU comptation. 4. ing or writing (). 5. Storing back to the register file. What if we made these stages eplicit in the hardware design? 2

Performance benefits Each instrction can eecte only the stages that are necessary. Arithmetic Load Store Branches This wold mean that instrctions complete as soon as possible, instead of being limited by the slowest instrction. Proposed eection stages. Instrction fetch and PC increment 2. ing sorces from the register file 3. Performing an ALU comptation 4. ing or writing () 5. Storing back to the register file 2

The clock cycle Things are simpler if we assme that each stage takes one clock cycle. This means instrctions will reqire mltiple clock cycles to eecte. Bt since a single stage is fairly simple, the cycle time can be low. For the proposed eection stages below and the sample path delays shown earlier, each stage needs 2ns at most. This acconts for the slowest devices, the ALU and. A 2ns clock cycle time corresponds to a 5Hz clock rate! Proposed eection stages. Instrction fetch and PC increment 2. ing sorces from the register file 3. Performing an ALU comptation 4. ing or writing () 5. Storing back to the register file 2

Cost benefits As an added bons, we can eliminate some of the etra hardware from the single-cycle path. We will restrict orselves to sing each fnctional nit once per cycle jst like before. Bt since instrctions reqire mltiple cycles, we cold rese some nits in a different cycle dring the eection of a single instrction. For eample, we cold se the same ALU: to increment the PC (first clock cycle), and for arithmetic operations (third clock cycle). Proposed eection stages. Instrction fetch and PC increment 2. ing sorces from the register file 3. Performing an ALU comptation 4. ing or writing () 5. Storing back to the register file 2

Two etra adders Or original single-cycle path had an ALU and two adders. The arithmetic-logic nit had two responsibilities. Doing an operation on two registers for arithmetic instrctions. Adding a register to a sign-etended constant, to compte effective es for lw and sw instrctions. One of the etra adders incremented the PC by compting PC + 4. The other adder compted branch targets, by adding a sign-etended, shifted offset to (PC + 4). 2

The etra single-cycle adders PC 4 Add Reg Shift left 2 Add PCSrc Instrction [3-] Instrction I [25-2] I [2-6] I [5 - ] register register 2 register 2 Registers ALU Zero Reslt ALUOp em Data emtoreg RegDst ALUSrc em I [5 - ] Sign etend 2

Or new adder setp We can eliminate both etra adders in a mlticycle path, and instead se jst one ALU, with mltipleers to select the proper inpts. A 2-to- m ALUSrcA sets the first ALU inpt to be the PC or a register. A 4-to- m ALUSrcB selects the second ALU inpt from among: the register file (for arithmetic operations), a constant 4 (to increment the PC), a sign-etended constant (for effective es), and a sign-etended and shifted constant (for branch targets). This permits a single ALU to perform all of the necessary fnctions. Arithmetic operations on two register operands. Incrementing the PC. Compting effective es for lw and sw. Adding a sign-etended, shifted offset to (PC + 4) for branches. 2

The mlticycle adder setp highlighted PC PC IorD em ALUSrcA Address emory em em Data RegDst register register 2 register Reg 2 Registers 4 2 3 ALU Zero Reslt ALUOp ALUSrcB Sign etend Shift left 2 emtoreg 2

Eliminating a Similarly, we can get by with one nified, which will store both program instrctions and. (a Princeton architectre) This is sed in both the instrction fetch and access stages, and the cold come from either: the PC register (when we re fetching an instrction), or the ALU otpt (for the effective of a lw or sw). We add another 2-to- m, IorD, to decide whether the is being accessed for instrctions or for. Proposed eection stages. Instrction fetch and PC increment 2. ing sorces from the register file 3. Performing an ALU comptation 4. ing or writing () 5. Storing back to the register file 2

The new setp highlighted PC PC IorD em ALUSrcA Address emory em em Data RegDst register register 2 register Reg 2 Registers 4 2 3 ALU Zero Reslt ALUOp ALUSrcB Sign etend Shift left 2 emtoreg 2

Intermediate registers Sometimes we need the otpt of a fnctional nit in a later clock cycle dring the eection of one instrction. The instrction word fetched in stage determines the destination of the register write in stage 5. The ALU reslt for an comptation in stage 3 is needed as the for lw or sw in stage 4. These otpts will have to be stored in intermediate registers for ftre se. Otherwise they wold probably be lost by the net clock cycle. The instrction read in stage is saved in Instrction register. Register file otpts from stage 2 are saved in registers A and B. The ALU otpt will be stored in a register ALUOt. Any fetched from in stage 4 is kept in the emory register, also called DR. 3

3 The final mlticycle path Reslt Zero ALU ALUOp ALUSrcA 2 3 ALUSrcB register register 2 register 2 Registers Reg Address emory em Data Sign etend Shift left 2 PC PC A 4 [3-26] [25-2] [2-6] [5-] [5-] Instrction register emory register IR RegDst emtoreg IorD em em PC ALU Ot B

Register write control signals We have to add a few more control signals to the path. Since instrctions now take a variable nmber of cycles to eecte, we cannot pdate the PC on each cycle. Instead, a PC signal controls the loading of the PC. The instrction register also has a write signal, IR. We need to keep the instrction word for the dration of its eection, and mst eplicitly re-load the instrction register when needed. The other intermediate registers, DR, A, B and ALUOt, will store for only one clock cycle at most, and do not need write control signals. 3

Smmary A single-cycle CPU has two main disadvantages. The cycle time is limited by the worst case latency. It reqires more hardware than necessary. A mlticycle processor splits instrction eection into several stages. Instrctions only eecte as many stages as reqired. Each stage is relatively simple, so the clock cycle time is redced. Fnctional nits can be resed on different cycles. We made several modifications to the single-cycle path. The two etra adders and one were removed. ltipleers were inserted so the ALU and can be sed for different prposes in different eection stages. New registers are needed to store intermediate reslts. Net time, we ll look at controlling this path. 3