The single-cycle design from last time

lticycle path Last time we saw a single-cycle path and control nit for or simple IPS-based instrction set. A mlticycle processor fies some shortcomings in the single-cycle CPU. Faster instrctions are not held back by slower ones. The clock cycle time can be decreased. We don t have to dplicate any hardware nits. A mlticycle processor reqires a somewhat simpler path which we ll see today, bt a more comple control nit that we ll save for net week.

The single-cycle design from last time PC 4 Add Reg Shift left 2 Add PCSrc A control nit (not shown) generates all the control signals from the instrction s op and fnc fields. Instrction [3-] Instrction I [25-2] I [2-6] I [5 - ] register register 2 register 2 Registers ALU Zero Reslt ALUOp em Data emtoreg RegDst ALUSrc em I [5 - ] Sign etend lticycle path 2

The eample add from last time Consider the instrction add $4, $, $2. op rs rt rd shamt fnc Assme $ and $2 initially contain and 2 respectively. Eecting this instrction involves several steps.. The instrction word is read from the instrction, and the program conter is incremented by 4. 2. The sorces $ and $2 are read from the register file. 3. The vales and 2 are added by the ALU. 4. The reslt (3) is stored back into $4 in the register file. lticycle path 3

How the add goes throgh the path PC+4 PC 4 Add Reg Shift left 2 Add PCSrc Instrction [3-] Instrction I [25-2] I [2-6] I [5 - ] RegDst register register 2 register 2 Registers...... ALUSrc ALU Zero Reslt ALUOp em Data em emtoreg I [5 - ] Sign etend... lticycle path 4

Edge-triggered state elements In an instrction like add $, $, $2, how do we know $ is not pdated ntil after its original vale is read? We ll assme that or state elements are positive edge triggered, and can be pdated only on the positive edge of a clock signal. The register file and have eplicit write control signals, Reg and em. These nits can be written to only if the control signal is asserted and there is a positive clock edge. In a single-cycle machine the PC is pdated on each clock cycle, so we don t bother to give it an eplicit write control signal. register register 2 register Reg em 2 Registers Data em PC lticycle path 5

The path and the clock. On a positive clock edge, the PC is pdated with a new. 2. A new instrction can then be loaded from. The control nit sets the path signals appropriately so that registers are read, ALU otpt is generated, is read or written, and branch target es are compted. 3. Several things happen on the net positive clock edge. The register file is pdated for arithmetic or lw instrctions. Data is written for a sw instrction. The PC is pdated to point to the net instrction. In a single-cycle path everything in Step 2 mst complete within one clock cycle, before the net positive clock edge. lticycle path 6

The slowest instrction... If all instrctions mst complete within one clock cycle, then the cycle time has to be large enogh to accommodate the slowest instrction. For eample, lw $, 4($) needs 8ns, assming the delays shown here. reading the instrction reading the base register $sp compting $sp-4 reading the storing back to $t 2ns ns 2ns 2ns ns 8ns Instrction [3-] Instrction 2 ns I [25-2] I [2-6] I [5 - ] I [5 - ] ns register register 2 register ns 2 Registers Sign etend ns ns ALU Zero Reslt 2 ns Data 2 ns ns lticycle path 7

...determines the clock cycle time If we make the cycle time 8ns then every instrction will take 8ns, even if they don t need that mch time. For eample, the instrction add $4, $, $2 really needs jst 6ns. reading the instrction reading registers $t and $t2 compting $t + $t2 storing the reslt into $s 2ns ns 2ns ns 6ns Instrction [3-] Instrction 2 ns I [25-2] I [2-6] I [5 - ] I [5 - ] ns register register 2 register ns 2 Registers Sign etend ns ns ALU Zero Reslt 2 ns Data 2 ns ns lticycle path 8

How bad is this? With these same component delays, a sw instrction wold need 7ns, and beq wold need jst 5ns. Let s consider the gcc instrction mi from p. 89 of the tetbook (ed2), Instrction Arithmetic Loads Stores Branches Freqency 48% 22% % 9% With a single-cycle path, each instrction wold reqire 8ns. Bt if we cold eecte instrctions as fast as possible, the average time per instrction for gcc wold be: (48% 6ns) + (22% 8ns) + (% 7ns) + (9% 5ns) = 6.36ns The single-cycle path is abot.26 times slower! lticycle path 9

It gets worse... Or small instrction set incldes only very simple operations. If we spported more comple, time-consming instrctions, then the performance penalty of a single-cycle machine cold be mch lower. Integer mltiplication and division, or floating-point operations Comple ing modes like the 886 Vector-based, SID instrctions like X lticycle path

...and worse... A single-cycle path also ses etra hardware one ALU is not enogh, since we mst do p to three calclations in one clock cycle for a beq. PC 4 Add Reg Shift left 2 Add PCSrc Instrction [3-] Instrction I [25-2] I [2-6] I [5 - ] register register 2 register 2 Registers ALU Zero Reslt ALUOp em Data emtoreg RegDst ALUSrc em I [5 - ] Sign etend lticycle path

...and worse This is also why we sed a Harvard architectre with two memories; yo can t easily read two es from the same in one cycle. PC 4 Add Reg Shift left 2 Add PCSrc Instrction [3-] Instrction I [25-2] I [2-6] I [5 - ] register register 2 register 2 Registers ALU Zero Reslt ALUOp em Data emtoreg RegDst ALUSrc em I [5 - ] Sign etend lticycle path 2

A mltistage approach to instrction eection We ve informally described instrctions as eecting in several steps.. Instrction fetch and PC increment. 2. ing sorces from the register file. 3. Performing an ALU comptation. 4. ing or writing (). 5. Storing back to the register file. What if we made these stages eplicit in the hardware design? 3

Performance benefits Each instrction can eecte only the stages that are necessary. Arithmetic operations never read or write. A sw instrction does not save anything to the register file. Branches neither access nor write to the registers. This wold mean that instrctions complete as soon as possible, instead of being limited by the slowest instrction. Proposed eection stages. Instrction fetch and PC increment 2. ing sorces from the register file 3. Performing an ALU comptation 4. ing or writing () 5. Storing back to the register file lticycle path 4

The clock cycle Things are simpler if we assme that each stage takes one clock cycle. This means instrctions will reqire mltiple clock cycles to eecte. Bt since a single stage is fairly simple, the cycle time can be low. For the proposed eection stages below and the sample path delays shown earlier, each stage needs 2ns at most. This acconts for the slowest devices, the ALU and. A 2ns clock cycle time corresponds to a 5Hz clock rate! Proposed eection stages. Instrction fetch and PC increment 2. ing sorces from the register file 3. Performing an ALU comptation 4. ing or writing () 5. Storing back to the register file lticycle path 5

Cost benefits As an added bons, we can eliminate some of the etra hardware from the single-cycle path. We are still restricted to sing each fnctional nit once per cycle, jst like before. Bt since instrctions reqire mltiple cycles, we cold rese some nits in a different cycle dring the eection of a single instrction. For eample, we cold se an ALU to increment the PC in the first clock cycle of an instrction eection, and then rese that ALU for arithmetic operations in the third cycle. Proposed eection stages. Instrction fetch and PC increment 2. ing sorces from the register file 3. Performing an ALU comptation 4. ing or writing () 5. Storing back to the register file lticycle path 6

Two etra adders Or original single-cycle path had an ALU and two adders. The arithmetic-logic nit had two responsibilities. Doing an operation on two registers for arithmetic instrctions. ( 3rd stage) Adding a register to a sign-etended constant, to compte effective es for lw and sw instrctions. ( 3rd stage) One of the etra adders incremented the PC by compting PC + 4. ( st stage) The other adder compted branch targets, by adding a sign-etended, shifted offset to (PC + 4). ( 3rd stage) lticycle path 7

The etra single-cycle adders PC 4 Add Reg Shift left 2 Add PCSrc Instrction [3-] Instrction I [25-2] I [2-6] I [5 - ] register register 2 register 2 Registers ALU Zero Reslt ALUOp em Data emtoreg RegDst ALUSrc em I [5 - ] Sign etend lticycle path 8

Or new adder setp We can eliminate both etra adders in a mlticycle path, and instead se jst one ALU, with mltipleers to select the proper inpts. A 2-to- m ALUSrcA sets the first ALU inpt to be the PC or a register. A 4-to- m ALUSrcB selects the second ALU inpt from among: the register file (for arithmetic operations), a constant 4 (to increment the PC), a sign-etended constant (for effective es), and a sign-etended and shifted constant (for branch targets). This permits a single ALU to perform all of the necessary fnctions. Arithmetic operations on two register operands. Incrementing the PC. Compting effective es for lw and sw. Adding a sign-etended, shifted offset to (PC + 4) for branches. lticycle path 9

The mlticycle adder setp highlighted PC PC IorD em ALUSrcA Address emory em em Data RegDst register register 2 register Reg 2 Registers 4 2 3 ALU Zero Reslt ALUOp ALUSrcB Sign etend Shift left 2 emtoreg lticycle path 2

Eliminating a Similarly, we can get by with one nified, which will store both program instrctions and. This is sed in both the instrction fetch and access stages, and the cold come from either: the PC register (when we re fetching an instrction), or the ALU otpt (for the effective of a lw or sw). We add another 2-to- m, IorD, to decide whether the is being accessed for instrctions or for. Proposed eection stages. Instrction fetch and PC increment 2. ing sorces from the register file 3. Performing an ALU comptation 4. ing or writing () 5. Storing back to the register file lticycle path 2

The new setp highlighted PC PC IorD em ALUSrcA Address emory em em Data RegDst register register 2 register Reg 2 Registers 4 2 3 ALU Zero Reslt ALUOp ALUSrcB Sign etend Shift left 2 emtoreg lticycle path 22

Intermediate registers Sometimes we need the otpt of a fnctional nit in a later clock cycle dring the eection of one instrction. The instrction word fetched in stage determines the destination of the register write in stage 5. The ALU reslt for an comptation in stage 3 is needed as the for lw or sw in stage 4. These otpts will have to be stored in intermediate registers for ftre se. Otherwise they wold probably be lost by the net clock cycle. The instrction read in stage is saved in Instrction register. Register file otpts from stage 2 are saved in registers A and B. The ALU otpt will be stored in a register ALUOt. Any fetched from in stage 4 is kept in the emory register, also called DR. lticycle path 23

The final mlticycle path PC PC IorD ALUSrcA em Address emory em em Data IR [3-26] [25-2] [2-6] [5-] [5-] RegDst register register 2 register Reg 2 Registers A B 4 2 3 ALU Zero Reslt ALUOp ALU Ot PCSorce Instrction register emory register Sign etend Shift left 2 ALUSrcB emtoreg lticycle path 24

Register write control signals We have to add a few more control signals to the path. Since instrctions now take a variable nmber of cycles to eecte, we cannot pdate the PC on each cycle. Instead, a PC signal controls the loading of the PC. The instrction register also has a write signal, IR. We need to keep the instrction word for the dration of its eection, and mst eplicitly re-load the instrction register when needed. The other intermediate registers, DR, A, B and ALUOt, will store for only one clock cycle at most, and do not need write control signals. lticycle path 25

Smmary A single-cycle CPU has two main disadvantages. The cycle time is limited by the slowest instrction. It reqires more hardware than necessary. A mlticycle processor splits instrction eection into several stages. Instrctions only eecte as many stages as reqired. Each stage is relatively simple, so the clock cycle time is redced. Fnctional nits can be resed on different cycles. We made several modifications to the single-cycle path. The two etra adders and one were removed. ltipleers were inserted so the ALU and can be sed for different prposes in different eection stages. New registers are needed to store intermediate reslts. Net onday we ll look at controlling this beast, which will also help s nderstand how this path works. lticycle path 26