Prof. Kozyrakis. 1. (10 points) Consider the following fragment of Java code:

EE8 Winter 25 Homework #2 Soltions De Thrsday, Feb 2, 5 P. ( points) Consider the following fragment of Java code: for (i=; i<=6; i=i+3) a[i] = b[i] +c; Assme that a and b are arrays of words and the base address of a is in $a and the base address of b is in $a. Register $t is associated with variable i and register $s is associated with the vale of c. Yo may also assme that any address constants yo need are available to be loaded from memory. Write the code for IPS. How many instrctions are eected dring the rnning of this code if there are no array ot-of-bonds eceptions thrown? How many memory data references will be made dring eection? Hint: To indicate branching to error handling code yo may se synta sch as: bne $t, $t, DescriptionOfError Soltion: To test for loop termination, the (address) constant 24 is needed. Assme that it is placed in memory when the program is loaded. This soltion assmes that the memory addresses storing the lengths of arrays are in $a2 and $a3 for a and b respectively: lw $t8, AddressConstant24($zero)# $t8 = 24 lw $t7, ($a2) # $t7 = length of a[] lw $t6, ($a3) # $t6 = length of b[] add $t, $zero, $zero # initialize i = Loop: slt $t4, $t, $zero # $t4 = if i < bne $t4, $zero, IndeOtOfBonds # if i<, goto Error slt $t4, $t, $t6 # $t4 = if i >= length beq $t4, $zero, IndeOtOfBonds # if i >= length, goto Error slt $t4, $t, $t7 # $t4 = if i >= length beq $t4, $zero, IndeOtOfBonds # if i >= length, goto Error add $t, $a, $t # $t = address of b[i] lw $t2, ($t) # $t2 = b[i] add $t2, $t2, $s # $t2 = b[i] + c add $t3, $a, $t # $t3 = address of a[i] sw $t2, ($t3) # a[i] = b[i] + c addi $t, $t, 2 # i = i + 2 slt $t4, $t, $t8 # $t4 = if $t < 24, i.e., i <= 6 bne $t4, $zero, Loop # goto Loop if i <= 6 The nmber of instrctions eected is 4 + 2 4 = 288. The nmber of data references made is 3 + 2 2 = 45. Eception and termination checks mst be handled correctly (as above).

EE8 Winter 25 2. (5 points) Sppose we have made the following measrements of average CPI for instrctions: Instrction Arithmetic Data transfer Conditional branch Jmp Average CPI. clock cycles.7 clock cycles 2.5 clock cycles 2.2 clock cycles Compte the effective CPI for IPS. Use the Core IPS instrction freqencies for SPEC26int in Figre 3.28 (on page 236 of the 5 th edition of the tetbook, to obtain the instrction mi. Soltion: Effective CPI = Sm of (CPI of instrction type Freqency of eection) The average instrction freqencies for SPEC2int and SPEC2fp are:.457 (arithmetic and logic).338 (data transfer).7 (conditional branch).8 (jmp) Ths, the effective CPI:.457. +.338.7 +.7 2.5 +.8 2.2 =.496 (rondoff to.5) Dividing this answer by.98 (to get.53) is also fine, as the total instrction percent does not add p to. 2

EE8 Winter 25 3. (5 points) Compter A has an overall CPI of.9 and can be rn at a clock rate of.8 GHz. Compter B has a CPI of 2.6 and can be rn at a clock rate of 2.4 GHz. We have a particlar program we wish to rn. When compiled for compter A, this program has eactly, instrctions. How many instrctions wold the program need to have when compiled for Compter B, in order for the two compters to have eactly the same eection time for this program? Soltion: Time = InstrCont * CPI * Clock Cycle Time Time for A =, *.9 * (/.8 GHz) Time for B = InstrContB * 2.6 * (/2.4 GHz) If the two eection times shold be eqal, then: InstrContB = (2.4GHz.9 ) (.8GHz 2.6) = 97436 Note that the instrction cont is mch lower for compter B than for compter A on the same program. To achieve this in real life, one wold need a dramatically different architectre (e.g. B is a CISC machine) or a mch more aggressive compiler for B.) 3

EE8 Winter 25 4. ( points) Consider the following idea: Let s modify the instrction set architectre and remove the ability to specify an offset for memory access instrctions. Specifically, all load-store instrctions with nonzero offsets wold become psedoinstrctions and wold be implemented sing two instrctions. For eample: addi $at, $t, 4 # add the offset to a temporary lw $t, $at # new way of doing lw $t, 4 ($t) What changes wold yo make to the single-cycle datapath and control if this simplified architectre were to be sed? Soltion: The key is recognizing that we no longer have to go throgh the ALU and then to memory. We wold not want to add zero sing the ALU, instead we want to provide a path directly from the Read data otpt of the Register File to the read/write address lines of the memory (assming the instrction format does not change). The otpt of the ALU wold no longer connect to memory. The control does not need to change, bt some of the control signals now are don t cares. Assming we are not implementing addi or addi, it is possible to remove the AlSrc control signal and the mltipleer that it controls, ths having jst the data from Read data 2 otpt (of the Register File) going into the ALU. This reslts in additional optimizations to ALU control. 5. ( points) IPS chooses to simplify the strctre of its instrctions. The way we implement comple instrctions throgh the se of IPS instrctions is to decompose sch comple instrctions into mltiple simpler IPS ones. Show how IPS can implement the instrction swap $rs, $rt which swaps the contents of registers $rs and $rt in software i.e., sing IPS instrctions. Consider the case in which there is an available register that may be sed as well as the case in which no sch register eists. If the implementation of this instrction in hardware will increase the clock period of a single-instrction implementation by 8%, what percentage of swap operations in the instrction mi wold recommend implementing it in hardware? What if the clock period wold increase by 5%? 4

EE8 Winter 25 Soltion: Available register ($rd ) case: swap $rs,$rt can be implemented as follows: addi $rd,$rs, addi $rs,$rt, addi $rt,$rd, No available register case: sw $rs,temp($r) addi $rs,$rt, lw $rt,temp($r) Alternate soltion: or $rs,$rs,$rt or $rt,$rs,$rt or $rs,$rs,$rt Clock cycle tradeoff evalation: Software takes three cycles, and hardware takes one cycle. Let Rs be the ratio of swaps in the code mi. Also, assme a base CPI= (which it is for the IPS). Now: Avg time per instrction: (Software): Rs*3*T + ( Rs)**T = (2Rs + ) * T (Hardware): T Hardware implementation makes sense only if: T <= (2Rs + ) * T 8% increase in clock period: Clock period =.8 * T i.e. if swap instrctions are greater than 4% of the instrction mi (Rs >=.4), then a hardware implementation wold be preferable. 5% increase in clock period: Clock period =.5*T i.e. if swap instrctions are greater than 7.5% of the instrction mi, then a hardware implementation wold be preferable. 5

EE8 Winter 25 6. (2 points) The following C program is compiled into IPS objects with no optimization and with O2 optimization. int A[], B[]; main() { int i; int c = ; } for (i=; i < ; i++) A[i] = B[i] + c; Unoptimized Code Optimized with O2 : li gp, 4: addi gp, gp, 8: add gp, gp, t9 c: addi sp, sp, -24 : sw gp, (sp) 4: sw fp, 2(sp) 8: sw gp, 6(sp) c: move fp, sp 2: li v, 24: sw v, 2(fp) 28: sw zero, 8(fp) 2c: lw v, 8(fp) 3: slti v, v, 34: bne v, zero, 3c 38: j 88 3c: lw v, 8(fp) 4: move v, v 44: sll v, v, 2 48: lw v, (gp) 4c: add v, v, v 5: lw v, 8(fp) 54: move a, v 58: sll v, a, 2 5c: lw a, 4(gp) 6: add v, v, a 64: lw a, (v) 68: lw v, 2(fp) 6c: add a, a, v 7: sw a, (v) 74: lw v, 8(fp) 78: addi v, v, 7c: move v, v 8: sw v, 8(fp) 84: j 2c 88: move sp, s8 8c: lw fp, 2(sp) 9: addi sp, sp, 24 94: jr ra : li gp, 4: addi gp, gp, 8: add gp, gp, t9 c: li a2, : move a, zero 4: lw a, (gp) 8: lw v, 4(gp) c: lw v, (v) 2: addi v, v, 4 24: addi a, a, 28: add v, v, a2 2c: sw v, (a) 3: slti v, a, 34: addi a, a, 4 38: bne v, zero, c 3c: jr ra a. ( points) Please identify the optimizations sed by the compiler to transform the code from the noptimized version into the optimized one and point ot where they are applied. Note: the s seen in the first few lines in both versions of the fnction are only place holders for nknown constants, so yo shold 6

EE8 Winter 25 not assme that gp is initialized to. Frthermore t9 in both versions contains the offset between gp and the address storing the pointer to array A. Soltion: Copy propagation: Instrctions 4, 54 and 7c are removed. Arithmetic identity/algebraic simplification: Since (i+) 4 == (i 4)+4, instrctions 4 and 4c, and 54 and 6 that comptes the new A[i] and B[i], are transformed to 34 and 2 respectively. Leaf rotine optimization: It is a leaf rotine and there is no need to save and restore fp and gp. There is also no need to store i and c on the stack since they are only sed locally. As a reslt no stack space needs to be allocated. Ths instrctions c 8, 24, 3c, 5, 68, 74, 8 and 88-9 in the noptimized code are removed, and 28-2c are redced to instrction in the optimized version. Loop invariant code otion: Since the arrays A and B are in static memory, instrctions 48 and 5c that load the base address of A and B are moved above the loop (instrctions 4-8 in the optimized code) to redce the nmber of dynamic instrctions. Loop inversion: Since the lower and pper bond of the for loop are constants, the loop can be transformed into a while loop that has a lower loop overhead. Ths, instrctions 3-38 and 84 are transformed to 3 and 38 in the optimized version. b. (7 points) Please compte the nmber of dynamic instrctions and show the instrction mi (types: ALU, Branch, emory) for both version of the code. Unoptimized version: (before loop) + 22 (in loop) * + 7 (after loop) = 228 7

EE8 Winter 25 ALU 9/228 = 46% Branch 22/228 = 9% emory 7/228 = 45% Optimized version: 7 (before loop) + 8 (in loop) * + (after loop) = 88 ALU 55/88 = 62% Branch /88 = 3% emory 22/88 = 25% c. (3 points) In the optimized code, find the code or data references that need to be resolved by the linker. The constants in instrctions and 4, which initializes $gp to point to the middle of the static data area of memory. The register $t9 acconts for the offset between the initial vale of $gp and where the base address of the first array is stored. The branch at 38 is not PC-relative, so this needs to be resolved by the linker. 8

EE8 Winter 25 7. (5 points) Using the figre below, show all the necessary data and control path for instrction jalr rd, rs in the single-cycle IPS processor discssed in lectre. P C [3 28 ] Instrction [25 ] 4 A dd Ins trc tion [3 26] Control RegDst Br anc h em Read em toreg ALUOp em Write ALUS rc RegW rite S hift left 2 ALU Add reslt Jm p PC Read address Instrction mem or y Instrction [3 ] Ins trc tion [25 2] Ins trc tion [2 6] Ins trc tion [5 ] Read r egister Read data Read r egister 2 Regis ter s Read W rite data 2 r egister W rite data Z ero ALU ALU reslt Address W rite data Read data Data memory Ins trc tion [5 ] 6 32 Sign etend A LU contr ol Instrction [5 ] I n s t r c t i o n [ 25 ] S h i f t J m p a d d r e s s [ 3 ] l e f t 2 26 28 4 A d d P C + 4 [ 3 2 8 ] I n s t r c t i o n [ 3 26 ] C o n t r o l R e g D s t J m p B r a n c h e m R e a d e m t o R e g A L U O p e m W r i t e A L U S r c R e g W r i t e S h i f t l e f t 2 A d d A L U r e s l t P C R e a d a d d r e s s I n s t r c t i o n m e m o r y I n s t r c t i o n [ 3 ] I n s t r c t i o n [ 25 2 ] I n s t r c t i o n [ 2 6 ] I n s t r c t i o n [ 5 ] R e a d r e g i s t e r R e a d r e g i s t e r 2 W r i t e R e g i s t e r s r e g i s t e r W r i t e d a t a R e a d d a t a R e a d d a t a 2 Z e r o A L U A L U r e s l t A d d r e s s W r i t e d a t a D a t a m e m o r y R e a d d a t a I n s t r c t i o n [ 5 ] I n s t r c t i o n [ 5 ] 6 32 S i g n e t e n d A L U c o n t r o l Jalr PC + 4 9

EE8 Winter 25 8. (5 Points) It happens qite often that we wish to inde throgh and access each element of an array. Absent from IPS, bt present in other assembly langages/instrction sets are load/store commands which also increment the indeing register. For eample, lwinc $rt, offset($rs) wold perform the normal load and sbseqently increment $rs by 4. Please either describe in words, or show in the figre below, all necessary modifications needed to spport these instrctions in the single-cycle IPS processor discssed in lectre. load / store Rs Rt Offset 3:26 25:2 2:6 5: The datapath reqires an additional ALU to increment the content of the $ rs register (Read data ) by 4 (7 points). The otpt of this is fed back to the register file, which needs a second write port (8 points) becase two writes to the register are reqired in a single cycle. The new write port will be controlled by a new signal, "Write 2." We assme that the destination register for the second write is always the same as Read register ($ rs). This way "Write 2" indicates that there is second write to register file to the register identified by "Read register," and the data is fed throgh Write data 2.

EE8 Winter 25 Adding a second register file wold be incorrect since then the contents of the two wold have to be kept consistent. 9. (2 Points) The poplar 86 instrction set by Intel allows arithmetic instrctions to directly access memory for one of their sorce operands. The primary benefit is that fewer instrctions will be eected becase we won t have to first load that sorce operand into a register. The primary disadvantage is that the cycle time will have to increase to accont for the additional time to read memory dring the arithmetic instrction. Consider adding a new instrction to the IPS ISA: addm $t2, $t3, $t4 // $t2 = $t3 + emory[$t4] a). (5 Points) Consider the single-cycle IPS processor datapath shown below. Show the datapath changes needed to implement addm. Describe each change in -2 sentences. Name control signals, bt don t worry abot their vales for now.

EE8 Winter 25 b). (5 Points) Determine the control signals necessary to implement addm in the singlecycle IPS processor. For each control signal specify in the following table whether it needs to be,, or X (don t care) to implement addm. There are additional lines for the control signals needed for datapath changes yo made in 2.c. The ALUop control signal can take one of the following vales: add, sb, or, X. 2