Prof. Kozyrakis. 1. (10 points) Consider the following fragment of Java code:

Similar documents
The single-cycle design from last time

The final datapath. M u x. Add. 4 Add. Shift left 2. PCSrc. RegWrite. MemToR. MemWrite. Read data 1 I [25-21] Instruction. Read. register 1 Read.

PART I: Adding Instructions to the Datapath. (2 nd Edition):

The extra single-cycle adders

Review. A single-cycle MIPS processor

EXAMINATIONS 2003 END-YEAR COMP 203. Computer Organisation

Review Multicycle: What is Happening. Controlling The Multicycle Design

Computer Architecture Chapter 5. Fall 2005 Department of Computer Science Kent State University

CS 251, Winter 2018, Assignment % of course mark

EXAMINATIONS 2010 END OF YEAR NWEN 242 COMPUTER ORGANIZATION

Quiz #1 EEC 483, Spring 2019

TDT4255 Friday the 21st of October. Real world examples of pipelining? How does pipelining influence instruction

EEC 483 Computer Organization

CS 251, Winter 2018, Assignment % of course mark

CS 251, Spring 2018, Assignment 3.0 3% of course mark

CS 251, Winter 2019, Assignment % of course mark

Review: Computer Organization

Computer Architecture

Lecture 7. Building A Simple Processor

Computer Architecture

Enhanced Performance with Pipelining

The multicycle datapath. Lecture 10 (Wed 10/15/2008) Finite-state machine for the control unit. Implementing the FSM

1048: Computer Organization

Pipelining. Chapter 4

What do we have so far? Multi-Cycle Datapath

Comp 303 Computer Architecture A Pipelined Datapath Control. Lecture 13

Chapter 6 Enhancing Performance with. Pipelining. Pipelining. Pipelined vs. Single-Cycle Instruction Execution: the Plan. Pipelining: Keep in Mind

Exceptions and interrupts

Hardware Design Tips. Outline

Chapter 3 & Appendix C Pipelining Part A: Basic and Intermediate Concepts

EEC 483 Computer Organization

Chapter 6: Pipelining

Lecture 13: Exceptions and Interrupts

1048: Computer Organization

Solutions for Chapter 6 Exercises

Overview of Pipelining

CSE Introduction to Computer Architecture Chapter 5 The Processor: Datapath & Control

1048: Computer Organization

POWER-OF-2 BOUNDARIES

Computer Architecture. Lecture 6: Pipelining

Lecture 6: Microprogrammed Multi Cycle Implementation. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 9: Microcontrolled Multi-Cycle Implementations

Instruction fetch. MemRead. IRWrite ALUSrcB = 01. ALUOp = 00. PCWrite. PCSource = 00. ALUSrcB = 00. R-type completion

CS/COE1541: Introduction to Computer Architecture

Lab 8 (All Sections) Prelab: ALU and ALU Control

CSE 141 Computer Architecture Summer Session I, Lectures 10 Advanced Topics, Memory Hierarchy and Cache. Pramod V. Argade

PS Midterm 2. Pipelining

Winter 2013 MIDTERM TEST #2 Wednesday, March 20 7:00pm to 8:15pm. Please do not write your U of C ID number on this cover page.

4.13 Advanced Topic: An Introduction to Digital Design Using a Hardware Design Language 345.e1

Control Instructions. Computer Organization Architectures for Embedded Computing. Thursday, 26 September Summary

Control Instructions

Chapter 2. Computer Abstractions and Technology. Lesson 4: MIPS (cont )

Multiple-Choice Test Chapter Golden Section Search Method Optimization COMPLETE SOLUTION SET

Animating the Datapath. Animating the Datapath: R-type Instruction. Animating the Datapath: Load Instruction. MIPS Datapath I: Single-Cycle

EEC 483 Computer Organization. Branch (Control) Hazards

Chapter 6: Pipelining

CENG3420 Lecture 03 Review

Machine Language Instructions Introduction. Instructions Words of a language understood by machine. Instruction set Vocabulary of the machine

Review. How to represent real numbers

ECE/CS 552: Introduction to Computer Architecture ASSIGNMENT #1 Due Date: At the beginning of lecture, September 22 nd, 2010

1 5. Addressing Modes COMP2611 Fall 2015 Instruction: Language of the Computer

COMPSCI 313 S Computer Organization. 7 MIPS Instruction Set

Chapter 2: Instructions:

CSCI 402: Computer Architectures. Instructions: Language of the Computer (3) Fengguang Song Department of Computer & Information Science IUPUI.

Computer Architecture Lecture 6: Multi-cycle Microarchitectures. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 2/6/2012

MIPS R-format Instructions. Representing Instructions. Hexadecimal. R-format Example. MIPS I-format Example. MIPS I-format Instructions

Chapter 2. Instructions: Language of the Computer. Adapted by Paulo Lopes

Chapter 4. The Processor. Computer Architecture and IC Design Lab

Instructions: Language of the Computer

CS3350B Computer Architecture

Thomas Polzer Institut für Technische Informatik

Lecture 4: MIPS Instruction Set

CS 153 Design of Operating Systems Spring 18

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: MIPS Instruction Set Architecture

CISC 662 Graduate Computer Architecture. Lecture 4 - ISA

Computer Architecture

CS222: MIPS Instruction Set

ECE232: Hardware Organization and Design

PIPELINING. Pipelining: Natural Phenomenon. Pipelining. Pipelining Lessons

Branch Addressing. Jump Addressing. Target Addressing Example. The University of Adelaide, School of Computer Science 28 September 2015

Topic Notes: MIPS Instruction Set Architecture

CISC 662 Graduate Computer Architecture. Lecture 4 - ISA MIPS ISA. In a CPU. (vonneumann) Processor Organization

Architecture II. Computer Systems Laboratory Sungkyunkwan University

Stored Program Concept. Instructions: Characteristics of Instruction Set. Architecture Specification. Example of multiple operands

CS3350B Computer Architecture MIPS Introduction

101 Assembly. ENGR 3410 Computer Architecture Mark L. Chang Fall 2009

CS 153 Design of Operating Systems

Chapter 2. Instruction Set Architecture (ISA)

Instructions: MIPS ISA. Chapter 2 Instructions: Language of the Computer 1

Chapter 2. Instructions:

MIPS ISA and MIPS Assembly. CS301 Prof. Szajda

5/17/2012. Recap from Last Time. CSE 2021: Computer Organization. The RISC Philosophy. Levels of Programming. Stored Program Computers

Recap from Last Time. CSE 2021: Computer Organization. Levels of Programming. The RISC Philosophy 5/19/2011

Chapter 3. Instructions:

Lecture 10: Pipelined Implementations

CSc 256 Midterm 2 Fall 2011

Chapter 2A Instructions: Language of the Computer

Instructions: MIPS arithmetic. MIPS arithmetic. Chapter 3 : MIPS Downloaded from:

CS232 Final Exam May 5, 2001

4.13. An Introduction to Digital Design Using a Hardware Design Language to Describe and Model a Pipeline and More Pipelining Illustrations

Transcription:

EE8 Winter 25 Homework #2 Soltions De Thrsday, Feb 2, 5 P. ( points) Consider the following fragment of Java code: for (i=; i<=6; i=i+3) a[i] = b[i] +c; Assme that a and b are arrays of words and the base address of a is in $a and the base address of b is in $a. Register $t is associated with variable i and register $s is associated with the vale of c. Yo may also assme that any address constants yo need are available to be loaded from memory. Write the code for IPS. How many instrctions are eected dring the rnning of this code if there are no array ot-of-bonds eceptions thrown? How many memory data references will be made dring eection? Hint: To indicate branching to error handling code yo may se synta sch as: bne $t, $t, DescriptionOfError Soltion: To test for loop termination, the (address) constant 24 is needed. Assme that it is placed in memory when the program is loaded. This soltion assmes that the memory addresses storing the lengths of arrays are in $a2 and $a3 for a and b respectively: lw $t8, AddressConstant24($zero)# $t8 = 24 lw $t7, ($a2) # $t7 = length of a[] lw $t6, ($a3) # $t6 = length of b[] add $t, $zero, $zero # initialize i = Loop: slt $t4, $t, $zero # $t4 = if i < bne $t4, $zero, IndeOtOfBonds # if i<, goto Error slt $t4, $t, $t6 # $t4 = if i >= length beq $t4, $zero, IndeOtOfBonds # if i >= length, goto Error slt $t4, $t, $t7 # $t4 = if i >= length beq $t4, $zero, IndeOtOfBonds # if i >= length, goto Error add $t, $a, $t # $t = address of b[i] lw $t2, ($t) # $t2 = b[i] add $t2, $t2, $s # $t2 = b[i] + c add $t3, $a, $t # $t3 = address of a[i] sw $t2, ($t3) # a[i] = b[i] + c addi $t, $t, 2 # i = i + 2 slt $t4, $t, $t8 # $t4 = if $t < 24, i.e., i <= 6 bne $t4, $zero, Loop # goto Loop if i <= 6 The nmber of instrctions eected is 4 + 2 4 = 288. The nmber of data references made is 3 + 2 2 = 45. Eception and termination checks mst be handled correctly (as above).

EE8 Winter 25 2. (5 points) Sppose we have made the following measrements of average CPI for instrctions: Instrction Arithmetic Data transfer Conditional branch Jmp Average CPI. clock cycles.7 clock cycles 2.5 clock cycles 2.2 clock cycles Compte the effective CPI for IPS. Use the Core IPS instrction freqencies for SPEC26int in Figre 3.28 (on page 236 of the 5 th edition of the tetbook, to obtain the instrction mi. Soltion: Effective CPI = Sm of (CPI of instrction type Freqency of eection) The average instrction freqencies for SPEC2int and SPEC2fp are:.457 (arithmetic and logic).338 (data transfer).7 (conditional branch).8 (jmp) Ths, the effective CPI:.457. +.338.7 +.7 2.5 +.8 2.2 =.496 (rondoff to.5) Dividing this answer by.98 (to get.53) is also fine, as the total instrction percent does not add p to. 2

EE8 Winter 25 3. (5 points) Compter A has an overall CPI of.9 and can be rn at a clock rate of.8 GHz. Compter B has a CPI of 2.6 and can be rn at a clock rate of 2.4 GHz. We have a particlar program we wish to rn. When compiled for compter A, this program has eactly, instrctions. How many instrctions wold the program need to have when compiled for Compter B, in order for the two compters to have eactly the same eection time for this program? Soltion: Time = InstrCont * CPI * Clock Cycle Time Time for A =, *.9 * (/.8 GHz) Time for B = InstrContB * 2.6 * (/2.4 GHz) If the two eection times shold be eqal, then: InstrContB = (2.4GHz.9 ) (.8GHz 2.6) = 97436 Note that the instrction cont is mch lower for compter B than for compter A on the same program. To achieve this in real life, one wold need a dramatically different architectre (e.g. B is a CISC machine) or a mch more aggressive compiler for B.) 3

EE8 Winter 25 4. ( points) Consider the following idea: Let s modify the instrction set architectre and remove the ability to specify an offset for memory access instrctions. Specifically, all load-store instrctions with nonzero offsets wold become psedoinstrctions and wold be implemented sing two instrctions. For eample: addi $at, $t, 4 # add the offset to a temporary lw $t, $at # new way of doing lw $t, 4 ($t) What changes wold yo make to the single-cycle datapath and control if this simplified architectre were to be sed? Soltion: The key is recognizing that we no longer have to go throgh the ALU and then to memory. We wold not want to add zero sing the ALU, instead we want to provide a path directly from the Read data otpt of the Register File to the read/write address lines of the memory (assming the instrction format does not change). The otpt of the ALU wold no longer connect to memory. The control does not need to change, bt some of the control signals now are don t cares. Assming we are not implementing addi or addi, it is possible to remove the AlSrc control signal and the mltipleer that it controls, ths having jst the data from Read data 2 otpt (of the Register File) going into the ALU. This reslts in additional optimizations to ALU control. 5. ( points) IPS chooses to simplify the strctre of its instrctions. The way we implement comple instrctions throgh the se of IPS instrctions is to decompose sch comple instrctions into mltiple simpler IPS ones. Show how IPS can implement the instrction swap $rs, $rt which swaps the contents of registers $rs and $rt in software i.e., sing IPS instrctions. Consider the case in which there is an available register that may be sed as well as the case in which no sch register eists. If the implementation of this instrction in hardware will increase the clock period of a single-instrction implementation by 8%, what percentage of swap operations in the instrction mi wold recommend implementing it in hardware? What if the clock period wold increase by 5%? 4

EE8 Winter 25 Soltion: Available register ($rd ) case: swap $rs,$rt can be implemented as follows: addi $rd,$rs, addi $rs,$rt, addi $rt,$rd, No available register case: sw $rs,temp($r) addi $rs,$rt, lw $rt,temp($r) Alternate soltion: or $rs,$rs,$rt or $rt,$rs,$rt or $rs,$rs,$rt Clock cycle tradeoff evalation: Software takes three cycles, and hardware takes one cycle. Let Rs be the ratio of swaps in the code mi. Also, assme a base CPI= (which it is for the IPS). Now: Avg time per instrction: (Software): Rs*3*T + ( Rs)**T = (2Rs + ) * T (Hardware): T Hardware implementation makes sense only if: T <= (2Rs + ) * T 8% increase in clock period: Clock period =.8 * T i.e. if swap instrctions are greater than 4% of the instrction mi (Rs >=.4), then a hardware implementation wold be preferable. 5% increase in clock period: Clock period =.5*T i.e. if swap instrctions are greater than 7.5% of the instrction mi, then a hardware implementation wold be preferable. 5

EE8 Winter 25 6. (2 points) The following C program is compiled into IPS objects with no optimization and with O2 optimization. int A[], B[]; main() { int i; int c = ; } for (i=; i < ; i++) A[i] = B[i] + c; Unoptimized Code Optimized with O2 : li gp, 4: addi gp, gp, 8: add gp, gp, t9 c: addi sp, sp, -24 : sw gp, (sp) 4: sw fp, 2(sp) 8: sw gp, 6(sp) c: move fp, sp 2: li v, 24: sw v, 2(fp) 28: sw zero, 8(fp) 2c: lw v, 8(fp) 3: slti v, v, 34: bne v, zero, 3c 38: j 88 3c: lw v, 8(fp) 4: move v, v 44: sll v, v, 2 48: lw v, (gp) 4c: add v, v, v 5: lw v, 8(fp) 54: move a, v 58: sll v, a, 2 5c: lw a, 4(gp) 6: add v, v, a 64: lw a, (v) 68: lw v, 2(fp) 6c: add a, a, v 7: sw a, (v) 74: lw v, 8(fp) 78: addi v, v, 7c: move v, v 8: sw v, 8(fp) 84: j 2c 88: move sp, s8 8c: lw fp, 2(sp) 9: addi sp, sp, 24 94: jr ra : li gp, 4: addi gp, gp, 8: add gp, gp, t9 c: li a2, : move a, zero 4: lw a, (gp) 8: lw v, 4(gp) c: lw v, (v) 2: addi v, v, 4 24: addi a, a, 28: add v, v, a2 2c: sw v, (a) 3: slti v, a, 34: addi a, a, 4 38: bne v, zero, c 3c: jr ra a. ( points) Please identify the optimizations sed by the compiler to transform the code from the noptimized version into the optimized one and point ot where they are applied. Note: the s seen in the first few lines in both versions of the fnction are only place holders for nknown constants, so yo shold 6

EE8 Winter 25 not assme that gp is initialized to. Frthermore t9 in both versions contains the offset between gp and the address storing the pointer to array A. Soltion: Copy propagation: Instrctions 4, 54 and 7c are removed. Arithmetic identity/algebraic simplification: Since (i+) 4 == (i 4)+4, instrctions 4 and 4c, and 54 and 6 that comptes the new A[i] and B[i], are transformed to 34 and 2 respectively. Leaf rotine optimization: It is a leaf rotine and there is no need to save and restore fp and gp. There is also no need to store i and c on the stack since they are only sed locally. As a reslt no stack space needs to be allocated. Ths instrctions c 8, 24, 3c, 5, 68, 74, 8 and 88-9 in the noptimized code are removed, and 28-2c are redced to instrction in the optimized version. Loop invariant code otion: Since the arrays A and B are in static memory, instrctions 48 and 5c that load the base address of A and B are moved above the loop (instrctions 4-8 in the optimized code) to redce the nmber of dynamic instrctions. Loop inversion: Since the lower and pper bond of the for loop are constants, the loop can be transformed into a while loop that has a lower loop overhead. Ths, instrctions 3-38 and 84 are transformed to 3 and 38 in the optimized version. b. (7 points) Please compte the nmber of dynamic instrctions and show the instrction mi (types: ALU, Branch, emory) for both version of the code. Unoptimized version: (before loop) + 22 (in loop) * + 7 (after loop) = 228 7

EE8 Winter 25 ALU 9/228 = 46% Branch 22/228 = 9% emory 7/228 = 45% Optimized version: 7 (before loop) + 8 (in loop) * + (after loop) = 88 ALU 55/88 = 62% Branch /88 = 3% emory 22/88 = 25% c. (3 points) In the optimized code, find the code or data references that need to be resolved by the linker. The constants in instrctions and 4, which initializes $gp to point to the middle of the static data area of memory. The register $t9 acconts for the offset between the initial vale of $gp and where the base address of the first array is stored. The branch at 38 is not PC-relative, so this needs to be resolved by the linker. 8

EE8 Winter 25 7. (5 points) Using the figre below, show all the necessary data and control path for instrction jalr rd, rs in the single-cycle IPS processor discssed in lectre. P C [3 28 ] Instrction [25 ] 4 A dd Ins trc tion [3 26] Control RegDst Br anc h em Read em toreg ALUOp em Write ALUS rc RegW rite S hift left 2 ALU Add reslt Jm p PC Read address Instrction mem or y Instrction [3 ] Ins trc tion [25 2] Ins trc tion [2 6] Ins trc tion [5 ] Read r egister Read data Read r egister 2 Regis ter s Read W rite data 2 r egister W rite data Z ero ALU ALU reslt Address W rite data Read data Data memory Ins trc tion [5 ] 6 32 Sign etend A LU contr ol Instrction [5 ] I n s t r c t i o n [ 25 ] S h i f t J m p a d d r e s s [ 3 ] l e f t 2 26 28 4 A d d P C + 4 [ 3 2 8 ] I n s t r c t i o n [ 3 26 ] C o n t r o l R e g D s t J m p B r a n c h e m R e a d e m t o R e g A L U O p e m W r i t e A L U S r c R e g W r i t e S h i f t l e f t 2 A d d A L U r e s l t P C R e a d a d d r e s s I n s t r c t i o n m e m o r y I n s t r c t i o n [ 3 ] I n s t r c t i o n [ 25 2 ] I n s t r c t i o n [ 2 6 ] I n s t r c t i o n [ 5 ] R e a d r e g i s t e r R e a d r e g i s t e r 2 W r i t e R e g i s t e r s r e g i s t e r W r i t e d a t a R e a d d a t a R e a d d a t a 2 Z e r o A L U A L U r e s l t A d d r e s s W r i t e d a t a D a t a m e m o r y R e a d d a t a I n s t r c t i o n [ 5 ] I n s t r c t i o n [ 5 ] 6 32 S i g n e t e n d A L U c o n t r o l Jalr PC + 4 9

EE8 Winter 25 8. (5 Points) It happens qite often that we wish to inde throgh and access each element of an array. Absent from IPS, bt present in other assembly langages/instrction sets are load/store commands which also increment the indeing register. For eample, lwinc $rt, offset($rs) wold perform the normal load and sbseqently increment $rs by 4. Please either describe in words, or show in the figre below, all necessary modifications needed to spport these instrctions in the single-cycle IPS processor discssed in lectre. load / store Rs Rt Offset 3:26 25:2 2:6 5: The datapath reqires an additional ALU to increment the content of the $ rs register (Read data ) by 4 (7 points). The otpt of this is fed back to the register file, which needs a second write port (8 points) becase two writes to the register are reqired in a single cycle. The new write port will be controlled by a new signal, "Write 2." We assme that the destination register for the second write is always the same as Read register ($ rs). This way "Write 2" indicates that there is second write to register file to the register identified by "Read register," and the data is fed throgh Write data 2.

EE8 Winter 25 Adding a second register file wold be incorrect since then the contents of the two wold have to be kept consistent. 9. (2 Points) The poplar 86 instrction set by Intel allows arithmetic instrctions to directly access memory for one of their sorce operands. The primary benefit is that fewer instrctions will be eected becase we won t have to first load that sorce operand into a register. The primary disadvantage is that the cycle time will have to increase to accont for the additional time to read memory dring the arithmetic instrction. Consider adding a new instrction to the IPS ISA: addm $t2, $t3, $t4 // $t2 = $t3 + emory[$t4] a). (5 Points) Consider the single-cycle IPS processor datapath shown below. Show the datapath changes needed to implement addm. Describe each change in -2 sentences. Name control signals, bt don t worry abot their vales for now.

EE8 Winter 25 b). (5 Points) Determine the control signals necessary to implement addm in the singlecycle IPS processor. For each control signal specify in the following table whether it needs to be,, or X (don t care) to implement addm. There are additional lines for the control signals needed for datapath changes yo made in 2.c. The ALUop control signal can take one of the following vales: add, sb, or, X. 2