University of California at Berkeley. D. Patterson & R. Yung

Similar documents
University of California at Berkeley. D. Patterson & R. Yung. Midterm I

Midterm I March 3, 1999 CS152 Computer Architecture and Engineering

Introduction. Datapath Basics

Chapter 4. The Processor

Final Exam Fall 2007

Homework 3. Assigned on 02/15 Due time: midnight on 02/21 (1 WEEK only!) B.2 B.11 B.14 (hint: use multiplexors) CSCI 402: Computer Architectures

Midterm #2 Solutions April 23, 1997

CPU Organization (Design)

Systems Architecture

The Processor: Datapath and Control. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Chapter 4. The Processor. Instruction count Determined by ISA and compiler. We will examine two MIPS implementations

Mark Redekopp and Gandhi Puvvada, All rights reserved. EE 357 Unit 15. Single-Cycle CPU Datapath and Control

Chapter 4. The Processor Designing the datapath

University of California College of Engineering Computer Science Division -EECS. CS 152 Midterm I

Chapter 4. The Processor

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

Computer Organization EE 3755 Midterm Examination

University of California, Berkeley. Midterm II. You are allowed to use a calculator and one 8.5" x 1" double-sided page of notes.

361 div.1. Computer Architecture EECS 361 Lecture 7: ALU Design : Division

Tailoring the 32-Bit ALU to MIPS

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Lecture Topics. Announcements. Today: Single-Cycle Processors (P&H ) Next: continued. Milestone #3 (due 2/9) Milestone #4 (due 2/23)

University of California, Berkeley College of Engineering Department of Electrical Engineering and Computer Science

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Grading: 3 pts each part. If answer is correct but uses more instructions, 1 pt off. Wrong answer 3pts off.

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Processor (I) - datapath & control. Hwansoo Han

CENG 3420 Computer Organization and Design. Lecture 06: MIPS Processor - I. Bei Yu

ECE232: Hardware Organization and Design

ECE331: Hardware Organization and Design

Midterm I October 6, 1999 CS152 Computer Architecture and Engineering

CS3350B Computer Architecture Winter 2015

CS 2506 Computer Organization II

MIPS ISA. 1. Data and Address Size 8-, 16-, 32-, 64-bit 2. Which instructions does the processor support

Problem maximum score 1 35pts 2 22pts 3 23pts 4 15pts Total 95pts

Stack Memory. item (16-bit) to be pushed. item (16-bit) most recent

CENG 3420 Lecture 06: Datapath

CS 351 Exam 2, Fall 2012

CISC 662 Graduate Computer Architecture. Lecture 4 - ISA MIPS ISA. In a CPU. (vonneumann) Processor Organization

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: A Based on P&H

Computer Arithmetic Multiplication & Shift Chapter 3.4 EEC170 FQ 2005

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

The Processor (1) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Midterm I March 1, 2001 CS152 Computer Architecture and Engineering

Programmable Machines

Programmable Machines

ECE260: Fundamentals of Computer Engineering

Outline. EEL-4713 Computer Architecture Multipliers and shifters. Deriving requirements of ALU. MIPS arithmetic instructions

Mark Redekopp, All rights reserved. EE 357 Unit 11 MIPS ISA

CS 251, Winter 2018, Assignment % of course mark

Midterm I March 12, 2003 CS152 Computer Architecture and Engineering

CSCE 212: FINAL EXAM Spring 2009

Major CPU Design Steps

Computer Organization EE 3755 Midterm Examination

04S1 COMP3211/9211 Computer Architecture Tutorial 1 (Weeks 02 & 03) Solutions

Chapter 4. The Processor

ECE 3056: Architecture, Concurrency and Energy of Computation. Single and Multi-Cycle Datapaths: Practice Problems

2. (3 pts) There are two different ways to measure performance. What are they? 3. (3 pts) Why doesn t MIPS have a subtract immediate instruction?

Chapter 1. Computer Abstractions and Technology. Lesson 3: Understanding Performance

Review: MIPS Organization

ECE260: Fundamentals of Computer Engineering

Final Exam Spring 2017

CS 351 Exam 2 Mon. 11/2/2015

Data paths for MIPS instructions

Single-Cycle Examples, Multi-Cycle Introduction

Processor. Han Wang CS3410, Spring 2012 Computer Science Cornell University. See P&H Chapter , 4.1 4

Computer Architecture

Review: Abstract Implementation View

Lecture Topics. Announcements. Today: The MIPS ISA (P&H ) Next: continued. Milestone #1 (due 1/26) Milestone #2 (due 2/2)

4. (2 pts) What is the only valid and unimpeachable measure of performance?

CS 61C: Great Ideas in Computer Architecture Datapath. Instructors: John Wawrzynek & Vladimir Stojanovic

The MIPS Processor Datapath

Truth Table! Review! Use this table and techniques we learned to transform from 1 to another! ! Gate Diagram! Today!

EE 457 Midterm Summer 14 Redekopp Name: Closed Book / 105 minutes No CALCULATORS Score: / 100

Midterm. Sticker winners: if you got >= 50 / 67

CS 251, Winter 2019, Assignment % of course mark

Midterm Questions Overview

The Processor: Datapath & Control

Implementing the Control. Simple Questions


CS 2506 Computer Organization II

UC Berkeley CS61C : Machine Structures

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

CISC 662 Graduate Computer Architecture. Lecture 4 - ISA

CS 2506 Computer Organization II Test 1. Do not start the test until instructed to do so! printed

ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

RISC Processor Design

Computer Science 61C Spring Friedland and Weaver. The MIPS Datapath

Performance Evaluation CS 0447

ENE 334 Microprocessors

Materials: 1. Projectable Version of Diagrams 2. MIPS Simulation 3. Code for Lab 5 - part 1 to demonstrate using microprogramming

3. (2 pts) Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished?

CS 2506 Computer Organization II Test 1

ECE260: Fundamentals of Computer Engineering

CS3350B Computer Architecture Quiz 3 March 15, 2018

The overall datapath for RT, lw,sw beq instrucution

CS Computer Architecture Spring Week 10: Chapter

Computer Organization EE 3755 Midterm Examination

Chapter 4. The Processor. Computer Architecture and IC Design Lab

Transcription:

1 University of California at Berkeley College of Engineering Computer Science Division { EECS CS 152 Fall 1995 D. Patterson & R. Yung Computer Architecture and Engineering Midterm I Solutions Question #1: Technology and performance [20 pts] a) Calculate the average execution time for each instruction with an innitely fast memory. Which is faster and by what factor? Show your work. [6 pts] InstructionTime GaAS = InstructionTime CMOS = InstructionTime CMOS InstructionTime GaAS = 3:75 2:5 =1:5 CPI Clock rate = 2:5 1000MHz = 2.5 nanoseconds CPI Clock rate = 200MHz 0:75 = 3.75 nanoseconds So, the GaAS microprocessor is 1.5 times faster than a CMOS microprocessor. Grading : 4 points for showing your work. 1 point for having correct instruction execution times. 1 point for having correct performance factor. b) How many seconds will each CPU take to execute a 1 billion instruction program? [3 pts] Execution Time of Program = (number of instructions) (avg inst exec. time) Execution Time of Program on GaAS = (1 10 9 ) (2:5 10, 9) = 2.5 seconds Execution Time of Program on CMOS = (1 10 9 ) (3:75 10, 9) = 3.75 seconds Grading : 1 point for showing your work. 2 points for having correct program execution times. c) What is the cost of an untested GaAs die for this CPU? Repeat the calculation for a CMOS die. Show your work. [7 pts] dies/wafer = (wafer diameter=2) 2, die area die yield = wafer yield 1+ cost of die = dies/wafer GaAS = cost of wafer bdies per waferdie yieldc j (10=2) 2 pwafer diameter 2 die area defects per unit area die area, p, 10 1 21 die yield GaAS = 0:8, 1+ 41 2,2 k, 4 =52 =0:088, test dies per wafer

2 cost of die GaAS = $2000 b520:088c = $500 dies/wafer CMOS = j (20=2) 2 p, 20 2 22 k, 4 = 121 die yield CMOS = 0:9, 1+ 12,2 =0:225 cost of die CMOS = $1000 b1210:225c = $37:04 2 Grading : 1 point for correctly using oor functions in formulas. 1 point deducted if formulas used are good, but computed values are wrong. Rest of the points on showing your work. d) What is the ratio of the cost of the GaAs die to the cost of the CMOS die? [1 pt] Cost of GaAS die Cost of CMOS die = $500:00 $37:04 =13:5 Grading : 1 point for using an equation which sets up the ratio correctly. e) Based on the costs and performance ratios of the CPU calculated above, what is the ratio of cost/performance of the CMOS CPU to the GaAs CPU? [3 pts] performance = 1 execution time cost CMOS perf. CMOS cost GaAS perf. GaAS = cost CMOSexec time CMOS cost GaASexec time GaAS = $37:043:75ns $500:002:5ns =0:111 Grading : 2 points for using the correct formula. 1 point for correct value.

3 Question #2: Instruction Set Architecture and Registers Sets [15 pts] Imagine a modied MIPS instruction set with a register le consisting of 64 general-purpose registers rather than the usual 32. Assume that we still want to use a uniform instruction length of four bytes and that the opcode size is constant. Also assume that you can expand and contract elds in an instruction, but that you cannot omit them. a) How would the format of R-type (arithmetic and logical) instructions change? Label all the elds with their name and bit length. What is the consequence of this change? [4 pts] opcode rs rt rd shamt func 6 bits 6 bits 6 bits 6 bits 2 bits 6 bits Since you have 64 registers, you must have 6 bits for the register elds. The function eld stays at 6 bits since you do not want to delete any instructions. The consequence of this is that the shift amount eld is reduced to 2 bits, thus allowing you to do max of 3-bit shifts per instruction. Therefore, if you want to shift more than 3 places, you must use more than one instruction. Grading : -1 for non-reasonable consequences, i.e. it is just shorter. -2 if the function eld was reduced. -3 for a decent attempt. b) How does this change the I-type instructions? What is the consequence of this change? [3 pts] opcode rs rt immediate 6 bits 6 bits 6 bits 14 bits Since the register elds are 6 bits, the immediate eld is reduced to just 14 bits. The consequence of this is that for branches, your range of addresses is cut down by a factor of 4. Grading : -1 for non-reasonable consequences, i.e. it is just shorter. -2 for a decent attempt c) How does this change the J-type instructions? What is the consequence of this change? [3 pts] jump target 6 bits 26 bits Nothing needs to be changed since the "jump" instruction does not use any registers.

4 Grading : -1 for non-reasonable consequences, i.e. it is just shorter. -2 for a decent attempt d) Imagine we are translating machine code to use the larger register set. Give an example of an instruction that used to t into the old format, but is impossible to translate directly into a single instruction in the new format. Write a short sequence of instructions that could replace it. [5 pts] One simple example is the "sll" instruction. Old instruction that ts in 16-bit immediate eld: sll $4,$4,9 Now becomes: sll $4,$4,3 sll $4,$4,3 sll $4,$4,3 Grading : -1 for mistakes of the form: addi 4;0, 32768 as an example for the old format. 16-bit values are signed for arithmetic operations, therefore the range is from -32768 to +32767. -2 for a decent attempt. -4 for writing anything that resembles an example.

5 Question #3: Left Shift vs. Multiply [20 pts] Design a 16-bit left shifter that shifts 0 to 15 bits using only 4:1 multiplexors. a) How many levels of 4:1 multiplexors are needed? Show your work. [4 pts] This question only makes sense if you read it as asking for the minimum number of multiplexors needed for the operation. It can clearly be done with arbitrarily more. The minimum number of levels is log 4 16=2 Grading: For an answer of 2 levels: 2 pts. For "showing your work," either by writing the formula or answering #3 correctly : 2pts. An answer of 3 levels, if supported by reasonable logic here or in #3 : 2pts. b) If the delay per multiplexor is 2ns, what is the speed of this shifter? (Assume zero delay for the wires.) [3 pts] 2 levels 2ns/level = 4ns. Grading: 1 pt if you can multiply correctly. 2 pts for the correct answer. c) Draw the four leftmost bits of the multiplexors with the proper connections. (You mightwant to practice drawing it on the back of a page, and then transfer the nal version here.) [6 pts] The simplest solution here was to draw each output O 15... O 12 separately, each using the shift amount bit pattern as the control, ie: i0 --- \ 0 --- \ i1 --- i0 --- i2 --- ---. i1 --- ---. i3 --- / i2 --- / i4 --- \ i3 --- \ i5 --- `- \ i4 --- `- \ i6 --- --. i5 --- --. i7 --- / `-- -- o15 i6 --- / `-- -- o14,---,--- i8 --- \ i7 --- \ i9 ---,- / i8 ---,- / i10 -- -' i9 --- -' i11 -- / i10 -- / i12 -- \ i11 -- \ i13 -- i12 -- i14 -- ---' i13 -- ---' i15 -- / i14 -- /

6 0 --- \ 0 --- \ 0 --- 0 --- i0 --- ---. 0 --- ---. i1 --- / i0 --- / i2 --- \ i1 --- \ i3 --- `- \ i2 --- `- \ i4 --- --. i3 --- --. i5 --- / `-- -- o13 i4 --- / `-- -- o12,---,--- i6 --- \ i5 --- \ i7 ---,- / i6 ---,- / i8 --- -' i7 --- -' i9 --- / i8 --- / i10 -- \ i9 --- \ i11 -- i10 -- i12 -- ---' i11 -- ---' i13 -- / i12 -- / The control signals to the rst level muxes are s 3 s 2, and the control signals to the second level muxes are s 1 s 0, where s is the shift amount bit pattern. Grading: We traced three arbitrary paths through the presented array. If all three worked, and all inputs/outputs were labeled: 6 pts. Leaving o the input/output labeling resulting in a loss of from 1-3 points, depending on the severity. From 1-2 points was taken o for improper or no labeling of the control inputs, again depending on their complexity and how severe the omissions were. Partial credit was awarded for partially working solutions. d) Multiplies can be accomplished by left shifts if the multiplier is a power of two. Write the MIPS instruction(s) that performs multiply via left shifts for a multiplier that is positive and a power of two. Assume the multiplicand is in register $4, the multiplier is in register $5, and that the least signicant 32 bits of the product should be left in $6. You can ignore the potential for overow in the result register. [7 pts] A good solution has four instructions in the loop, a single branch instruction, and avoids a delayed branch stall and/or error: LI $C, -1 ANDI $T, $5, 1 LOOP: ADDI $C, $C, 1 SHR $5, $5, 1 BEQZ $T, LOOP ANDI $T, $5, 1 ; Delay slot, taken from top of target EXIT: SHL $6, $4, $C Grading: Start with full credit for nominally working code. Then, -1 pt for the delay slot being wasted or doing useless work (i.e. the SHL). -2 pts for delay slot causing improper execution. -2 pts for using div rather than shr. -2 pts for using mult rather than shl somewhere. -1 to -2 pts for failing on certain cases, such as when $5 = 1. 2 pts were given to solutions that assumed that $5 contained the log of the multiplier and implemented a single instruction. -1 pt for illegal MIPS instructions, -1 pt for wrong MIPS instructions. Just implementing the hardware multiply algorithm from class in software was worth 2 pts. No points were deducted for checking if the multiplier was zero, although zero is not a power of two.

7 extra credit) In class, we've seen three versions of the multiply operation (ignoring Booth encoding). Approximately how many clock cycles does a multiply take using the fastest algorithm? Approximately how much faster are multiplies that use the shifting technique (assuming the multiplier is a power of two)? [+4 pts] Original multiply takes from 1 32 = 32 to 2 32 = 64 clock cycles (just answering one or the other was accepted). Most common mistake here was forgetting that it is an algorithm in hardware, not a MIPS program doing the same thing. The shifting technique takes a variable amount of time, based on the location of the 1 in the multiplier. Total clock cycles also depends on the number of instructions in the loop. For the code above, number of clock cycles = 3 + 4 (size of multiplier). Assuming the multiplier is 2 i and i ranges form 0 to 15, and that i is randomly distributed (not necessarily true!), the minimum, average, and maximum time to execute is 3 + 4 = 7, 3 + 4 16 = 67, and 3 + 4 32 = 131 clock cycles. Thus, the shifting algorithm is faster when the size of the multipiler is less than or equal to 15 bits (32768) for the 64 clock original multiply, or less than or equal to 7 bits (128) in the 32 clock cycle original. Grading: +1 pt for the original multiply cycle time. +1 pt for the shift method's cycle time. +2 pts for realizing the performance is variable based on the multiplier, and analyzing the performance in accordance. +3 pt max for an otherwise correct solution with a minor error.

8 Question #4: Enhancing the Single Cycle Datapath [30 pts] Recall the 32-bit single-cycle control and datapath from class. a) A MIPS instruction that would be useful to have for writing loops is Decrement and Branch if Not Zero (DBNZ). What changes and additions would be needed to the single cycle data path to support DBNZ? Modify the datapath below to show that support. You do not need to write the control signals. The only change to the datapath needed is to add a hardwired \1" to the ALU MUX. To execute the DBNZ instruction, the register le supplies the value of register Rd to the ALU on bus A, and the ALU MUX supplies \1" to the ALU. The ALU performs a subtraction, the output is written back into Rd, and the \Zero" output is used to select PC + 4 or PC+IMM as the next instruction to fetch. All of the other components needed to execute DBNZ are already in the datapath. This instruction is a good example of execution concurrency. Every other instructions we've studied uses either the ALU data output or the ALU \Zero" output in its execution, but never both. DBNZ uses both. Grading : Most people lost points for adding unnecessary or redundant components. Adding another ALU: -5 points. Adding unnecessary MUXes: -3 points. Unclear descriptions or annotations: -1 to -5 points. b) The basic datapath supports only 32-bit loads. Imagine we wanted to augment the instruction set with new I-type load instructions Load from Lower to Upper Halfword (LLUH) and Load from Upper to Lower Halfword (LULH). LLUH: Rd M[oset + base] 15::0 jj 0 16 LULH: Rd 0 16 jj M[oset + base] 31::16 What datapath changes must be made to support this style of 16-bit loads? Modify the datapath reproduced below to show that support. There are two good solutions to this problem: 1. Insert two MUXes in busw, one to select the correct value for the upper halfword, the other to select the correct value for the lower halfword. 2. Add a shifter that can shift the value on busw 16 bits left or right. Or; add two shifters, one to shift 16 bits left, the other to shift 16 bits right. A MUX selects whether the output of the left shifter, the right shifter, or the unshifted value gets written into the register le. Grading : Most people either got this completely correct or made fundamental mistakes. Incorrect logic: -5 to -10 points. Overlooking the fact that all for instructions other than LLUH and LULH the data won't be shifted: -5 points. Unclear descriptions or annotations: -1 to -5 points.