ECE Sample Final Examination

Similar documents
ECE 3056: Architecture, Concurrency, and Energy of Computation. Sample Problem Sets: Pipelining

ECE 3056: Architecture, Concurrency, and Energy of Computation. Sample Problem Set: Memory Systems

ECE 3056: Architecture, Concurrency and Energy of Computation. Single and Multi-Cycle Datapaths: Practice Problems

CSEN 601: Computer System Architecture Summer 2014

ECE Exam I - Solutions February 19 th, :00 pm 4:25pm

CS232 Final Exam May 5, 2001

The University of Alabama in Huntsville Electrical & Computer Engineering Department CPE Test II November 14, 2000

ECE 313 Computer Organization FINAL EXAM December 14, This exam is open book and open notes. You have 2 hours.

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

ECE Exam II - Solutions November 8 th, 2017

CS232 Final Exam May 5, 2001

THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY Computer Organization (COMP 2611) Spring Semester, 2014 Final Examination

CS3350B Computer Architecture Quiz 3 March 15, 2018

CS 351 Exam 2 Mon. 11/2/2015

cs470 - Computer Architecture 1 Spring 2002 Final Exam open books, open notes

Computer Architecture CS372 Exam 3

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

COMP2611: Computer Organization. The Pipelined Processor

Computer Architecture V Fall Practice Exam Questions

ECE 313 Computer Organization FINAL EXAM December 14, This exam is open book and open notes. You have 2 hours.

Computer Organization and Structure

Machine Organization & Assembly Language

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: A Based on P&H

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Chapter 4. The Processor. Computer Architecture and IC Design Lab

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

NATIONAL UNIVERSITY OF SINGAPORE

Processor (I) - datapath & control. Hwansoo Han

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 5 Solutions: For More Practice

COMPUTER ORGANIZATION AND DESIGN

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

OPEN BOOK, OPEN NOTES. NO COMPUTERS, OR SOLVING PROBLEMS DIRECTLY USING CALCULATORS.

CS 251, Winter 2018, Assignment % of course mark

CSE 378 Midterm 2/12/10 Sample Solution

LECTURE 10. Pipelining: Advanced ILP

Lecture 9 Pipeline and Cache

Virtual memory. Hung-Wei Tseng

Chapter 4. The Processor

CS/CoE 1541 Mid Term Exam (Fall 2018).

CPU Organization (Design)

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Perfect Student CS 343 Final Exam May 19, 2011 Student ID: 9999 Exam ID: 9636 Instructions Use pencil, if you have one. For multiple choice

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 4: Datapath and Control

Final Exam Spring 2017

ECS 154B Computer Architecture II Spring 2009

CSE 378 Final Exam 3/14/11 Sample Solution

Pipelined Processor Design

Pipelining. CSC Friday, November 6, 2015

2 GHz = 500 picosec frequency. Vars declared outside of main() are in static. 2 # oset bits = block size Put starting arrow in FSM diagrams

CS 251, Winter 2019, Assignment % of course mark

Chapter 4. The Processor

Final Exam Fall 2007

CS 230 Practice Final Exam & Actual Take-home Question. Part I: Assembly and Machine Languages (22 pts)

Final Exam Fall 2008

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Chapter 4 The Processor 1. Chapter 4B. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University

Adding Support for jal to Single Cycle Datapath (For More Practice Exercise 5.20)

Processor Design Pipelined Processor (II) Hung-Wei Tseng

CS 351 Exam 2, Fall 2012

COMPUTER ORGANIZATION AND DESIGN

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

ISA Instruction Operation

ECE260: Fundamentals of Computer Engineering

LECTURE 5. Single-Cycle Datapath and Control

Pipelined Processor Design

Virtual memory. Hung-Wei Tseng

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

LECTURE 3: THE PROCESSOR

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

Lecture 4: Review of MIPS. Instruction formats, impl. of control and datapath, pipelined impl.

University of Jordan Computer Engineering Department CPE439: Computer Design Lab

CS 2506 Computer Organization II Test 2

Do not start the test until instructed to do so!

ECE 30, Lab #8 Spring 2014

ECE 2300 Digital Logic & Computer Organization. More Caches Measuring Performance

Project Description EEC 483 Computer Organization, Spring 2019

Faculty of Science FINAL EXAMINATION

CSE 141 Spring 2016 Homework 5 PID: Name: 1. Consider the following matrix transpose code int i, j,k; double *A, *B, *C; A = (double

Single vs. Multi-cycle Implementation

The University of Michigan - Department of EECS EECS 370 Introduction to Computer Architecture Midterm Exam 2 solutions April 5, 2011

Problem 1 (logic design)

Chapter 4 The Processor 1. Chapter 4A. The Processor

For More Practice FMP

Pipelining Exercises, Continued

CPS 104 Final Exam. 2pm to 5pm Open book exam. Answer all questions, state all your assumptions, clearly mark your final answer. 1.

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 13 EE141

ECE232: Hardware Organization and Design

CSEE W3827 Fundamentals of Computer Systems Homework Assignment 3 Solutions

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

Laboratory Exercise 6 Pipelined Processors 0.0

CPE 335. Basic MIPS Architecture Part II

Pipelining. lecture 15. MIPS data path and control 3. Five stages of a MIPS (CPU) instruction. - factory assembly line (Henry Ford years ago)

CS/COE0447: Computer Organization

CS/COE0447: Computer Organization

Question 1: (20 points) For this question, refer to the following pipeline architecture.

Transcription:

ECE 3056 Sample Final Examination 1

Overview The following applies to all problems unless otherwise explicitly stated. Consider a 2 GHz MIPS processor with a canonical 5-stage pipeline and 32 general-purpose registers. The branch condition and jump addresses are computed in decode. Branches are assumed not taken. The datapath implements forwarding. The processor has 2 Gbytes of memory, 32 Kbyte direct mapped separate L1 Instruction and data caches with 64 byte lines backed by a unified 4 Mbyte 16-way set associative L2 cache with 512 byte lines and a unified 32-way 16 Mbyte L3 cache with 1024 byte lines. Applications have a 4 Gbyte virtual address space and 16 Kbyte pages. Hit time for the L1 is one cycle. On a miss in the L1, it takes 4 cycles to access the L2 cache and 1 cycle to transfer each 32-bit word from the L2 to the L1. Similarly, on a miss in the L2 cache it takes 8 cycles to access the L3 and 2 cycles for each 128-bit element (this means the L3 is accessed in chunks of 128 bits when moving data to the L2.) The memory system is banked with 16 banks each bank delivering one 32-bit word in 64 cycles. The bus to the memory system takes 2 cycles to transmit an address and is 64 bits wide, i.e., it can transfer 64 bits in a single clock cycle between the memory and the L3. The virtual memory system is supported by separate 8 entry fully associative separate TLBs for data and for instruction. Program execution statistics are the following. Instruction Class Frequency Load 28% Store 15% ALU 40% Branches 12% Jumps 5% 2

1. (15 pts) Consider execution of a MIPS code sequence on a pipelined data path shown on the next page. a. Show the modifications required to the pipelined datapath overleaf to implement support for the jal instruction. To receive credit you must be clear. There is more than one solution. The solution below has three functional additions. 1. Compute the jump address: Take the least significant 26 bits of the instruction, multiply by 4 (convert word address to byte address), concatenate the most significant 4 bits of the PC to generate a 32 bit address. This address is now a new input to the PCSrc mux. The controller generates the proper PCSrc mux control bits. 2. Add a third input to the RegDst mux hardwired to 31. Mux control is now 2 bits. 3. Make PC+4 available through WB and as a new input to to the MemToReg mux. Expand the number of bits that controls this mux. 4. Controller is updated to include a flush of IF/ID and generate a jal signal. 5. Since register 31 is not written until jal reaches WB, the procedure must have at least one instruction before executing a jr instruction. b. What would be the values of all of the control signals to realize this implementation. Note that some of these control signal values can be don t care Instr. RegDst AluSrc MemToReg RegWrite MemRead MemWrite Branch AluOp jal 10 0 10 1 0 0 0 00 3

4

c. Now consider the execution of the following code sequence specifically the execution of the function func: Why will execution be incorrect in the datapath described on page 2 and what can you do to ensure correct execution? Assume support for the jr instruction is similar to that of the jal instruction and addi and move are supported similar to the add instruction..data start:.word 0x32, 33, 0x34, 35.text la $a0, start addi $a1, $a0, 0x10 jal func li $v0, 10 syscall func: add $t7, $zero, $zero loop: lw $t6, 0($a0) add $t7, $t7, $t6 addi $a0, $a0, 4 bne $a1, $a0, loop move $v0, $t7 jr $31 Since branches are resolved in decode bne will not receive the correct forwarded value. Therefore we must either insert 2 stall cycles or 2 nops between addi $a0, $a0, 4 and bne $a1, $a0, loop. There is also a load-to-use hazard on the lw instruction that must either be supported in hardware or utilize a nop. 5

2. (10 pts) We wish to implement an atomic swap instruction swap r1, offset(r2). This instruction atomically swaps the contents of a register with a location in memory. a. What does it mean for the execution of this instruction to be atomic? The execution of the instruction is atomic, if all of its actions are completed in their entirety without any intervening actions (read or write) to the state that is being accessed. For example, the increment of a memory location is atomic if when once begun, there is no intervening access to the memory location until the increment operation has completed. b. Using a few instructions, provide a MIPS assembly language implementation of this instruction using the following MIPS instructions. load linked: ll r1, offset(r2) store conditional: sc r1, offset(r2) The following code sequence atomically swaps the contents of register $s4 and the memory location whose address is in register $s1. loop: add $t0,$zero,$s4 ll $t1,0($s1) ;load linked sc $t0,0($s1) ;store conditional beq $t0,$zero,loop ;branch store fails add $s4,$zero,$t1 ;put load value in $s4 6

3. (30 pts) Consider the memory system. a. (5 pts) Show the breakdown of the following addresses: i) virtual address, ii) physical address, iii) physical address as interpreted by the L1 cache, iv) physical address as interpreted by the L2 cache, and v) physical address interpreted by the L3 cache. Be clear! PA 18 14 VA 18 14 The most significant bit of the Physical Address will be zero if the address space is only 2 Gbytes. This would apply to the address breakdowns below. L1 17 9 6 L2 14 9 9 L3 13 9 10 7

b. (5 pts) Assuming a data miss in the L1 and L2 but a hit in the L3 what is the miss penalty in cycles? L2 hit service to L1 = 4 + 16 *1 = 20 cycles L3 hit service to L2 = 8 + (128/4) * 2 = 72 cycles Note: A L2 cache line is 512 bytes = 128 words. It takes 2 cycles to transmit 4 words (128 bits) from the L3 to the L2. There is an initial cost of 8 cycles. Therefore the miss penalty is 20 + 72 cycles = 92 cycles. Note that this is not the average hit time. This is only the miss penalty when there is a miss in the L1 and L2 and data is served from the L3. c. (5 pts) Assuming a Dirty bit and a Valid bit for each tag entry, what is the size in bits of each of the following L1 Directories: I-cache+ D-cache = (2 9 * 19) *2 = 19Kbits. Note that each tag I s17 bits since we are assuming full 32-bit addresses L2 Directories: (2 9 * 16) * 2 4 = 2 17 bits Note that there are 16 ways and each entry has 14 + 1 + 1 = 16 bits. 8

d. (8 pts) The instruction and data miss rates in the L1 are 2% and 4% respectively. The unified local miss rate in the L2 is 25% and the L3 is 44%. What is the worst-case increase in CPI due to memory accesses? Main memory miss penalty = (2 + 64 + (16/2) *1) * 16 = 1184 cycles Note that transmission of an address takes 2 cycles while two words can be transmitted from the memory system to the cache in 1 cycle. Each memory cycle delivers 16 words. 16 memory cycles are necessary to deliver a cache line. From the previous sub-problems we already have the miss penalties between the other layers of the cache. CPI increase due to instruction misses = 0.02 * (20+72*0.25+0.25*0.44*1184) CPI increase due to data misses = 0.04 * 0.43 * (20+0.25*72+0.25*0.44*1184) e. (7 pts) Provide a sequence of data references (addresses) to the L1 data cache that will generate a miss on every reference. Addresses 0x00048c0 and 0x800048c0 will map to the same line in the L1 data cache. An alternating sequence of references to these addresses will create a miss on every reference. 9

4. (10 pts) Base CPI of the core is 1.0 and 40% of loads produce a load-to-use hazard and 80% of these can be filled with useful instructions while 30% of branches are taken. We have separate direct mapped I and D caches (IF and MEM stages). If we change the L1 cache designs from direct mapped to two-way set associative, we will reduce the L1 instruction miss rate by from 2% to 1.6% and the L1 data miss rate by from 4% to 3%. However, we will need to add a pipeline stage to each of IF and MEM. Assuming every L1 miss hits in the L2, is the change to a set associative L1 worth it? Justify your answer. CPI-increase = 0.3 * 0.12 * 1 + 0.4 * 0.28 * 0.2 * 1 + 0.02 * 20 + 0.04 * 0.43 * 20 (branches + ALU hazards + instr miss + data miss) CPI-increase-opt = 0.3 * 0.12 * 2 + 0.4 * 0.28 * 0.2 * 2 + 0.016 * 20 + 0.03 * 0.43 * 20 E1 = #I * CPI-increase * clock_cycle E2 = #I * CPI-increase-opt * clock_cycle Compare E1 and E2 to determine if this is a good idea. 10

5. (10 pts) Consider a 32x32 matrix stored in memory in row major form, i.e., all elements in the first row followed by all elements in the second row, and so on. Each element is an integer number that takes 4 bytes. The matrix is stored starting 0x10010000. a. (5 pts) In which L2 set can we find the element A(2,4)? The first element is A(0,0) and is stored at memory location 0x10010000. The byte address of the first element of row 2 is 0x10010100 the first two rows (0 and 1) have 64 elements or 256 bytes. 256 = 0x100. The 4 th element of row 2 is 4*4 = 0x10 bytes offset from the first element. Therefore the address of A(2,4) is 0x10010110. The L2 has a line size of 512 bytes (9 bit offset). The number of sets in the L2 = (2 22 /(2 9 * 2 4 )) = 512 (9 bits). The 9 bit set index value in address 0x10010110 is 0x080. b. Consider two threads working on two different halves of the matrix. Each thread executes on a different core. How would assign the matrix elements between the two threads so the one thread never accesses a cache line that contains elements belonging to the other thread, i.e., a cache line can only have elements that are processed by one thread. Consider the L1 cache since all other cache line sizes are larger and multiples of the L1 cache line size, and accesses by the core are to the L1 cache. Each line is 64 bytes or 16 words. Therefore, each row of the matrix (32 elements) is stored in two successive cache lines. The first 16 rows of the matrix are stored in 32 (16 *2) consecutive cache lines, while the remaining 16 rows are stored in the next 32 cache lines. Thread 1 should be allocated to the first 16 rows. Thread 2 should be allocated to the next 16 rows. Now each thread will never access cache lines accessed by the other thread. 11

6. (15 pts) Consider a bus-based cache-coherent multiprocessor with four cores. Each core has a 1Kbyte fully associative cache with 256 byte lines. Main memory is 64Kbytes. The following figures show the state of each cache line in each cache and the corresponding tag. An invalid line (INV) is simply an empty line and a miss always fills an invalid line before evicting a valid line (V). A valid line means multiple copies may in the cache and are not dirty. If a cache has the only copy (which can also be dirty) the state of the line is EX. a. To keep the caches coherent, what could be the state of the caches after i) Processor 0 generates a read to address 0x1300, ii) processor 3 generates a write to address 0xCD40, and iii) core 1 generates a write to address 0xC104. You need only show changes. Core 0 Core 1 Core 2 Core 3 V 13 EX C1 V 13 V 9B INV CD INV CD EX FE V 00 V 12 INV AA V A3 V 12 V 33 V D0 INV 22 EX CD b. Can a write through cache design be used to solve the coherence problem? How or why not? No. Write through does not affect copies of data in another cache. Those copies will remain stale (and incorrect). 12

7. (10 pts) This question relates to the disk subsystem. a. The disks have a controller with 0.5ms overhead and a transfer rate of 100 Mbytes/sec at 7200rpm with average seek times of 6 ms. We have to make a choice between 4Kbyte and 256Kbyte pages. What is the difference in page fault service times between these page sizes? The only elements of the disk access latency that varies as a function of the page size are the transfer rate. Latency = controller_overhead + (1/2) rotational delay + average_seek_time + transfer_time transfer_time = page_size * 10-8 (secs/byte) b. Consider power gating (turning off) the parity disk in a RAID 4 organization. Under what conditions would you have to power up the parity disk? Whenever a write operation occurs, the parity disk will have to be updated and therefore powered up. Under continuous read operations, it can remained powered off. It also is powered up on a failure when the contents of the failed disk has to be reconstructed. 13