CS/CoE 1541 Mid Term Exam (Fall 2018).

Similar documents
Question 1: (20 points) For this question, refer to the following pipeline architecture.

CS/CoE 1541 Exam 1 (Spring 2019).

Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

LECTURE 3: THE PROCESSOR

Chapter 4. The Processor

Static, multiple-issue (superscaler) pipelines

OPEN BOOK, OPEN NOTES. NO COMPUTERS, OR SOLVING PROBLEMS DIRECTLY USING CALCULATORS.

Full Datapath. Chapter 4 The Processor 2

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Computer Architecture CS372 Exam 3

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

COMPUTER ORGANIZATION AND DESIGN

Final Exam Fall 2007

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

CS 251, Winter 2019, Assignment % of course mark

Computer Architecture V Fall Practice Exam Questions

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

ECE Exam II - Solutions November 8 th, 2017

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

Pipelining. CSC Friday, November 6, 2015

Thomas Polzer Institut für Technische Informatik

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Multiple Instruction Issue. Superscalars

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Final Exam Fall 2008

Please state clearly any assumptions you make in solving the following problems.

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

CS 251, Winter 2018, Assignment % of course mark

Chapter 4. The Processor

Processor (II) - pipelining. Hwansoo Han

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Chapter 4 The Processor 1. Chapter 4A. The Processor

ECE331: Hardware Organization and Design

CMSC411 Fall 2013 Midterm 1

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Full Datapath. Chapter 4 The Processor 2

Basic Instruction Timings. Pipelining 1. How long would it take to execute the following sequence of instructions?

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Chapter 4. The Processor

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

Chapter 4. The Processor

Do not open this exam until instructed to do so. CS/ECE 354 Final Exam May 19, CS Login: QUESTION MAXIMUM SCORE TOTAL 115

14:332:331 Pipelined Datapath

EIE/ENE 334 Microprocessors

The University of Michigan - Department of EECS EECS 370 Introduction to Computer Architecture Midterm Exam 2 solutions April 5, 2011

CS 2506 Computer Organization II Test 2. Do not start the test until instructed to do so! printed

Chapter 4. The Processor

SOLUTION. Midterm #1 February 26th, 2018 Professor Krste Asanovic Name:

CSEE 3827: Fundamentals of Computer Systems

Pipelining. lecture 15. MIPS data path and control 3. Five stages of a MIPS (CPU) instruction. - factory assembly line (Henry Ford years ago)

CS 2506 Computer Organization II Test 2. Do not start the test until instructed to do so! printed

Pipelining Exercises, Continued

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

CS 2506 Computer Organization II Test 2. Do not start the test until instructed to do so! printed

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

EXAM #1. CS 2410 Graduate Computer Architecture. Spring 2016, MW 11:00 AM 12:15 PM

1. Truthiness /8. 2. Branch prediction /5. 3. Choices, choices /6. 5. Pipeline diagrams / Multi-cycle datapath performance /11

Lecture Topics. Announcements. Today: Data and Control Hazards (P&H ) Next: continued. Exam #1 returned. Milestone #5 (due 2/27)

ECE154A Introduction to Computer Architecture. Homework 4 solution

CS 2410 Mid term (fall 2018)

CS 2506 Computer Organization II Test 2

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

LECTURE 9. Pipeline Hazards

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

ECE 2300 Digital Logic & Computer Organization. Caches

ECE Sample Final Examination

EE557--FALL 1999 MIDTERM 1. Closed books, closed notes

CS232 Final Exam May 5, 2001

ECE 3056: Architecture, Concurrency, and Energy of Computation. Sample Problem Sets: Pipelining

CS2100 Computer Organisation Tutorial #10: Pipelining Answers to Selected Questions

Computer Architecture EE 4720 Final Examination

CS252 Prerequisite Quiz. Solutions Fall 2007

CS146 Computer Architecture. Fall Midterm Exam

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

Instruction Pipelining Review

COSC 6385 Computer Architecture - Pipelining

LECTURE 10. Pipelining: Advanced ILP

CS/CoE 1541 Exam 2 (Spring 2019).

ELE 655 Microprocessor System Design

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

CSE 490/590 Computer Architecture Homework 2

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

Processor (I) - datapath & control. Hwansoo Han

Transcription:

CS/CoE 1541 Mid Term Exam (Fall 2018). Name: Question 1: (6+3+3+4+4=20 points) For this question, refer to the following pipeline architecture. a) Consider the execution of the following code (5 instructions) on the architecture: L: lw r3, 8(r10) add r4,r5, r6 beq r9, r10, LL sw r7, 100(r8) addi r1, r11, 8 Assuming that at the end of some cycle, C, the lw instruction is in the MEM/WB buffer, specify the values of the following control signals at the end of cycle C (use if you do not care about the value being 0 or 1). ID/E.RegWrite: 0 E/MEM.RegWrite: 1 MEM/WB.RegWrite: 1 ID/E.MemtoReg: x E/MEM.MemtoReg: 1 MEM/WB.MemtoReg: 0 ID/E.MemWrite: 0 E/MEM.MemWrite: 0 ID/E.MemRead: x E/MEM.MemRead: x ID/E.Branch: 1 ID/E.RegDst: x ID/E.ALUSrc: 0 1

b) What will be stored in the PC and the IF/ID buffer at the end of cycle C? - The PC will store: L+16 - The IF/ID buffer will store: A will store L+16 and B will store the 32 bits instruction sw r7, 100(r8) c) What is the reason for having the three shaded components in the E stage: - The shaded shift left 2 unit: to multiply the branch offset by 4 - The shaded ALU unit: To compute the target address - The shaded AND gate: To force the target address into PC only when a branch is in E d) Now assume that the branch condition of the beq instruction, which is in the ID/E buffer at the end of Cycle C evaluates to true, what actions (in terms of changing the values stored in the inter-stage buffers) will be taken by the architecture during cycle C+1 as a result of the branch being taken? 1) PC = the target address 2) IF/ID = 0 (so that to force opcode = 0, the opcode for a no-op) 3) ID/E.MemWrite = ID/E.RegWrite = ID/ED.Branch = 0 e) If 25% of the instructions executing on this architecture are branch instructions, 30% are lw/sw instructions and 45% are ALU instructions, what would be the CPI for the architecture assuming that 40% of the branches are taken (consider only control hazards and ignore the effect of other hazards)? CPI = 1 + 0.25 * 0.4 * 2 = 1.2 2

Question 2 (5+4+4=13 points): Assume that the 5-stage pipelined architecture has only one memory unit which is used for fetching instructions (in the IF stage) and fetching/storing data (in the MEM stage). Hence, if at a given cycle, the IF stage wants to read an instruction and the MEM stage wants to read or write data, then one of the two stages has to stall and use the memory in a later cycle. Assume that when the IF and MEM stages compete for the memory in a given cycle, priority is given to the MEM stage. a) Complete the following timing diagram to trace the execution of the first 7 instructions of the following code segment: I1: lw $1, 100($0) I2: and $3, $2, $4 I3: add $9, $9, $10 I4: sw $7, 50($6) I5: sub $3, $3, $1 I6: sub $2, $2, $2 I7: bneq $9, $6, I1 I8:.. I9: IF ID E MEM WB Cycle 1 I1: lw Cycle 2 I2: and I1: lw Cycle 3 I3: add I2: and I1: lw Cycle 4 I4: sw I3: add I2: and I1: lw sw -- add and lw I5: sub sw -- add and I6: sub I5: sub sw -- add bneq I6: sub I5: sub sw -- bneq -- I6: sub I5: sub sw bneq -- I6: sub I5: sub bneq -- I6: sub bneq -- bneq b) Assuming that the loop (I1-I7) will execute 10000 iterations and that the branch condition/target is resolved in the E stage, compute the CPI during the execution of the loop if the branch is always predicted not-taken (ignore the time to fill up the pipeline). During each iteration (7 instructions) there will be two bubbles due to structural hazards and two bubbles due to control hazards. Hence each iteration will take 11 cycles to complete. CPI = 11/7 c) What would the CPI be if a 1-bit branch predictor is used? A one bit branch predictor will be correct 9998 out of 10000 times can ignore structural hazard Hence, each iteration will take 9 cycles to complete CPI = 9/7 3

Question 3 (6+6=12 points): The following figure shows two forwarding paths from E/MEM and MEM/WB buffers to the E stage. A forwarding unit sets the mux control signals, A and B, to 1 or 2 whenever it detects data dependencies that may cause hazards. a) Using the notation given in the figure for the information stored in the ID/E, E/MEM and MEM/WB buffers, state the condition(s) that will cause the forwarding unit to set up control signal A to the value 1. (MEM/WB.Rd = ID/E.Rs) AND (MEM/WB.RegWrite=1) AND (MEM/WB.Rd!= 0) b) As clear from the figure, the branch condition is determined in the ID stage. A data hazard that we did not discuss in class occurs when the instruction in the E stage writes into a register, $r, and at the same cycle, a branch instruction uses register $r to determine the branch condition. For example add $r1, $r2, $r3 beq $r1, $r5, L causes a data hazard because the condition of the branch is evaluated using the old value of $r1 rather than the value computed by the add instruction. By drawing on the figure, show an additional forwarding path that can be used to avoid the above data hazard. 4

Question 4 (5+5+5=15 points): Consider the following loop which accumulates in $r3 the sum of the N elements of a vector. The value of N is initially stored in $r5 and the address of the first element of the vector is initially stored in $r2. I1: lw $r1, 0($r2) I2: add $r3, $r3, $r1 I3: addi $r2, $r2, -4 I4: addi $r5, $r5, -1 I5: bneq $r5, $0, I1 In this question, assume a superscalar architecture with two pipelines, one for ALU/branch instructions and the other for lw/sw instructions with hardware support for forwarding. a) Show, using the table below, the schedule for one iteration of the loop on the superscalar. You should rearrange the instructions to minimize the execution time (of course, without violating correctness). b) Assuming that N is a multiple of 3, show the code for the loop after unrolling it twice. lw $r1, 0($r2) lw $r6, -4($r2) lw $r7, -8($r2) add $r3, $r3, $r1 add $r3, $r3, $r6 add $r3, $r3, $r7 addi $r2, $r2, -12 addi $r5, $r5, -3 bneq $r5, $0, I1 c) Show, using the table below, the schedule for one iteration of the unrolled loop on the superscalar. You should rearrange the instructions to minimize the execution time Answer of part a Schedule for the original loop (one iteration) Answer of part c Schedule for the unrolled loop ALU/branch pipeline Load/store pipeline ALU/branch pipeline Load/store pipeline addi $r5, $r5, -1 lw $r1, 0($r2) addi $r5, $r5, -3 lw $r1, 0($r2) addi $r2, $r2, -4 addi $r2, $r2, -12 lw $r6, -4($r2) add $r3, $r3, $r1 add $r3, $r3, $r1 lw $r7, 4($r2) bneq $r5, $0, I1 add $r3, $r3, $r6 add $r3, $r3, $r7 bneq $r5, $0, I1 5

Question 5 (10 points): Indicate which of the following sentences is True and which is False. For the first 4 sentences, refer to the execution of the following sequence of instructions on a 5-stage CPU pipeline: I1: add $r1, $r3, $r2 I2: lw $r1, 100($r2) I3: lw $r3, 104($r1) true false 1 Assuming no forwarding or stalling hardware, data hazard will occur because I2 will not use the correct register data. 2 Assuming no forwarding or stalling hardware, data hazard will occur because I3 will not use the correct register data. 3 Assuming forwarding but not stalling hardware, data hazard will occur because I2 will not use the correct register data. 4 Assuming forwarding but not stalling hardware, data hazard will occur because I3 will not use the correct register data. 5 When an exception occurs in a 5-stage pipelined CPU, the address of the exception handler is stored in the EPC (Exception PC) register 6 In DRAM open row policy, opening a row occurs before every access to the row. 7 In DRAM closed row policy, closing a row occurs after every access to the row. 8 In the write allocate cache policy, a block is allocated in the cache on either a read or a write miss. 9 Increasing the cache s associativity reduces compulsory misses 10 If memory words are accessed sequentially (ex: 1,2,3,4,5,6, ) and the cache block size is 4 words, then the cache miss rate is 25% 6

Question 6 (4+5+6=15 points): In this question, use the notation [tag, M(address),...] to describe the content of each entry. For example [4,M(46)] indicates that the entry contains tag=4 and the data from memory location (word address) 46. Similarly, [4,M(46),M(47)] indicates that the entry contains a block of two words from word addresses 46 and 47. a) Show the content of each of the caches shown below after the two memory references (word addresses) 31, 36 (a) An 8-words, direct mapped cache with block size = 1 word (b) An 8-words, direct mapped cache with block size = 2 words Index Content of cache Block index Content of cache 0 0 1 1 2 2 [4,M(36), M(37)] 3 3 [3,M(30),M(31)] 4 [4,M(36)] 5 6 7 [3,M(31)] b) Show the content of the cache shown below after the three memory references (word addresses) 31, 36, 20 (c) A 32-words, 2-ways set associative cache with Block size = 2 words Block Index 0 Content of cache (Way 1) Content of cache (Way 2) 1 2 [2, M(36),M(37)] [1,M(20),M(21)] 3 4 5 6 7 [1, M(30),M(31)] 7

Question 7 (5+5+5=15 points): Assume that the CPI of a particular CPU is 2.0 with an infinitely large cache and consider the addition of a 4-way, 1MByte L1 cache to the CPU. You have two possible options for the L1 cache: Option C1: uses a block size of 32 bytes and results in an effective miss rate of 4% Option C2: uses a block size of 64 bytes and results in an effective miss rate of 3%. Because the block size in C2 is larger than the block size in C1, the miss penalty for C2 is 100 cycles while the miss penalty of C1 is 80 cycles. (a) By computing and comparing the CPI for each of the cache options, determine which option results in a faster system. CPI with C1 (32 bytes blocks) = 2.0 + 0.04 * 80 = 5.2 CPI with C2 (64 bytes blocks) = 2.0 + 0.03 * 100 = 5 Hence, option C2 is the faster than option C1. (b) To compare the total number of bits (for tag and data, ignoring other bits) required to implement the L1 cache, complete the following sentences: For option C1, - the tag for each block in the cache consists of N-5-13 = N-18 bits (14 if N= 32) - the total number of blocks in the cache is _2 15 For option C2, - the tag for each block in the cache consists of N-6-12 = N-18 bits (14 if N= 32) - the total number of blocks in the cache is 2 14 Hence, option C1 uses more bits (total) for tags and data than option C2 (c) Assume that an L2 cache is added to option C2 to reduce the miss penalty for blocks that are not in L1 but are in L2 to 10 cycles. Assuming that the miss penalty of L2 is 100 cycles and that its hit rate is 30% (30% of the misses from L1 are found in L2), what is the CPI of the system with two levels of caches? CPI = 2.0 + 0.03 * (10 + 0.7 * 100) = 4.4 8