ECE Exam II - Solutions October 30 th, :35 pm 5:55pm

Similar documents
ECE 3056: Architecture, Concurrency, and Energy of Computation. Sample Problem Sets: Pipelining

ECE Exam II - Solutions November 8 th, 2017

LECTURE 9. Pipeline Hazards

Chapter 4 The Processor 1. Chapter 4B. The Processor

ECE260: Fundamentals of Computer Engineering

CS 251, Winter 2018, Assignment % of course mark

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

Final Exam Fall 2007

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

CS3350B Computer Architecture Quiz 3 March 15, 2018

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Full Datapath. Chapter 4 The Processor 2

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

CS 251, Winter 2019, Assignment % of course mark

Processor (II) - pipelining. Hwansoo Han

EE557--FALL 1999 MIDTERM 1. Closed books, closed notes

Chapter 4. The Processor

Full Datapath. Chapter 4 The Processor 2

COMPUTER ORGANIZATION AND DESIGN

Lecture Topics. Announcements. Today: Data and Control Hazards (P&H ) Next: continued. Exam #1 returned. Milestone #5 (due 2/27)

Pipelined datapath Staging data. CS2504, Spring'2007 Dimitris Nikolopoulos

Static, multiple-issue (superscaler) pipelines

Thomas Polzer Institut für Technische Informatik

Computer Architecture CS372 Exam 3

zhandling Data Hazards The objectives of this module are to discuss how data hazards are handled in general and also in the MIPS architecture.

Final Exam Spring 2017

CS 61C Fall 2016 Guerrilla Section 4: MIPS CPU (Datapath & Control)

COMPUTER ORGANIZATION AND DESIGN

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Computer Organization and Structure

Chapter 4. The Processor

CS/CoE 1541 Exam 1 (Spring 2019).

ECS 154B Computer Architecture II Spring 2009

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Outline Marquette University

The University of Alabama in Huntsville Electrical & Computer Engineering Department CPE Test II November 14, 2000

ECE 3056: Architecture, Concurrency and Energy of Computation. Single and Multi-Cycle Datapaths: Practice Problems

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

CS 61C Summer 2016 Guerrilla Section 4: MIPS CPU (Datapath & Control)

Question 1: (20 points) For this question, refer to the following pipeline architecture.

Chapter 4. The Processor

CSEE 3827: Fundamentals of Computer Systems

Chapter 4. The Processor

1 Hazards COMP2611 Fall 2015 Pipelined Processor

EIE/ENE 334 Microprocessors

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

CS2100 Computer Organisation Tutorial #10: Pipelining Answers to Selected Questions

CS/CoE 1541 Mid Term Exam (Fall 2018).

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

DEE 1053 Computer Organization Lecture 6: Pipelining

ECE Exam I - Solutions February 19 th, :00 pm 4:25pm

CS 2506 Computer Organization II Test 2. Do not start the test until instructed to do so! printed

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

ECE154A Introduction to Computer Architecture. Homework 4 solution

ECE473 Computer Architecture and Organization. Pipeline: Data Hazards

Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12

Final Exam Fall 2008

CS232 Final Exam May 5, 2001

CSE 490/590 Computer Architecture Homework 2

Chapter 4 The Processor 1. Chapter 4A. The Processor

Processor Design Pipelined Processor (II) Hung-Wei Tseng

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

ECE Sample Final Examination

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition. Chapter 4. The Processor

SOLUTION. Midterm #1 February 26th, 2018 Professor Krste Asanovic Name:

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

ECE 313 Computer Organization FINAL EXAM December 13, 2000

LECTURE 3: THE PROCESSOR

OPEN BOOK, OPEN NOTES. NO COMPUTERS, OR SOLVING PROBLEMS DIRECTLY USING CALCULATORS.

CS 351 Exam 2, Fall 2012

Chapter 4. The Processor

CS 230 Practice Final Exam & Actual Take-home Question. Part I: Assembly and Machine Languages (22 pts)


Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12 (2) Lecture notes from MKP, H. H. Lee and S.

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

CPE 335 Computer Organization. Basic MIPS Pipelining Part I

Chapter 4. The Processor. Jiang Jiang

ECE 486/586. Computer Architecture. Lecture # 12

EC 413 Computer Organization - Fall 2017 Problem Set 3 Problem Set 3 Solution

ELE 655 Microprocessor System Design

Chapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1>

EXAM #1. CS 2410 Graduate Computer Architecture. Spring 2016, MW 11:00 AM 12:15 PM

Machine Organization & Assembly Language

Assignment 1 solutions

CS232 Final Exam May 5, 2001

Pipeline design. Mehran Rezaei

CSE 378 Midterm 2/12/10 Sample Solution

CS 351 Exam 2 Mon. 11/2/2015

Chapter 4. The Processor

LECTURE 10. Pipelining: Advanced ILP

COSC121: Computer Systems. ISA and Performance

ECE232: Hardware Organization and Design

Transcription:

ECE 3056 Exam II - Solutions October 30 th, 2013 4:35 pm 5:55pm

1. (20 pts) Consider the pipelined SPIM datapath with forwarding and branches are predicted as not taken. Assume branch instructions occur 10% of the time and are taken 45% of the time with a penalty of 3 cycles. With forwarding, the load delay slot is one cycle and can be filled 60% of the time with useful instructions. 25% of the instructions are loads and 30% of these introduce load delay hazards. a. (10 pts) What is the increase in CPI due to load delay slots and branch hazards? CPI_new = CPI_base + stalls-due-branches + stalls-due-to-loads CPI_new = CPI_base + (0.1*0.45 * 3) + (0.25 * 0.3 * 0.4 * 1) b. (10 pts) With the goal of reducing execution time, we wish to pipeline all memory accesses to 3 cycles. This will reduce the clock cycle time by 15%. Is this a good idea? Provide a quantitative justification for your answer. Both IF and MEM are increased to 3 cycles each. Therefore branch penalties increase to 5 cycles and load penalties increase to 3 cycles. CPI_opt = CPI_base + (0.1*0.45 * 5) + (0.25 * 0.3 * 0.4 * 3) EX_old = #I * CPI_new * clock_cycle EX_opt = #I * CPI_opt * 0.85 clock_cycle If EX_OPT < EX_Old it is a good idea.

2. (40 pts) Consider the pipelined SPIM datapath shown at the end of this question. a. (10 pts) Show the additions to the datapath that would be necessary to correctly implement forwarding. Specifically, i) identify the locations of the forwarding muxes on the figure (clearly!), ii) list the specific inputs to each forwarding mux below (i.e., signal names), and iii) identify the location on the datapath that is the source of each signal. Full credit requires that I can read and understand your modifications. ID/EX.RegisterRS (contents of RS) EX/MEM.ALUResult ForwardA ID/EX.RegisterRT (contents of RT) EX/MEM.ALUResult ForwardB MEM/WB.WriteData (output of MemToReg mux) MEM/WB.WriteData (output of MemToReg mux) b. (10 pts) Write the equation for the forwarding logic for the Rs input to the ALU. Make sure that your notation makes clear the signals being used and the stage they are from. ForwardA <= "10" when ( (EX/MEM_RegWrite = '1') and (EX/MEM_wreg_addr /= "00000") and (EX/MEM_wreg_addr = reg_rs)) else "01" when ( (MEM/WB_RegWrite = '1') and (MEM/WB_wreg_addr /= "00000") and (EX/MEM_wreg_addr /= reg_rs) and (MEM/WB_wreg_addr = reg_rs)) else "00"; wreg_addr is the destination register and reg_rs is Rs source register number.

c. (10 pts) Now consider this datapath after augmenting it with forwarding and appropriate hazard detection logic, executing the following instruction sequence. Register $12 and $13 are initialized to 0x1028 and 0x4004 respectively. The first instruction is at 0x40040000. Assume all other registers are properly initialized. Further consider the state of the data path during cycle 7 (first cycle is 0). What are the values of the datapath elements below? loop: lw $7, 4($12) lw $8, 4($13) add $9, $8, $7 sw $9, 4($13) addi $12, $12, 4 addi $13 $13, 4 addi $5, $5, -1 bne $5, $0, loop IF/ID.PC MEM/WB.RegWrite EX/MEM.ALUResult EX/MEM.BranchAddress WB/ID.MemToReg 0x40040018 1 0x4008 0x40040020_(This is the sum of PC+4 and the immediate (4)*4) 0 There is a stall between the second and third instruction. On cycle 7 the addi$5, $5, -1 instruction is in IF and there are no stalls in the pipeline on this cycle. d. (5 pts) On what cycle does bne exit the pipeline in part c. the first time through the loop. bne will be in WB on cycle 12. Remember the pipeline has one load-touse hazard between instructions 2 and 3. e. (5pts) If 10% of all instructions were branches, what would be the maximum impact on CPI of having a perfect branch predictor, i.e., one that was correct 100% of the time? CPI Impact = prob-of-branch * penalty = 0.1*3 = 0.3 At best, CPI can be improved by 0.3 (assuming we were otherwise wrong all of the time).

3. (25 pts) Consider the following code segment executing on the pipelined datapath (from Problem 2, i.e., no forwarding or hazard detection). In ID, writes occur in the first half of the cycle and reads in the second half. 1. loop: lw $t3, 0($t0) # load first element 2. add $t2, $t2, $t3 # update sum 3. sw $t2, 1028($t0) #store partial sum 4. addi $t0, $t0, 4 # point to next word 5. addi $t1, $t1, -1 # decrement count 6. bne $t1, $zero, loop # check if done a. (10 pts) Insert s to ensure this code executes correctly on this datapath. Show the code with s. lw $t3, 0($t0) add $t2, $t2, $t3 sw $t2, 1028($t0) addi $t0, $t0, 4 addi $t1, $t1, -1 bne $t1, $zero, loop b. (10 pts) Shorten the execution time by re-scheduling any instructions to remove any s. Show the rescheduled code. lw $t3, 0($t0) addi $t1, $t1, -1 add $t2, $t2, $t3 addi $t0, $t0, 4 bne $t1, $zero, loop sw $t2, 1028($t0) Can move this up since there is no dependence Can move these up since the add will not affect the sw instruction since there is no hazard detection

c. (5 pts) What is the speedup achieved of executing a single unscheduled loop iteration on the pipelined datapath relative to executing a single loop iteration on a multi-cycle datapath? Assume the same clock frequency and ignore pipeline setup time. On a multi-cycle datapath these 6 instructions will take 5+4+4+4+4+3 = 24 cycles On the pipelined datapath the loop will take 15 cycles see solution to part a) of this problem. Note that each loop iteration will incur the three s after the branch except for the last iteration. Speedup = 24/15.

4. (15 pts) Consider a pipelined SPIM core with full support for forwarding and load-to-use hazard detection between instructions within a thread. Now consider execution of two threads on this core where each thread executes the following function but over distinct data sets. func: move $v0, $a0 move $v1, $a1 add $v1, $v1, $v0 move $t1, $0 loop: lw $s6, 0($v0) add $t1, $t1, $s6 addi $v0, $v0, 4 slt $t0, $v0, $v1 bne $t0, $0, loop move $v0, $t1 jr $ra #get array base address # get array count #array ending address #initialize loop sum #load first value #update sum #update pointer #check for termination #onto next element #pass sum back to main #return to main a. (5 pts) With fine-grained, instruction level hardware multithreading and two threads (T1 and T2) what instruction from each thread, is in each stage of the pipeline on cycle 11 (the first cycle is cycle 0). Assume each instruction is a native instruction. IF _T2:add$t1, $t1, $s6 ID _T1:add$t1, $t1, $s6 EX _T2:lw $s6, 0($v0) MEM _T1: lw $s6, 0($v0) WB _T2: move $t1, $0 No stalls since the loadto-use hazard is covered by multithreading b. (5 pts) Assuming branches are tested in EX and the branch condition is available in MEM (as in our base pipeline), how many hardware threads do you need with fine grained multithreading before branch penalties can be completely hidden? Explain. Four. This is because the branch penalty is three cycles. Each instruction in a thread will be followed by three instructions from the other three threads before the next instruction in the same thread is executed. The branch condition is known before the next instruction in this thread is fetched. Thus, the branch penalty is completely hidden.

c. (5 pts) What is the difference between thread level parallelism (TLP) and instruction level parallelism (ILP)? Instruction level parallelism refers to the concurrent execution of multiple instructions from the same thread. Thread level parallelism refers to the concurrent execution of two distinct sequences of instructions (threads). The parallelism aspect denotes that the two threads belong to the same process and share an address space, but can execute in parallel in the same core or on different cores.