CSE 490/590 Computer Architecture Homework 2

Similar documents
CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS152 Computer Architecture and Engineering. Complex Pipelines

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

CS252 Graduate Computer Architecture Midterm 1 Solutions

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CS433 Homework 3 (Chapter 3)

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

Super Scalar. Kalyan Basu March 21,

Static, multiple-issue (superscaler) pipelines

Course on Advanced Computer Architectures

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

Instruction-Level Parallelism and Its Exploitation

Hardware-Based Speculation

Hardware-based Speculation

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Final Exam Fall 2007

ECE 313 Computer Organization FINAL EXAM December 14, This exam is open book and open notes. You have 2 hours.

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

CMSC411 Fall 2013 Midterm 2 Solutions

CS3350B Computer Architecture Quiz 3 March 15, 2018

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Pipelining. CSC Friday, November 6, 2015

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Four Steps of Speculative Tomasulo cycle 0

Computer Architecture Practical 1 Pipelining

ECSE 425 Lecture 11: Loop Unrolling

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2012 Professor Krste Asanović

TDT 4260 lecture 7 spring semester 2015

Problem Set 4 Solutions

Please state clearly any assumptions you make in solving the following problems.

Basic Instruction Timings. Pipelining 1. How long would it take to execute the following sequence of instructions?

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

Hardware-Based Speculation

4DM4 Sample Problems for Midterm Tuesday, Oct. 22, 2013

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

CS 614 COMPUTER ARCHITECTURE II FALL 2004

Instruction Level Parallelism

Multi-cycle Instructions in the Pipeline (Floating Point)

Floating Point/Multicycle Pipelining in DLX

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Computer Organization and Structure

Getting CPI under 1: Outline

Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2011 Professor Krste Asanović

3.16 Historical Perspective and References

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Computer Architecture and Engineering. CS152 Quiz #5. April 23rd, Professor Krste Asanovic. Name: Answer Key

Computer System Architecture Quiz #5 December 14th, 2005 Professor Arvind Dr. Joel Emer

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

LECTURE 10. Pipelining: Advanced ILP

Good luck and have fun!

COSC 6385 Computer Architecture. Instruction Level Parallelism

Copyright 2012, Elsevier Inc. All rights reserved.

ECE154A Introduction to Computer Architecture. Homework 4 solution

5008: Computer Architecture HW#2

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Problem M5.1: VLIW Programming

Computer Architecture Homework Set # 3 COVER SHEET Please turn in with your own solution

SI232 Set #20: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Chapter 6 ADMIN. Reading for Chapter 6: 6.1,

ECE 313 Computer Organization FINAL EXAM December 13, 2000

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

CS252 Prerequisite Quiz. Solutions Fall 2007

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

EXAM #1. CS 2410 Graduate Computer Architecture. Spring 2016, MW 11:00 AM 12:15 PM

Processor (IV) - advanced ILP. Hwansoo Han

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

The Processor: Instruction-Level Parallelism

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Multiple Instruction Issue. Superscalars

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Processor Architecture

DAT105: Computer Architecture Study Period 2, 2009 Exercise 3 Chapter 2: Instruction-Level Parallelism and Its Exploitation

CS433 Homework 2 (Chapter 3)

2. [3 marks] Show your work in the computation for the following questions involving CPI and performance.

CS/CoE 1541 Mid Term Exam (Fall 2018).

CS 2410 Mid term (fall 2018)

TDT 4260 TDT ILP Chap 2, App. C

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

CS 614 COMPUTER ARCHITECTURE II FALL 2005

CS 152 Computer Architecture and Engineering

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

/ : Computer Architecture and Design Fall Midterm Exam October 16, Name: ID #:

Computer System Architecture Final Examination Spring 2002

ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST *

Transcription:

CSE 490/590 Computer Architecture Homework 2 1. Suppose that you have the following out-of-order datapath with 1-cycle ALU, 2-cycle Mem, 3-cycle Fadd, 5-cycle Fmul, no branch prediction, and in-order fetch and commit. ALU Mem IF ID Issue Commit Fadd Fmul Consider the following sequence of instructions. Instruction Number I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 I 9 I 10 Instruction LD F4, 0 (R1) LD F3, 0 (R2) FADD F6, F3, F4 FMUL F1, F6, F3 LD F5, 0 (R3) FADD F2, F5, F4 FMUL F5, F2, F5 FADD F3, F1, F4 FMUL F5, F3, F2 FADD F6, F2, F4 Fill in the ROB and renaming table (next page). Answer: Note: Because of the commit stage, the ROB will also hold data (i.e., the answer table does not show the complete picture of the ROB). In other words, we re dealing with the ROB in slides 6-8 in ilp2.pptx. Thus, the renaming table will only hold tags and not the actual data values. 1

Tag op dst src1 src2 T1 LD T1 R1 0 T2 LD T2 R2 0 T3 FADD T3 T2 T1 T4 FMUL T4 T3 T2 T5 LD T5 R3 0 T6 FADD T6 T5 T1 T7 FMUL T7 T6 T5 T8 FADD T8 T4 T1 T9 FMUL T9 T8 T6 T10 FADD T10 T6 T1 I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 I 9 I 10 R1 R2 R3 F1 F2 F3 F4 F5 F6 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 2

2. Consider the following instructions. Assume that the initial values for R1, R2, and R3 are all 0. target1: SUBI R2, R1, 2 BNEZ R2, target1 ADDI R3, R3, 1 ADDI R1, R1, 1 SUBI R4, R1, 3 BNEZ R4, loop (a) Explain what the code does. Answer: Pseudo-code: if R1 == 2, then R3 = R3 + 1 R1 = R1 + 1 if R1!= 3, then next loop Thus, the code loops three times (R1 == 0, 1, & 2), and it increment R3 by 1 in the last loop. (b) Change the code to minimize the number of registers necessary. Answer: R2 and R4 only hold temporary values, so we can use just one register. target1: SUBI R2, R1, 2 BNEZ R2, target1 ADDI R3, R3, 1 ADDI R1, R1, 1 SUBI R2, R1, 3 BNEZ R2, loop (c) Assume that we have a 1-bit branch predictor that stores the result of the last branch and makes the prediction based on the result, i.e., the prediction is take if the last branch was taken and the other way round for not take. Show the results of all predictions throughout the execution. Answer: Assume that we re starting from Not Take. In total, the code loops three times, and there are 6 branches. Branch Prediction Actual Result 1st BNEZ (target1) Not Take Taken 2nd BNEZ (loop) Take Taken 3rd BNEZ (target1) Take Taken 4th BNEZ (loop) Take Taken 5th BNEZ (target1) Take Not Taken 6th BNEZ (loop) Not Take Not Taken 3

(d) Assume that we have one 2-bit branch predictor. Show the results of all predictions throughout the execution. Answer: Assume that we re starting from Not Take & Right. Branch Prediction Actual Result 1st BNEZ (target1) Not Take & Right Taken 2nd BNEZ (loop) Not Take & Wrong Taken 3rd BNEZ (target1) Take & Right Taken 4th BNEZ (loop) Take & Right Taken 5th BNEZ (target1) Take & Right Not Taken 6th BNEZ (loop) Take & Wrong Not Taken (e) Assume that we have four 2-bit branch predictors per branch instruction as well as one 2-bit shift register that stores the result of the last two branch instructions (i.e., we have a two-level branch predictor). Show the results of all predictions throughout the execution. Answer: Assume that we re starting from Not Take & Right for predictors and Not Taken & Not Taken for the global history. We use two tables, one table per branch instruction. Each instruction has four predictors since there are four possible cases from the global history (the tables below do not show these cases separately though). Thus, out of 4 branches, the 3rd and 5th branches share the same predictor; all others use different ones. Branch History (Last Two) Prediction Actual Result 1st BNEZ (target1) Not Taken & Not Taken Not Take & Right Taken 3rd BNEZ (target1) Taken & Taken Not Take & Right Taken 5th BNEZ (target1) Taken & Taken Not Take & Wrong Not Taken Branch History (Last Two) Prediction Actual Result 2nd BNEZ (loop) Not Taken & Taken Not Take & Right Taken 4th BNEZ (loop) Taken & Taken Not Take & Right Taken 6th BNEZ (loop) Taken & Not Taken Not Take & Right Not Taken 4

3. (Example on p.77) Consider the following code: LD F0, 0(R1) FADD F4, F0, F2 ST F4, 0(R1) ADDI R1, R1, -8 BNE R1, R2 loop Show how to unroll the loop so that there are four copies of the loop body, assuming that R1 - R2 is initially a multiple of 32, which means that the number of loop iterations is a multiple of 4. Eliminate any obviously redundant computations and do not reuse any of the registers. Answer: Please refer to the textbook. 5

4. (Example on p.116) Suppose we have a VLIW that could issue two memory references, two FP operations, and one integer operation or branch in every clock cycle. Show an unrolled version of the loop x[i] = x[i] + s (see p.76 for the MIPS code) for such a processor. Unroll as many times as necessary to eliminate any stalls. Ignore delayed branches. Answer: Please refer to the textbook. 6

5. Suppose that you have a multithreading single-issue in-order datapath with 1-cycle ALU, 2-cycle Mem, 3-cycle Fadd, 5-cycle Fmul, no branch prediction, and in-order fetch and commit. Consider the following instructions: LD F0, 0(R1) FADD F3, F3, F2 ADDI R1, R1, -8 BNE F0, F2, loop What is the minimum number of threads necessary to fully utilize the datapath for each of the following strategies? (a) Fixed switching: the CPU switches to a different thread every cycle in a round-robin fashion. Answer: We need to fill in the pipeline bubbles with useful instructions. Thus, the general strategy is to see the worst case scenario see where the pipeline bubbles are and how many cycles we are wasting because of the bubbles. That s the minimum number of threads we need to keep the pipeline busy. In general, problems can be memory operations, control hazards, and data hazards. In our code above, the problematic one is not BNE because we assume that there is no branch prediction. In the worst case, during EX for the BNE, we might know that we need to insert pipeline bubbles for IF and ID. Thus, we need at least 3 threads to replace the bubbles. (b) Data-dependent switching: the CPU switches to a different thread when an instruction cannot proceed due to a dependency. Answer: The general strategy is the same. We consider the worst case and see how many threads we need to fill in the possible bubbles. There are two dependencies. One is between LD and ADDI on R1, and the other is LD and BNE on F0. However, these dependencies do not lead to stalls in a simple pipeline. Thus, we are not able to improve the performance even with more threads. 7