For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

Similar documents
CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3)

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

Computer Architecture Homework Set # 3 COVER SHEET Please turn in with your own solution

Hardware-based Speculation

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

Instruction-Level Parallelism and Its Exploitation

Lecture-13 (ROB and Multi-threading) CS422-Spring

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

CMSC411 Fall 2013 Midterm 2 Solutions

Good luck and have fun!

CS 614 COMPUTER ARCHITECTURE II FALL 2004

CS152 Computer Architecture and Engineering. Complex Pipelines

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

CS433 Final Exam. Prof Josep Torrellas. December 12, Time: 2 hours

Instruction Level Parallelism

The basic structure of a MIPS floating-point unit

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Branch prediction ( 3.3) Dynamic Branch Prediction

Complex Pipelines and Branch Prediction

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Alexandria University

T T T T T T N T T T T T T T T N T T T T T T T T T N T T T T T T T T T T T N.

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

5008: Computer Architecture

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Hardware-Based Speculation

CS433 Homework 3 (Chapter 3)

Handout 2 ILP: Part B

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Multiple Instruction Issue and Hardware Based Speculation

Please state clearly any assumptions you make in solving the following problems.

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

Architectures for Instruction-Level Parallelism

Four Steps of Speculative Tomasulo cycle 0

Computer Architecture Practical 1 Pipelining

HY425 Lecture 05: Branch Prediction

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Metodologie di Progettazione Hardware-Software

CS146: Computer Architecture Spring 2004 Homework #2 Due March 10, 2003 (Wednesday) Evening

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Copyright 2012, Elsevier Inc. All rights reserved.

TDT 4260 lecture 7 spring semester 2015

Adapted from David Patterson s slides on graduate computer architecture

Super Scalar. Kalyan Basu March 21,

DYNAMIC SPECULATIVE EXECUTION

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

CS 614 COMPUTER ARCHITECTURE II FALL 2005

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

Branch Prediction Chapter 3

Hardware-based Speculation

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

Processor: Superscalars Dynamic Scheduling

Question 1 (5 points) Consider a cache with the following specifications Address space is 1024 words. The memory is word addressable The size of the

CS252 Graduate Computer Architecture Midterm 1 Solutions

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

EXAM #1. CS 2410 Graduate Computer Architecture. Spring 2016, MW 11:00 AM 12:15 PM

University of Toronto Faculty of Applied Science and Engineering

ECE 505 Computer Architecture

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Midterm I March 21 st, 2007 CS252 Graduate Computer Architecture

Exploitation of instruction level parallelism

EECC551 Exam Review 4 questions out of 6 questions

Static vs. Dynamic Scheduling

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

6.823 Computer System Architecture

2. [3 marks] Show your work in the computation for the following questions involving CPI and performance.

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Case Study IBM PowerPC 620

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

CS146 Computer Architecture. Fall Midterm Exam

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Dynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Computer System Architecture Quiz #2 April 5th, 2019

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

3.16 Historical Perspective and References

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

Transcription:

CS333: Computer Architecture Spring 006 Homework 3 Total Points: 49 Points (undergrad), 57 Points (graduate) Due Date: Feb. 8, 006 by 1:30 pm (See course information handout for more details on late submissions) Directions: Please read the course information handout for information on groups and collaboration and other homework policies. On top of the first page of your homework solutions, please write your name and NETID, your partner s name and NETID, and whether you are a 3-hour or 4-hour student. On each successive page, write your NETID. Please show all work that you used to arrive at your answer. Answers without justification will not receive credit. On Campus Students: please submit the homework in class. ICS Students: please submit the homework to cs433hw@ad.uiuc.edu. Please write your name, NETID, your partner s name and NETID on the first page of your homework solutions. Problem 1 [5 Points] Tomasulo and Hardware Speculation For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units Integer 1 1 FP Divider 15 1 Assume you have unlimited reservation stations. Memory accesses use the integer functional unit to perform effective address calculation during the EX stage. For stores, memory is accessed during the EX stage (Tomasulo s algorithm without speculation) or commit stage (Tomasulo s algorithm with speculation). All loads access memory during EX stage. Loads and stores stay in EX for 1 cycle. Functional units are not pipelined. If an instruction moves to its WB stage in cycle x, then an instruction that is waiting on the same functional unit (due to a structural hazard) can start executing in cycle x. +1 An instruction waiting for data on the CDB can move to its EX stage in the cycle after the CDB broadcast. Only one instruction can write to the CDB in one clock cycle. Branches and stores do not need the CDB.

Whenever there is a conflict for a functional unit or the CDB, assume that the oldest (by program order) of the conflicting instructions gets access, while others are stalled. Assume that BNEQZ occupies the integer functional unit for its computation and spends one cycle in EX. Assume that the result from the integer functional unit is also broadcast on the CDB and forwarded to dependent instructions through the CDB (just like any floating point instruction). Part A [11 Points] Complete the following table using Tomasulo s algorithm but without assuming any hardware speculation on branches. That is, an instruction after a branch cannot issue until the cycle after the branch completes its EX. Assume a single-issue machine. Fill in the cycle numbers in each pipeline stage for each instruction in the first two iterations of the loop, assuming the branch is always taken. The entries for the first instruction in the first iteration are filled in for you. Explain the reason for any stalls. Iteration 1 Instruction Issue EX WB Reason for Stalls LP: L.D F0, 0(R1) 1 3 DIV.D F, F0, F6 DIV.D F6, F6, F DADDI R1, R1, #-3 LP: Iteration Instruction Issue EX WB Reason for Stalls L.D F0, 0(R1) DIV.D F, F0, F6 DIV.D F6, F6, F DADDI R1, R1, #-3 Part B [14 Points] Complete the following table using Tomasulo s algorithm, but this time, assume hardware speculation and dual issue. That is, assume that an instruction can issue even before the branch has completed (or started) its execution (as with perfect branch and

target prediction). However, assume that an instruction after a branch cannot issue in the same cycle as the branch; the earliest it can issue is in the cycle immediately after the branch (to give time to access the branch history table and/or buffer). Any other pair of instructions can issue in the same cycle (assuming all necessary dependences are satisfied). Additionally, assume that you have as large a reorder buffer as you need. Further, two instructions can commit each cycle. Recall also that stores only calculate target addresses in EX and perform memory accesses during the Commit stage, and do nothing during the WB stage. Fill in the cycle numbers in each pipeline stage for each instruction in the first two iterations of the loop, assuming the branch is always taken. The entries for the first instruction of the first iteration are filled in for you. CMT stands for the commit stage. Explain the reason for any stalls. Iteration 1 Instruction Issue EX WB CMT Reason for Stalls LP: L.D F0, 0(R1) 1 3 4 DIV.D F, F0, F6 DIV.D F6, F6, F DADDI R1, R1, #-3 LP: Iteration Instruction Issue EX WB CMT Reason for Stalls L.D F0, 0(R1) DIV.D F, F0, F6 DIV.D F6, F6, F DADDI R1, R1, #-3 Problem : Dynamic Branch (1 points) Consider the following MIPS code. The register r0 is always 0. daddi r1, r0, #

L1: daddi r1, r0, #4 L: dsubi r1, r1, #1 bnez r1, L -- Branch 1 dsubi r1, r1, #1 bnez r1, L1 -- Branch Each table below refers to only one branch. For instance, branch 1 will be executed 8 times. Those 8 times should be recorded in the table for branch 1. Similarly branch is executed only times. Part A [4 points] Assume that 1-bit branch predictors are used. When the processor starts to execute the above code, both predictors contain value N (Not taken). What is the number of correct predictions? Use the following tables to record the prediction and action of each branch. The first entry is filled in for you. Branch 1: Step Branch 1 Actual Branch 1 1 N T 3 4 5 6 7 8 Branch : Step Branch Actual Branch 1 N T Part B [4 Points] Now assume that 4-bit saturation counters are used. When the processor starts to execute the above code, both counters contain value 7. What is the number of correct predictions? Use the following tables to record the prediction and action of each branch. The first entry is filled in for you.

Branch 1: Step Counter Value Branch 1 Actual Branch 1 1 0111 N T 3 4 5 6 7 8 Branch : Step Counter Value Branch Actual Branch 1 0111 N T Part C: [4 points] Now assume that level correlating predictors of the form (,1) are used (assume global history; i.e., similar to the correlating predictor example discussed in class). When the processor starts to execute the above code, the outcome of the previous two branches is not taken (N). Also assume that the initial state of predictors of all branches is not taken (N). What is the number of correct predictions? Use the following table to record your steps. Record the "New State" of predictors in the form W/X/Y/Z where, W - state corresponds to the case where the last branch and the branch before the last are both TAKEN X - state corresponds to the case where the last branch is TAKEN and the branch before the last is NOT TAKEN Y - state corresponds to the case where the last branch is NOT TAKEN and the branch before the last is TAKEN Z - state corresponds to the case where the last branch and the branch before the last are both NOT TAKEN The first entry is filled in for you. Branch 1: Step Branch 1 Actual Branch 1 New State 1 N T N/ N/ N/ T 3 4 5

6 7 8 Branch : Step Branch Actual Branch New State 1 N T N/ N/ T/ N Problem 3 [8 points] Assume you have a machine with a 9-stage in-order pipeline with the following stages: F1 F D R A1 A M1 M WB Starts the fetch. If the instruction is a branch, predict whether it is taken or not; if taken, predict the target address. Branch target address available at the end of this stage. Resolve branch condition. Part A [4 points] What is the penalty in cycles for a branch whose outcome is mispredicted? Part B [4 points] What is the penalty when a branch is correctly predicted as taken, but the branch address is incorrectly predicted? Problem 4 [4 points] Suppose we have a deeply pipelined processor for which we implement a branch-target buffer (BTB) for conditional branches. Assume that the misprediction penalty is always 4 cycles and the buffer miss penalty is always 3 cycles. Assume 85% prediction accuracy and 0% frequency of conditional branches. What should the hit rate in the branch target buffer be for this processor to be faster than a processor that has a fixed -cycle branch penalty? (assume that the two processors are equivalent in every other respect.) Assume a

base CPI of 1 when there are no conditional branch stalls. GRADUATE PROBLEM: Problem 5 (8 points) This problem concerns the implications of the reorder buffer size on performance. Consider a processor implementing Tomasulo s algorithm with reservation stations and the reorder buffer scheme described in detail in the lecture notes. Assume infinite processor resources unless stated otherwise; e.g., infinite execution units and infinite reservation stations. Assume a perfect branch predictor and assume there are no data dependences in the instruction stream we are considering. Assume the maximum instruction fetch rate is 1 instructions per cycle. (The other stages in the pipeline have no constraints; e.g., the processor can decode an unbounded number of instructions per cycle.) Part (A): ( points) Suppose all instructions take one cycle to execute and the processor has an infinite reorder buffer. What is the average instructions-per-cycle rate or IPC for this processor? Part (B): ( points) Consider the system in part (a) except that now every 49th instruction is a load that misses in the cache and the miss latency is 500 cycles. What is the average instructions-per-cycle or IPC for this processor? Part (C): (4 points) Consider the system in part (b) except that now the reorder buffer size is 48 entries. What is the average IPC for this processor? If the IPC is less than 1, then what is the smallest reorder buffer size for which the IPC will be 1 again (assume the reorder buffer size can only be a multiple of 1).