University of Toronto Faculty of Applied Science and Engineering

Similar documents
University of Toronto Faculty of Applied Science and Engineering

University of Toronto Faculty of Applied Science and Engineering

University of Toronto Faculty of Applied Science and Engineering

University of Toronto Faculty of Applied Science and Engineering

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3)

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

EE557--FALL 2001 MIDTERM 2. Open books

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Instruction-Level Parallelism and Its Exploitation

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Hardware-Based Speculation

Hardware-based Speculation

CMSC411 Fall 2013 Midterm 2 Solutions

Static vs. Dynamic Scheduling

Computer Architecture Homework Set # 3 COVER SHEET Please turn in with your own solution

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Good luck and have fun!

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

EECS 470 Midterm Exam Answer Key Fall 2004

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

Processor: Superscalars Dynamic Scheduling

CSE 490/590 Computer Architecture Homework 2

The basic structure of a MIPS floating-point unit

6.823 Computer System Architecture

Tomasulo Loop Example

CS152 Computer Architecture and Engineering. Complex Pipelines

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

4.1.3 [10] < 4.3>Which resources (blocks) produce no output for this instruction? Which resources produce output that is not used?

Floating Point/Multicycle Pipelining in DLX

Metodologie di Progettazione Hardware-Software

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CS 2410 Mid term (fall 2018)

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

EECS 470 Midterm Exam Winter 2008 answers

Instruction Level Parallelism

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Hardware-Based Speculation

CS425 Computer Systems Architecture

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

CS Mid-Term Examination - Fall Solutions. Section A.

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

CS146 Computer Architecture. Fall Midterm Exam

/ : Computer Architecture and Design Fall Midterm Exam October 16, Name: ID #:

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Four Steps of Speculative Tomasulo cycle 0

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

CS433 Final Exam. Prof Josep Torrellas. December 12, Time: 2 hours

Alexandria University

CMSC 611: Advanced Computer Architecture

CS252 Graduate Computer Architecture Midterm 1 Solutions

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

5008: Computer Architecture

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

DAT105: Computer Architecture Study Period 2, 2009 Exercise 3 Chapter 2: Instruction-Level Parallelism and Its Exploitation

/ : Computer Architecture and Design Fall 2014 Midterm Exam Solution

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Copyright 2012, Elsevier Inc. All rights reserved.

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Final Exam Fall 2007

Advanced Computer Architecture. Chapter 4: More sophisticated CPU architectures

RECAP. B649 Parallel Architectures and Programming

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

ECSE 425 Lecture 15: More Dynamic Scheduling

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Course on Advanced Computer Architectures

EECC551 Exam Review 4 questions out of 6 questions

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Updated Exercises by Diana Franklin

Hardware-based Speculation

Computer Architecture EE 4720 Final Examination

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

Computer System Architecture Final Examination Spring 2002

CS 2506 Computer Organization II Test 2. Do not start the test until instructed to do so! printed

ILP: Instruction Level Parallelism

CS / ECE 6810 Midterm Exam - Oct 21st 2008

Multi-cycle Instructions in the Pipeline (Floating Point)

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Multiple Instruction Issue. Superscalars

HY425 Lecture 05: Branch Prediction

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Solutions to exercises on Instruction Level Parallelism

EECS 470 Midterm Exam

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Getting CPI under 1: Outline

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Branch prediction ( 3.3) Dynamic Branch Prediction

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Scoreboard information (3 tables) Four stages of scoreboard control

Transcription:

Print: First Name:......... SOLUTION............... Last Name:............................. Student Number:............................................... University of Toronto Faculty of Applied Science and Engineering Midterm Examination November 3, 2014 ECE552F Computer Architecture Examiner Natalie Enright Jerger 1. There are 4 questions and 8 pages. Do all questions. The total number of marks is 48. The duration of the test is 50 minutes. 2. ALL WORK IS TO BE DONE ON THESE SHEETS! Use the back of the pages if you need more space. Be sure to indicate clearly if your work continues elsewhere. 3. Please put your final solution in the box if one is provided. 4. Clear and concise answers will be considered more favourably than ones that ramble. Do not fill space just because it exists! 5. You may use a single 8.5x11 aid sheet. 6. You may use faculty approved non-programmable calculators. 7. Always give some explanations or reasoning for how you arrived at your solutions to help the marker understand your thinking. 8. State your assumptions. Show your work. Use your time wisely as not all questions will require the same amount of time. If you think that assumptions must be made to answer a question, state them clearly. If there are multiple possibilities, comment that there are, explain why and then provide at least one possible answer and state the corresponding assumptions. 9. Only exams written in pen can be considered for remarking. Page 1 of 8

This page is for grading purposes only. The marks breakdown is given for each question. 1 [11] 2 [19] 3 [8] 4 [10] Total [48] Page 2 of 8

1. Pipelining [3 marks] (a) Branches represent 30% of dynamic instructions. Branches are statically predicted not-taken. 40% of branches are taken. Loads make up 30% of dynamic instructions. 65% of loads are followed immediately by a dependent ALU instruction in the dynamic instruction sequence. Consider a 4-stage pipeline where the Execute and Memory Access stages have been combined into 1 stage (called XM). The branch outcome is known at the end of the XM stage. Full forwarding exists in this pipeline. Calculate the CPI for this pipeline implementation. CP I = 1 + 0.3 0.4 2 CPI = 1.24 Page 3 of 8

(b) This part of the question assumes the typical 5-stage in-order pipeline used in class (F, D, X, M, W). The table below gives the fraction of instructions that have a particular type of RAW data dependence. The type of RAW data dependence is identified by the stage that produces the result (X or M) and the instruction that consumes the result (1 st instruction that follows the one that produces the result, 2 nd instruction that follows, or both). Assume that the register write is done in the first half of the clock cycle and the register read is done in the second half of the cycle. X to 1 st M to 1 st X to 2 nd M to 2 nd X to 1 st Other RAW Only Only Only Only and M to 2 nd Dependences 5% 20% 5% 10% 10% 0% [8 marks] Let us assume that we cannot afford to have three-input muxes that are needed for full forwarding. We have to decide if it is better to implement MX forwarding or WX forwarding. Which of the two options results in fewer data stalls cycles? You must show your work to justify your answer. An answer without any justification will not receive full marks. WX forwarding is better. WX forwarding leads to 0.5 stall cycles while MX forwarding leads to 0.65 stall cycles. To receive full marks, answer should enumerate stall conditions and/or do calculations on stall cycles MX forwarding WX Forwarding Case 1: F D X M W F D X M W F D X M W F D d* X M W Case 2: F D X M W F D X M W F d* d* D X M W F D d* X M W Case 3: F D X M W F D X M W F D X M W F D X M W F d* D X M W F D X M W Case 4: F D X M W F D X M W F D X M W F D X M W F d* D X M W F D X M W Case 5: F D X M W F D X M W F D X M W F D d* X M W F d* D X M W F p* D X M W 0 * 0.05 + 2 * 0.2 + 1 * 0.05 + 1 * 0.2 + 0 * 0.05 + 1 * 0.05 + 1 * 0.1 + 0 * 0.1 + 1 * 0.1 =.35 1 * 0.1 = 0.65 Page 4 of 8

2. Dynamic Scheduling [16 marks] (a) Consider a MIPSR10K processor with the following execution latencies and one reservation station/functional unit for each type of instruction: 5 cycles for an add 12 cycles for a multiply 20 cycles for a divide Suppose the segment of the program with the four instructions is i1: DIV R3, R5 -> R2 i2: ADD R1, R4 -> R3 i3: MULT R2, R6 -> R4 i4: ADD R2, R3 -> R4 Considering the MIPSR10K implementation, if i1 is issued at cycle 0, in what cycle will each instruction complete and retire? The ROB size is infinite, as are the number of physical registers. Assumptions: Multiple instructions can issue in the same cycle (i3, i4). Instructions can issue in same cycle as dependent instruction broadcasts tag on CDB. Complete Retire i1 21 22 i2 7 23 i3 34 35 i4 27 36 [3 marks] (b) Explain how WAW hazards are avoided in Tomasulo s algorithm. WAW hazards are avoided through register map table. If the ID in the map table does not match the ID of the instruction writing the CDB, then that instruction does not update the register file. Answer must be more detailed than register renaming Page 5 of 8

[8 marks] 3. Consider two possible improvements to a processor design. The first improvement can speed up floating point arithmetic instructions by a factor of 8. The second improvement can speed up load and store instructions by a factor of 3. Let F fp and F ls be the fraction of execution time spent on floating point and load/store instructions respectively. The executions of these two sets of instructions are non-overlapping in time. What should the relation be between the fractions F fp and F ls such that a machine built with the first improvement outperforms a machine built with the second improvement. 1 1 F fp + F fp 8 f fp f ls > 16 21 > 1 1 F ls + F ls 3 Page 6 of 8

4. Branch Prediction Consider the following code sequence. Assume that each instruction is encoded in one 32-bit word. Address Instruction 0x0038 L3:... 0x003C... 0x0040 SUBI R1, 2 -> R3 0x0044 BNEZ R3, L1 0x0048 ADD R0, R0 -> R1 0x004C L1: SUBI R2, 2 -> R3 0x0050 BNEZ R3, L2 0x0054 ADD R0, R0 -> R2 0x0058 L2: SUB R1, R2 -> R3 0x005c BEQZ R3, L3 [4 marks] (a) Show the contents of a 4-entry branch target buffer (BTB) after one execution of the code starting at PC = 0x0040. Assume the BTB is initially empty. Discard the lowest-order PC bits that never change and use the next set of bits to index into the BTB. Assume the initial register values are such that every branch is taken. Each entry can hold 16 bits of information. Entry # (index) 0 Contents 0x0058 1 0x004C 2 3 0x0038 Page 7 of 8

[6 marks] (b) Consider the case where we use a global branch direction predictor with a 3-bit global history register. Execution of the 13th iteration of the code on the previous page is about to start. Provide an example of i) the value of feasible 3-bit global branch history, and ii) the value of an infeasible global branch history. To receive marks, you must justify your answers. Simply writing some combination of N and T values without any explanation is not sufficient Other answers are possible provided correct/sufficient justification is given i. Feasible 3-bit global branch history Assume R1 initially has the value 2 and R2 initially has the value 2. The first two branches will be not taken and the third branch will be taken leading to a feasible global history of NNT ii. Infeasible global branch history From the previous answer, we can see that if both of the first two branches are not taken, the third branch can never be not taken so an infeasible history is: NNN. Page 8 of 8