Computer Architecture and Engineering. CS152 Quiz #4 Solutions

Similar documents
Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Hardware-based Speculation

Copyright 2012, Elsevier Inc. All rights reserved.

CS152 Computer Architecture and Engineering. Complex Pipelines

CS433 Homework 2 (Chapter 3)

Instruction Level Parallelism

CS433 Homework 2 (Chapter 3)

CS252 Graduate Computer Architecture Midterm 1 Solutions

Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2012 Professor Krste Asanović

CS / ECE 6810 Midterm Exam - Oct 21st 2008

Four Steps of Speculative Tomasulo cycle 0

6.004 Tutorial Problems L22 Branch Prediction

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Hardware-Based Speculation

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

CS433 Homework 3 (Chapter 3)

Computer Architecture and Engineering. CS152 Quiz #5. April 23rd, Professor Krste Asanovic. Name: Answer Key

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

/ : Computer Architecture and Design Fall Midterm Exam October 16, Name: ID #:

Super Scalar. Kalyan Basu March 21,

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Good luck and have fun!

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Adapted from David Patterson s slides on graduate computer architecture

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

/ : Computer Architecture and Design Fall 2014 Midterm Exam Solution

Please state clearly any assumptions you make in solving the following problems.

5008: Computer Architecture

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2016 Professor George Michelogiannakis

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

CSE 490/590 Computer Architecture Homework 2

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2011 Professor Krste Asanović

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Complex Pipelines and Branch Prediction

Static, multiple-issue (superscaler) pipelines

COMPUTER ORGANIZATION AND DESI

EECC551 Exam Review 4 questions out of 6 questions

Quiz for Chapter 4 The Processor3.10

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Static vs. Dynamic Scheduling

Processor (IV) - advanced ILP. Hwansoo Han

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Handout 2 ILP: Part B

Instruction-Level Parallelism and Its Exploitation

Processor: Superscalars Dynamic Scheduling

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Assignment 1 solutions

Alexandria University

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Computer System Architecture Quiz #2 April 5th, 2019

Question 1 (5 points) Consider a cache with the following specifications Address space is 1024 words. The memory is word addressable The size of the

The basic structure of a MIPS floating-point unit

Hardware-based Speculation

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Case Study IBM PowerPC 620

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

LECTURE 10. Pipelining: Advanced ILP

ECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories

Multiple Instruction Issue. Superscalars

4DM4 Sample Problems for Midterm Tuesday, Oct. 22, 2013

CS252 Prerequisite Quiz. Solutions Fall 2007

Rui Wang, Assistant professor Dept. of Information and Communication Tongji University.

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Getting CPI under 1: Outline

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

The Processor: Instruction-Level Parallelism

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

CS 614 COMPUTER ARCHITECTURE II FALL 2004

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

Pipelining. CSC Friday, November 6, 2015

CS152 Computer Architecture and Engineering SOLUTIONS Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

CSC258: Computer Organization. Microarchitecture

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

SOLUTION. Midterm #1 February 26th, 2018 Professor Krste Asanovic Name:

Lecture-13 (ROB and Multi-threading) CS422-Spring

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Computer Organization MIPS Architecture. Department of Computer Science Missouri University of Science & Technology

CS 614 COMPUTER ARCHITECTURE II FALL 2005

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

Computer Architecture and Engineering. CS152 Quiz #1. February 19th, Professor Krste Asanovic. Name:

Transcription:

Computer Architecture and Engineering CS152 Quiz #4 Solutions

Problem Q4.1: Out-of-Order Scheduling Problem Q4.1.A Time Decode Issued WB Committed OP Dest Src1 Src2 ROB I 1-1 0 1 2 L.D T0 R2 - I 2 0 2 12 13 MUL.D T1 T0 F0 I 3 1 13 15 16 ADD.D T2 T1 F0 I 4 2 3 5 17 ADDI T3 R2 - I 5 3 4 6 18 L.D T4 T3 - I 6 4 7 17 19 MUL.D T5 T4 T4 I 7 5 18 20 21 ADD.D T6 T5 T2 Table Q4.1-1 Common mistakes: Forgetting in-order commit, integer result bypassing, single-ported ROB/register files, issue dependent on writeback of sources and completed decode Problem Q4.1.B Time Decode Issued WB Committed OP Dest Src1 Src2 ROB I 1-1 0 1 2 L.D T0 R2 - I 2 0 2 12 13 MUL.D T1 T0 F0 I 3 3 13 15 16 ADD.D T0 T1 F0 I 4 14 15 17 18 ADDI T1 R2 - I 5 17 18 19 20 L.D T0 T1 - I 6 19 20 30 31 MUL.D T1 T0 T0 I 7 21 31 33 34 ADD.D T0 T1 F3 Table Q4.1-2 Common mistakes: Forgetting in-order commit, not reusing T0 and T1 appropriately

Problem Q4.2: Fetch Pipelines PC F1 F2 D1 D2 RN RF EX PC Generation ICache Access Instruction Decode Rename/Reorder Register File Read Integer Execute Problem Q4.2.A Pipelining Subroutine Returns Immediately after what pipeline stage does the processor know that it is executing a subroutine return instruction? D2 Immediately after what pipeline stage does the processor know the subroutine return address? RF How many pipeline bubbles are required when executing a subroutine return? 6 Problem Q4.2.B Adding a BTB A subroutine can be called from many different locations and thus a single subroutine return can return to different locations. A BTB holds only the address of the last caller. Problem Q4.2.C Adding a Return Stack Normally, instruction fetch needs to wait until the return instruction finishes the RF stage before the return address is known. With the return stack, as soon as the return instruction is decoded in D2, instruction fetch can begin fetching from the return address. This saves 2 cycles. A return address is pushed after a JAL/JALR instruction is decoded in D2. A return address is popped after a JR r31 instruction is decoded in D2.

Problem Q4.2.D Return Stack Operation A: JAL B A+1: A+2: B: JR r31 B+1: B+2: instruction time A PC F1 F2 D1 D2 RN RF EX A+1 PC F1 F2 D1 D2 RN RF EX A+2 PC F1 F2 D1 D2 RN RF EX A+3 PC F1 F2 D1 D2 RN RF EX A+4 PC F1 F2 D1 D2 RN RF EX B PC F1 F2 D1 D2 RN RF EX B+1 PC F1 F2 D1 D2 RN RF EX B+2 PC F1 F2 D1 D2 RN RF EX B+3 PC F1 F2 D1 D2 RN RF EX B+4 PC F1 F2 D1 D2 RN RF EX A+1 PC F1 F2 D1 D2 RN RF EX Problem Q4.2.E Handling Return Address Mispredicts When a value is popped off the return stack after D2, it is saved for two cycles as part of the pipeline state. After the RF stage of the return instruction, the actual r31 is compared against the predicted return address. If the addresses match, then we are done. Otherwise we mux in the correct program counter at the PC stage and kill the instructions in F1 and F2. Depending on how fast the address comparison is assumed to be, you might also kill the instruction in D1. So there is an additional 2 or 3 cycles on a return mispredict. Problem Q4.2.F Further Improving Performance Ben should add a cache of the most recently encountered return instruction addresses. During F1, the contents of the cache are looked up to see if any entries match the current program counter. If so, then by the end of F1 (instead of D2) we know that we have a return instruction. We can then use the return stack to supply the return address.

Problem Q4.3: Bounds on ILP Problem Q4.3.A Maximum ILP The solution to this question comes from Little's Law. Every instruction in flight needs a separate destination physical register, unless it does not produce a result. We also need enough physical registers to map the committed architectural registers. We want to find the minimum number of physical registers for some piece of code to saturate the machine pipelines. The machine cannot fetch and commit more than four instructions per cycle, so we pick the four lowest latency functional units (i.e., the two integer and the two memory) out of the six available, as these will require the least number of instructions in flight to saturate the machine's pipelines. Instructions allocate a physical register in the decode stage, then need to hold it for at least the issue stage, plus the execute stage(s), plus the writeback stage, plus the commit stage. This is at least 3 cycles in addition to the number of execute stages. An acceptable answer was: minimum physical = 2 * (3 + 1) // Two integer instructions 2 * (3 + 2) // Two memory instructions + # architectural registers = 18 + # architectural registers Answers based on this Little's argument received 6/7 points, minus 1 point if the # architectural registers was not included. We obtain a better solution if we realize that we only need to find one piece of code that saturates the pipelines. Stores and branches do not need to allocate a destination register. Consider the following unrolled loop which zeros a region of memory, written in MIPS assembly code: loop: sw r0, 0(r1) sw r0, 4(r1) bne r1, r2, loop addu r1, r1, 8 # in delay slot This loop will reach a steady state of one iteration per cycle, and will saturate the machine (four instructions per cycle) while allocating only one physical register per cycle (also freeing one physical register per cycle). The number of physical registers neededfor this loop is only: minimum physical = 4 // Addu instruction. + # architectural registers

This possibility got full credit (7/7). Another possible reading of the question was to find a piece of code that had a different performance bottleneck, e.g., a data dependency, and find the number of physical registers required for it. (This wouldn't actually saturate the machines pipelines). These answers received almost full credit depending on quality of answer. Common errors: Many students incorrectly thought that each instruction needs to allocate three physical registers, one for each source operand and one for the destination. Problem Q4.4.B Maximum ILP with no FPU The answer does not change, because the FPU would not be used in a solution that minimized the number of physical registers needed. However, if in part A the assumption had been that code with a performance bottleneck was used, and that included the FPU, then a valid answer would be that the number would change. This was given full credit. Common errors: Most errors were due to not understanding the setup in previous part, or not explaining that floating-point units were the longest latency units and hence avoided in part A. Problem Q4.4.C Minimum Registers There is no minimum as we can always create a scenario where a larger number of physical registers would allow the instruction fetch/decode stages to run ahead and find an instruction that could execute in parallel with preceding instructions. For example, an arbitrarily long serial dependency chain can be followed by an independent instruction: add.d f1, f1, f2 add.d f1, f1, f2... # Many similar dependent instructions elided. add.d f1, f1, f2 sub r1, r2, r3 # An independent instruction. Given that the fetch stage can proceed faster than the serially dependent instructions can be executed, we can always find a case where more physical registers will lead to faster execution given an infinite ROB.

Problem Q4.4.D ROB Size Yes. Stores and branches do not require new physical registers, and together average around 10-25% of a typical instruction mix. So making the ROB have up to ~30% more entries than the difference between physical and architectural registers will tend to improve performance by allowing larger active instruction windows. Some examples could benefit from even larger ROBs, e.g., code from part A. Common errors: Many students assumed that every ROB entry needs a separate physical register (got 1 point for giving this incorrect answer).