EECS 470 Midterm Exam Winter 2009

Similar documents
EECS 470 Midterm Exam Winter 2008 answers

EECS 470 Midterm Exam

EECS 470 Midterm Exam Answer Key Fall 2004

EECS 470 Midterm Exam Winter 2015

EECS 470 Midterm Exam

EECS 470 Midterm Exam Fall 2014

EECS 470 Midterm Exam Fall 2006

EECS 470 Lecture 7. Branches: Address prediction and recovery (And interrupt recovery too.)

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EECS 470 Final Exam Fall 2013

EECS 470 Final Exam Fall 2005

EECS 470 Final Exam Fall 2015

HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding

EECS 570 Final Exam - SOLUTIONS Winter 2015

Multiple Instruction Issue and Hardware Based Speculation

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

EECS 470 Final Exam Winter 2012

EECS 470. Branches: Address prediction and recovery (And interrupt recovery too.) Lecture 7 Winter 2018

Chapter. Out of order Execution

CS433 Homework 2 (Chapter 3)

Hardware-Based Speculation

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Hardware-based Speculation

Advanced Computer Architecture

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

CS433 Homework 2 (Chapter 3)

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

E0-243: Computer Architecture

b) Register renaming c) CDB, register file, and ROB d) 0,1,X (output of a gate is never Z)

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

CS152 Computer Architecture and Engineering. Complex Pipelines

Computer Architecture, Fall 2010 Midterm Exam I

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Good luck and have fun!

Hardware-based Speculation

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

CPSC 313, 04w Term 2 Midterm Exam 2 Solutions

Photo David Wright STEVEN R. BAGLEY PIPELINES AND ILP

COMPUTER ORGANIZATION AND DESI

November 7, 2014 Prediction

Copyright 2012, Elsevier Inc. All rights reserved.


Recitation #6 Arch Lab (Y86-64 & O3 recap) October 3rd, 2017

Processor (IV) - advanced ILP. Hwansoo Han

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

1. Truthiness /8. 2. Branch prediction /5. 3. Choices, choices /6. 5. Pipeline diagrams / Multi-cycle datapath performance /11

Architectures for Instruction-Level Parallelism

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Pipelining and Vector Processing

Full Datapath. Chapter 4 The Processor 2

5008: Computer Architecture

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

The Problem with P6. Problem for high performance implementations

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

EECS 470 Final Project Report

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

6.823 Computer System Architecture

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

CS152 Computer Architecture and Engineering SOLUTIONS Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12

Computer Architecture EE 4720 Final Examination

The Processor: Instruction-Level Parallelism

Multiple Instruction Issue. Superscalars

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Handout 2 ILP: Part B

CMSC411 Fall 2013 Midterm 2 Solutions

NAME: Problem Points Score. 7 (bonus) 15. Total

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

EE 660: Computer Architecture Superscalar Techniques

University of Toronto Faculty of Applied Science and Engineering

Adapted from David Patterson s slides on graduate computer architecture

CS Mid-Term Examination - Fall Solutions. Section A.

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

EXAM #1. CS 2410 Graduate Computer Architecture. Spring 2016, MW 11:00 AM 12:15 PM

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Lecture 19: Instruction Level Parallelism

EECC551 Exam Review 4 questions out of 6 questions

Chapter 4 The Processor 1. Chapter 4D. The Processor

CS146 Computer Architecture. Fall Midterm Exam

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods

Four Steps of Speculative Tomasulo cycle 0

Midterm Exam 1 Wednesday, March 12, 2008

Processor: Superscalars Dynamic Scheduling

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Hardware Speculation Support

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Advanced issues in pipelining

Wide Instruction Fetch

University of Toronto Faculty of Applied Science and Engineering

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

Transcription:

EECS 70 Midterm Exam Winter 2009 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 18 2 / 12 3 / 29 / 21 5 / 20 otal / 100 Bonus / 3 NOES: Closed book/notes Calculators are allowed, but no PDAs, Portables, Cell phones, etc. Don t spend too much time on any one problem. You have about 90 minutes for the exam (avg. 18 minutes per problem). here are 9 pages including this one. Please ensure you have all pages. Be sure to show work and explain what you ve done when asked to do so. 1/9

1) Short answer [18 points] a) Does pipelining improve latency or throughput? [3 points] throughput b) Give 2 reasons why we don t build processors with massive register files (e.g., tens of thousands of registers). [6 points] Requires too many bits in opcode Register file accesses would be extremely slow c) Give an example of a microprocessor component that exploits locality. [3 points] Cache d) Most instruction sets include both PC-relative branches and branch-to-register instructions. i) State an advantage of PC-relative branches and give an example of a software construct where they are used. [3 points] PC-relative takes fewer bits in opcode; allows relocatable code. Used for loops. ii) State an advantage of branch-to-register instructions and give an example of a software construct where they are used. [3 points] Allows branching arbitrary distances; allows dynamic targets, look up tables for targets. Used for function calls, virtual functions, switch statements, function pointers. 2/9

2) Performance Analysis [12 points] a) Suppose 80% of a program is parallelizable (performance scales linearly with the number of cores), while the other 20% is serial (must run on one core). What is the speedup over a uniprocessor when running the program on a quad-core machine? [6 points] 1/ ( ( 1 0.8 ) + ( 0.8 / ) ) = 2.5 b) Suppose you run two applications one after the other on a Core 2 Duo. he two applications contain the same number of instructions. he first application runs at instructions per cycle (IPC), while the IPC of the second application is 2. What is the overall average IPC? [6 points] 2 / ( ¼ + ½ ) = 2.666 3/9

3) Reorder Buffers in the P6 microarchitecture [29 points] c) Briefly explain the purpose of a reorder buffer. [3 points] Enables precise state for speculation/exceptions d) What effect does a reorder buffer have on performance? [3 points] It reduces performance due to new structural hazard e) Draw a diagram showing the contents of a single re-order buffer () entry for a P6- like microarchitecture (i.e., one having an architectural register file). Identify all the fields stored within the entry and label the width (in bits) of each. Don't forget to include any "instruction status" bits used by any pipeline stages. Assume a 32-bit machine with 32 architectural registers, a 6-entry (you only need to draw one entry) and 16 reservation stations. [5 points] Value 32 (1) PC and/or calculated target 32 (1) Dest Reg Name 5 (1) Executed 1 (1) Exception/Mispredict 1 (1) (optional: opcode) /9

Inputs f) Finish the drawing below to indicate the input and output ports of the module from part (c) for a 2-wide superscalar machine. For each port, label its width and indicate during which pipeline stage the port is used (assume the P6 pipeline stages discussed in class: Fetch, Decode, issue, execute, Complete, Retire). Assume that the head and tail pointers are maintained within the module, and that the does not support early branch resolution. (Note: this problem is significantly harder than it looks at first glance; think carefully about all the signals required to get instructions in and out of the. I suggest doing this problem last.) [18 points] Inputs Outputs Dispatch Enable (2) - Dispatch (0.5) Dest Register x 2 (5) - Dispatch (2) PC x 2 - Dispatch (0.5) (optional: opcode - Dispatch) (optional: retire enable x2) (optional: squash) CDB Value x2 (32)- Complete (2) CDB ag x2 (6) - Complete (1) CDB Write Enable x2 (0.5) CDB Exception/Mispredict x2-complete(0.5) Source Operand ag x (6) - Dispatch (1) (Optional: Clock) Outputs: Full - Dispatch (1) Almost Full - Dispatch (0.5) Source Op. Value x (32) - Dispatch (2) Correct PC (32) - Retire (0.5) Retirement Value x2 (32) - Retire (2) Retirement Register x2 (5)- Retire (2) Head complete bits x2 - Retire (0.5) Head Mispredict/Exception x2 - Retire(0.5) ail pointer/ next tag Dispatch (1) (Optional: head pointer) Bonus) In 1-2 sentences, explain how a history buffer differs from a reorder buffer. [+3 bonus] 5/9

) Handling RAW Memory Dependences. [20 points] a) Consider the following sequence of load and store instructions (the first operand contains the address for the load or store, the second is the source/destination register for the value): (1) store [r1], r16 (2) store [r2], r17 (3) load [r], r18 () store [r5], r19 (5) load [r6], r20 i) Explain the necessary/sufficient conditions to execute instruction #5 nonspeculatively. [3 points] Stores (1),(2),() have calculated their addresses; Load (5) s address can be calculated (its input registers are ready). ii) Suppose we want to issue load (5) earlier, speculatively. What are the conditions to issue the load to the memory system? [3 points] It s address has been calculated. iii) What, precisely, are we speculating? (I.e., what is the hardware guessing about the values of the registers accessed by the loads and stores?). [3 points] hat r6 is different from r5,r2,r1 6/9

iv) Describe a sequence of events where load #5 is issued speculatively, but the speculation fails (i.e., the conditions you specify in your answer to (iii) turn out to be false). [3 points] r5 resolves, load issues, then r2 resolves to same value. v) What does the processor core have to do to fix the mis-speculation? [3 points] Squash and re-execute load 5 and all subsequent instructions b) A Memory Dependence Predictor is a piece of hardware that tries to reduce the frequency of mis-speculation events like the scenario you described in your answer to (a.iv). You can think of the predictor as a black box that takes in some information about a load instruction and tries to guess if the load should execute speculatively or not. Internally, the predictor stores some state about what it has observed in the past. Propose a highlevel design for a memory dependence predictor. In particular, describe what the inputs to your predictor black box and what state it contains. Briefly argue why your design will provide high prediction accuracy and require reasonable resources. [6 points] Save the PCs of load instructions that receive their value via forwarding in the table. Predict that the load should not execute speculatively if it has an entry in the table. 7/9

5) MIPS R10K Microarchitecture. [20 points] On the next pages, you will find a set of charts showing a snapshot of a MIPS R10K-like microarchitecture after one cycle executing a sequence of instructions. You must advance this machine 5 additional clock cycles (to the end of cycle #6). Use the cycle-by-cycle state tables to record the contents of each hardware structure at the end of each clock cycle. Assume the following: Assume the machine is a 2-wide superscalar (i.e., it can issue, complete, and retire at most 2 instructions per cycle). If there are conflicts among instructions, the machine always selects the oldest instructions first. Ignore the fetch stage. Assume all instructions have been fetched and are ready for dispatch whenever the out-of-order core allows. his machine has architectural registers, a 5-entry, reservation stations, and 9 physical registers. here are 2 add functional units with a 1-cycle latency, and 1 fully-pipelined multiply functional unit with a 2-cycle latency (fully-pipelined means the multiply unit has 2 pipeline stages; it can issue a new multiply each cycle, however, multiplies take 2 cycles to execute). Assume there is no bypassing in the X stage, but C will bypass to S through the physical register file. Assume reservation stations are freed as early as possible and can be reused as soon as they are freed. Note that there are 6 instructions, but the cycle-by-cycle tables only have space for the 5 entries. Be sure to wrap back to the top of the if you dispatch the 6 th instruction. Here is the instruction sequence: (1) R3 = R1 * R2 (2) R = R * 10 (3) R1 = R3 + R2 () R2 = R2 + 5 (5) R3 = R + R2 (6) R1 = R1 + R3 Pay attention to the cycle number on each chart be sure you fill them out in correct order! If you make a mistake and need additional blank copies of the fill-in sheet, ask the exam proctor. Make sure the old sheets are torn up and the new ones are stapled to your exam!!! 8/9

SOLUION R10K Cycle # 1 ht # Insn old S X C t 1 R3=R1*R2 p5 p3 h 2 R=R*10 p6 p 3 5 Map able Reg + r1 p1+ r2 p2+ r3 p5 p7,p8,p9 R10K Cycle # ht # Insn old S X C t 1 R3=R1*R2 p5 p3 2 3-2 R=R*10 p6 p 3-3 R1=R3+R2 p7 p1 R2=R2+5 p8 p2 3 h 5 R3=R+R2 p9 p5 Map able Reg + r1 p7 r2 p8 r3 p9 # op 1 2 1 R3=R1*R2 p5 p1+ p2+ 2 R=R*10 p6 p+ 3 # op 1 2 1 R3=R+R2 p9 p6 p8 2 3 R1=R3+R2 p7 p5 p2+ R10K Cycle # 2 ht # Insn old S X C t 1 R3=R1*R2 p5 p3 2 2 R=R*10 p6 p 3 R1=R3+R2 p7 p1 h R2=R2+5 p8 p2 5 Map able Reg + r1 p7 r2 p8 r3 p5 p9 R10K Cycle # 5 ht # Insn old S X C t 1 R3=R1*R2 p5 p3 2 3-5 2 R=R*10 p6 p 3-5 3 R1=R3+R2 p7 p1 5 R2=R2+5 p8 p2 3 5 h 5 R3=R+R2 p9 p5 Map able Reg + r1 p7 r2 p8+ r3 p9 p5 p8 # op 1 2 1 R3=R1*R2 p5 p1+ p2+ 2 R=R*10 p6 p+ 3 R1=R3+R2 p7 p5 p2+ R2=R2+5 p8 p2+ # op 1 2 1 R3=R+R2 p9 p6 p8+ 2 3 R1=R3+R2 p7 p5+ p2+ R10K Cycle # 3 ht # Insn old S X C t 1 R3=R1*R2 p5 p3 2 3 2 R=R*10 p6 p 3 3 R1=R3+R2 p7 p1 R2=R2+5 p8 p2 3 h 5 R3=R+R2 p9 p5 Map able Reg + r1 p7 r2 p8 r3 p9 R10K Cycle # 6 ht # Insn old S X C h 1 R1=R1+R3 p3 p7 t 2 R=R*10 p6 p 3-5 6 3 R1=R3+R2 p7 p1 5 6 R2=R2+5 p8 p2 3 5 5 R3=R+R2 p9 p5 6 Map able Reg + r1 p3 r2 p8+ r3 p9 + p6 # op 1 2 1 R3=R+R2 p9 p6 p8 2 R=R*10 p6 p+ 3 R1=R3+R2 p7 p5 p2+ R2=R2+5 p8 p2+ # op 1 2 1 R3=R+R2 p9 p6+ p8+ 2 R1=R1+R3 p3 p3 p9 3 9/9