EECS 470 Final Exam Fall 2015

Similar documents
EECS 470 Final Exam Fall 2013

EECS 470 Final Exam Winter 2012

EECS 470 Final Exam Fall 2005

EECS 470 Midterm Exam Answer Key Fall 2004

EECS 470 Midterm Exam

EECS 470 Midterm Exam Fall 2006

EECS 470 Midterm Exam Winter 2015

EECS 470 Midterm Exam

EECS 470 Midterm Exam Winter 2008 answers

EECS 470 Midterm Exam Winter 2009

EECS 470 Midterm Exam Fall 2014

Last lecture. Some misc. stuff An older real processor Class review/overview.

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

NAME: Problem Points Score. 7 (bonus) 15. Total

Computer Architecture EE 4720 Final Examination

EECS 570 Final Exam - SOLUTIONS Winter 2015

Superscalar Processors Ch 14

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

EECS 270 Midterm Exam

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

East Tennessee State University Department of Computer and Information Sciences CSCI 4717 Computer Architecture TEST 3 for Fall Semester, 2005

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

East Tennessee State University Department of Computer and Information Sciences CSCI 4717 Computer Architecture TEST 2 for Fall Semester, 2007

ECE 485/585 Microprocessor System Design

Hardware-Based Speculation

CS 654 Computer Architecture Summary. Peter Kemper

1. Truthiness /8. 2. Branch prediction /5. 3. Choices, choices /6. 5. Pipeline diagrams / Multi-cycle datapath performance /11

East Tennessee State University Department of Computer and Information Sciences CSCI 4717 Computer Architecture TEST 3 for Fall Semester, 2006

Computer Architecture Spring 2016

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

HW1 Solutions. Type Old Mix New Mix Cost CPI

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

CPSC 313, 04w Term 2 Midterm Exam 2 Solutions

CSC 631: High-Performance Computer Architecture

ELE 375 Final Exam Fall, 2000 Prof. Martonosi

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence

Processors. Young W. Lim. May 12, 2016

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Chapter. Out of order Execution

CS152 Computer Architecture and Engineering. Complex Pipelines

OPEN BOOK, OPEN NOTES. NO COMPUTERS, OR SOLVING PROBLEMS DIRECTLY USING CALCULATORS.

CS Computer Architecture

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen

Computer System Architecture Midterm Examination Spring 2002

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Chapter 2: Instructions How we talk to the computer

Getting CPI under 1: Outline

15-740/ Computer Architecture, Fall 2011 Midterm Exam II

Computer System Architecture Final Examination

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

Advanced Computer Architecture

ECE 411 Exam 1 Practice Problems

Key Point. What are Cache lines

CSE351 Winter 2016, Final Examination March 16, 2016

Do not start the test until instructed to do so!

Hardware-based Speculation

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt

exam length Exam Review 1 exam focus exam format

CSE 240A Midterm Exam

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Tutorial 11. Final Exam Review

Memory Hierarchies 2009 DAT105

EECS 470 Lecture 7. Branches: Address prediction and recovery (And interrupt recovery too.)

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

ECE 341 Final Exam Solution

Computer Architecture EE 4720 Final Examination

CPUs. Caching: The Basic Idea. Cache : MainMemory :: Window : Caches. Memory management. CPU performance. 1. Door 2. Bigger Door 3. The Great Outdoors

s complement 1-bit Booth s 2-bit Booth s

Keywords and Review Questions

EECS 452 Midterm Closed book part Fall 2010

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Four Steps of Speculative Tomasulo cycle 0

Design of Digital Circuits ( L) ETH Zürich, Spring 2017

EITF20: Computer Architecture Part4.1.1: Cache - 2

CS146 Computer Architecture. Fall Midterm Exam

EC 513 Computer Architecture

EECS 470 Lecture 6. Branches: Address prediction and recovery (And interrupt recovery too.)

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Portland State University ECE 588/688. Cray-1 and Cray T3E

CS433 Final Exam. Prof Josep Torrellas. December 12, Time: 2 hours

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Superscalar Processor Design

ECE 411 Exam 1. This exam has 5 problems. Make sure you have a complete exam before you begin.

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

ECE550 PRACTICE Final

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Transcription:

EECS 470 Final Exam Fall 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: Page # Points 2 /17 3 /11 4 /13 5 /10 6 /11 7 /9 8 /6 9 /8 10&11 /15 Total /100 NOTES: Open notes, open book. There are 11 pages including this one. Calculators are allowed, but no PDAs, Portables, Cell phones, etc. Don t spend too much time on any one problem. You have about 120 minutes for the exam. Be sure to show work and explain what you ve done when asked to do so. Getting partial credit without showing work will be rare. Page 1 of 11

1) Multiple choice/fill-in-the-blank. Pick the best answer. [10 points, -2 per wrong/blank answer, min 0] a) You would expect a direct-mapped 256-byte cache with 32-byte cache lines to get a hit about % of the time on a memory access with a stack distance of 1. b) In a superscalar processor, the hardware / complier / operating system detects the data dependencies between concurrently fetched instructions. In a VLIW processor, the hardware / complier / operating system does. c) In a bus-based multi-processor system, a load request must go to the bus unless the cache has the data it in the S or E or I / S or E or M / E or M / M state. d) The fundamental cause of false register dependencies in a program is imperfect branch prediction. instructions which generate exceptions. a finite number of architected registers. an inability of the compiler to detect the false dependencies. e) One advantage of a CISC ISA over a RISC ISA is that you would expect more memory operations the decode to be simpler fewer complex instructions a better Icache hit rate f) The compiler often has difficulty moving a load above a branch in program order because there might be a true dependence. there might be a name dependence. it could cause an exception that wouldn t otherwise occur. 2) Each of the following can be said to be a feature of the ISA or of the microarchitecture. Circle each of the following that can be said to be a feature of the ISA. [7 points, -2 per incorrectly circled or not circled answer, minimum 0] The depth of an in-order pipeline The size of the L1 cache Existence of predicated instructions The encoding of a given instruction Gshare branch prediction Number of CDBs Number of physical registers Number of architectural registers Page 2 of 11

3) Short answer [11 points] a) Which would be simpler to design, a 4-wide superscalar out-of-order machine or a 4-wide VLIW machine? Justify your answer. [4] b) Consider the following program running on 4 different processors on the same multi-processor system. Each processor gets a unique result from the CPUID instruction (either 0, 1, 2 or 3). Notice that the array big is read-only. main(int argc, char * argv[]) { int A[4]; // shared global array int big[400000]; // shared global array initilized elsewhere int x,i; // local variables put into a register by the // complier. Each processor/thread as its own x=cpuid(); //gets the CPU number of the current processor //so processor 1 returns "1", processor //0 returns "0" etc. for(i=0;i<400000;i++) A[x]+=big[i]; } i) Assuming the array big is initialized so all elements are 1, what would you expect the final value of the array A to be? (This is not a trick question). [2] A[0]= A[1]= A[2]= A[3]= ii) When measured in the lab, it turns out that processors 0, 1 and 2 are issuing around 400,000 BRILs on the bus when running this program, while processor 3 only issues one. What is likely causing all those reads for ownership and why is processor 3 issuing so many fewer than the others? [5] Page 3 of 11

4) Caches [13 points] a) Provide the shortest possible reference stream where a 2-way associative cache will get a hit, while a direct-mapped cache will get a miss. Assume both are 4KB caches with 32-byte lines and provide the addresses in hex. [4] b) Consider the following C code: int SIZE, STRIDE; int A[SIZE]; // ints are 4 bytes on this computer // Initialize SIZE and STRIDE here for(j=0;j<n;j++) for(i=0;i<size;i=i+stride) X+=A[i]; Assume N is a very large number. What approximately what hit rates would you expect to get on a 4 KB, two-way associative data cache with 32-byte lines given the following values for STRIDE and SIZE? You are to assume that every value other than the array A is kept in registers and that shorts are 2 bytes in size. [9 points, -3 per wrong/blank box, min 0] SIZE=4096 SIZE=3072 (2048*1.5) STRIDE=1 STRIDE=4 Page 4 of 11

4) Consider a processor running a given application that performs 300 million loads and 100 million stores per second. Assume the following is true: [10 points] The processor s multi-level cache system gets a 90% hit rate on both loads and stores. Cache lines are 32-bytes in size There is no prefetching and the instruction cache never misses. 20% of all lines evicted from the last level of the cache are dirty. The cache is write-back and no-write allocate All loads and stores are to 4-byte values. The bus supports both 4-byte and 32-byte transactions. There is no coherence traffic (only one processor) a) What is the read bandwidth (bytes/second) on the bus? Show your work. [4] b) What is the write bandwidth (bytes/second) on the bus? Show your work. [6] Page 5 of 11

5) Consider a case of having 3 processors using a snoopy MESI protocol where the memories can snarf data. All three have a 2 line direct-mapped cache with each line consisting of 16 bytes. The caches begin with all lines marked as invalid. Fill in the following tables indicating If the processor gets a hit or a miss in its cache If a HIT or HITM (or nothing) occurs on the bus during snoop. What bus transaction(s) (if any) the processor performs (BRL, BWL, BRIL, BIL) For misses only, indicate if the miss is compulsory, capacity, conflict, or coherence. A coherence miss is one where there would have been a hit, had some other processor not caused an invalidation of that line. Finally, indicate the state of the processor after all of these memory operations have completed. The operations occur in the order shown. [11 points, -1 per wrong or blank, minimum of 0] Processor Address Read/ Write 1 0x100 Read 1 0x120 Read 1 0x100 Read 2 0x104 Read 1 0x100 Write 1 0x200 Read 1 0x118 Write 1 0x100 Read 2 0x100 Read 2 0x100 Write 3 0x110 Read Hit/Miss Bus transaction(s) HIT/ HITM 4C s Miss type Proc 1 Proc 2 Proc 3 Address State Address State Address State Set 0 Set 0 Set 0 Set 1 Set 1 Set 1 Page 6 of 11

6) For purposes of this problem, assume the power of a single processor is approximately proportional to performance cubed. Say that you have two designs for a die: [9 points] (1) a single processor on the die that does 10 BIPS while drawing 200W (2) three processors on a die. They draw a total of 200W and have the performance you d expect from voltage/frequency scaling (per assumption above). a) On a highly trivially parallelizable benchmark, what performance in BIPS would (2) achieve? [3] b) On a benchmark that cannot be parallelized, what performance in BIPS would (2) achieve? [3] c) Say that power was the sole limiting factor on performance (area, cost, etc. are of no concern). How could you optimize performance to do well in both cases? [3] Page 7 of 11

7) Circle the correct answer [6 points, -2 per blank or wrong answer, min 0] a) The physical memory is effectively a cache of the page table / TLB / disk / data cache. b) The TLB is effectively a cache of page table / physical memory / disk / data cache c) If I have a virtually indexed, physically tagged cache with 32-byte blocks that is 4-way associative and the virtual memory system has 8KB pages, then I know that: The cache index is 13 bits. The cache index is 9 bits The cache size cannot be greater than 8KB The cache may suffer from the synonym problem None of the above. d) If I have a cache with 32-byte blocks that is 4-way associative and the virtual memory system has 8KB pages, where the TLB comes after the cache then I know that: The cache index is 13 bits. The cache index is 9 bits The cache size cannot be greater than 8KB The cache may suffer from the synonym problem None of the above. Page 8 of 11

8) Your boss has asked you to design a module in SystemVerilog to determine whether a branch instruction should be taken. This module, "br_taken", outputs a 1-bit signal "taken". The module header and all necessary input/output signals are provided. You may assume the following: A valid branch is either conditional or unconditional and uses a comparison operator if appropriate. Your module must output correctly for all valid input. For invalid input you should output taken = 0. Comparisons should be performed as: <op1> <comparison operator> <op2> Your code must be synthesizable and should not produce any latches. [8 points] module br_taken( input logic cond_br, // 1 if the branch is conditional, // otherwise 0 input logic uncond_br, // 1 if the branch is unconditional, //otherwise 0 input logic [63:0] op1, // 64-bit, unsigned, integer operand input logic [63:0] op2, // 64-bit, unsigned, integer operand input logic [1:0] comp, // 0: less-than, 1: equals, // 2: greater-than output logic taken // 0: not taken, 1: taken ); //Your code here Page 9 of 11

9) Consider the following tables that represent the state of a processor that implements what we have called the P6 algorithm: RAT ROB Arch ROB# Buffer PC Done Dest. Value Reg. # (-- if in ARF) Number with EX? Arch Reg # 0 -- 0 20 N 4 1 4 1 24 N 2 2 1 2 28 Y 4 100 3 -- 3 32 Y -- -- 4 2 4 36 N 1 5 -- 5 6 7 8 RS RS# Op type Op1 ready? Op1 RoB/value Op2 ready? Op2 RoB/value Dest ROB 0 + Y 5 Y 6 0 1 2 * N 0 Y 6 1 3 * Y 9 Y 7 4 4 ARF Reg# 0 1 2 3 4 5 Value 4 5 6 7 8 9 The instruction at PC 32 is a branch that has been predicted not-taken, but it is actually taken. The destination of the branch is PC 200, where the following code resides: R3=R3+R4 // A (PC 200) R1=R1+R3 // B R5=R1+R3 // C Show the state of the above tables if instruction A has retired, inst B has not started executing, while C has progressed as far along as possible. Be sure to label the head and tail of the ROB. Please place instruction A in slot 5 of the ROB. When other arbitrary decisions need to be made, you are to just make them. Be sure to update the head and tail. [15] (A second copy is available on the following page, please cross out the one you don t want graded!) Page 10 of 11

(Extra copy, cross out if not used.) Consider the following tables that represent the state of a processor that implements what we have called the P6 algorithm: RAT ROB Arch ROB# Buffer PC Done Dest. Value Reg. # (-- if in ARF) Number with EX? Arch Reg # 0 -- 0 20 N 4 1 4 1 24 N 2 2 1 2 28 Y 4 100 3 -- 3 32 Y -- -- 4 2 4 36 N 1 5 -- 5 6 7 8 RS RS# Op type Op1 ready? Op1 RoB/value Op2 ready? Op2 RoB/value Dest ROB 0 + Y 5 Y 6 0 1 2 * N 0 Y 6 1 3 * Y 9 Y 7 4 4 ARF Reg# 0 1 2 3 4 5 Value 4 5 6 7 8 9 The instruction at PC 32 is a branch that has been predicted not-taken, but it is actually taken. The destination of the branch is PC 200, where the following code resides: R3=R3+R4 // A (PC 200) R1=R1+R3 // B R5=R1+R3 // C Show the state of the above tables if instruction A has retired, inst B has not started executing, while C has progressed as far along as possible. Be sure to label the head and tail of the ROB. Please place instruction A in slot 5 of the ROB. When other arbitrary decisions need to be made, you are to just make them. Be sure to update the head and tail. [15] Page 11 of 11