Spring 2014 CSE Qualifying Exam Core Subjects

Similar documents
Fall 2015 CSE Qualifying Exam Core Subjects

Spring 2013 CSE Qualifying Exam Core Subjects. March 23, 2013

CSE Qualifying Exam, Spring February 2, 2008

Spring 2011 CSE Qualifying Exam Core Subjects. February 26, 2011

Fall 2009 CSE Qualifying Exam Core Subjects. September 19, 2009

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3)

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Hardware-based Speculation

Spring 2010 CSE Qualifying Exam Core Subjects. March 6, 2010

Good luck and have fun!

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Super Scalar. Kalyan Basu March 21,

Hardware-based Speculation

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Static vs. Dynamic Scheduling

Fall 2016 CSE Qualifying Exam CSCE 531, Compilers

Computer Architecture Homework Set # 3 COVER SHEET Please turn in with your own solution

Hardware-Based Speculation

Hardware-Based Speculation

Four Steps of Speculative Tomasulo cycle 0

Alexandria University

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CS433 Final Exam. Prof Josep Torrellas. December 12, Time: 2 hours

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Updated Exercises by Diana Franklin

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CS433 Homework 3 (Chapter 3)

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

CMSC411 Fall 2013 Midterm 2 Solutions

Instruction Level Parallelism

EECC551 Exam Review 4 questions out of 6 questions

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Instruction-Level Parallelism and Its Exploitation

Course on Advanced Computer Architectures

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Solutions to exercises on Instruction Level Parallelism

Adapted from David Patterson s slides on graduate computer architecture

EE557--FALL 2001 MIDTERM 2. Open books

CS 614 COMPUTER ARCHITECTURE II FALL 2004

CS / ECE 6810 Midterm Exam - Oct 21st 2008

CS 2410 Mid term (fall 2018)

The basic structure of a MIPS floating-point unit

Please state clearly any assumptions you make in solving the following problems.

Final Exam Fall 2007

Final Exam Fall 2008

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

ILP: Instruction Level Parallelism

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:

CPE 631 Advanced Computer Systems Architecture: Homework #3,#4

Computer Architecture EE 4720 Practice Final Examination

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Lecture-13 (ROB and Multi-threading) CS422-Spring

CS152 Computer Architecture and Engineering. Complex Pipelines

5008: Computer Architecture

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Computer Architecture Practical 1 Pipelining

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Exercises (III) Output dependencies: Instruction 6 has output dependence with instruction 1 for access to R1

Multi-cycle Instructions in the Pipeline (Floating Point)

/ : Computer Architecture and Design Fall Midterm Exam October 16, Name: ID #:

/ : Computer Architecture and Design Fall Final Exam December 4, Name: ID #:

Processor: Superscalars Dynamic Scheduling

Structure of Computer Systems

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

NP-complete Reductions

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

CS 614 COMPUTER ARCHITECTURE II FALL 2005

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

University of Toronto Faculty of Applied Science and Engineering

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

DYNAMIC SPECULATIVE EXECUTION

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST *

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

COSC 6385 Computer Architecture - Instruction Level Parallelism (II)

1 Introduction. 1. Prove the problem lies in the class NP. 2. Find an NP-complete problem that reduces to it.

CSCE 531 Spring 2009 Final Exam

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm

Question 1 (5 points) Consider a cache with the following specifications Address space is 1024 words. The memory is word addressable The size of the

COSC4201 Instruction Level Parallelism Dynamic Scheduling

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Transcription:

Spring 2014 CSE Qualifying Exam Core Subjects February 22, 2014 Architecture 1. ROB trace - single issue Assume the following: Ten ROB slots. 1 integer Execution unit, 1 reservation stations, one cycle to execute. 1 Floating Add Unit, 3 reservation stations, 7 cycles to execute. 1 Floating MULT Unit, 2 reservation stations, 12 cycles to execute. 1 Load unit with only two reservation stations. Loads on hits require 2 clock cycles, both performed in the Load Unit. Functional units are not pipelined!!!! There is no forwarding between functional units; results are communicated by the common data bus (CDB). The execution stage (EX) does both the effective address calculation and the memory access for loads and stores. Thus, the pipeline is IF/ ID/ IS/ EX/ WB. Assume all memory references are hits and branch predictions are branch taken and are correct. The issue (IS) and write-back (WB) result stages each require one clock cycle. Assume that the Branch on Not Equal to Zero (BNEZ) instruction requires one clock cycle. 1

loop: LD F0, 0(R1) MUL.D F4, F0, F0 ADD.D F8, F8, F0 S.D F8, 0(R2) ADD.D F6, F6, F4 DADDIU R1, R1, +8 DADDIU R2, R2, +8 BNE R1, R3, loop (a) Fill in the table below for as much as you can assuming the BNE is taken and succeeds. (Do not add rows to the table.) Iter. Instruction Issues at Execute memory Write CDB Commit (b) What are the major differences in Tomasulo s and Reorder buffer (ROB)? (c) What additional hardware is rquired (if any) to support two CDBs? (d) What additional hardware is rquired to support loads requiring 10 or 100 cycles due to misses to the L1 and L2 caches. (e) What additional hardware is rquired to support dual issue? 2. In the system we are analyzing the memory has: Separate L1 instruction and data caches, HitTime = Processor Cycle Time 64KB L1 instruction cache with 1% miss rate, 32B blocks 2

64KB L1 data cache with 10% miss rate, 32B blocks 512K L2 unified cache with 64B blocks, local miss rate 20%, Hit Time = 8 cycles, Main Memory Access time is 100 cycles for the first 128 bits and subsequent 128 bit chunks are available every 8 cycles. Both L1 caches are direct mapped, L2 four-way associative. Assume there are no misses to main memory. (a) What is the Miss Penalty for accesses to L2? (b) What is the average memory access time for instruction references? (c) What is the average memory access time for data references? (d) Assume the only memory reference instructions are loads(25%) and stores(5%). What percentage of total memory references are data references? (e) What is the Average memory access time? 3. Your company has just bought a new dual-core processor, and you have been tasked with optimizing your software for this processor. You will run two applications on this dual core processor, but the resource requirements are not equal. The first application needs 75% of the resources, and the other only 25% of the resources. (a) Given that 60% of the first application is parallelizable, how much speedup would you achieve with that application if run in isolation? (b) Given that 95% of the second application is parallelizable, how much speedup would this application observe if run in isolation? (c) Given that 60% of the first application is parallelizable, how much overall system speedup would you observe if you parallelized it, but not the second application? (d) How much overall system speedup would you achieve if you parallelized both applications, given the information in parts (a) and (b)? 3

Compilers 1. Consider the following grammar: S V = E E E E E + E ( E ) E V V id V id [ Elist ] Elist Elist c E Elist E (The c stands for comma. ) (a) Construct the sets of LR(1) items for this grammar. (Stop at 8 sets.) i. Compute I 0 ii. For which symbols x is GOTO(I 0, x) nontrivial? iii. For each such symbol x, compute GOTO(I 0, x). iv. Compute GOTO([P, $], =), where P is S V = E. (b) For the sets of LR(1) items from above, provide entries of the LR(1) parse table. 2. Consider the following CFG for formulas in predicate logic (the start symbol is start ): start ::= formula formula ::= formula 1 conjunction conjunction conjunction ::= conjunction 1 literal literal literal ::= literal 1 ( formula ) quantifier ::= ( quantifier VAR ) literal 1 R ( VAR 1, VAR 2 ) Here, the token R denotes a fixed binary relation symbol, and the token VAR represents any variable name. We say that a variable occurrence in a formula is free if it is not within the scope of a quantifier of the same variable. For example, in the formula R(x, y) ( x)( z) R(x, y), 4

the first occurrence of x and both occurrences of y are free, but the second and third occurrences of x are not free. The occurrence of z is not free, either. Add semantic rules to the grammar above that correctly set a Boolean attribute isfree of each occurrence of VAR in the formula. You may assume that each VAR has a given name attribute, which is a string. You may also use a setofstrings ADT for an unordered, duplicate-free collection of strings that supports the following nondestructive operations: setofstrings emptyset() returns the empty set of strings. boolean ismember(string x, setofstrings S) returns true iff x is a member of the set S. setofstrings add(string x, setofstrings S) returns S {x}. You may define any attributes you find useful. [Hint: most attributes are inherited.] 3. The following fragment of 3-address code was produced by a nonoptimizing compiler: 1 start: i = 1 2 loop1: if i > n goto part2 3 j = 1 4 sum = 0 5 loop2: if j > i goto fin1 6 o = i * 8 7 o = o + j 8 s = a[o] 9 t = j * 8 10 v = a[t] 11 y = s * v 12 if s < n goto loop1 13 sum = sum + y 14 j = j + 1 15 goto loop2 16 fin1: j = sum 17 goto fin2 18 fin2: i = i + o 19 o = i * 8 20 a[o] = sum 21 goto loop2 22 part2: no-op Assume that there are no entry points into the code from outside other than at start. (a) (20% credit) Decompose the code into basic blocks B 1, B 2,..., giving a range of line numbers for each. 5

(b) (30% credit) Draw the control flow graph, describe any unreachable code, and coalesce any nodes if possible. (c) (30% credit) Is variable i live just before line 7? Is the variable o live just before line 9? Is the variable y live just before line 14? Explain. Assume that n and sum are the only live variables immediately after line 22. (d) (20% credit) Describe any simplifying transformations that can be performed on the code (i.e., transformations that preserve the semantics but reduce (i) the complexity of an instruction, (ii) the number of instructions, (iii) the number of branches, or (iv) the number of variables). 6

Algorithms 1. Find tight asymptotic bounds on any positive-valued function T (n) satisfying the following recurrence for all n 2: T (n) = T ( n ) + T (n 1) + n. That is, find an expression f(n), as simple as possible, such that T (n) = Θ(f(n)). Prove that T (n) = O(f(n)) (upper bound only) using the substitution method. 2. You are given as input an array S[1... n] of n integers and a positive integer d. (a) (50%) Design an algorithm to determine whether there are two elements S[i] and S[j] of S with i < j such that S[j] S[i] = d. The algorithm should run in time O(n lg n). (b) (50%) Suppose that S is given in sorted order. Design an algorithm to solve this problem in time O(n). 3. Consider the following decision problem: DOUBLE HAMILTONIAN CIRCUIT (DHC) Instance: An simple undirected graph G = (V, E) with no self-loops. Question: Is there a circuit in G that includes each vertex of G exactly twice? As usual, a circuit in G is by definition a sequence of vertices v 1, v 2,..., v l, where l 3 and {v i, v i+1 } E for all 1 i < l, and {v l, v 1 } E. Also, the same edge is never traversed twice in a row, i.e., v i v i+2 for all 1 i l 2, v l 1 v 1, and v l v 2. The question is whether there exists such a sequence where each vertex shows up in the sequence exactly twice. DHC is clearly in NP. Show that DHC is NP-hard via a polynomial reduction from HAMILTONIAN CIRCUIT. Theory Not given in Spring 2014. 7