Good luck and have fun!

Similar documents
CS433 Homework 2 (Chapter 3)

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

Instruction-Level Parallelism and Its Exploitation

CS433 Homework 2 (Chapter 3)

Scoreboard information (3 tables) Four stages of scoreboard control

Processor: Superscalars Dynamic Scheduling

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Computer Architecture Homework Set # 3 COVER SHEET Please turn in with your own solution

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

COSC 6385 Computer Architecture - Pipelining (II)

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Pipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Solutions to exercises on Instruction Level Parallelism

Detailed Scoreboard Pipeline Control. Three Parts of the Scoreboard. Scoreboard Example Cycle 1. Scoreboard Example. Scoreboard Example Cycle 3

Static vs. Dynamic Scheduling

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Hardware-based Speculation

Metodologie di Progettazione Hardware-Software

ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST *

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Advantages of Dynamic Scheduling

Hardware-based Speculation

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm

EECC551 Exam Review 4 questions out of 6 questions

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

CMSC411 Fall 2013 Midterm 2 Solutions

University of Southern California Department of Electrical Engineering EE557 Fall 2001 Instructor: Michel Dubois Homework #3.

DYNAMIC SPECULATIVE EXECUTION

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

CS433 Final Exam. Prof Josep Torrellas. December 12, Time: 2 hours

Adapted from David Patterson s slides on graduate computer architecture

CS433 Homework 3 (Chapter 3)

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

CS 614 COMPUTER ARCHITECTURE II FALL 2004

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

The basic structure of a MIPS floating-point unit

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

CS152 Computer Architecture and Engineering. Complex Pipelines

/ : Computer Architecture and Design Fall Midterm Exam October 16, Name: ID #:

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Updated Exercises by Diana Franklin

Four Steps of Speculative Tomasulo cycle 0

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Instruction Level Parallelism

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Structure of Computer Systems

Lecture-13 (ROB and Multi-threading) CS422-Spring

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Course on Advanced Computer Architectures

Lecture 4: Introduction to Advanced Pipelining

5008: Computer Architecture

Question 1 (5 points) Consider a cache with the following specifications Address space is 1024 words. The memory is word addressable The size of the

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

DAT105: Computer Architecture Study Period 2, 2009 Exercise 3 Chapter 2: Instruction-Level Parallelism and Its Exploitation

CS / ECE 6810 Midterm Exam - Oct 21st 2008

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

Computer Architecture Practical 1 Pipelining

Slide Set 8. for ENCM 501 in Winter Steve Norman, PhD, PEng

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Latencies of FP operations used in chapter 4.

Hardware-Based Speculation

Graduate Computer Architecture. Chapter 3. Instruction Level Parallelism and Its Dynamic Exploitation

The Tomasulo Algorithm Implementation

Instruction Level Parallelism

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Preventing Stalls: 1

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

Instruction Pipelining Review

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Super Scalar. Kalyan Basu March 21,

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

COSC 6385 Computer Architecture - Instruction Level Parallelism (II)

Instruction Level Parallelism (ILP)

COSC4201. Prof. Mokhtar Aboelaze York University

CS 2410 Mid term (fall 2018)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Superscalar Architectures: Part 2

Complex Pipelining. Motivation

MIPS ISA AND PIPELINING OVERVIEW Appendix A and C

Chapter 4 The Processor 1. Chapter 4D. The Processor

T T T T T T N T T T T T T T T N T T T T T T T T T N T T T T T T T T T T T N.

4DM4 Sample Problems for Midterm Tuesday, Oct. 22, 2013

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 8 Instruction-Level Parallelism Part 1

Transcription:

Midterm Exam October 13, 2014 Name: Problem 1 2 3 4 total Points Exam rules: Time: 90 minutes. Individual test: No team work! Open book, open notes. No electronic devices, except an unprogrammed calculator. No phone, no internet! Good luck and have fun! 1

1. Basics (10 Pts.) Note: Answers should be no longer than two sentences. (a) Assume the 5 least significant bits of the branch address are used to form the branch prediction buffer, and the branch predictor has a 2-bit counter, what is the total size of the branch predictor? (b) Assuming infinitely many reservation stations, what is the best performance achievable when following Tomasulo s algorithm? (c) Assume a 4kB cache with a block size of 1kB. What values can N take for making that cache N-way set associative? How is the cache usually classified if N takes the maximum possible value? (d) Consider the case of a processor with an instruction length of 16 bits and with 64 general purpose registers, so the size of the address fields is 6 bits. How many two-address instructions can be encoded? (e) A two-level cache architecture features a hit time of 2 cycles for L1 cache, 8 cycles for L2 cache, and a miss penalty of 50 cycles for L2 cache misses. If on 1000 memory accesses there are 60 L1 misses and 30 L2 misses, what is the average miss penalty for L1 cache? 2

Additional space: 3

2. Multiple pipeline with hazards and forwarding Consider a single-issue, 5-stage pipeline design. For all instructions, fetching (F), decoding (D), memory (M), and write-back (W) take 1 cycle each. The execution time is 3 cycles for multiplication (E_MUL), 6 cycles for division (E_DIV), and 1 cycle for any other executions (E). The processor has an integer ALU, a multiplier and a divider. Furthermore, the memory (M) and writeback (W) stages (and only those) can accomodate multiple instructions without a structural hazard, as long as only one instructions needs memory access or access to registers, respectively. If more than one instruction requires access to memory or registers, preference is given to instructions in issuing order (while the other(s) stall(s)). (a) Show the forwarding and stalls for the following MIPS code for one loop. Indicate stalls by (S) and forwarding by clearly visible arrows between the appropriate states. (b) After heavy optimization, multiplication and division are merged into one unit that completes execution in 2 cycles. The processor now only has an integer ALU and one combined MUL/DIV engine. Show the forwarding and stalls for the following MIPS code for one loop. Compare the result to the previous engine. Assuming the second engine to be clocked 10% faster, express the performance gain for the given loop. Assume the execution time of the loop to end after the fetch of the last instruction. Loop: LD F2, 0(R1) DIV F8, F2, F0 MULT F2, F6, F2 LD F4, 0(R2) ADD F4, F0, F4 ADD F10, F8, F2 ADDI R1, R1, 8 ADDI R2, R2, 8 SD F4, 0(R2) SUB R20, R4, R1 BNZ R20, Loop 4

Instruction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 LD F2, 0(R1) DIV F8, F2, F0 MULT F2, F6, F2 LD F4, 0(R2) ADD F4, F0, F4 ADD F10, F8, F2 ADDI R1, R1, 8 ADDI R2, R2, 8 SD F4, 0(R2) SUB R20, R4, R1 BNZ R20, Loop Instruction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 LD F2, 0(R1) DIV F8, F2, F0 MULT F2, F6, F2 LD F4, 0(R2) ADD F4, F0, F4 ADD F10, F8, F2 ADDI R1, R1, 8 ADDI R2, R2, 8 SD F4, 0(R2) SUB R20, R4, R1 BNZ R20, Loop 5

3. Single Issue vs. Multiple Issue Processor Consider the following code and latencies for this problem. We ignore instruction fetch and decode as well as write backs in this problem. All instructions execute in one cycle, plus the latency given in the below table. Loop: LD F6, 0(R2) LD F2, 8(R2) MULTD F8, F2, F6 LD F4, 8(R2) ADD F10, F0, F6 ADD F6, F8, F4 ADDI R2, R2, -16 BNE R2, R3, Loop Latencies Memory LD +3 Branch +1 ADD,ADDI +0 MULTD +5 (a) What is the baseline performance of the code? Assume that one instruction per cycle can be issued and that the following instruction only executes after the previous has produced a result. (b) Now we assume the instruction execution to be non-blocking. The execution is only delayed by true data dependencies. Rewrite the code (without reordering is) indicating when which instruction is starting execution and indicate stalls, i.e. cycles where no new instruction is starts execution. How many cycles does one execution of the code take now? (c) Now consider a two-issue design. Suppose there are two execution pipelines, each capable of beginning execution of one instruction per cycle. Results are immediately forwarded between the pipelines. The pipelines only stall for true data dependences. Note that instructions may only be issued once all inputs are available and instructions must be issued in order. Write down the execution order per cycle as before. How many cycles does one execution of the loop take now? 6

Additional space: Problem A Problem B Problem C Cycle Pipeline Pipeline Pipeline A Pipeline B 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 7

4. Scoreboard Algorithm (10 Pts.) For this problem use the single-issue Scoreboard MIPS pipeline with the number of functional units and reservation stations listed in the table below FU type Execution Number Time of FUs Integer (LD, SD, ADDI) 1 3 FP adder 3 2 FP multiplier 5 2 When applying the Scoreboard algorithm, consider the following arbitration rules: Only one instruction can be issued per clock cycle. Functional units are not pipelined. The execution stage (EX) does both the effective address calculation and the memory access for loads and stores. Thus the pipeline is IS / RO /EX / WB. All stages (IS, RO, WB) except EX take 1 cycle to complete. There is no forwarding between functional units. Both integer and floating point results are communicated through the registers. Memory accesses use the integer functional unit to perform effective address calculation. All loads and stores access memory during the EX stage. If an instruction A has read-after-write dependency on instruction B, instruction A can read operands (RO) only AFTER the WB of instruction B is complete. Whenever there is a conflict for a functional unit, assume program order. When an instruction is done executing in its functional unit and is waiting for writing the result, it is still occupying the functional unit (meaning no other instruction may enter). However, in the clock cycle it writes the result, a new instruction can be issued to the FU. (a) Using the Scoreboard approach, fill in the table below to indicate the cycle number at each stage for all instructions. Indicate the reason to stall by hazard type and the causing register or FU. Also indicate which FU the instruction is issued to. (b) Indicate the state of the scoreboard and register file after instruction 7. 8

Instruction Issue Read Ops Execute Complete Write Result Assigned FU Reason to Stall LD F0, 0(R0) MUL.D F1, F0, F1 ADD.D F0, F0, F2 SD F1, 0(R1) LD F2, 4(R0) MUL.D F3, F2, F1 ADD.D F4, F1, F2 SD F3, 4(R3) ADDI R3, R3, 1 ADD.D F0, F4, F1 The scoreboard and register state table should contain the state after cycle 7. Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer 1 Integer 2 Integer 3 FP Add 1 FP Add 2 FP Mul 1 FP Mul 2 Name FU Name FU F0 R0 F1 R1 F2 R2 F3 R3 F4 R4 9

Additional space: 10

Additional space: 11