Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

Similar documents
CS433 Homework 2 (Chapter 3)

Good luck and have fun!

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

CS433 Homework 2 (Chapter 3)

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

Question 1 (5 points) Consider a cache with the following specifications Address space is 1024 words. The memory is word addressable The size of the

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Instruction-Level Parallelism and Its Exploitation

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Four Steps of Speculative Tomasulo cycle 0

CMSC411 Fall 2013 Midterm 2 Solutions

Processor: Superscalars Dynamic Scheduling

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

CS433 Homework 3 (Chapter 3)

Hardware-based Speculation

ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST *

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm

CS 614 COMPUTER ARCHITECTURE II FALL 2004

Static vs. Dynamic Scheduling

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Computer Architecture Practical 1 Pipelining

4DM4 Sample Problems for Midterm Tuesday, Oct. 22, 2013

Tomasulo Loop Example

Computer Architecture Homework Set # 3 COVER SHEET Please turn in with your own solution

CS 2410 Mid term (fall 2018)

The Tomasulo Algorithm Implementation

CSE 490/590 Computer Architecture Homework 2

Floating Point/Multicycle Pipelining in DLX

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

CS433 Final Exam. Prof Josep Torrellas. December 12, Time: 2 hours

Instruction Level Parallelism (ILP)

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:

Metodologie di Progettazione Hardware-Software

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CS152 Computer Architecture and Engineering. Complex Pipelines

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Solutions to exercises on Instruction Level Parallelism

EE557--FALL 2000 MIDTERM 2. Open books and notes

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Updated Exercises by Diana Franklin

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

CS425 Computer Systems Architecture

Superscalar Architectures: Part 2

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Course on Advanced Computer Architectures

EECC551 Exam Review 4 questions out of 6 questions

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Level Parallelism

Super Scalar. Kalyan Basu March 21,

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Adapted from David Patterson s slides on graduate computer architecture

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Hardware-Based Speculation

The basic structure of a MIPS floating-point unit

Hardware-based Speculation

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Graduate Computer Architecture. Chapter 3. Instruction Level Parallelism and Its Dynamic Exploitation

DYNAMIC SPECULATIVE EXECUTION

Scoreboard information (3 tables) Four stages of scoreboard control

University of Toronto Faculty of Applied Science and Engineering

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

University of Southern California Department of Electrical Engineering EE557 Fall 2001 Instructor: Michel Dubois Homework #3.

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Instruction Level Parallelism. Taken from

Lecture-13 (ROB and Multi-threading) CS422-Spring

DAT105: Computer Architecture Study Period 2, 2009 Exercise 3 Chapter 2: Instruction-Level Parallelism and Its Exploitation

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

CMSC 611: Advanced Computer Architecture

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

CS / ECE 6810 Midterm Exam - Oct 21st 2008

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

EE 4683/5683: COMPUTER ARCHITECTURE

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Chapter 4 The Processor 1. Chapter 4D. The Processor

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Multi-cycle Instructions in the Pipeline (Floating Point)

5008: Computer Architecture

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

Instruction-Level Parallelism (ILP)

ECE 505 Computer Architecture

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

Transcription:

Advanced Computer Architecture CMSC 611 Homework 3 Due in class Oct 17 th, 2012 (Show your work to receive partial credit) 1) For the following code snippet list the data dependencies and rewrite the code to resolve name dependencies. (15 points) Loop: LD R2, 0(R7) ----- 0 Add R1, R2, R3 ----- 1 Sub R4, R6, R1 ----- 2 Add R2, R5, R6 ----- 3 LD R4, 32(R1) ----- 4 SD 36(R1), R2 ----- 5 BEQ R4, Loop ----- 6 Data Dependency 1 & 2, 1 & 4 and 1 & 5 on R1 0 & 1, 3 & 5 on R2 4 & 6 on R4 Name Dependencies- Anti-dependence between 1 and 3 because of R2 Output dependence between 2 and 4 because of R4 Output dependence between 0 and 3 because of R2 Name dependencies can be resolved by renaming registers R2 and R4 with temporary registers T2 and T4 Loop: LD R2, 0(R7) ----- 0 Add R1, R2, R3 ----- 1 Sub R4, R6, R1 ----- 2 Add T2, R5, R6 ----- 3 LD T4, 32(R1) ----- 4 SD 36(R1), T2 ----- 5 BEQ T4, Loop ----- 6

2) From the course slides. (Multi-cycle FP pipeline for MIPS) (45 points) Use the instruction latencies as indicated in the selected slide to first show all the stalls that is present in the following piece of code if the branch is not taken. Now unroll this loop as many times as needed and schedule the instructions to remove all the stalls. You may rename registers i.e. use new registers and/or change the immediate/offset values. You can ignore structural hazards as well. What is the speed up that you achieved after unrolling and scheduling? Loop: LD F0, 0(r0) LD F2, 0(r2) MULTD F4, F0, F2 LD F6, 0(r4) ADDD F8, F4, F6 SD 0(r6), F8 ADDI r0, r0, #4 ADDI r2, r2, #4 ADDI r4, r4, #4 ADDI r6, r6, #4 SUBI r8, r8, #1 BNEZ r8, Loop

Clock Loop: LD F0, 0(r0) 1 LD F2, 0(r2) 2 MULTD F4, F0, F2 4 LD F6, 0(r4) 5 ADDD F8, F4, F6 11 SD 0(r6), F8 15 ADDI r0, r0, #4 16 ADDI r2, r2, #4 17 ADDI r4, r4, #4 18 ADDI r6, r6, #4 19 SUBI r8, r8, #1 20 BNEZ r8, Loop 22 23 Unrolling it twice is sufficient to remove all the stalls. Loop: LD F0, 0(r0) 1 LD F2, 0(r2) 2 LD F10, 4(r0) 3 LD F12, 4(r2) 4 MULTD F4, F0, F2 5 MULTD F14, F10, F12 6 LD F6, 0(r4) 7 LD F16, 4(r4) 8 ADDI r0, r0, #8 9 ADDI r2, r2, #8 10 ADDI r4, r4, #8 11 ADDD F8, F4, F6 12 ADDD F18, F14, F16 13

ADDI r6, r6, #8 14 SUBI r8, r8, #2 15 SD -8(r6), F8 16 BNEZ r8, Loop 17 SD -4(r6), F18 18 Unrolled loop is (23*2)/18 = 2.55 times faster than without unrolling with respect to clock cycles consumed. (Ignoring pipeline fill time) CPI for a) is 23/12 2 CPI for b) is 1 After unrolling the CPI is now the same as the ideal CPI. 3) Tomasulo s Algorithm Consider the following specifications. (40 points) FU type cycles in EX Number of FUs Integer 1 1 FP adder 5 1 FP multiplier 8 1 FP Divide 24 1 Assume the following. Functional units are not pipelined. All stages except EX take one cycle to complete. No limit on reservation stations. There is no forwarding between functional units. Both integer and floating point results are communicated through the CDB. Memory accesses use the integer functional unit to perform effective address calculation. All loads and stores will access memory during the EX stage. Pipeline stage EX does both the effective address calculation and memory access for loads/stores. There are unlimited load/store buffers and an infinite instruction queue. Loads and stores take one cycle to execute. Loads and stores share a memory access unit. If an instruction is in the WR stage in cycle x, then an instruction that is waiting on the same functional unit (due to a structural hazard) can begin execution in cycle x, unless it needs to read the CDB, in which case it can only start executing on cycle x + 1.

Only one instruction can write to the CDB in a clock cycle. Branches and stores do not need the CDB since they don t have WR stage. Whenever there is a conflict for a functional unit or the CDB, assume program order. When an instruction is done executing in its functional unit and is waiting for the CDB, it is still occupying the functional unit and its reservation station. (meaning no other instruction may enter). Treat the BNEZ instruction as an Integer instruction. Assume LD instruction after the BNEZ can be issued the cycle after BNEZ instruction is issued due to branch prediction Fill in the execution profile for the code given in the table which includes the cycles that each instruction occupies in the IS, EX, and WR stages and comments to justify your answer such as type of hazards and the registers involved. # Instruction IS EX WR Comments 1 LD F0, 0(r0) 1 2 3 2 ADDD F2, F0, F4 2 4-8 9 RAW on F0 from #1 3 MULTD F4, F2, F6 3 17-24 25 4 ADDD F6, F8, F10 4 9-13 14 5 DADDI r0,r0, #8 5 6 7 6 LD F1, 0(r1) 6 7 8 RAW on F4 from #2 Structural Hazard from MULT at #7 (Only one functional Unit) Structural Hazard from Adder Add instruction #2 7 MULTD F1, F1, F8 7 9-16 17 RAW on F1 from #6 8 ADDD F6, F3, F5 8 14-18 19 9 DADDI r1, r1, #8 9 10 11 Structural Hazard from adder Add instruction #4