Alexandria University

Similar documents
3.16 Historical Perspective and References

1. Pipeline CPI =Ideal pipeline CPI+ Structural Stalls + not fully pipelined. Insert the bubble clock cycle

Case Study 1: Exploring the Impact of Microarchitectural Techniques

CS433 Homework 2 (Chapter 3)

Four Steps of Speculative Tomasulo cycle 0

CS433 Homework 2 (Chapter 3)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

Hardware-Based Speculation

Static, multiple-issue (superscaler) pipelines

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

CS146 Computer Architecture. Fall Midterm Exam

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

Instruction Level Parallelism

Hardware-based Speculation

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Copyright 2012, Elsevier Inc. All rights reserved.

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Instruction-Level Parallelism and Its Exploitation

EECC551 Exam Review 4 questions out of 6 questions

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

Chapter 4. The Processor

5008: Computer Architecture

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

5008: Computer Architecture HW#2

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

Super Scalar. Kalyan Basu March 21,

Multiple Instruction Issue and Hardware Based Speculation

ECE260: Fundamentals of Computer Engineering

CS252 Graduate Computer Architecture Midterm 1 Solutions

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Week 11: Assignment Solutions

Multi-cycle Instructions in the Pipeline (Floating Point)

Dynamic Control Hazard Avoidance

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Course on Advanced Computer Architectures

Static vs. Dynamic Scheduling

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Multiple Issue ILP Processors. Summary of discussions

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Full Datapath. Chapter 4 The Processor 2

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Please state clearly any assumptions you make in solving the following problems.

Final Exam Fall 2007

Midterm I March 21 st, 2007 CS252 Graduate Computer Architecture

Handout 2 ILP: Part B

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Chapter 4 The Processor 1. Chapter 4D. The Processor

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

Updated Exercises by Diana Franklin

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Pipelining. CSC Friday, November 6, 2015

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Good luck and have fun!

Processor (IV) - advanced ILP. Hwansoo Han

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Pipelining and Vector Processing

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

CS152 Computer Architecture and Engineering. Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12

CS425 Computer Systems Architecture

Metodologie di Progettazione Hardware-Software

EXAM #1. CS 2410 Graduate Computer Architecture. Spring 2016, MW 11:00 AM 12:15 PM

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Getting CPI under 1: Outline

ILP: Instruction Level Parallelism

Chapter 4. The Processor

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

COMPUTER ORGANIZATION AND DESIGN

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

Instruction Pipelining Review

LECTURE 3: THE PROCESSOR

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

Adapted from David Patterson s slides on graduate computer architecture

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

Computer Architecture Homework Set # 3 COVER SHEET Please turn in with your own solution

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

CS152 Computer Architecture and Engineering. Complex Pipelines

Transcription:

Alexandria University Faculty of Engineering Computer and Communications Department CC322: CC423: Advanced Computer Architecture Sheet 3: Instruction- Level Parallelism and Its Exploitation 1. What would be the baseline performance (in cycles, per loop iteration) of the code sequence in Figure 1 if no new instruction s execution could be initiated until the previous instruction s execution had completed? Ignore front-end fetch and decode. Assume for now that execution does not stall for lack of the next instruction, but only one instruction/cycle can be issued. Assume the branch is taken, and that there is a one-cycle branch delay slot. Figure 1 2. Think about what latency numbers really mean they indicate the number of cycles a given function requires to produce its output, nothing more. If the overall pipeline stalls for the latency cycles of each functional unit, then you are at least guaranteed that any pair of back-to-back instructions (a producer followed by a consumer ) will execute correctly. But not all instruction pairs have a producer/consumer relationship. Sometimes two adjacent instructions have nothing to do with each other. How many cycles would the loop body in the code sequence in Figure 1 require if the pipeline detected true data dependences and only stalled on those, rather than blindly stalling everything just because one functional unit is busy? Show the

code with <stall > inserted where necessary to accommodate stated latencies. (Hint: An instruction with latency +2 requires two <stall > cycles to be inserted into the code sequence. Think of it this way: A one-cycle instruction has latency 1 + 0, meaning zero extra wait states. So, latency 1 + 1 implies one stall cycle; latency 1 + N has N extra stall cycles. 3. Consider a multiple-issue design. Suppose you have two execution pipelines, each capable of beginning execution of one instruction per cycle, and enough fetch/decode bandwidth in the front end so that it will not stall your execution. Assume results can be immediately forwarded from one execution unit to another, or to itself. Further assume that the only reason an execution pipeline would stall is to observe a true data dependency. Now how many cycles does the loop require? 4. Reorder the instructions to improve performance of the code in Figure 1. Assume the two-pipe machine in Exercise 3 has been dealt with successfully. Just worry about observing true data dependences and functional unit latencies for now. How many cycles does your reordered code take? 5. Every cycle that does not initiate a new operation in a pipe is a lost opportunity, in the sense that your hardware is not living up to its potential. a. In your reordered code from Exercise 4, what fraction of all cycles, counting both pipes, were wasted (did not initiate a new op)? b. Loop unrolling is one standard compiler technique for finding more parallelism in code, in order to minimize the lost opportunities for performance. Hand-unroll two iterations of the loop in your reordered code from Exercise 4. c. What speedup did you obtain? (For this exercise, just color the N + 1 iteration s instructions green to distinguish them from the Nth iteration s instructions; if you were actually unrolling the loop, you would have to reassign registers to prevent collisions between the iterations.) 6. In this exercise, we will look at the simple register renaming: when the hardware register renamer sees a source register, it substitutes the

destination T register of the last instruction to have targeted that source register. When the rename table sees a destination register, it substitutes the next available T for it, but superscalar designs need to handle multiple instructions per clock cycle at every stage in the machine, including the register renaming. A simple scalar processor would therefore look up both src register mappings for each instruction and allocate a new dest mapping per clock cycle. Superscalar processors must be able to do that as well, but they must also ensure that any dest -to-src relationships between the two concurrent instructions are handled correctly. Consider the sample code sequence in Figure 2. Assume that we would like to simultaneously rename the first two instructions. Further assume that the next two available T registers to be used are known at the beginning of the clock cycle in which these two instructions are being renamed. Conceptually, what we want is for the first instruction to do its rename table lookups and then update the table per its destination s T register. Then the second instruction would do exactly the same thing, and any interinstruction dependency would thereby be handled correctly. But there s not enough time to write that T register designation into the renaming table and then look it up again for the second instruction, all in the same clock cycle. That register substitution must instead be done live (in parallel with the register rename table update). Figure 3 shows a circuit diagram, using multiplexers and comparators, that will accomplish the necessary onthe-fly register renaming. Your task is to show the cycle-by-cycle state of the rename table for every instruction of the code shown in Figure 2. Assume the table starts out with every entry equal to its index (T0 = 0 ; T1 = 1, ). Figure 2

Figure 3 7. Assume a five-stage single-pipeline microarchitecture (fetch, decode, execute, memory, write-back) and the code in Figure 4. All ops are one cycle except LW and SW, which are 1 + 2 cycles, and branches, which are 1 + 1 cycles. There is no forwarding. Show the phases of each instruction per clock cycle for one iteration of the loop. a. How many clock cycles per loop iteration are lost to branch overhead? b. Assume a static branch predictor, capable of recognizing a backwards branch in the Decode stage. Now how many clock cycles are wasted on branch overhead? c. Assume a dynamic branch predictor. How many cycles are lost on a correct prediction? Figure 4

8. In this exercise, we will look at how variations on Tomasulo s algorithm perform when running the following loop and The functional units (FUs) are described in the table below. Assume the following: Functional units are not pipelined. There is no forwarding between functional units; results are communicated by the common data bus (CDB). The execution stage (EX) does both the effective address calculation and the memory access for loads and stores. Thus, the pipeline is IF/ID/IS/EX/WB. Loads require one clock cycle. The issue (IS) and write-back (WB) result stages each require one clock cycle. There are five load buffer slots and five store buffer slots. Assume that the Branch on Not Equal to Zero (BNEZ) instruction require one clock cycle. a. For this problem use the single-issue Tomasulo MIPS pipeline of Figure 5 with the pipeline latencies from the table above. Show the number of stall cycles for each instruction and what clock cycle each instruction begins execution (i.e., enters its first EX cycle) for

three iterations of the loop. How many cycles does each loop iteration take? Report your answer in the form of a table with the following column headers: Iteration (loop iteration number) Instruction Issues (cycle when instruction issues) Executes (cycle when instruction executes) Memory access (cycle when memory is accessed) Write CDB (cycle when result is written to the CDB) Comment (description of any event on which the instruction is waiting) Show three iterations of the loop in your table. You may ignore the first instruction. b. Repeat part (a) but this time assume a two-issue Tomasulo algorithm and a fully pipelined floating-point unit (FPU). Figure 5

9. Tomasulo s algorithm has a disadvantage: Only one result can compute per clock per CDB. Use the hardware configuration and latencies from the previous question and find a code sequence of no more than 10 instructions where Tomasulo s algorithm must stall due to CDB contention. Indicate where this occurs in your sequence. 10. Suppose we have a deeply pipelined processor, for which we implement a branch-target buffer for the conditional branches only. Assume that the misprediction penalty is always four cycles and the buffer miss penalty is always three cycles. Assume a 90% hit rate, 90% accuracy, and 15% branch frequency. How much faster is the processor with the branch-target buffer versus a processor that has a fixed two-cycle branch penalty? Assume a base clock cycle per instruction (CPI) without branch stalls of one. 11. Consider a branch-target buffer that has penalties of zero, two, and two clock cycles for correct conditional branch prediction, incorrect prediction, and a buffer miss, respectively. Consider a branch-target buffer design that distinguishes conditional and unconditional branches, storing the target address for a conditional branch and the target instruction for an unconditional branch. a. What is the penalty in clock cycles when an unconditional branch is found in the buffer? b. Determine the improvement from branch folding for unconditional branches. Assume a 90% hit rate, an unconditional branch frequency of 5%, and a two-cycle penalty for a buffer miss. How much improvement is gained by this enhancement? How high must the hit rate be for this enhancement to provide a performance gain?