Computer Architecture Practical 1 Pipelining

Similar documents
Updated Exercises by Diana Franklin

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

CS433 Homework 3 (Chapter 3)

Hardware-based Speculation

CS433 Homework 2 (Chapter 3)

Four Steps of Speculative Tomasulo cycle 0

EE 4683/5683: COMPUTER ARCHITECTURE

CS433 Homework 2 (Chapter 3)

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

CS 614 COMPUTER ARCHITECTURE II FALL 2005

CPE 631 Advanced Computer Systems Architecture: Homework #3,#4

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Good luck and have fun!

Instruction-Level Parallelism (ILP)

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

EECC551 Exam Review 4 questions out of 6 questions

Floating Point/Multicycle Pipelining in DLX

Solutions to exercises on Instruction Level Parallelism

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

Spring 2011 CSE Qualifying Exam Core Subjects. February 26, 2011

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Instruction Level Parallelism

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

CSE 490/590 Computer Architecture Homework 2

Adapted from David Patterson s slides on graduate computer architecture

Hardware-based Speculation

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

3.16 Historical Perspective and References

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

4DM4 Sample Problems for Midterm Tuesday, Oct. 22, 2013

HY425 Lecture 09: Software to exploit ILP

HY425 Lecture 09: Software to exploit ILP

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

CS 2410 Mid term (fall 2018)

Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

ECSE 425 Lecture 11: Loop Unrolling

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

The basic structure of a MIPS floating-point unit

Question 1 (5 points) Consider a cache with the following specifications Address space is 1024 words. The memory is word addressable The size of the

CS 614 COMPUTER ARCHITECTURE II FALL 2004

CS252 Prerequisite Quiz. Solutions Fall 2007

Hardware-Based Speculation

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

CS420/520 Homework Assignment: Pipelining

Static Compiler Optimization Techniques

Structure of Computer Systems

CS252 Graduate Computer Architecture Midterm 1 Solutions

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Lecture: Static ILP. Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction-Level Parallelism and Its Exploitation

Lecture: Static ILP. Topics: predication, speculation (Sections C.5, 3.2)

CS425 Computer Systems Architecture

Super Scalar. Kalyan Basu March 21,

EEC 581 Computer Architecture. Lec 4 Instruction Level Parallelism

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Topics. Digital Systems Architecture EECE EECE Software Approaches to ILP Part 2. Ideas To Reduce Stalls. Processor Case Studies

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

ECE 505 Computer Architecture

Please state clearly any assumptions you make in solving the following problems.

Superscalar Architectures: Part 2

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST *

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Instruction Pipelining Review

CSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics

Computer Science 246 Computer Architecture

Computer Architecture Homework Set # 3 COVER SHEET Please turn in with your own solution

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Exercises (III) Output dependencies: Instruction 6 has output dependence with instruction 1 for access to R1

Multi-cycle Instructions in the Pipeline (Floating Point)

Hiroaki Kobayashi 12/21/2004

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation

COSC 6385 Computer Architecture - Instruction Level Parallelism (II)

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

EE557--FALL 2000 MIDTERM 2. Open books and notes

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

ILP: Instruction Level Parallelism

Static vs. Dynamic Scheduling

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Processor: Superscalars Dynamic Scheduling

Lecture 4: Introduction to Advanced Pipelining

CS433 Final Exam. Prof Josep Torrellas. December 12, Time: 2 hours

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

CMSC411 Fall 2013 Midterm 2 Solutions

Getting CPI under 1: Outline

Copyright 2012, Elsevier Inc. All rights reserved.

Transcription:

Computer Architecture Issued: Monday 28 January 2008 Due: Friday 15 February 2008 at 4.30pm (at the ITO) This is the first of two practicals for the Computer Architecture module of CS3. Together the practicals make up for 25% of the final mark for the module. This practical consists of pen-and-paper exercises. Assessment of this practical will be based on the correctness and the clarity of the solution. This practical is to be solved individually to assess your competence on the subject. Please bear in mind the School of Informatics guidelines on plagiarism. You must return your solutions to the ITO before the due date and time shown above. Problem 1 [10 Marks] Use the following code fragment: loop: L.D L.D F4, 0(R2) F6, 0(R3) MUL.D F8, F4,F0 MUL.D F10, F6,F2 ADD.D F12, F8,F10 DADDUI R2, R2,8 DADDUI R3, R3,8 DSUBU R5, R4,R2 BNEZ R5, loop Assume that the initial value of R4 is R2+7992, that R2 and R3 contain the base addresses of some arrays, and that F0 and F2 contain some data values pre-loaded prior to the loop. For this exercise assume the standard five-stage integer pipeline and the MIPS FP pipeline as described in the lectures. Assume the latencies and initiation intervals shown in the table below. If structural hazards are due to write-back contention, assume the earliest instruction gets priority and other instructions are stalled. Functional unit Latency Initiation interval Integer ALU 0 1 Data memory 1 1 FP add 3 1 FP multiply 6 1 FP divide 24 24 1

a. Show the timing of this instruction sequence for the MIPS FP pipeline without any forwarding or bypassing hardware but assuming a register read and a write in the same clock cycle forwards through the register file. Assume that the branch is handled by flushing the pipeline. If all memory references hit in the cache, how many cycles does this loop take to execute? b. Show the timing of this instruction sequence for the MIPS FP pipeline with normal forwarding and bypassing hardware. Assume that the branch is handled by predicting it as not taken. If all memory references hit in the cache, how many cycles does this loop take to execute? Problem 2 [10 Marks] A machine is called underpipelined if additional levels of pipelining can be added without changing the pipeline-stall behaviour appreciably. Suppose that the five-stage MIPS integer pipeline was changed to four stages by merging EX and MEM and lengthening the clock cycle by 50%. Assume that branches are resolved in the ID stage in both cases. How much faster would the conventional MIPS pipeline be versus the underpipelined MIPS on integer code only? Make sure you include the effect of any change in pipeline stalls assuming that, in the application mix running on the five-stage pipeline, 4% of instructions stall due to branches and 5% of instructions stall due to loads. Problem 3 [10 Marks] Here is an unusual loop. First, list the dependences and then rewrite the loop so that it is parallel. for (i=1; i<100; i=i+1) { a[i] = b[i] + c[i]; /* S1 */ b[i] = a[i] + d[i]; /* S2 */ a[i+1] = a[i] + e[i]; /* S3 */ } Problem 4 [20 Marks] Assume the pipeline latencies shown in the table below, and a one-cycle delayed branch. Unroll the following loop a sufficient number of times to schedule it without any delays. Show the schedule after eliminating any redundant overhead instructions. Despite the fact that the loop is not parallel, it can be scheduled with no delays. loop: L.D F0, 0(R1) L.D F4, 0(R2) MUL.D F0, F0, F4 2

ADD.D F2, F0, F2 DADDI R1, R1, -8 DADDI R2, R2, -8 BNEZ R1, loop Instruction producing result Instruction using result Latency in clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0 Problem 5 [30 Marks] In this Exercise, we will look at how a common vector loop runs on a variety of pipelined versions of the MIPS integer and FP pipelines. The loop is the so-called DAXPY loop (double-precision ax plus Y) and is the central operation in Gaussian elimination. The following code implements the DAXPY operation, Y = a X + Y, for vector length 100: loop: L.D F2, 0(R1) ; load X(i) MUL.D F4, F2, F0 ; multiply a*x(i) L.D F6, 0(R2) ; load Y(i) ADD.D F6, F4, F6 ; add a*x(i) + Y(i) S.D 0(R2), F6 ; store Y(i) DADDUI R1, R1, 8 ; increment X index DADDUI R2, R2, 8 ; increment Y index DSGTUI R3, R1, 800 ; test if done BEQZ R3, loop ; loop if not done where DSGTUI is the instruction double set greater than unsigned integer, which sets the result register to true if the value of the source register is greater than the unsigned immediate value. For the questions below, assume that the integer operations issue and complete in one clock cycle (including loads) and that their results are fully bypassed. Ignore the branch delay. Use the FP latencies from Problem 4 in this Practical. Assume that the FP units are fully pipelined. a. Assume the single-issue MIPS pipeline described above. Do not perform any re-ordering of instructions. Show the number of stall cycles for each instruction and what clock cycle each instruction begins execution on the first iteration of the loop. How many clock cycles does each loop iteration take? 3

b. Unroll the code to make four copies of the body and schedule it for the single-issue MIPS pipeline described above. When unrolling and scheduling, you should optimise the code to eliminate as many stalls as possible. How many clock cycles does each loop iteration take? c. Consider the original (non-unrolled) DAXPY code. Assume a hardware with Tomasulos algorithm with one integer unit, three FP adders, and two RP multipliers. Show the state of the reservation stations and register-status tables (as in the slides of Lecture 7) when the DSGTUI writes its result on the CDB. Do not include the branch. d. Assume a superscalar architecture that can issue two independent operations in a clock cycle (including two integer/load-store operations). Unroll the DAXPY code to make four copies of the body and schedule it. Assume one fully pipelined copy of each functional unit (including the integer unit). When unrolling and scheduling, you should optimise the code to eliminate as many stalls as possible. How many clock cycles will each iteration take? Problem 6 [20 Marks] Consider a predicated instruction set where all instructions have been augmented with predication. So, for instance, a predicated ADD would have the following syntax: (p1) ADD R1,R2,R3 which would be executed to completion if p1 is TRUE and would be discarded if p1 is FALSE. Assume the RISC architecture is extended with 8 predicate registers (p1,..., p8). Consider also that a special compare instruction is used to set the predicate registers as follows: CMP.EQ p1,p2=r1,r2 which would set p1 to TRUE and p2 to FALSE if R1=R2, and p1 to FALSE and P2 to TRUE if R1!=R2. Consider the following code fragment: DSUB R1, R13, R14 BNEZ R1, L1 DADDI R2, R2,1 SD 0(R7), R2 J L2 L1: MUL.D F0, F0, F2 4

L2:... ADD.D F0, F4, F0 S.D 0(R8), F0 a. Using predicated instructions, write this code fragment as a single basic block. List all the data and control dependences in the original code fragment and in your predicated version. b. Assume the 5-stage MIPS integer pipeline and the FP MIPS pipeline with the latencies of Problem 4. Branches and jumps are resolved in the ID stage and the (possibly) incorrectly fetched instruction is discarded. Assume full bypassing is supported for both integer and FP register operands. Further assume that predicated instructions obtain the values of the predicated registers in the beginning of the ID stage (and stall if they are not yet available) and are either allowed to continue or are turned into NOPs by the end of the ID stage. Finally, assume that the compare instruction described above generates the predicate results at the end of the ID stage (it only requires a simple comparator) and bypassing is supported for predicated register operands from ID back to ID. Show the timing of the original instruction sequence and your predicated version when each path is executed. What is the performance improvement (if any) of the predicated version for each execution path? Nigel Topham 2008 5