UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Similar documents
5008: Computer Architecture HW#2

Good luck and have fun!

CIS 662: Midterm. 16 cycles, 6 stalls

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

CS433 Homework 2 (Chapter 3)

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Updated Exercises by Diana Franklin

Instruction-Level Parallelism and Its Exploitation

CS433 Homework 2 (Chapter 3)

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Hardware-based Speculation

CSE 141 Summer 2016 Homework 2

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

CS / ECE 6810 Midterm Exam - Oct 21st 2008

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Instruction Pipelining Review

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

Chapter 4. The Processor

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

HW1 Solutions. Type Old Mix New Mix Cost CPI

Pipelining. CSC Friday, November 6, 2015

Instr. execution impl. view

CS/COE1541: Introduction to Computer Architecture

CS146 Computer Architecture. Fall Midterm Exam

Please state clearly any assumptions you make in solving the following problems.

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Appendix C: Pipelining: Basic and Intermediate Concepts

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

CO Computer Architecture and Programming Languages CAPL. Lecture 15

Structure of Computer Systems

Preventing Stalls: 1

CSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 9. Pipelining Design Techniques

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

CS433 Homework 3 (Chapter 3)

Write only as much as necessary. Be brief!

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

INSTRUCTION LEVEL PARALLELISM

ECE260: Fundamentals of Computer Engineering

ECE 341. Lecture # 15

Multi-cycle Instructions in the Pipeline (Floating Point)

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Advantages of Dynamic Scheduling

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

DYNAMIC SPECULATIVE EXECUTION

Practice Assignment 1

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Designing for Performance. Patrick Happ Raul Feitosa

CS 251, Winter 2018, Assignment % of course mark

EECC551 Exam Review 4 questions out of 6 questions

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

COMPUTER ORGANIZATION AND DESI

ECE 486/586. Computer Architecture. Lecture # 12

Week 11: Assignment Solutions

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

LECTURE 3: THE PROCESSOR

ADVANCED COMPUTER ARCHITECTURES: Prof. C. SILVANO Written exam 11 July 2011

ECE 505 Computer Architecture

Alexandria University

Computer Architecture Practical 1 Pipelining

What is Pipelining? RISC remainder (our assumptions)

MEASURING COMPUTER TIME. A computer faster than another? Necessity of evaluation computer performance

1 Hazards COMP2611 Fall 2015 Pipelined Processor

COSC4201 Instruction Level Parallelism Dynamic Scheduling

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

Processor (II) - pipelining. Hwansoo Han

Chapter 4. The Processor

Parts A and B both refer to the C-code and 6-instruction processor equivalent assembly shown below:

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Four Steps of Speculative Tomasulo cycle 0

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

Instruction Pipelining

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

Instruction Pipelining

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

ECE 154A Introduction to. Fall 2012

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Hardware-Based Speculation

CS430 Computer Architecture

COSC 6385 Computer Architecture - Pipelining (II)

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Pipelining and Vector Processing

Final Exam Fall 2007

Advanced Computer Architecture

Transcription:

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 568 Sample Midterm I Questions Israel Koren ECE568/Koren Sample Midterm.1.1 1. The cost of a pipeline can be estimated by c+k h where c is the cost of all logic stages, h is the cost of the latches in a single stage and k is the number of stages. A pipeline performance/cost ratio (PCR) has been defined as PCR = Throughput/(c+kh) where Throughput=1/t and t =t_{latch}+t/k. T is the total time required for the nonpipelined execution. Derive an expression for the optimal number of pipeline stages, k_opt, that maximizes the PCR. ECE568/Koren Sample Midterm.1.2 Page 1

2. Suppose 25% of all instructions in a program are conditional branches (of which 65% are taken), and 10% are unconditional jumps. Consider a four-deep pipeline, with each stage being executed in one cycle. Thus, each instruction takes 4 cycles to be executed, in the 1st of which the instruction is fetched. The branch is resolved at the end of the 2nd cycle for unconditional jumps, and at the end of the 3rd cycle for conditional branches. Assuming that only the 1st pipeline stage can always be done independently of whether the branch is taken, and ignoring all other pipeline stalls, how much faster would the machine be without any branches in the workload? Also assume that without any branches, the machine completes on the average, one instruction every 1.3 clock cycles. ECE568/Koren Sample Midterm.1.3 3. A given pipelined processor executes floating-point instructions of the type Fi <- Fj op Fk. It has an I stage (instruction fetch) and a D stage (instruction decode and operand fetch) each taking 1 cycle. The execution phase of a floating-point ADD operation and a floating-point Multiply operation takes 2 and 4 cycles, respectively. Each execution is followed by a single (clock) cycle write back (W) into the floating-point register file which can support simultaneous reads but only a single write per cycle. The processor executes the following sequence of floating-point instructions: S1: F2 <- F1 + F4 S4: F4 <- F1 * F2 S2: F3 <- F2 * F4 S5: F1 <- F2 + F3 S3: F3 <- F4 + F5 S6: F4 <- F1 * F2 ECE568/Koren Sample Midterm.1.4 Page 2

(a) Assume that the processor has a single floating-point adder and a single floating-point multiplier. Show all data and structural hazards that may occur while executing the given program. Indicate the exact type of each hazard. S1: F2 <- F1 + F4 S4: F4 <- F1 * F2 S2: F3 <- F2 * F4 S5: F1 <- F2 + F3 S3: F3 <- F4 + F5 S6: F4 <- F1 * F2 ECE568/Koren Sample Midterm.1.5 (b) Assume that the floating-point adder (with execution time of 2 cycles) and the floating-point multiplier (with execution time of 4 cycles) can operate in parallel and are fully pipelined (with a throughput of 1). Show the exact timing for the above program on the charts below for the following alternatives: (I) Basic pipeline with data forwarding. (II) Software rescheduling and then executing on a pipeline with data forwarding. (III) Dynamic scheduling in hardware using Tomasulo's algorithm with two reservation stations for each execution unit. Make and justify any additional assumptions that are needed to your opinion. (I) S1: F2 <- F1 + F4 S2: F3 <- F2 * F4 S3: F3 <- F4 + F5 S4: F4 <- F1 * F2 S5: F1 <- F2 + F3 S6: F4 <- F1 * F2 ECE568/Koren Sample Midterm.1.6 Page 3

(II) Software rescheduling and then executing on a pipeline with data forwarding. S1: F2 <- F1 + F4 S2: F3 <- F2 * F4 S3: F3 <- F4 + F5 S4: F4 <- F1 * F2 S5: F1 <- F2 + F3 S6: F4 <- F1 * F2 ECE568/Koren Sample Midterm.1.7 (III) Dynamic scheduling in hardware using Tomasulo's algorithm with two reservation stations for each execution unit. S1: F2 <- F1 + F4 S2: F3 <- F2 * F4 S3: F3 <- F4 + F5 S4: F4 <- F1 * F2 S5: F1 <- F2 + F3 S6: F4 <- F1 * F2 ECE568/Koren Sample Midterm.1.8 Page 4

8. A 1.12 GHz MIPS" processor (with a 5-stage pipeline) executes program A at a rate of 900 million instructions per second. (a) What is the CPI of the processor when executing this program? Do you expect the CPI to be different when executing other programs? 1.24 (b) The program executes 2.5 billion instructions. Calculate the total execution time of the program. 2.77 (c) Program B executes on the same processor, has a CPI=1.33 and the following instruction mix and average number of clock cycles per instruction, for non-branch instructions: Calculate the percentage of executed branch instructions for program B and the average number of cycles for these branch instructions. Avg. # of cycles for Branch = 1.8 (d) If an access to the data cache memory in this processor takes only 1 clock cycle, what is the probability that the instruction immediately following the Load instruction would need the data obtained from the memory by the Load? 30% ECE568/Koren Sample Midterm.1.9 4. The EX stage in a 5-stage pipeline (with data forwarding) has been replaced by 2 stages EX1 and EX2 to accommodate floating-point add/subtract (FP Add) and Integer multiply (Int Mlt) which need 2 cycles for their execution. Integer add/subtract (Int Add) would only use one of these 2 stages (either EX1 or EX2) and similarly, load/store (Ld/St) would use only one EX stage for calculating the memory address. The two EX stages also allow inclusion of conditional branch instructions (BC) that compare the values stored in 2 registers. For these BC instructions one EX stage (either EX1 or EX2) will subtract the values in the two registers and determine the condition while the other EX stage will calculate the target address. Assume the frequencies of instructions shown in the table below: (a) Ignoring all types of hazards, what is the speedup provided by the 6-stage pipeline (IF,ID,EX1, EX2,MEM,WB) compared to a non-pipelined machine? Indicate in the table above, the number of cycles that each type of instruction would take in the non-pipelined machine, and assume that the clock period of the 6-stage pipeline had to be increased by 2%. ECE568/Koren Sample Midterm.1.10 Page 5

(b) Assuming that you, as the designer, can decide in which out of the two EX cycles the integer addition will be executed, what would be the best decision if the goal is to minimize pipeline stalling because of data hazards? Consider an Int Add that follows either a Load or Int Mlt or a previous Int Add. Would your decision change if you consider an Int Add that is followed by either Store or Int Mlt? ECE568/Koren Sample Midterm.1.11 (c) In this part you have to decide whether, for conditional branches, to execute the condition check in EX1 and the target address calculation in EX2 or vice versa. Assume first that a static predict taken strategy is followed. Which execution order will be better? Repeat the analysis for a static predict not-taken strategy. Will one of these static prediction strategies be always better than the other? ECE568/Koren Sample Midterm.1.12 Page 6

5. A 1 GHz pipelined processor takes 1 cycle for an ALU instruction but all other instructions take two cycles. A certain benchmark executes 900 millions instructions out of which 44% are ALU instructions. (a) Calculate the MIPS rating of the processor for this benchmark and the total execution time. (b) An optimized compiler has been used for the above benchmark and it managed to reduce the number of executed ALU instructions by 25% and the number of the other instructions by 12.5%. Repeat the calculations in (a). Is the MIPS rating an acceptable metric for the performance in this case? ECE568/Koren Sample Midterm.1.13 6. A given benchmark has a floating-point computation phase and an integer computation phase. To speed up the execution of this benchmark the floating-point performance of the processor has been enhanced by a factor of 10. With this enhancement, the benchmark spends 50% of its total execution time in its floating-point phase. (a) Calculate the speedup achieved by using the processor's floating-point enhancement. (b) Calculate the percentage of the benchmark's floating-point ECE568/Koren Sample Midterm.1.14 Page 7

7. The designer of a 6-stage pipelined processor (with the stages IF, ID, EX1,EX2,MEM,WB) has considered two alternatives for reducing branch penalties: a delayed branch with a single delay slot and a static Not-Taken prediction scheme. The processor supports conditional branches that compare the contents of two registers. The target address is calculated in EX1 and the two registers are compared in EX2. The average percentages of unconditional branches, Taken conditional branches and Not-Taken conditional branches are 5%, 9% and 6%, respectively. Assume that the CPI of the processor when ignoring the branches is 1.4. (a) Why isn't the branch condition checked in EX1 and the target address calculated in EX2? (b) Explain the decision to have a single delay slot. (c) Write in the table below the penalties (in clock cycles) of the two schemes. Note that the instruction inserted in the delay slot can be either a useful one or useless. ECE568/Koren Sample Midterm.1.15 (c) Assuming that in 65% of the cases the delay slot contains a useful instruction calculate the CPI contribution due to branches for the two schemes. What is the speedup of the better scheme over the other one? (d) (Bonus) The designer then decided to combine the 2 schemes as follows. The Instruction Set Architecture was extended to include 2 sets of branches, one that uses the delayed branch technique (with 1 delay slot) and one that uses the static Not-Taken prediction. If the compiler estimates the probability of filling the single delay slot with a useful instruction, to be higher than 90%, it uses a delayed branch instruction. Otherwise, the compiler uses instead a different branch instruction that uses the predict Not-Taken scheme. Assuming that the compiler uses the delayed branch scheme for 70% of the branches (and the predict Not- Taken for the remaining 30%) and that now the probability of having a useful instruction in the delay slot has increased to 0.80, calculate the CPI contribution due to branches of the combined technique. ECE568/Koren Sample Midterm.1.16 Page 8