Please state clearly any assumptions you make in solving the following problems.

Similar documents
CS146: Computer Architecture Spring 2004 Homework #2 Due March 10, 2003 (Wednesday) Evening

CS146 Computer Architecture. Fall Midterm Exam

6.004 Tutorial Problems L22 Branch Prediction

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Pipelining. CSC Friday, November 6, 2015

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

CS/CoE 1541 Mid Term Exam (Fall 2018).

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

Chapter 4. The Processor

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

EXAM #1. CS 2410 Graduate Computer Architecture. Spring 2016, MW 11:00 AM 12:15 PM

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Good luck and have fun!

CS433 Homework 2 (Chapter 3)

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

CSE 490/590 Computer Architecture Homework 2

Final Exam Fall 2007

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

CS152 Computer Architecture and Engineering. Complex Pipelines

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS433 Homework 2 (Chapter 3)

Updated Exercises by Diana Franklin

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Exercises (III) Output dependencies: Instruction 6 has output dependence with instruction 1 for access to R1

ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST *

Question 1: (20 points) For this question, refer to the following pipeline architecture.

Multi-cycle Instructions in the Pipeline (Floating Point)

Static, multiple-issue (superscaler) pipelines

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Instruction Level Parallelism

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

Computer Architecture EE 4720 Final Examination

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

LECTURE 3: THE PROCESSOR

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Computer Architecture CS372 Exam 3

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

CS433 Homework 3 (Chapter 3)

Alexandria University

Computer Architecture Practical 1 Pipelining

Solutions to exercises on Instruction Level Parallelism

Chapter 4. The Processor

ELE 375 / COS 471 Final Exam Fall, 2001 Prof. Martonosi

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

CS252 Prerequisite Quiz. Solutions Fall 2007

ECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories

ELE 655 Microprocessor System Design

LECTURE 10. Pipelining: Advanced ILP

High Performance SMIPS Processor. Jonathan Eastep May 8, 2005

CS2100 Computer Organisation Tutorial #10: Pipelining Answers to Selected Questions

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Computer Architecture Sample Test 1

6.823 Computer System Architecture Datapath for DLX Problem Set #2

CS152 Computer Architecture and Engineering. Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12

Problem Set 4 Solutions

Computer Organization and Structure

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

Branch prediction ( 3.3) Dynamic Branch Prediction

COMPUTER ORGANIZATION AND DESIGN

CS 251, Winter 2018, Assignment % of course mark

Hardware-based Speculation

Hardware-Based Speculation

CS 251, Winter 2019, Assignment % of course mark

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

What is Pipelining? RISC remainder (our assumptions)

COMPUTER ORGANIZATION AND DESIGN

Four Steps of Speculative Tomasulo cycle 0

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

Computer Architecture and Engineering. CS152 Quiz #4 Solutions

Complex Pipelines and Branch Prediction

Short Answer: [3] What is the primary difference between Tomasulo s algorithm and Scoreboarding?

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

The University of Alabama in Huntsville Electrical & Computer Engineering Department CPE Test II November 14, 2000

Instruction-Level Parallelism and Its Exploitation

Assignment 1 solutions

CS 6354: Static Scheduling / Branch Prediction. 12 September 2016

ECE260: Fundamentals of Computer Engineering

Full Datapath. Chapter 4 The Processor 2

CS420/520 Homework Assignment: Pipelining

Comprehensive Exams COMPUTER ARCHITECTURE. Spring April 3, 2006

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

CS152 Computer Architecture and Engineering SOLUTIONS Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12

CAD for VLSI 2 Pro ject - Superscalar Processor Implementation

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Advanced Computer Architecture

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Transcription:

Computer Architecture Homework 3 2012-2013 Please state clearly any assumptions you make in solving the following problems. 1 Processors Write a short report on at least five processors from at least three different companies. The points of comparison in your report must include, but are not limited to, the frequency of operation, the number of different pipelines used, and the number of pipeline stages in each pipeline. (7 points) 2 Branch prediction 1. Suppose that a machine with a 5-stage pipeline uses branch prediction (i.e., no branch delay slots). 15% of the instructions for a given test program are branches, of which 80% are correctly predicted. The other 20% of the branches suffer a 4-cycle mis-prediction penalty. (In other words, when the branch predictor predicts incorrectly, there are four instructions in the pipeline that must be discarded.) Assuming there are no other stalls, develop a formula for the number of cycles it will take to complete n lines of this program. Answer: n(0.85 1 + 0.15(0.8 1 + 0.2 (1 + 4))) = n(1 + 0.03 4) = 1.12n 2. Now suppose you are given the option of replacing this processor s branch prediction scheme with a 1-cycle branch delay system (i.e., there is one branch delay slot after every branch). What percentage of the branch delay slots must be filled in order for the CPU with the branch-delay system to have better performance than the CPU described above? Answer: With p as the required percentage, the new number of cycles is n(0.85 1+0.15(p 1 + (1 p) 2)) = n(1.15 0.15p). We want 1.15 0.15p < 1.12 hence p > 0.2. (4 points) (3 points) 3 Re-ordering Assume that the following code runs on a processor with a 5-stages pipeline: fetch, decode, execute, memory, write-back. If you have stalls in the code, can you re-order it to avoid the stalls? If yes, what is the new order? If no, explain why it cannot be re-ordered. (3 points) I1: add R1, R3, R4 I2: ld R6, 12(R1) I3: sub R6, R6, R5 I4: ld R7, 16(R1) I5: mul R8, R7, R7 Answer: Yes, I1: add R1, R3, R4 I2: ld R6, 12(R1) I4: ld R7, 16(R1) I3: sub R6, R6, R5 I5: mul R8, R7, R7 1

4 Advanced pipelines The following code is from the algorithms employed in the machine you are designing. Assume that your processor uses a pipeline with full bypassing (forwarding), the initial value of register R23 is much bigger than the initial value of R20, and all memory references hit in the caches taking a single cycle. You may not re-order the instructions and if an instruction stalls it stalls all the following instructions. LOOP: lw R10, X(R20) ; load the first value into R10 lw R11, Y(R20) ; load the second value into R11 subu R10, R10, R11 ; subtract sw Z(R20), R10 ; store R10 into memory addiu R20, R20, 4 ; step the index subu R5, R23, R20 ; check the limit bnez R5, LOOP ; branch if R5 is not equal to zero or R20, R5, 0 ; start of block after the loop lw R12, X(R20) ; part of new block, load first value lw R13, Y(R20) ; load the second value into R13 mul R12, R12, R13 ; multiply... 1. Fill the following table by a pipeline diagram of 2 iterations of the loop s execution on a standard 5-stage pipeline (similar to what we studied). Assume that the branch is completely resolved in the decode stage. Indicate clearly all the required stall cycles and the reasons for those stalls. Write the average number of cycles required to complete a single iteration of the loop. (8 points) 2

01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 lw R10, X(R20) F D X M W Reasons for stalls: Answer: The diagram is Average cycles for a single iteration= 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 lw R10 IF D EX M W B lw R11 IF D EX M W B subu R10 IF D stall EX M W B sw IF stall D EX M W B addui R20 IF D EX M W B subu R5 IF D EX M W B bnez IF D D EX M W B or IF stall flush lw R10 IF D EX M W B lw R11 IF D EX M W B subu R10 IF D stall EX M W B sw IF stall D EX M W B addui R20 IF D EX M W B subu R5 IF D EX M W B bnez IF D D EX M W B or IF stall flush The subu R10 instruction is stalled waiting for the loaded value of R11 and the following sw is stalled as well for the same reason. The bnez cannot read the value of R5 during its original decode stage since it is calculated within the same cycle by the previous instruction. A stall cycle must be used for the or instruction and the branch evaluates the condition in the following cycle. The average number of cycles is 10: from the start of one iteration to the start of the following iteration. (three points for the diagram, two points for each explanation, one point for the number of cycles.) 2. As an attempt to increase the performance, you consider doubling the frequency of operation and using the following stages IF1: Begin Instruction Fetch IF2: Complete Instruction Fetch ID: Instruction Decode 4

RF: Register Fetch (including the fetching of registers for branch resolution) EX1: ALU operation execution begins. Branch target calculation finishes. Memory address calculation. Branch condition resolution calculation begins. EX2: Branch condition resolution finishes. Finish ALU ops. (But branch and memory address calculations finish in a single cycle during EX1). M1: First part of memory access, address sent to memory. M2: Second part of memory access, Data sent to memory for stores OR returned from memory for loads. WB: Write back results to register file Assume that forwarding allows the RF stage of an instruction to complete in the same cycle producing the required value by a previous instruction. Fill the following table with a pipeline diagram of two iterations of the loop on the new pipeline (a single iteration is from the fetch of the first load till that same instruction is fetched again). Indicate clearly any required stall cycles and the reasons for those stalls. Assume that any instructions following a branch are fetched in order and may move up to the register fetch stage but are not issued for execution until the condition of the branch is resolved. If the branch is taken the following instructions are flushed and the target is fetched. Write the average number of cycles required for a single iteration. (12 points) 5

01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 lw R10, X(R20) F1 F2 D RF X1 X2 M1 M2 W Pipeline Load Delay = Pipeline Branch Delay = Average cycles for a single iteration = Reasons for stalls: 6

Answer: The diagram is 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 lw R10 IF1 IF2 D RF EX1 EX2 M1 M2 W B lw R11 IF1 IF2 D RF EX1 EX2 M1 M2 W B subu R10 IF1 IF2 D stall1 stall2 stall3 RF EX1 EX2 M1 M2 W B sw IF1 IF2 stall1 stall2 stall3 D RF EX1 EX2 M1 M2 W B addui R20 IF1 stall1 stall2 stall3 IF2 D RF EX1 EX2 M1 M2 W B subu R5 IF1 IF2 D stall1 RF EX1 EX2 M1 M2 W B bnez IF1 IF2 stall1 D stall2 RF EX1 EX2 M1 M2 W B or IF1 stall1 IF2 stall2 D RF stall3 flush lw R12 IF1 stall1 IF2 D stall2 flush lw R13 IF1 IF2 stall1 flush mul R12 IF1 stall1 flush lw R10 IF1 IF2 D RF EX1 EX2 M1 M2 W B lw R11 IF1 IF2 D RF EX1 EX2 M1 M2 W B subu R10 IF1 IF2 D stall1 stall2 stall3 RF EX1 EX2 M1 M2 W B sw IF1 IF2 stall1 stall2 stall3 D RF EX1 EX2 M1 M2 W B addui R20 IF1 stall1 stall2 stall3 IF2 D RF EX1 EX2 M1 M2 W B subu R5 IF1 IF2 D stall1 RF EX1 EX2 M1 M2 W B bnez IF1 IF2 stall1 D stall2 RF EX1 EX2 M1 M2 W B or IF1 stall1 IF2 stall2 D RF stall3 flush lw R12 IF1 stall1 IF2 D stall2 flush lw R13 IF1 IF2 stall1 flush mul R12 IF1 stall1 flush lw R10 IF1 IF2 D... The subu R10 instruction is stalled waiting for the loaded value of R11 and the following sw is stalled as well for the same reason. The RF stage of subu R5 must wait for R20 from the previous instruction. The RF stage of bnez must wait for the previous instruction. All the instructions following bnez wait for the condition resolution. Delay for load is 3 cycles. Delay for branch is 1 stall for its register fetch and 5 cycles till its condition resolution at the EX2 stage. The average number of cycles is 17: from the start of one iteration to the start of the following iteration. (Two points for the diagram, two points for each explanation, two points for each number of cycles.) 7

3. Based on this piece of code only, is it beneficial to double the frequency and use the new pipeline? Explain why or why not? Answer: The new pipeline takes 17 cycles to finish one iteration. Those 17 cycles are equivalent to 8.5 cycles in the old pipeline. This is better than what was achieved on the old pipeline and thus it is beneficial to double the frequency. (One point for the answer, two points for the analysis.) (3 points) 8