CIS 662: Midterm. 16 cycles, 6 stalls

Similar documents
5008: Computer Architecture HW#2

ECE 505 Computer Architecture

Instruction-Level Parallelism and Its Exploitation

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

COSC 6385 Computer Architecture - Pipelining

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

ADVANCED COMPUTER ARCHITECTURES: Prof. C. SILVANO Written exam 11 July 2011

CS Mid-Term Examination - Fall Solutions. Section A.

Instruction Pipelining Review

Instr. execution impl. view

Chapter 3 & Appendix C Part B: ILP and Its Exploitation

Updated Exercises by Diana Franklin

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Basic Pipelining Concepts

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Chapter 4. The Processor

Appendix C: Pipelining: Basic and Intermediate Concepts

ECE154A Introduction to Computer Architecture. Homework 4 solution

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

The basic structure of a MIPS floating-point unit

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Instruction Level Parallelism

Pipeline Review. Review

Computer Architecture

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Week 11: Assignment Solutions

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Chapter 06: Instruction Pipelining and Parallel Processing

EECC551 Exam Review 4 questions out of 6 questions

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Modern Computer Architecture

University of Toronto Faculty of Applied Science and Engineering

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts

CAD for VLSI 2 Pro ject - Superscalar Processor Implementation

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

Scoreboard information (3 tables) Four stages of scoreboard control

CS422 Computer Architecture

COSC 6385 Computer Architecture - Pipelining (II)

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:

Advanced Computer Architecture

ECE 486/586. Computer Architecture. Lecture # 12

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

CS146 Computer Architecture. Fall Midterm Exam

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Pipelining. CSC Friday, November 6, 2015

Hardware-based Speculation

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

ארכי טק טורת יחיד ת עיבוד מרכזי ת

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Pipelining. Maurizio Palesi

Course on Advanced Computer Architectures

LECTURE 3: THE PROCESSOR

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Exercises (III) Output dependencies: Instruction 6 has output dependence with instruction 1 for access to R1

Chapter 4. The Processor

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

COMPUTER ORGANIZATION AND DESI

What is Pipelining? RISC remainder (our assumptions)

Pipeline Architecture RISC

ECE Exam II - Solutions November 8 th, 2017

Instruction Level Parallelism (ILP)

Metodologie di Progettazione Hardware-Software

Good luck and have fun!

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

(Basic) Processor Pipeline

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Architectures for Multimedia Systems - Code: (073335) Prof. C. SILVANO 5th July 2010

Tomasulo s Algorithm

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

ECE260: Fundamentals of Computer Engineering

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

COSC4201 Instruction Level Parallelism Dynamic Scheduling

( ) תשס"ח סמסטר ב' May, 2008 Hugo Guterman Web site:

Chapter 4 The Processor 1. Chapter 4D. The Processor

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Transcription:

CIS 662: Midterm Name: Points: /100 First read all the questions carefully and note how many points each question carries and how difficult it is. You have 1 hour 15 minutes. Plan your time accordingly. 1. (20 points) A processor has the following stages in its pipeline: IF ID ALU1 MEM ALU2 WB. ALU instructions and load/store instructions can access memory. ALU1 stage is used for effective address calculation for ALU operations, loads and stores, and target calculation for branches. ALU2 stage is used for all other calculations and for branch resolution. The only supported addressing mode is displacement addressing. Assume all ALU operations (MUL, ADD, SUB) take one cycle in ALU2 stage. a) (10 points) Show the pipeline timing diagram (which instruction goes into which pipeline stage at each clock cycle) without forwarding. How many cycles does it take to execute the following instruction sequence? How many stalls have to be inserted? ADD R1, R3, 50(R2) MUL R6, R4, 100(R1) STORE R6, 50(R2) ADD R1, R1, #100 SUB R2, R2, #8 Clock cycle 1 2 3 4 5 6 7 8 9 10 ADD R1, R3, 50(R2) IF ID ALU1 MEM ALU2 WB MUL R6, R4, 100(R1) IF S S S ID ALU1 MEM ALU2 WB STORE R6, 50(R2) IF S S S ID ADD R1, R1, #100 IF SUB R2, R2, #8 Clock cycle 11 12 13 14 15 16 17 18 19 20 ADD R1, R3, 50(R2) MUL R6, R4, 100(R1) STORE R6, 50(R2) ALU1 MEM ALU2 WB ADD R1, R1, #100 ID ALU1 MEM ALU2 WB SUB R2, R2, #8 IF ID ALU1 MEM ALU2 WB 16 cycles, 6 stalls

b) (10 points) Show the pipeline timing diagram (which instruction goes into which pipeline stage at each clock cycle) with forwarding. How many cycles does it take to execute the following instruction sequence? Indicate the pipeline stages between which the information is forwarded. How many stalls are left? Clock cycle 1 2 3 4 5 6 7 8 9 10 ADD R1, R3, 50(R2) IF ID ALU1 MEM ALU2* WB MUL R6, R4, 100(R1) IF ID S S *ALU1 MEM ALU2* WB STORE R6, 50(R2) IF S S ID ALU1 S *MEM ALU2 ADD R1, R1, #100 IF ID S ALU1 MEM SUB R2, R2, #8 IF S ID ALU1 Clock cycle 11 12 13 14 15 16 17 18 19 20 ADD R1, R3, 50(R2) MUL R6, R4, 100(R1) STORE R6, 50(R2) WB ADD R1, R1, #100 ALU2 WB SUB R2, R2, #8 MEM ALU2 WB Forwarding is done between ALU2/WB and ID/ALU1, and between ALU2/WB and ALU1/MEM 13 cycles, 3 stalls

2. (40 points) For the processor from question 1 a) (5 points) How large is the branch penalty? Branches are fully resolved only in ALU2 stage, so we can do IF for next instruction in WB stage of the branch. We would like to do it in ID stage of the branch. The penalty is therefore 4 cycles. b) (5 points) Assume that we can introduce an optimization so that branches are resolved in ALU1 stage, and the target address for branches is also calculated there. How large is the branch penalty now? Branches are fully resolved in ALU1 stage, so we can do IF for next instruction in MEM stage of the branch. We would like to do it in ID stage of the branch. The penalty is therefore 2 cycles.

c) (18 points) We are considering an optimization from part b) where branches are resolved in ALU1 stage. This optimization increases clock cycle by 10% for all instructions and can be used in 65% of branches. Branches represent 35% of all instructions in our usual workload. Ideal CPI is 1. How large speedup can we achieve with this optimization? Old_CPU_time = Old_CPI*Old_IC*Old_CC Old_CPI = 1 + 0.35*branch_penalty = 1+0.35*4=2.4 Old_CPU_time = 2.4*Old_IC*Old_CC New_CPU_time = New_CPI*New_IC*New_CC New_IC=Old_IC New_CC=Old_CC*1.1 Old_CPI = 1 + 0.35*branch_penalty = 1+0.35*(0.65*2 + 0.35*4) = 1.945 New_CPU_time = 1.945*Old_IC*Old_CC*1.1 = 2.1395*Old_IC*Old_CC Speedup = Old_CPU_time/New_CPU_time = 1.12

d) (12 points) We are considering a branch prediction mechanism, for the original architecture from question 1. This optimization does not increase clock cycle time. Branches represent 35% of all instructions in our usual workload. Ideal CPI is 1. If 50% of branches are not taken, 30% are taken and 20% are jumps: a. How large is penalty for taken and not taken branches with predict-taken strategy? We know the target in ALU1. For taken branches penalty will be 2 cycles, for not-taken we have to wait until after ALU2 so the penalty will be 4 cycles. b. How large is penalty for taken and not taken branches with predict-nottaken strategy? For taken branches penalty will be 4 cycles, for not-taken it will be 0. c. How large is new average CPI with predict-taken strategy? 1 + 0.35*(0.5*4 + 0.5*2) = 2.05 cycles d. How large is new average CPI with predict-not-taken strategy? 1 + 0.35*(0.5*0 + 0.5*4) = 1.7 cycles

3. (20 points) Consider Tomasulo s algorithm (TA). a) (10 points) Explain how TA handles WAW hazards and how this differs from the way scoreboarding handles WAW hazards. TA: When an instruction is issued the destination register is associated with functional unit producing the result. If a later instruction wants to write to the same destination, the destination register will be associated with the functional unit producing the result of the later instruction. There will be no delay in issuing the later instruction. Scoreboarding: When an instruction is issued the destination register is associated with functional unit producing the result. If a later instruction wants to write to the same destination, the issue of this instruction will be stalled until the prior instruction performs write to the destination register.

b) (10 points) Explain how TA handles WAR hazards and how this differs from the way scoreboarding handles WAR hazards. TA: When an instruction is issued, the reservation station holds the available operand values. If the values are not available, the reservation station holds the reference to functional unit producing the values. If a later instruction wants to write to a register that holds the source operand of the prior instruction it is free to do so, since the current value of that register (that will be overwritten by later instruction) is either saved in the appropriate reservation station for the instruction that will use it as source operand, or will be stored there once it is produced by another functional unit. Scoreboarding: When an instruction is issued, the functional unit status table holds the reference to registers holding source operands. These registers will be read only when all source operands are available. If a later instruction wants to write to a register that holds the source operand of the prior instruction it will be stalled until the prior instruction has read its operands.

4. (20 points) Explain what are delayed branches, how do we decide the number of delay slots, what choices do we have to populate these slots and what considerations we must observe to keep the program execution correct. Finally, describe the difference between delayed and nullifying branches. Delayed branches are branches followed by specially chosen instructions that will be executed while we wait for the branch to resolve. The number of delay slots is equal to the number of cycles of waiting for a branch to completely resolve until both the condition and the target are known. Instructions placed into a delay slot must always be executed, regardless of the branch outcome. We can choose instructions to populate delay slots from before, from target and from fall through. When choosing from before we must select instructions that do no affect the branch condition, so that they can be moved after then branch. When choosing from target we must select instructions that do no affect registers that are used in fall through part of the code. Similarly, when choosing from fall-through we must select instructions that do no affect registers that are used in target part of the code. Nullifying branches are similar to delayed branches they are also followed by delay slots where we can place useful instructions. However, a compiler tries to predict a nullifying branch s outcome and chooses the code from fall-through for predict-nottaken or code from target for predict-taken branches. Compiler also inserts the prediction into the executable code. While instructions in the delay slots of delayed branches must always be executed, instructions following a nullifying branch can be turned into no-ops at run time if the branch was mispredicted. Because of this opportunity to cancel instructions from delay slots, a compiler can choose harmful instructions to populate delay slots for nullifying branches.