CS252 Prerequisite Quiz. Solutions Fall 2007

Similar documents
Lecture 5: Pipelining Basics

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Pipelining. CSC Friday, November 6, 2015

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

Computer Architecture and Engineering. CS152 Quiz #2. March 3rd, Professor Krste Asanovic. Name:

What is Pipelining? RISC remainder (our assumptions)

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Computer Architecture CS372 Exam 3

ECE260: Fundamentals of Computer Engineering

SOLUTION. Midterm #1 February 26th, 2018 Professor Krste Asanovic Name:

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

CS 2506 Computer Organization II Test 2

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

Computer Architecture Practical 1 Pipelining

LECTURE 10. Pipelining: Advanced ILP

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Lecture 4: Advanced Pipelines. Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

CSE 490/590 Computer Architecture Homework 2

CS/CoE 1541 Mid Term Exam (Fall 2018).

Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )

Pipelining and Caching. CS230 Tutorial 09

CS252 Graduate Computer Architecture Midterm 1 Solutions

Computer Architecture and Engineering. CS152 Quiz #1. February 19th, Professor Krste Asanovic. Name:

Lecture 2: Pipelining Basics. Today: chapter 1 wrap-up, basic pipelining implementation (Sections A.1 - A.4)

Instruction Pipelining Review

Please state clearly any assumptions you make in solving the following problems.

CS152 Computer Architecture and Engineering. Complex Pipelines

Pipelining: Overview. CPSC 252 Computer Organization Ellen Walker, Hiram College

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

Final Exam Fall 2007

CS2100 Computer Organisation Tutorial #10: Pipelining Answers to Selected Questions

EECC551 Exam Review 4 questions out of 6 questions

Hardware-based Speculation

Chapter 4. The Processor

ECE 505 Computer Architecture

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Complex Pipelines and Branch Prediction

Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2012 Professor Krste Asanović

Static, multiple-issue (superscaler) pipelines

Instruction Level Parallelism

/ : Computer Architecture and Design Fall Midterm Exam October 16, Name: ID #:

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

EITF20: Computer Architecture Part2.2.1: Pipeline-1

ELE 655 Microprocessor System Design

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

Computer Architecture and Engineering. CS152 Quiz #5. April 23rd, Professor Krste Asanovic. Name: Answer Key

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

PIPELINING: HAZARDS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Question 1: (20 points) For this question, refer to the following pipeline architecture.

The University of Michigan - Department of EECS EECS 370 Introduction to Computer Architecture Midterm Exam 2 solutions April 5, 2011

Hardware-Based Speculation

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

COSC 6385 Computer Architecture - Pipelining

What about branches? Branch outcomes are not known until EXE What are our options?

Course on Advanced Computer Architectures

Final Exam Fall 2008

EE 4683/5683: COMPUTER ARCHITECTURE

MIPS ISA AND PIPELINING OVERVIEW Appendix A and C

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

HY425 Lecture 05: Branch Prediction

1 /10 2 /16 3 /18 4 /15 5 /20 6 /9 7 /12

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Exploitation of instruction level parallelism

ECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories

ECE 313 Computer Organization FINAL EXAM December 14, This exam is open book and open notes. You have 2 hours.

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

COMPUTER ORGANIZATION AND DESIGN

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation

COMPUTER ORGANIZATION AND DESI

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

Chapter 4. The Processor

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007

6.004 Tutorial Problems L22 Branch Prediction

COSC 6385 Computer Architecture. Instruction Level Parallelism

Computer Architecture and Engineering. CS152 Quiz #4 Solutions

ECE 2300 Digital Logic & Computer Organization. Caches

Photo David Wright STEVEN R. BAGLEY PIPELINES AND ILP

The University of Alabama in Huntsville Electrical & Computer Engineering Department CPE Test II November 14, 2000

Lecture: Pipelining Basics

ECEC 355: Pipelining

Question 1 (5 points) Consider a cache with the following specifications Address space is 1024 words. The memory is word addressable The size of the

Multiple Instruction Issue. Superscalars

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

CS / ECE 6810 Midterm Exam - Oct 21st 2008

Transcription:

CS252 Prerequisite Quiz Krste Asanovic Solutions Fall 2007 Problem 1 (29 points) The followings are two code segments written in MIPS64 assembly language: Segment A: Loop: LD r5, 0(r1) # r5 Mem[r1+0] LD r6, 0(r2) # r6 Mem[r2+0] DADD r5, r5, r6 # r5 r5 + r6 SD r5, 0(r3) # Mem[r3+0] r5 LD r5, 0(r1) # r5 Mem[r1+0] LD r6, 0(r2) # r6 Mem[r2+0] DSUB r5, r5, r6 # r5 r5 - r6 SD r5, 0(r4) # Mem[r4+0] r5 DADDUI r1, r1, 8 # r1 r1 + 8 DADDUI r2, r2, 8 # r2 r2 + 8 DADDUI r3, r3, 8 # r3 r3 + 8 DADDUI r4, r4, 8 # r4 r4 + 8 BNE r1, r9, Loop # branch to Loop if r1 r9 Segment B: Loop: LD r5, 0(r1) # r5 Mem[r1+0] LD r6, 0(r2) # r6 Mem[r2+0] DADD r7, r5, r6 # r7 r5 + r6 DSUB r8, r5, r6 # r8 r5 - r6 SD r7, 0(r3) # Mem[r3+0] r7 SD r8, 0(r4) # Mem[r4+0] r8 DADDUI r1, r1, 8 # r1 r1 + 8 DADDUI r2, r2, 8 # r2 r2 + 8 DADDUI r3, r3, 8 # r3 r3 + 8 DADDUI r4, r4, 8 # r4 r4 + 8 BNE r1, r9, Loop # branch to Loop if r1 r9 In both segments, assume r1, r2, r3, r4 initially hold valid memory addresses. Register r9 is pre-computed to be 80 larger than the initial value of r1. All s operate on 64- bit doubleword values and the memory address space is byte-addressable. 1

a) If both segments are expected to perform the same task, can you guess what the task is? You can write the answer in C-like pseudo code. (10 points) for (i = 0 ; i < 10 ; i= i+ 1) { c[i] = a[i] + b[i]; d[i] = a[i] b[i]; } b) In general, which segment do you expect to perform better when executed? (9 points) Segment B is expected to perform better because it executes two less memory load s than segment A per iteration of the loop. c) Do the two segments always produce the same results at all situations? If not, can you specify a situation which makes them behave differently? (10 points) If the initial value of register r3 is same as that of r1, memory address alias occurs. In this case, the two segments behave differently. When segment A is executed, the value of r5 for calculating the addition is different from the value of r5 for calculating the subtraction in the same iteration because the first memory store to address pointed by r3 in the iteration overwrites the previous value of the memory address pointed by r1, thus affecting the value returned by the second load to the same address. On the other hand, when segment B is executed, the value of r5 for calculating the addition is the same as the value of r5 for calculating the subtraction in the same iteration. Similar scenario occurs when r3 has the same initial value as r2. 2

Problem 2 (36 points) The following figure shows a 5-stage pipelined processor. The pipelined processor should always compute the same results as an unpipelined processor. Answer the following questions for each of the sequences below: Why does the sequence require special handling (what could go wrong)? What are the minimal hardware mechanisms required to ensure correct behavior? What additional hardware mechanisms, if any, could help preserve performance? Assume that the architecture does not have any branch delay slots, and assume that branch conditions are computed by the ALU. Instruction Fetch IF Instruction Decode/ Register Fetch RF (read) A B Execute/ Address Calc. ALU Y Memory Access A D Data Mem Write Back Y RF (write) Y a) BEQ r1, r0, 200 # branch to PC+200 if r1 == r0 (12 points) ADD r2, r3, r5 # r2 r3 + r5 SUB r4, r5, r6 # r4 r5 + r6 To determine the outcome of the branch, the registers r1 and r0 have to be read and compared for equality in the ALU stage. By this time, the add will be in the Register File stage, and the subtract will be in the Instruction Fetch stage. If the branch is taken, then these two s should not be executed. This is called a hazard. We can achieve correct behavior if the branch is taken by killing the two following s and inserting bubbles into the pipeline in their place. 3

We can improve performance by using branch prediction to guess the outcome of the branch during the Register File stage. If we correctly predict a taken branch, we only have to kill one instead of two. To further improve performance, we could use a branch target buffer to predict the address to fetch next during the Instruction Fetch stage (even before the branch is decoded in the Register File stage). b) ADD r1, r0, r2 # r1 r0 + r2 (12 points) SUB r4, r1, r2 # r4 r1 - r2 The add writes its result to register r1 when it reaches the Write Back stage, at which time the subtract will be in the Memory stage. However, the subtract needs to use register r1 to compute its result in the ALU stage. This is called a data hazard. We can achieve correct behavior by stalling the subtract in the Register File stage until register r1 gets written. We would stall the subtract for three cycles and insert bubbles into the pipeline. We can improve performance by adding bypass muxes to the pipeline and forwarding the result of the add at the end of the ALU stage to the subtract at the end of the Register File stage. With this solution, the pipeline does not need to stall. c) LD r1, 0(r2) # r1 Mem[r2+0] (12 points) ADD r3, r1, r2 # r3 r1 + r2 This is a data hazard, similar to part (b) above. We can again achieve correct behavior by stalling the dependent until register r1 gets written. We can improve perfomance by forwarding the result of the load at the end of the Memory stage to the add at the end of the Register File stage. However, we still must stall the add for one cycle. 4

Problem 3 (15 points) For each of the following statements about making a change to a cache design, circle True or False and provide a one sentence explanation of your choice. Assume all cache parameters (capacity, associativity, line size) remain fixed except for the single change described in each question. Please provide a one sentence explanation of your answer. a) Doubling the line size halves the number of tags in the cache. (3 points) True. Since capacity (the amount of data that can be stored in the cache) and associativity are fixed, doubling the line size would halve the number of lines in the cache. Since there is one tag per line, halving the number of lines would halve the number of tags. b) Doubling the associativity doubles the number of tags in the cache. (3 points) False. Since capacity and line size are fixed, doubling the associativity will not change the number of lines in the cache. Therefore the number of tags in the cache remains the same. c) Doubling cache capacity of a direct-mapped cache usually reduces conflict misses. (3 points) True. Assuming that line size remains the same, doubling the cache capacity would double the number of lines, reducing the probability of a conflict miss. d) Doubling cache capacity of a direct-mapped cache usually reduces compulsory misses. (3 points) False. Since line size remains the same, the amount of data loaded into the cache on each miss remains the same. Therefore the number of compulsory misses remain the same. e) Doubling the line size usually reduces compulsory misses. (3 points) 5

True. Doubling the line size increases the amount of data loaded into the cache on each miss. If the program exhibits spatial locality, it would experience fewer compulsory misses. 6