Computer Architecture EE 4720 Final Examination

Similar documents
Computer Architecture EE 4720 Final Examination

Computer Architecture EE 4720 Final Examination

Computer Architecture EE 4720 Final Examination

Computer Architecture EE 4720 Final Examination

Computer Architecture EE 4720 Final Examination

Computer Architecture EE 4720 Final Examination

Computer Architecture EE 4720 Midterm Examination

Computer Architecture EE 4720 Midterm Examination

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

Computer Architecture EE 4720 Midterm Examination

Computer Architecture EE 4720 Midterm Examination

Very Simple MIPS Implementation

Computer Architecture EE 4720 Midterm Examination

LSU EE 4720 Homework 5 Solution Due: 20 March 2017

Computer Architecture EE 4720 Practice Final Examination

Final Exam Fall 2007

Computer Architecture EE 4720 Midterm Examination

LSU EE 4720 Homework 1 Solution Due: 12 February 2016

Very Simple MIPS Implementation

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

Please state clearly any assumptions you make in solving the following problems.

Computer Architecture EE 4720 Midterm Examination

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

The University of Michigan - Department of EECS EECS 370 Introduction to Computer Architecture Midterm Exam 2 solutions April 5, 2011

GPU Microarchitecture Note Set 2 Cores

CS/CoE 1541 Mid Term Exam (Fall 2018).

Question 1: (20 points) For this question, refer to the following pipeline architecture.

Chapter 4. The Processor

CS/CoE 1541 Exam 1 (Spring 2019).

CS 351 Exam 2 Mon. 11/2/2015

CAD for VLSI 2 Pro ject - Superscalar Processor Implementation

CS 251, Winter 2018, Assignment % of course mark

CS 251, Winter 2019, Assignment % of course mark

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

Final Exam Fall 2008

Pipeline design. Mehran Rezaei

Notes Interrupts and Exceptions. The book uses exception as a general term for all interrupts...

(1) Using a different mapping scheme will reduce which type of cache miss? (1) Which type of cache miss can be reduced by using longer lines?

CS2100 Computer Organisation Tutorial #10: Pipelining Answers to Selected Questions

LECTURE 3: THE PROCESSOR

CISC 662 Graduate Computer Architecture. Lecture 4 - ISA

3. (2 pts) Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished?

09-1 Multicycle Pipeline Operations

CS146 Computer Architecture. Fall Midterm Exam

Spring 2014 Midterm Exam Review

ISA Instruction Operation

CISC 662 Graduate Computer Architecture. Lecture 4 - ISA MIPS ISA. In a CPU. (vonneumann) Processor Organization

Very short answer questions. You must use 10 or fewer words. "True" and "False" are considered very short answers.

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Complex Pipelines and Branch Prediction

1 Tomasulo s Algorithm

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Very short answer questions. You must use 10 or fewer words. "True" and "False" are considered very short answers.

Final Exam Spring 2017

EXAM #1. CS 2410 Graduate Computer Architecture. Spring 2016, MW 11:00 AM 12:15 PM

Comprehensive Exams COMPUTER ARCHITECTURE. Spring April 3, 2006

Chapter 4. The Processor

are Softw Instruction Set Architecture Microarchitecture are rdw

LSU EE 4720 Dynamic Scheduling Study Guide Fall David M. Koppelman. 1.1 Introduction. 1.2 Summary of Dynamic Scheduling Method 3

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods

/ : Computer Architecture and Design Fall Midterm Exam October 16, Name: ID #:

ECE/CS 552: Introduction to Computer Architecture ASSIGNMENT #1 Due Date: At the beginning of lecture, September 22 nd, 2010

Computer System Architecture Final Examination Spring 2002

Computer Architecture CS372 Exam 3

SOLUTION. Midterm #1 February 26th, 2018 Professor Krste Asanovic Name:

/ : Computer Architecture and Design Fall 2014 Midterm Exam Solution

Lecture 2: RISC V Instruction Set Architecture. Housekeeping

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

INSTRUCTION SET COMPARISONS

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

COMPUTER ORGANIZATION AND DESIGN

Instruction Level Parallelism (Branch Prediction)

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

Do not open this exam until instructed to do so. CS/ECE 354 Final Exam May 19, CS Login: QUESTION MAXIMUM SCORE TOTAL 115

IF1 --> IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB. add $10, $2, $3 IF1 IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB sub $4, $10, $6 IF1 IF2 ID1 ID2 --> EX1 EX2 ME1 ME2 WB

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

ECE 313 Computer Organization FINAL EXAM December 14, This exam is open book and open notes. You have 2 hours.

University of Toronto Faculty of Applied Science and Engineering

Full Datapath. Chapter 4 The Processor 2

4.1.3 [10] < 4.3>Which resources (blocks) produce no output for this instruction? Which resources produce output that is not used?

Anne Bracy CS 3410 Computer Science Cornell University. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Basic Instruction Timings. Pipelining 1. How long would it take to execute the following sequence of instructions?

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Course Administration

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

ECE550 PRACTICE Final

CS152 Computer Architecture and Engineering. Complex Pipelines

EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts

Pipelining. CSC Friday, November 6, 2015

CPU Pipelining Issues

Very short answer questions. "True" and "False" are considered short answers.

Machine Organization & Assembly Language

Midterm I March 21 st, 2007 CS252 Graduate Computer Architecture

Lecture 2: RISC V Instruction Set Architecture. James C. Hoe Department of ECE Carnegie Mellon University

See also cache study guide. Contents Memory and Caches. Supplement to material in section 5.2. Includes notation presented in class.

BRANCH PREDICTORS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

101 Assembly. ENGR 3410 Computer Architecture Mark L. Chang Fall 2009

Transcription:

Name Computer Architecture EE 4720 Final Examination 1 May 2017, 10:00 12:00 CDT Alias Problem 1 Problem 2 Problem 3 Problem 4 Problem 5 Exam Total (20 pts) (15 pts) (20 pts) (15 pts) (30 pts) (100 pts) Good Luck!

Problem 1: (20 pts) The diagram below, based on the solution to Homework 5, shows control logic that generates a stall signal when the value to be bypassed is too large for 12-bit bypass paths. The logic only works when the dependency is with the rt register of the consuming instruction and when the producing instruction is not a load. Modify the control logic so that it will generate a stall signal for a dependency to an rs register (first example below) and dependencies with loads. Pay attention to the load sizes. # Dependency through rs register (r1 in the sub). add r1, r2, r3 sub r4, r1, r5 # Producing instruction is a load word. lw r1, 2(r3) and r4, r5, r1 # Producing instruction is a load half. lh r1, 2(r3) and r4, r5, r1 # Producing instruction is a load byte. lb r1, 2(r3) and r4, r5, r1 Use next page for solution. 29:26 25:0 29:0 IF + ID = EX ME WB N N ALU +1 rsv ByME imm ByWB ALU D In D MD 2'b0 30 2 msb lsb format immed imm mx1 31:11 31:11 abig IR Decode Dest dst dst dst rt =' =' STALL = lb = lh = lw is Type I is Type R ByME 00 01 imm 10 lsb msb ByWB 11 Use next page for solution. 2

Problem 1, continued: Modify the control logic so that it also generates the stall signal for dependencies through the rs register. Modify the stall control logic for when loads lb, lh, and lw produce the value to bypass, account whether value can use the 12-bit bypasses. take into 29:26 25:0 + 29:0 IF ID = EX ME WB +1 N N rsv ByME imm ByWB ALU D In D ALU MD 30 2 msb 2'b0 lsb format immed imm mx1 31:11 31:11 abig IR Decode Dest dst dst dst rt =' =' STALL = lb = lh = lw is Type I is Type R ByME 00 01 imm 10 lsb msb ByWB 11 3

Problem 2: (15 pts) Illustrated below is a superscalar implementation taken from the solution to last year s final exam. Show the execution of the code sequences below on the illustrated superscalar MIPS implementation. Don t forget to check for dependencies. 31:2 2'b0 IF + ID EX ME WB npc Register File +8 rsv 0 0 rsv 1 1 addr D md 64 ir 0 ir 1 imm 0 imm 1 mp1 is is STALL_ID_1 (a) Show the execution of the code below on this implementation. Note that the address of the first instruction is 0x1000. Show execution of the following code sequence. Pay attention to ME in the diagram. Check for dependencies. START: ess is 0x1000. lw r1, 0(r2) lw r3, 4(r2) lw r4, 8(r2) add r5, r1, r5 add r5, r3, r5 add r5, r4, r5 4

Problem 2, continued: The illustration below is the same as the one on the previous page. 31:2 2'b0 IF + ID EX ME WB npc Register File +8 rsv 0 0 rsv 1 1 addr D md 64 ir 0 ir 1 imm 0 imm 1 mp1 is is STALL_ID_1 (b) Show the execution of the code below on the illustrated implementation when the branch is taken. Use the classroom default assumption: fetches are aligned. Show execution of the following code sequence. Check for dependencies. Show all instructions that enter the pipeline, even those that are squashed in IF. Pay attention to instruction addresses, such as 0x1000. # Branch is taken. 0x1000: bneq r1, r4 TARG 0x1004: sub r5, r2, r7 0x1008: xor r10, r11, r12 0x100c: lbu r9, 0(r5) 0x1010: andi r8, r9, 12 TARG: 0x1014: or r11, r5, r12 0x1018: sb r11, 0(r5) 5

Problem 2, continued: The illustration below is the same as on the previous page. 31:2 2'b0 IF + ID EX ME WB npc Register File +8 rsv 0 0 rsv 1 1 addr D md 64 ir 0 ir 1 imm 0 imm 1 mp1 is is STALL_ID_1 (c) Appearing below is an execution of MIPS code on the illustrated superscalar implementation shown for the first two iterations. Compute the CPI for a large number of iterations. If necessary extend the execution diagram. lw r1, 0(r2) IF ID EX ME WB LOOP: # Cycle 0 1 2 3 4 5 6 7 8 9 10 add r1, r1, r4 IF ID -> EX ME WB # First Iteration lw r1, 0(r2) IF ID -> EX ME WB bneq r2, r3 LOOP IF -> ID EX ME WB addi r2, r2, 4 IF -> ID EX ME WB LOOP: # Cycle 0 1 2 3 4 5 6 7 8 9 10 add r1, r1, r4 IF ID EX ME WB # Second Iteration lw r1, 0(r2) IF ID EX ME WB bneq r2, r3 LOOP IF ID EX ME WB addi r2, r2, 4 IF ID EX ME WB LOOP: # Cycle 0 1 2 3 4 5 6 7 8 9 10 CPI for a large number of iterations. 6

Problem 3: (20 pts) Answer the following branch prediction questions. (a) Code producing the branch patterns shown below is to run on three systems, each with a different branch predictor. All systems use a 2 12 entry BHT. One system has a bimodal predictor, one system has a local predictor with a 8-outcome local history, and one system has a global predictor with a 8-outcome global history. Branch B2 consists of a repeating pattern that starts with TNTT and is either followed by three not-taken outcomes, nnn, or four taken outcomes, tttt. (They are shown in lower case for clarity.) The nnn sequence occurs with probability.4, and is not correlated with anything. Answer each question below, the answers should be for predictors that have already warmed up. Show work or provide brief explanations. B1: T N T T N T N T T N T N T T N B2: T N T T n n n T N T T t t t t What is the accuracy of the bimodal predictor on branch B1? What is the accuracy of the bimodal predictor on branch B2? length. Account for the variable pattern What is the accuracy of the local predictor on branch B2? Account for the variable pattern length. What is the minimum local history size for which branch B1 and B2 will not interfere with each other? Explain. Note that an arrow ( ) points at an execution of B1. Show the value of the GHR at the time that that execution is being predicted. 7

Problem 3, continued: (b) Appearing below is a diagram of a bimodal predictor, showing in detail the logic for predicting the instruction in IF and for updating the predictor for the resolving branch. Modify the diagram so that it is a local predictor with an 8-outcome local history. Show the PHT, and connections for prediction and update. IF ID (resolve) -1 15:2 15:2 Post-resolve 2-b counter. BHT a d we a Target d in To Mux. To control logic. Target 2-bit counter 2 Prediction 1:1 Travels with instruction. +1 2 come (1=T, 0=N) Pre-resolve 2-b counter Target (resolve) Is-Branch (=1 if branch, =0 otherwise.) From ME (or stage where branch resolved). 8

Problem 4: (15 pts) The diagram below is for a 32MiB (2 25 B) four-way set-associative cache with a line size of 32B. (a) Answer the following, formulæ are fine as long as they consist of grade-time constants. Fill in the blanks in the diagram. CPU In 16 c hit 64 logic Tag Tag Tag Valid = Tag Valid = Complete the address bit categorization below. Label the sections appropriately. (Index, Offset, Tag.) ess: 0 ory Needed to Implement Indicate Unit!!: Show the bit categorization for a direct-mapped cache with the same line size and capacity as the cache above. ess: 9

Problem 4, continued: The problem on this page is not based on the cache from Part a. The code in the problem belows run on a cache with a line size of 1024B (2 10 B). Each code fragment starts with the cache empty; consider only accesses to the arrays. (b) Find the hit ratio executing the code below. int sum = 0; int *a = 0x2000000; // sizeof(int) == 4 int i; int ILIMIT = 1 << 11; // = 2 11 for (i=0; i<ilimit; i++) sum += a[ i ]; What is the hit ratio running the code above? Show formula and briefly justify. 10

Problem 5: (30 pts) Answer each question below. (a) Consider a 4-way superscalar system and a scalar system with a 4-lane vector unit. Both can compute arithmetic at a rate of 4 operations per cycle. The vector system is cheaper but the superscalar system is more flexible. Why is the vector system less costly? Show something the superscalar system can do that the vector system cannot. (b) Unlike MIPS, ARM A64 has pre-index and post-index load and store instructions. Show two code examples, one in A64 that uses a post-index load, and one in MIPS that does the same thing (but without a post-index load). The exact syntax of the ARM instructions is not important, use comments to clarify instructions. ARM code and equivalent MIPS code. (c) What additional hardware is needed to implement ARM A64 pre- and post-index loads when starting with something like our five-stage MIPS implementation. (Think about Homework 4.) Additional hardware for pre- and post-index loads. 11

(d) VLIW ISAs are supposed to do for superscalar implementations what RISC ISAs did for pipelined implementations. The diagram below shows our 2-way superscalar MIPS. Show how a 2-slot-bundle VLIW ISA (perhaps one a lot like MIPS) could simplify hardware in this implementation related to the sharing in ME. 31:2 2'b0 IF + ID EX ME WB npc Register File +8 rsv 0 0 rsv 1 1 addr D md 64 ir 0 ir 1 imm 0 imm 1 mp1 is is STALL_ID_1 Modify the hardware above. Explain the bundle slot restrictions based on modified hardware. Explain why the control logic driving STALL ID 1 would no longer be needed. 12

(e) When an exception occurs (or a trap instruction is executed) the processor switches from user mode into privileged mode (also called system mode). Explain how privileged mode affects instruction execution, including loads, compared to user mode. Affect of privileged mode on instruction execution. Affect of privileged mode on load instruction execution. (f) It s hard to choose a line size that makes everyone happy. Explain how a long line size might slow down some programs in a small cache in comparison to the right line size (for those programs). With a small cache large lines can slow some programs because: Describe the characteristics of code that works well with long lines. 13