Midterm I March 21 st, 2007 CS252 Graduate Computer Architecture

Similar documents
Midterm I SOLUTIONS March 21 st, 2007 CS252 Graduate Computer Architecture

Midterm I SOLUTIONS March 18, 2009 CS252 Graduate Computer Architecture

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Four Steps of Speculative Tomasulo cycle 0

Short Answer: [3] What is the primary difference between Tomasulo s algorithm and Scoreboarding?

CS146 Computer Architecture. Fall Midterm Exam

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

CS425 Computer Systems Architecture

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Hardware-Based Speculation

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Computer Architecture EE 4720 Final Examination

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

EXAM #1. CS 2410 Graduate Computer Architecture. Spring 2016, MW 11:00 AM 12:15 PM

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Hardware-Based Speculation

Lecture-13 (ROB and Multi-threading) CS422-Spring

Alexandria University

Multi-cycle Instructions in the Pipeline (Floating Point)

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Lecture 9: Multiple Issue (Superscalar and VLIW)

Hardware-based Speculation

CS433 Homework 2 (Chapter 3)

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

CS433 Homework 2 (Chapter 3)

TDT 4260 lecture 7 spring semester 2015

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

CS252 Graduate Computer Architecture Midterm 1 Solutions

CS433 Homework 3 (Chapter 3)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Instruction-Level Parallelism and Its Exploitation

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Computer Science 246 Computer Architecture

Instruction Level Parallelism

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

Question 1: Calculate Your Cache A certain system with a 350 MHz clock uses a separate data and instruction cache, and a uniæed second-level cache. Th

Multiple Instruction Issue. Superscalars

CS146: Computer Architecture Spring 2004 Homework #2 Due March 10, 2003 (Wednesday) Evening

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

Metodologie di Progettazione Hardware-Software

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

CS425 Computer Systems Architecture

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

CMSC411 Fall 2013 Midterm 2 Solutions

Copyright 2012, Elsevier Inc. All rights reserved.

CS 152 Computer Architecture and Engineering

Static vs. Dynamic Scheduling

CSE 490/590 Computer Architecture Homework 2

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

Processor (IV) - advanced ILP. Hwansoo Han

Good luck and have fun!

6.823 Computer System Architecture

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Processor: Superscalars Dynamic Scheduling

Advanced Instruction-Level Parallelism

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

CS 152 Computer Architecture and Engineering

Topics. Digital Systems Architecture EECE EECE Software Approaches to ILP Part 2. Ideas To Reduce Stalls. Processor Case Studies

ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST *

4DM4 Sample Problems for Midterm Tuesday, Oct. 22, 2013

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

CMSC 611: Advanced Computer Architecture

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

Computer Architecture Homework Set # 3 COVER SHEET Please turn in with your own solution

Getting CPI under 1: Outline

Latencies of FP operations used in chapter 4.

CS 152 Computer Architecture and Engineering

CS2100 Computer Organisation Tutorial #10: Pipelining Answers to Selected Questions

COMPUTER ORGANIZATION AND DESI

Super Scalar. Kalyan Basu March 21,

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

ILP: Instruction Level Parallelism

CS152 Computer Architecture and Engineering. Complex Pipelines

The Processor: Instruction-Level Parallelism

5008: Computer Architecture

/ : Computer Architecture and Design Fall Midterm Exam October 16, Name: ID #:

3.16 Historical Perspective and References

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Computer System Architecture Quiz #2 April 5th, 2019

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Chapter 4 The Processor 1. Chapter 4D. The Processor

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Adapted from David Patterson s slides on graduate computer architecture

Transcription:

University of California, Berkeley College of Engineering Computer Science Division EECS Spring 2007 John Kubiatowicz Midterm I March 21 st, 2007 CS252 Graduate Computer Architecture Your Name: SID Number: Problem Possible Score 1 16 2 21 3 19 4 20 5 24 Total 100 1

[ This page left for π ] 3.141592653589793238462643383279502884197169399375105820974944 2

Question #1: Short Answer [16 pts] Problem 1a[2pts]: What is simultaneous multithreading and why is it useful? Probglem 1b[2pts]: What is a data flow architecture? How would it work? Problem 1c[3pts]: What technological forces have caused Intel, AMD, Sun, and others to start putting multiple processors on a chip? Problem 1d[2pts]: Name two components of a modern superscalar architecture whose delay scales quadratically with the issue-width. 3

Problem 1e[2pts]: Most branches in a program are highly biased, i.e. they can be predicted by a simple one-level predictor. What can the compiler do to improve the number of branches that are in this category? Problem 1f[3pts]: What is the difference between implicit and explicit register renaming? How are they implemented? Problem 1g[2pts]: Why are Vector processors are more power efficient that superscalar processors when executing applications with a lot of data-level parallelism? Explain. 4

Problem #2: Superpipelining [21 pts] Suppose that we have single-issue, in-order pipeline with one fetch stage, one decode stage, multiple execution stages (which include memory access) and a singe write-back stage. Assume that it has the following execution latencies (i.e. the number of stages that it takes to compute a value): multf (4 cycles), addf (3 cycles), divf (6 cycles), integer ops (1 cycle). Assume full bypassing and two cycles to perform memory accesses, i.e. loads and stores take a total of 3 cycles to execute (including address computation). Finally, branch conditions are computed by the first execution stage (integer execution unit). Problem 2a[10pts]: Assume that this pipeline consists of a single linear sequence of stages in which later stages serve as no-ops for shorter operations. You should do the following on your diagram: 1. Draw each stage of the pipeline as a box and name each of the stages. Stages may have multiple function: i.e. an execute stage + memory op. You will have a total of 9 stages. 2. Describe what is computed in each stage (e.g. EX 1 : Integer Ops, Address Compute, First stage of ) 3. Show all of the bypass paths (as arrows between stages). Your goal is to design a pipeline which never stalls unless a value is not ready. Label each of these arrows with the types of instructions that will forward their results along these paths (i.e. use M for multf, D for divf, A for addf, I for integer operations, Ld for loads). [Hint: be careful to optimize for information feeding into store instructions!] 5

Problem 2b[3pts]: How many extra instructions are required between each of these instruction combinations to avoid stalls (i.e. assume that the second instruction uses a value from the first). Be careful! Between a divf and an store: Between a load and a multf: Between two integer instructions: Between a multf and an addf: Between an addf and a divf: Between an integer op and a store: Problem 2c[2pts]: How many branch delay slots does this machine have? Explain. Probem 2d[2pts]: Could branch prediction increase the performance of this pipeline? Why or why not? Problem 2e[2pts]: In the 5-stage pipeline that we discussed in class, a load into a register followed by an immediate store of that register to memory would not require any stalls, i.e. the following sequence could run without stalls: lw r4, 0(r2) sw r4, 0(r3) Explain why this was true for the 5-stage pipeline. Problem 2f[2pts]: Is this still true for your superpipelined processor? Explain. 6

Problem 3: Tomasulo Architecture [20 pts] Problem 3a[5pts]: Consider a Tomasulo architecture with a reorder buffer. This architecture replaces the normal 5- stages of execution with 5 stages: Fetch, Issue, Execute, Writeback, and Commit. Explain what happens to an instruction in each of them (be as complete as you can): a) Fetch: b) Issue: c) Execute: d) Writeback: e) Commit: Problem 3b[3pts]: Name each of the three types of data hazards and explain how the Tomasulo architecture removes them: Problem 3c[3pts]: Name three structural hazards that this architecture exhibits. Explain your answer. 7

Problem 3d[2pts]: Assume that you have a long chain of dependent instructions, such as the following: add $r1, $r2, $r3 add $r3, $r1, $r4 add $r7, $r3, $r5 Also assume that the integer execution unit takes one cycle for adds. What CPI would you achieve for this sequence with the basic Tomasulo architecture, assuming that each of the stages from (3a) are non-overlapped and take a complete cycle? Problem 3e[2pts]: Assume that associative matching on the CDB is a slow enough operation that it takes much of a cycle. How can you still get a throughput of one instruction per cycle for long dependent chains of operations such as given in (3d)? Only well-thought-out answers will get credit. Problem 3f[2pts]: The Tomasulo algorithm has one interesting bug in it. Consider the situation where one instruction uses a value from another one. Suppose the first instruction is issued on the same cycle as the one that it depends on is in writeback. add $r1, $r2, $r3 The result is broadcast... add $r4, $r1, $r1 This one is being issued What is the problem? Can you fix it easily? Problem 3g[3pts]: Which changes would you have to make to the basic Tomasulo architecture (with reorder buffer) to enable it to average a CPI = 0.33? 8

Problem #4: Fixing the loops [21 pts] For this problem, assume that we have a superpipelined architecture like that in problem (2) with the following use latencies (these are not the right answers for problem #2b!): Between a multf and an addf: 3 insts Between a load and a multf: 2 insts Between an addf and a divf: 1 insts Between a divf and a store: 6 insts Between an int op and a store: 0 insts Number of branch delay slots: 1 insts Consider the following loop which performs a restricted rotation and projection operation. In this code, F0 and F1 contain sin(θ) and cos(θ) for rotation. The array based at register r1 contains pairs of single-precision (32-bit) values which represent x,y coordinates. The array based at register r2 receives a projected coordinate along the observer s horizontal direction: project: ldf F3,0(r1) multf F10,F3,F0 ldf F4,4(r1) multf F11,F4,F1 addf F12,F10,F11 divf F13,F12,F2 stf 0(r2),F13 addi r1,r1,#8 addi r2,r2,#4 subi r3,r3,#1 bneq r3, r0, project nop Problem 4a[2pts]: How many cycles does this loop take per iteration? Indicate stalls in the above code by labeling each of them with a number of cycles of stall: Problem 4b[4pts]: Reschedule this code to run with as few cycles per iteration as possible. Do not unroll it or software pipeline it. How many cycles do you get per iteration of the loop now? 9

Problem 4c[6pts]: Unroll the loop once and schedule it to run with as few cycles as possible per iteration of the original loop. How many cycles do you get per iteration now? Problem 4e[3pts]: Your loop in (4c) will not run without stalls. Without going to the trouble to unroll further, what is the minimum number of times that you would have to unroll this loop to avoid stalls? How many cycles would you get per iteration then? Problem 4f[6pts]: Rewrite your code to utilize vector instructions and to run as fast as possible. Assume that the value in r3 is the vector length. Make sure to comment each instruction to say what it is doing. Assuming full chaining, one instruction/cycle issue, and delays for instructions/memory that are the same as the non-vector processor. How long does this code take to execute (you can use the original value of r3 in your expression). 10

Problem 4g: [Extra Credit: 5pts] Assume that you have a Tomasulo architecture with functional units of the same execution latency (number of cycles) as our deeply pipelined processor (be careful to adjust use latencies to get number of execution cycles!). Assume that it issues one instruction per cycle and has an unpipelined divider with a small number of reservation stations. Suppose the other functional units are duplicated with many reservation stations and that there are many CDBs. What is the minimum number of divide reservation stations needed to achieve one instruction per cycle with the optimized code of (4b)? Show your work. [hint: assume that the maximum issue rate is sutained and look at the scheduling of a single iteration] 11

Problem 5: Prediction [24 pts] In this question, you will examine several different schemes for branch prediction, using the following code sequence for a MIPS-like ISA with no branch delay slow: addi r2, r0, #45 ; initialize r2 to 101101, binary addi r3, r0, #6 ; initialize r3 to 6, decimal addi r4, r0, #10000 ; initialize r4 to a big number top: PC1--> andi r1, r2, #1 bnez r1, skip1 ; extract the low-order bit of r2 ; branch if the bit is set xor r0, r0, r0 ; dummy instruction skip1: srli r2, r2, #1 ; shift the pattern in r2 PC2--> subi r3, r3, #1 bnez r3, skip2 ; decrement r3 addi r2, r0, #45 ; reinitialize pattern addi r3, r0, #6 skip2: subi r4, r4, #1 PC3--> bnez r4, top ; decrement loop counter This sequence contains 3 branches, labeled by PC1, PC2, and PC3. Problem 5a[2pts]: Sketch a basic PAg predictor that might be used for prediction. Assume that we will be tracking 3 bits of history. Problem 5b[2pts]: What is the minimum range of instruction address bits required to address the branch history table for your PAg predictor in order to avoid aliasing between PC1, PC2, and PC3? How many entries that this correspond to? 12

Problem 5c[6pts]: The following are the steadystate taken/not-taken patterns for each of the three branches (T indicates taken, N indicates not taken): PC1: TTNTNT TTNTNT... PC2: NTNTNT NTNTNT... PC3: TTTTTN TTTTTN... Using the PAg predictor of 5a and assuming no aliasing (i.e. a correct answer to 5b), what is the steady state prediction success rate (that is, the ratio of correctly predicted branches to total branches) for each branch? Assume that all 2-bit predictors are initialized to zero. Hint: Draw a table representing values (T or F) fed to each entry of the pattern history table. After you get a repeating-pattern stream for each predictor, you should be able to know how each 2-bit counter will predict: 13

Problem 5d[2pts]: Can you make a simple argument why a version of PAg with 6 bits of history will have 100% prediction accuracy for this set of branch patterns? Problem 5e[4pts]: Draw the following global predictors: GAg, GShare, GAs. What is the reason for using a GShare or GAs predictor instead of GAg predictor? 14

Problem 5f[4pts]: What is the simplest type of predictor that can predict the following sequence of data values without errors after some startup period: 1 4 7 10 13 16 19 22 Draw a hardware diagram for it. How many data values must it see before it starts predicting correctly? Problem 5g[4pts]: What is the simplest type of predictor that can predict the following sequence of data values without errors after some startup period: 1 3 3 7 10 1 3 3 7 10 1 3 3 7 10 Draw a hardware diagram for it. How many data values must it see before it starts predicting correctly? 15

[ This page intentionally left blank!] 16