CS146 Computer Architecture. Fall Midterm Exam

Similar documents
CS146: Computer Architecture Spring 2004 Homework #2 Due March 10, 2003 (Wednesday) Evening

Please state clearly any assumptions you make in solving the following problems.

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

Instruction-Level Parallelism and Its Exploitation

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Alexandria University

Midterm I March 21 st, 2007 CS252 Graduate Computer Architecture

Metodologie di Progettazione Hardware-Software

ELE 375 Final Exam Fall, 2000 Prof. Martonosi

Hardware-based Speculation

Chapter 4. The Processor

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

The Processor: Instruction-Level Parallelism

Processor (IV) - advanced ILP. Hwansoo Han

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS425 Computer Systems Architecture

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Multiple Instruction Issue. Superscalars

EXAM #1. CS 2410 Graduate Computer Architecture. Spring 2016, MW 11:00 AM 12:15 PM

CS252 Graduate Computer Architecture Midterm 1 Solutions

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

Multiple Issue ILP Processors. Summary of discussions

Hardware-Based Speculation

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

COMPUTER ORGANIZATION AND DESI

EECS 470 Midterm Exam Winter 2008 answers

Processor: Superscalars Dynamic Scheduling

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Good luck and have fun!

Computer Architecture EE 4720 Final Examination

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

CS433 Homework 2 (Chapter 3)

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

CS152 Exam #2 Fall Professor Dave Patterson

Static, multiple-issue (superscaler) pipelines

6.823 Computer System Architecture

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

EECC551 Exam Review 4 questions out of 6 questions

Chapter 4 The Processor 1. Chapter 4D. The Processor

EECC551 Review. Dynamic Hardware-Based Speculation

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

LECTURE 3: THE PROCESSOR

Lecture-13 (ROB and Multi-threading) CS422-Spring

Four Steps of Speculative Tomasulo cycle 0

Structure of Computer Systems

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

Multi-cycle Instructions in the Pipeline (Floating Point)

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007

Chapter 4. The Processor

Complex Pipelines and Branch Prediction

Preventing Stalls: 1

Keywords and Review Questions

CS433 Homework 3 (Chapter 3)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Control Hazards. Branch Prediction

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

CS/CoE 1541 Mid Term Exam (Fall 2018).

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

Super Scalar. Kalyan Basu March 21,

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

CS / ECE 6810 Midterm Exam - Oct 21st 2008

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

Final Exam Fall 2007

OPEN BOOK, OPEN NOTES. NO COMPUTERS, OR SOLVING PROBLEMS DIRECTLY USING CALCULATORS.

Handout 2 ILP: Part B

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

CS433 Homework 2 (Chapter 3)

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

Full Datapath. Chapter 4 The Processor 2

COMPUTER ORGANIZATION AND DESIGN

CIS 662: Midterm. 16 cycles, 6 stalls

Computer System Architecture Final Examination Spring 2002

Lecture 9: Multiple Issue (Superscalar and VLIW)

Advanced Computer Architecture

Chapter 4 The Processor (Part 4)

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Transcription:

CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state any assumptions explicitly. Name: 1 /35 2 /20 3 /10 4 /12 5 /12 6 /10 Total /100

1. Multiple Choice and Short Answer (35pts) 1.1 (10 points) Amdahl s Law a. (5 points) Consider a workload where 50% of the execution time consists of multimedia processing for which the MMX instruction set extensions might be helpful. According to Amdahl s law, what is the maximum speedup that can be achieved by implementing them? b. (5 points) Now, say that you work at Intel and the MMX designers claim that multimedia code sequences will see a 3.5 times (3.5X) speedup by using the MMX extensions. What is the fraction of the execution time that must be multimedia code in order to achieve an overall speedup of 1.8X?

1.2 (9 points) CISC vs. RISC When pipelined microprocessors were first becoming more common (early to mid 80 s) designers believed that RISC instruction sets were easier to pipeline because? (Please give 3 reasons) 1.3 (5 points) In spite of this, high-performance pipelined implementations of CISC instruction sets have been successfully built; various VAX and x86 implementations are examples including Intel s P6-Microarchitecture discussed in class. The most effective/common implementation strategy used by these machines has been to: a. pipeline the CISC instructions, despite their wide variability in instruction execution times, and use elaborate memory disambiguation techniques to avoid stalls due to memory address calculations. b. Pipeline the CISC instructions and handle the extra structural hazards using aggressive scoreboarding techniques. c. Use link-time techniques to break CISC instructions into small RISC-like operations that are more easily pipelined d. Use run-time techniques to break CISC instructions into easily-pipelined RISC-like operations

1.4 (6 points) Pipelining Limits We have seen how pipelining improves the instruction throughput increasing effective performance. Machines with deeper pipelines perform less work per pipestage but have more in-flight instructions processing at the same time allowing instructions to complete at a higher rate. In class we discussed several reasons why the effectiveness of deeply pipelined machines can be limited too much pipelining can be detrimental. Describe two reasons here. 1.5 (6 points) Limits of Loop Unrolling We have seen how loop unrolling can significantly improve performance by removing loop overhead and providing a better opportunity for the compiler to generate an efficient static schedule. In class we discussed several reasons why loop unrolling cannot be performed indefinitely too much loop unrolling can limit performance. Describe two reasons here.

2. Pipelining (20 Points) For this question, consider the code segment below. Assume that full bypassing/forwarding has been implemented. Assume that the initial value of register R23 is much bigger than the initial value of register R20. Assume that all memory references hit in the caches and TLBs. Assume that both load-use hazards and branch delay slots are hidden using delay slots. You maynot reorder instructions to fill such slots, but if a subsequent instruction is independent and is properly positioned, you may assume that it fills the slot. Otherwise, fill slots with additional noops as needed. LOOP: lw R10, X(R20) lw R11, Y(R20) subu R10, R10, R11 sw Z(R20), R10 addiu R20, R20, 4 subu R5, R23, R20 bnez R5, LOOP nop ; 1 delay slot a. (5 points) On the grid page at the end of the exam, draw a pipeline diagram of 2 iterations of its execution on a standard 5-stage MIPS pipeline. (You may want to turn it horizontally). Assume that the branch is resolved using an ID control point. In the box below, write the total number of cycles required to complete 2 iterations of the loop. Cycles =

b. (15 points) On the second grid page that follows, draw a pipeline diagram of 2 iterations of the loop on the pipeline below. Note that the loop has a single branch delay slot nop included you may need to add more. You can not assume anything about the program s register usage before or after this code segment. Fill in the boxes below. Pipeline Branch Delay = Pipeline Load Delay = Cycles = Pipeline: IF1 IF2 ID RF EX1 EX2 M1 M2 WB IF1 IF2 ID RF EX1 EX2 M1 M2 WB IF1 IF2 ID RF EX1 EX2 M1 M2 WB IF1 IF2 ID RF EX1 EX2 M1 M2 WB IF1 IF2 ID RF EX1 EX2 M1 M2 WB IF1 IF2 ID RF EX1 EX2 M1 M2 WB IF1: Begin Instruction Fetch IF2: Complete Instruction Fetch ID: Instruction Decode RF: Register Fetch EX1: ALU operation execution begins. Branch target calculation finishes. Memory address calculation. Branch condition resolution calculation begins. EX2: Branch condition resolution finishes. Finish ALU ops. (But branch and memory address calculations finish in a single cycle). M1: First part of memory access, TLB access. M2: Second part of memory access, Data sent to memory for stores OR returned from memory for loads. WB: Write back results to register file

3. Multimedia ISAs and Conditional MOVs (10 Points) Absolute value is expressed as A = abs(b). In high-level code: If (B<0) {A=-B;} else {A=B;) In MIPS-style code this would look something like the following (R2 = B, R1 = A): BLTZ R2,THEN: ; Check if R2 < 0, Jump to Then ADDI R1,R2,0 ; R1 = R2 + 0; Else Clause JUMP END: ; Skip over Then Clause THEN: SUBI R1,0,R2 ; R1 = 0 R2; Then Clause END: In class, we have seen that conditional branches are detrimental to performance and we have seen two methods to remove conditional branches. a. (5 points) Using the saturating arithmetic features of a multimedia ISA code the absolute value function without using any branch or jump instructions. You can perform the computation with the following six instructions (you may not need all of them). You can perform the absolute value operation on subwords (ie. don t worry about shifting or extracting). HADD, HADD,us, HADD,ss, HSUB, HSUB,us, HSUB,ss Here HADD,us uses unsigned saturating arithmetic and HADD,ss uses saturating arithmetic.

b. (5 points) Using Conditional Move operations code the absolute value function without using any branch or jump instructions. In this problem just perform the absolute value operation on a 64-bit integer value. You may use CMOV s of the form: CMOVGTZ R1, R2, R3 // if (R1 > 0) R2 = R3 CMOVLTZ R1, R2, R3 // if (R1 < 0) R2 = R3 CMOVEQZ R1, R2, R3 // if (R1 ==0) R2 = R3

4. Branch Prediction (12 Points) The following series of branch outcomes occurs for a single branch in a program. (T means the branch is taken, N means the branch is not taken). T T T N T N T T T N T N T a. (4 points) Assume that we are trying to predict this sequence with a BHT using a 1-bit counter. The counters of the BHT are initialized to the N state. Which of the branches would be mispredicted? Use the following table. You may assume that this is the only branch in the program. Predictor State Before Prediction N Branch Outcome T Mis-Prediction?

b. (8 points) Draw the state-transition diagram for a BHT scheme using 2-bit saturating counters. Repeat this exercise with a 2-bit saturating counter initialized to Weakly-Not-Taken. Predictor State Before Prediction W-N Branch Outcome T Mis-Prediction?

5. Tomasulo s Algorithm (12 pts) The drawing below depicts the basic structure of Tomasulo s algorithm as described in the textbook and during class. This is the version without a reorder buffer all renaming occurs in the reservation stations. a. Where indicated by the letters A,B,C, and D determine the width of the fields in the reservation stations, the width of the buses, etc. Please write your answers in the box below. Memory C Floating Point Operation Queue FP Registers Load Buffers (6) FP Adder Reservation Stations (3) A Op Tag Tag Op Tag Tag B FP Mul RS (3) FP Adder FP Multiplier Common Data Bus (D) A= C= B= D= b. Now consider the three reservation stations associated with the FP adder. How many (and what size) comparators are needed in order to determine when relevant values are being broadcast on the Common Data Bus? Why?

6. SuperScalar Microarchitectures (10 Points) Register Execute Write Fetch Transit Map Queue Register Address Cache1 Cache2 Write Register FP1 FP2 FP3 FP4 Write The Alpha 21264 processor is designed to issue 4 integer (2 of which may be load/store) and 2 FP instructions per cycle, and its pipeline diagram is shown above. (The shaded region indicates where instruction queueing/reordering may occur). All instructions execute in the first four stages, and then the pipeline is different for integer ops (top row), memory ops (second row), or floating point opts (third row). a. (5 points) Since the Alpha instruction set is similar to MIPS (RISC, load/store, etc), how many register read and write ports would one expect to need, to avoid structural hazards, in a straightforward implementation of the integer register file? b. (5 points) Since that number of ports is too difficult to implement, the chip designers used a trick instead. They divided the physical registers of the machine into two clusters. A group of functional units are associated with each clusters. Values written to registers in one cluster will eventually propagate to the other cluster, but there will be extra delay. In words explain how this implementation choice affects both the machine s instruction dispatch/issue unit as well as compiler strategies for this chip.