Question 1: Calculate Your Cache A certain system with a 350 MHz clock uses a separate data and instruction cache, and a uniæed second-level cache. Th

Similar documents
Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

Static vs. Dynamic Scheduling

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Processor: Superscalars Dynamic Scheduling

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST *

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Good luck and have fun!

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

EECC551 Exam Review 4 questions out of 6 questions

Four Steps of Speculative Tomasulo cycle 0

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Instruction-Level Parallelism and Its Exploitation

Scoreboard information (3 tables) Four stages of scoreboard control

Hardware-based Speculation

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

Computer Architecture Homework Set # 3 COVER SHEET Please turn in with your own solution

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

Hardware-based Speculation

The basic structure of a MIPS floating-point unit

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Midterm I March 21 st, 2007 CS252 Graduate Computer Architecture

Instruction Level Parallelism

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Metodologie di Progettazione Hardware-Software

Adapted from David Patterson s slides on graduate computer architecture

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

CS433 Homework 2 (Chapter 3)

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Name: Computer Science 252 Quiz #2

Hardware-Based Speculation

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

Latencies of FP operations used in chapter 4.

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

CS 614 COMPUTER ARCHITECTURE II FALL 2004

CS433 Homework 2 (Chapter 3)

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

Chapter 4 The Processor 1. Chapter 4D. The Processor

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

CS 5515 Fall Solution to Test è1. Open booksènotes; calculator allowed

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

CS152 Computer Architecture and Engineering. Complex Pipelines

Lecture 4: Introduction to Advanced Pipelining

Lecture-13 (ROB and Multi-threading) CS422-Spring

Hardware-Based Speculation

Superscalar Architectures: Part 2

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Prerequisite Quiz January 23, 2007 CS252 Computer Architecture and Engineering

Super Scalar. Kalyan Basu March 21,

Instruction Level Parallelism (ILP)

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

BWM CRM, PRM CRM, WB, PRM BRM, WB CWM, CWH, PWM. Exclusive (read/write) CWM PWM

Instruction Level Parallelism

CS 423 Computer Architecture Spring Lecture 04: A Superscalar Pipeline

University of Southern California Department of Electrical Engineering EE557 Fall 2001 Instructor: Michel Dubois Homework #3.

Graduate Computer Architecture. Chapter 3. Instruction Level Parallelism and Its Dynamic Exploitation

COSC 6385 Computer Architecture - Instruction Level Parallelism (II)

COSC 6385 Computer Architecture - Pipelining (II)

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

CS252 Graduate Computer Architecture Midterm 1 Solutions

CS433 Homework 3 (Chapter 3)

Multi-cycle Instructions in the Pipeline (Floating Point)

A Model RISC Processor. DLX Architecture

EECC551 Review. Dynamic Hardware-Based Speculation

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

CS 2410 Mid term (fall 2018)

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)

CMSC411 Fall 2013 Midterm 2 Solutions

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

CS146 Computer Architecture. Fall Midterm Exam

CS146: Computer Architecture Spring 2004 Homework #2 Due March 10, 2003 (Wednesday) Evening

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Processor (IV) - advanced ILP. Hwansoo Han

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation

CS 614 COMPUTER ARCHITECTURE II FALL 2005

6.823 Computer System Architecture

DYNAMIC SPECULATIVE EXECUTION

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

3.16 Historical Perspective and References

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

Transcription:

University of California, Berkeley College of Engineering Computer Science Division EECS Spring 1998 D.A. Patterson Quiz 1 March 4, 1998 CS252 Graduate Computer Architecture You are allowed to use a calculator and one 8.5" by 11" double-sided page of notes. Show your work on all problems. Good luck! Your name: SID number: Email address: 1 è12 2 è20 3 è24 Total è56

Question 1: Calculate Your Cache A certain system with a 350 MHz clock uses a separate data and instruction cache, and a uniæed second-level cache. The ærst-level data cache is a direct-mapped, write-through, write-allocate cache with 8kBytes of data total and 8-Byte blocks, and has a perfect write buæer ènever causes any stallsè. The ærst-level instruction cache is a direct-mapped cache with 4kBytes of data total and 8-Byte blocks. The second-level cache is a two-way set associative, write-back, write-allocate cache with 2MBytes of data total and 32-Byte blocks. The ærst-level instruction cache has a miss rate of 2è. The ærst-level data cache has a miss rate of 15è. The uniæed second-level cache has a local miss rate of 10è. Assume that 40è of all instructions are data memory accesses; 60è of those are loads, and 40è are stores. Assume that 50è of the blocks in the second-level cache are dirty at any time. Assume that there is no optimization for fast reads on an L1 or L2 cache miss. All ærst-level cache hits cause no stalls. The second-level hit time is 10 cycles. èthat means that the L1 miss penalty, assuming a hit in the L2 cache, is 10 cycles.è Main memory access time is 100 cycles to the ærst bus width of data; after that, the memory system can deliver consecutive bus widths of data on each following cycle. Outstanding non-consecutive memory requests can not overlap; an access to one memory location must complete before an access to another memory location can begin. There is a 128-bit bus from memory to the L2 cache, and a 64-bit bus from both L1 caches to the L2 cache. Assume a perfect TLB for this problem ènever causes any stallsè. aè What percent of all data memory references cause a main memory access èmain memory is accessed before the memory request is satisæedè? First show the equation, then the numeric result. bè How many bits are used to index each of the caches? physical addresses. Assume the caches are presented 2

Question 1 ècontinuedè cè How many cycles can the longest possible data memory access take? Describe èbrieæyè the events that occur during this access. dè What is the average memory access time in cycles èincluding instruction and data memory referencesè? First show the equation, then the numeric result. 3

Question 2: Tomasulo's Revenge Using the DLX code shown below, show the state of the Reservation stations, Reorder buæers, and FP register status for a speculative processor implementing Tomasulo's algorithm as presented in the book. Assume the following: æ Only one instruction can issue per cycle. æ The reorder buæer has 8 slots. æ The reorder buæer implements the functionality of the load buæers and store buæers. æ All FUs are fully pipelined. æ There are 2 FP multiply reservation stations. æ There are 3 FP add reservation stations. æ There are 3 integer reservation stations, which also execute load and store instructions. æ No exceptions occur during the execution of this code. æ All integer operations require 1 execution cycle. Memory requests occur and complete in this cycle. èfor this problem, assume that, barring structural hazards, loads issue in one cycle, execute in the next, write in the third, and a dependent instruction can start execution on the fourth.è æ All FP multiply operations require 4 execution cycles. æ All FP addition operations require 2 execution cycles. æ On a CDB write conæict, the instruction issued earlier gets priority. æ Execution for a dependent instruction can begin on the cycle after its operand is broadcast on the CDB. æ If any item changes from ëbusy" to ënot Busy", you should update the ëbusy" column to reæect this, but you should not erase any other information in the row èunless another instruction then overwrites that informationè. æ Assume the all reservation stations, reorder buæers, and functional units were empty and not busy when the code show below began execution. æ The ëvalue" column gets updated when the value is broadcast on the CDB. Integer registers are not shown, and you do not have to show their state. 4

Question 2 ècontinuedè aè The tables below show the state after the cycle in which the second SUBI from the code below issued. Show the state after the next cycle. Lp: LD F0, 0èR1è LD MULTD ADDD F2, 0èR2è F4, F0, F2 F6, F0, F0 SUBI R1, R1, 8 SUBI R2, R2, 8 ADDI R3, R3, 1 Reservation stations Name Busy Op Vj Vk Qj Qk Dest Add1 Y ADDD F0 F0 è4 Add2 Add3 Mult1 Y MULTD F0 F2 è3 Mult2 Int1 Y SUBI R2 8 è6 Int2 N LD R2 0 è2 Int3 Y SUBI R1 8 è5 Reorder buæer Entry Busy Instruction State Destination Value 1 N LD F0, 0èR1è Commit F0 Memë0èR1èë 2 N LD F2, 0èR2è Commit F2 Memë0èR2èë 3 Y MULTD F4, F0, F2 Execute F4 4 Y ADDD F6, F0, F0 Execute F6 5 Y SUBI R1, R1, 8 Execute R1 6 Y SUBI R2, R2, 8 Issue R2 7 8 FP register status Field F0 F2 F4 F6 F8 F10 F12... F30 Reorder è 1 2 3 4... Busy N N Y Y... 5

Question 2 ècontinuedè bè The tables below show the state during the cycle in which the second MULTD from the code below issued. Show the state after the next two cycles. MULTD F0, F2, F4 ADDD ADDD LD F6, F6, F0 F2, F2, F8 F4, 0èR2è ADDI R1, R1, 8 MULTD ADDD F8, F10, F12 F4, F4, F10 ADDI R2, R2, 1 Reservation stations Name Busy Op Vj Vk Qj Qk Dest Add1 Y ADDD F6 è1 è2 Add2 Y ADDD F2 F8 è3 Add3 Mult1 N MULTD F2 F4 è1 Mult2 Y MULTD F10 F12 è6 Int1 Y LD R2 0 è4 Int2 Y ADDI R1 8 è5 Int3 Reorder buæer Entry Busy Instruction State Destination Value 1 Y MULTD F0, F2, F4 Write F0 F2æF4 2 Y ADDD F6, F6, F0 Issue F6 3 Y ADDD F2, F2, F8 Execute F2 4 Y LD F4, 0èR2è Execute F4 5 Y ADDI R1, R1, 8 Execute R1 6 Y MULTD F8, F10, F12 Issue F8 7 8 FP register status Field F0 F2 F4 F6 F8 F10 F12... F30 Reorder è 1 3 4 2 6... Busy N Y Y Y Y... 6

Question 2 ècontinuedè cè The tables below show the state after the cycle in which the SUBD from the code below issued. Show the state after the next four cycles. ADDD SUBD F4, F0, F0 F4, F4, F2 ADDI R2, R2, 1 ADDI R3, R3, 1 ADDD MULTD F2, F6, F8 F0, F6, F8 Reservation stations Name Busy Op Vj Vk Qj Qk Dest Add1 Y ADDD F0 F0 è1 Add2 Y SUBD F2 è1 è2 Add3 Mult1 Mult2 Int1 Int2 Int3 Reorder buæer Entry Busy Instruction State Destination Value 1 Y ADDD F4, F0, F0 Execute F4 2 Y SUBD F4, F4, F2 Issue F4 3 4 5 6 7 8 FP register status Field F0 F2 F4 F6 F8 F10 F12... F30 Reorder è 2... Busy Y... 7

Question 3: Vector vs DSP Showdown Examine the two architectures below. The ærst architecture is a 25 Mhz 3-stage dsp processor. A block diagram showing some of the fully-bypassed datapath is shown below. The three stages are fetch, decode èwhere branches W Registers X Y Z Control Instruction Ram Multiplier ALU Shifter S Accumulator Figure 1: The DSP block diagram are evaluated and the PC updatedè, and execute èwhere memory and register writes also occurè. The processor is able to multiply, accumulate, and shift during its execute stage. It has the same load, store, and branch instructions as DLX. It also includes a LT instruction, which loads a value into a register from memory, and decrements the base register to the next element. The arithmetic operations are slightly diæerent: æ Register 0 always contains the value 0 æ Register 1 always contains the value 1 æ The result from the shifter is always written to the accumulator on arithmetic operations æ Operations can be speciæed as MAC W, X, Y, Z, S, where W is the register to be written; X and Y are registers that go to the multiplier; Z is the register that goes to the alu; and S speciæes the amount to right shift the result. æ Operations can also be speciæed as MACA W, X, Y, S, where W is the register to be written; X and Y are registers that go to the multiplier; the accumulator goes to the alu; and S speciæes the amount to right shift the result. The second architecture is a 100 Mhz vector processor with a MVL of 64 elements. It has one FP addèsubtract FU, one FP multiplyèdivide FU, and a single memory FU. The startup overhead is 5 cycle for add, subtract, multiply, and divide instructions, and 10 cycles for memory instructions. It supports æexible chaining but not tailgating. 8

Question 3 ècontinuedè Here is the code for the dsp: LP: LT R2, 0èR5è è Load R2 with new value MAC R0, R10, R2, R0, 0 è Perform the calculation MACA R0, R11, R2, 0 MACA R2, R12, R2, 0 BNEZ R5, LP SW R2, 4èR5è è Delayed branch aè What is the peak performance, in results per second, of the above three-tap ælter? Count each tap as a result. bè What would be the peak performance, in results per second, of the above code, if it was a æve-tap ælter? Count each tap as a result. 9

Question 3 ècontinuedè cè Translate the following DLX code to code that will operate on this DSP. Assume that all æoating point calculations below can be done in æxed-point on the DSP. Do not worry about round-oæ error from converting between æoating point and æxed point. Assume that for DLX code, F0 has been initialized to 0, F2 to 0.5, and F4 to 2.0. Assume that for the DSP code you will write, R2 has been initialized to 2. Assume for both that register R5 contains the correct initial loop count. LP: LW R3, 0èR5è è Load R3 with the new value MOVI2FP F6, R3 è Move the value from R3 to F6 CVTI2F F6, F6 è Convert integer value to floating point MULTF F8, F4, F6 è Multiply by 2 ADDD F10, F8, F0 è Add accumulator to value MULTF F0, F10, F2 è Divide value by 2 CVTF2I F12, F0 è Convert to integer representation MOVFP2I R3, F12 è Move it to integer registers SW R3, 0èR5è è Store value back to mem ADDI R5, -4 è Point to next element BNEZ R5, LP è If not done, branch back 10

Question 3 ècontinuedè Here is the code for the vector machine: LP: LV V1, R5 è Load V1 with new value MULTSV V2, F0, V1 MULTSV V3, F1, V1 ADDV V2, V2, V3 MULTSV V3, F2, V1 ADDV SV V2, V2, V3 V2, R5 SUBI R5, R5, 8 BNEZ R5, LP è Perform the calculation dè Show the convoys of vector instructions for the above code. Follow the timing examples in the book. Draw lines to show the convoys on the existing code shown above. eè Show the the execution time in clock cycles of this loop with n elements èt n è; assume T loop = 15. Show the equation, and give the numeric value of execution time for n=64. fè What is R1 for this loop? Show the equation, and give the numeric value. 11

Question 3 ècontinuedè gè List 6 characteristics of DSP instruction set architectures that diæer from general purpose microprocessors. hè Which of those characteristics are supported in vector architectures? iè Which of the unsupported characteristics could be handled in software? jè What changes to the hardware would you make to handle the remaining characteristics? 12