IF1 --> IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB. add $10, $2, $3 IF1 IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB sub $4, $10, $6 IF1 IF2 ID1 ID2 --> EX1 EX2 ME1 ME2 WB

Similar documents
Spring 2014 Midterm Exam Review

COMPUTER ORGANIZATION AND DESI

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

LSU EE 4720 Dynamic Scheduling Study Guide Fall David M. Koppelman. 1.1 Introduction. 1.2 Summary of Dynamic Scheduling Method 3

CS433 Homework 2 (Chapter 3)

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Complex Pipelines and Branch Prediction

The Processor: Instruction-Level Parallelism

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Processor (IV) - advanced ILP. Hwansoo Han

CS146 Computer Architecture. Fall Midterm Exam

CS433 Homework 2 (Chapter 3)

CS152 Computer Architecture and Engineering. Complex Pipelines

Computer Architecture EE 4720 Midterm Examination

CS425 Computer Systems Architecture

Chapter 4. The Processor

More advanced CPUs. August 4, Howard Huang 1

Pipelining. CSC Friday, November 6, 2015

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

Advanced processor designs

administrivia final hour exam next Wednesday covers assembly language like hw and worksheets

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

ELE 375 Final Exam Fall, 2000 Prof. Martonosi

Multiple Issue ILP Processors. Summary of discussions

Computer Architecture EE 4720 Midterm Examination

Photo David Wright STEVEN R. BAGLEY PIPELINES AND ILP

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

EC 513 Computer Architecture

Chapter. Out of order Execution

Pipelining to Superscalar

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Processor: Superscalars Dynamic Scheduling

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Hardware-based Speculation

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Superscalar Processors

Computer Architecture EE 4720 Final Examination

Intel released new technology call P6P

Chapter 06: Instruction Pipelining and Parallel Processing. Lesson 14: Example of the Pipelined CISC and RISC Processors

Limitations of Scalar Pipelines

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Four Steps of Speculative Tomasulo cycle 0

Superscalar Machines. Characteristics of superscalar processors

These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions.

CS 152, Spring 2011 Section 2

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processors Ch 14

Superscalar Organization

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

Multiple Instruction Issue. Superscalars

Lecture 9: Multiple Issue (Superscalar and VLIW)

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen

EE 4980 Modern Electronic Systems. Processor Advanced

Advanced Computer Architecture

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

Metodologie di Progettazione Hardware-Software

E0-243: Computer Architecture

Updated Exercises by Diana Franklin

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Advanced issues in pipelining

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

1 Tomasulo s Algorithm

Please state clearly any assumptions you make in solving the following problems.

Superscalar Processor

LECTURE 3: THE PROCESSOR

Good luck and have fun!

Reduced Instruction Set Computer

09-1 Multicycle Pipeline Operations

Advanced Instruction-Level Parallelism

Computer Architecture EE 4720 Midterm Examination

CS 2410 Mid term (fall 2018)

COSC 6385 Computer Architecture - Pipelining

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

CS311 Lecture: Pipelining and Superscalar Architectures

CS152 Computer Architecture and Engineering. Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods

COSC4201 Instruction Level Parallelism Dynamic Scheduling

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

URL: Offered by: Should already know: Will learn: 01 1 EE 4720 Computer Architecture

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

EECC551 Exam Review 4 questions out of 6 questions

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

CISC, RISC and post-risc architectures

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CSE 141 Summer 2016 Homework 2

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

Chapter 4 The Processor 1. Chapter 4D. The Processor

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

CS/CoE 1541 Exam 1 (Spring 2019).

The Pentium II/III Processor Compiler on a Chip

2. [3 marks] Show your work in the computation for the following questions involving CPI and performance.

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

URL: Offered by: Should already know: Will learn: 01 1 EE 4720 Computer Architecture

ECE 154A Introduction to. Fall 2012

Transcription:

EE 4720 Homework 4 Solution Due: 22 April 2002 To solve Problem 3 and the next assignment a paper has to be read. Do not leave the reading to the last minute, however try attempting the first problem below before reading the paper. Problem 1: The pipeline below was derived from the five-stage statically scheduled MIPS implementation by splitting each stage (except writeback) into two stages. Each pair of stages (say IF1 and IF2) does the same thing as the original stage (say IF), but because it is broken in to two stages it takes two rather than one clock cycle. The diagram shows only a few details. Bypass connections into the ALU are available from all stages from MEM1 to WB. IF1 IF2 ID1 ID2 EX1 EX2 MEM 1 MEM 2 WB PC The advantage of this pipeline is that the clock frequency can be doubled. (Actually not quite times two.) Perfect execution is shown in the diagram below: add $1, $2, $3 sub $4, $5, $6 and $7, $8, $9 (a) Suppose the old five-stage system ran at a clock frequency of 1 GHz and the new system runs at 2 GHz. How does the execution time compare on the new system when execution is perfect? It s half! That is, performance scaled with clock frequency. (b) Show a pipeline execution diagram of the code below on the new pipeline. Note dependencies through registers $10 and $11. add $10, $2, $3 sub $4, $10, $6 IF1 IF2 ID1 ID2 --> EX1 EX2 ME1 ME2 WB and $11, $8, $9 IF1 IF2 ID1 --> ID2 EX1 EX2 ME1 ME2 WB or $20, $21, $22 IF1 IF2 --> ID1 ID2 EX1 EX2 ME1 ME2 WB xor $7, $11, $0 IF1 --> IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB (c) In the previous part there should be a stall on the new pipeline that does not occur on the original pipeline. (It s not too late to change your answer!) How does that affect the usefulness of splitting pipeline stages? Assuming not all adjacent instructions are truely dependent, splitting is still useful, but performance does not scale with clock frequency. (It never does.) (d) (Optional, complete before reading paper.) To get that I m-so-clever feeling answer the following: Suppose there is no way a 32-bit add can be completed in less than two cycles. Is there any way to perform addition so that results can be bypassed to an immediately following instruction, as

in the example above, but without stalling? The technique must work when adding any two 32- bit numbers. Hint: The adder would have to be redesigned. (A question in the next homework assignment revisits the issue.) Split the ALU in to sixteen-bit parts and bypass the low and high parts separately.

Problem 2: Note: The following problem is similar to one given in the Fall 2001 semester, see http://www.ece.lsu.edu/ee4720/2001f/hw03.pdf (the problem) and http://www.ece.lsu.edu/ee4720/2001f/hw03_sol.pdf (the solution). For best results do not look at the solutions until you re really stuck. This problem below uses MIPS instead of DLX and is for Method 3 instead of Method 1. The code below executes on a dynamically scheduled four-way superscalar MIPS implementation using Method 3, physical register numbers. Loads and stores use the load/store unit, which consists of segments L1 and L2. The floating-point multiply unit is fully pipelined and consists of six segments, M1 to M6. The usual number of instructions (for a 4-way superscalar machine) can be fetched, decoded, and committed per cycle. An unlimited number of instructions can complete (but not commit) per cycle. (Not realistic, but it makes the solution easier.) There are an unlimited number of reservation stations, reorder buffer entries, and physical registers. The target of a branch is fetched in the cycle after the branch is in ID, whether or not the branch condition is available. (We ll cover that later.) (a) Show a pipeline execution diagram for the code below until the beginning of the fourth iteration. Show where instructions commit. See diagram below. (b) What is the CPI for a large number of iterations? Hint: There should be less than six cycles per iteration. The CPI is 3 6 =0.5. (c) Show the entries in the ID and commit register maps for registers f0 and $1 for each cycle in the first two iterations. If several values are assigned in the same cycle show each one separated by commas. (d) Show the values in the physical register file for f0 and $1 for the first two iterations. Use a ] to show when a physical register is removed from the free list and use a [ to show when it is put back in the free list. See pipeline execution diagram on the next page.

! Solution LOOP:! Instructions shown in dynamic order. (Instructions repeated.)! Cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ldc1 f0, 0($1) IF ID Q L1 L2 WC ldc1 f0, 0($1) IF ID Q L1 L2 WB C ldc1 f0, 0($1) IF ID Q L1 L2 WB C ldc1 f0, 0($1) IF ID Q L1 L2 WB mul.d f0, f0, f2 IF ID Q M1 M2 M3 M4 M5... ID Map f0 99 97,96 94,93 91,90 $1 98 95 92 89 # In cycle one first 97 is assigned to f0, then 96 (replacing 97). The # same sort of replacement occurs in cycles 4 and 7. Commit Map f0 99 97 96 94 93 91 90 $1 98 95 92 89! Cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Physical Register File 99 1.0 ] 98 0x1000 ] 97 [ 10 ] 96 [ 11 ] 95 [ 0x1008 ] 94 [ 20 ] 93 [ 2.2 ] 92 [ 0x1010 ]! Cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

The following is an introduction to the next few problems. As mentioned several times in class many of the performance-enhancing microarchitectural features that came in to wide use in the closing decades of the twentieth century (I love the way that sounds!) are much easier to apply to RISC ISAs than CISC ISAs. Bound by golden handcuffs to the CISCish IA-32 ISA, Intel was forced to get RISC-level performance from IA-32. (Not just Intel, DEC [now Compaq, perhaps soon HP] faced the problem with the VAX ISA and IBM with 360.) The solution chosen by Intel (and also DEC) was to translate individual IA-32 instructions in to one or more µops (micro-operations). Each µop is something like a RISC instruction and so the parts of the hardware beyond the IA-32 to µop translation can employ the same techniques used to implement RISC ISAs. The paper at http://www.intel.com/technology/itj/q12001/articles/art_2.htm and http://www.ece.lsu.edu/ee4720/s/hinton_p4.pdf (password needed off campus, will be given in class) describes the Pentium 4 implementation of IA-32, including µops (which are typeset using u instead of the Greek letter µ, except occasionally in figures). This paper was not written for a senior-level computer architecture class four weeks from the end of the semester and so it will include material which we have not yet covered (caches and TLBs) and some material not covered at all. Some stuff in the paper is not explained (how they do branch prediction or what the Pentium 4 pipeline segments in Figure 3 mean), some of this can be figured out other things have to be found out elsewhere (but not for this assignment). Read the paper and answer the question below. The next homework assignment will include additional questions on the paper. For this initial reading skip or lightly read material on the L2 cache, L1 data cache, and the ITLB. Questions on the cache material might be asked in a later assignment. Problem 3: What does the paper call the following actions and components (that is, translate from the terminology used in class to the terminology used in the paper): Commit Retire ID Register Map Frontend RAT Commit Register Map Retirement RAT Physical Register File Physical Register File