Course on Advanced Computer Architectures

Similar documents
Course on Advanced Computer Architectures

ADVANCED COMPUTER ARCHITECTURES: Prof. C. SILVANO Written exam 11 July 2011

Hardware-based Speculation

Handout 2 ILP: Part B

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Instruction-Level Parallelism and Its Exploitation

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

Course on Advanced Computer Architectures

CMSC411 Fall 2013 Midterm 2 Solutions

Four Steps of Speculative Tomasulo cycle 0

Static vs. Dynamic Scheduling

Multiple Instruction Issue. Superscalars

Hardware-Based Speculation

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Lecture-13 (ROB and Multi-threading) CS422-Spring

Multi-cycle Instructions in the Pipeline (Floating Point)

Corso Architetture Avanzate dei Calcolatori

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Adapted from David Patterson s slides on graduate computer architecture

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

ILP: Instruction Level Parallelism

5008: Computer Architecture

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CS425 Computer Systems Architecture

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

EECC551 Exam Review 4 questions out of 6 questions

ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST *

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

INSTRUCTION LEVEL PARALLELISM

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Lecture: Pipeline Wrap-Up and Static ILP

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Hardware-based Speculation

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Instruction Level Parallelism

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Good luck and have fun!

Advanced issues in pipelining

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

CS152 Computer Architecture and Engineering. Complex Pipelines

Metodologie di Progettazione Hardware-Software

Instruction Level Parallelism (ILP)

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

The basic structure of a MIPS floating-point unit

Lecture 4: Advanced Pipelines. Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)

Getting CPI under 1: Outline

Super Scalar. Kalyan Basu March 21,

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

Structure of Computer Systems

COGNOME NOME MATRICOLA

CSE 490/590 Computer Architecture Homework 2

Processor: Superscalars Dynamic Scheduling

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Floating Point/Multicycle Pipelining in DLX

Hardware-Based Speculation

CSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

TDT 4260 lecture 7 spring semester 2015

Lecture 9: Multiple Issue (Superscalar and VLIW)

Exploitation of instruction level parallelism

CS433 Homework 2 (Chapter 3)

DYNAMIC SPECULATIVE EXECUTION

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation

TDT 4260 TDT ILP Chap 2, App. C

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

CS425 Computer Systems Architecture

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Chapter 4 The Processor 1. Chapter 4D. The Processor

CSE 502 Graduate Computer Architecture

3.16 Historical Perspective and References

COSC4201. Prof. Mokhtar Aboelaze York University

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Transcription:

Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, July 9, 2018 Course on Advanced Computer Architectures Prof. D. Sciuto, Prof. C. Silvano EX1 EX2 EX3 Q1 Q2 SIX QUESTIONS TOTAL ( 5 points) ( 5 points) ( 6 points) ( 5 points) ( 5 points) ( 6 points) (32 points)

EXERCISE 1 MULTI-CYCLE PIPELINE (5 points) In this problem we will examine the execution of the following code segment on the following single-issue out-of-order processor: All functional units are pipelined ALU operations take 1 clock cycle Memory operations take 2 clock cycles (includes time in ALU) FP ALU operations take 2 clock cycles The Write Back unit has a single write port There is no register renaming Instructions are fetched, decoded and issued in order An instruction will only enter the ISSUE stage if it does not cause a WAR or WAW hazard Only one instruction can be issued at a time, and in the case multiple instructions are ready, the oldest one will go first Fill in the following table pointing out, for each instruction, the pipeline stages activated in each clock cycle and type of hazards: Instruction C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 Type of Hazards lw.d $f1,basea($r1) IF ID IS ALU M WB addi.d $f1,$f1,k1 IF ID IS FA1 FA2 WB RAW $f1 WAW $f1 add.d $f3,$f2,$f1 IF ID IS FA1 FA2 WB RAW $f1 add.d $f1,$f2,$f3 IF ID IS FA1 FA2 WB RAW $f3 WAW $f1 WAR $f1 sw.d $f1,basea($r1) IF ID IS ALU M WB RAW $f1 Insert in the empty pipeline scheme the pipeline stages and stalls needed to solve the previous data, control and structural hazards. INS C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 C19 C20 C21 I1 I2 I3 I4 I5 Solution: In C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 C19 C20 I1 IF ID IS ALU M WB I2 IF ID S S S IS FA1 FA2 WB I3 IF S S S ID IS S S FA1 FA2 WB I4 IF ID S S S S S IS FA1 FA2 WB I5 IF S S S S S ID IS S S ALU M WB Page 1 - SOLUTION

Fill in the following table pointing out the NUMBER OF STALLS to be inserted within each instruction to solve the hazards: Num. of Stalls INSTRUCTION Type of Hazards lw.d $f1,basea($r1) 3 addi.d $f1,$f1,k1 (RAW $f1) WAW $f1 2 add.d $f3,$f2,$f1 RAW $f1 3 add.d $f1,$f2,$f3 (RAW $f3) WAW $f1 WAR $f1 2 sw.d $f1,basea($r1) RAW $f1 Page 2 - SOLUTION

EXERCISE 2 TOMASULO (5 points) Assume that the following code has been executed on a CPU with dynamic scheduling based on TOMASULO algorithm. 1. List all the possible conflicts in the code (use the column HAZARDS TYPE): Instruction Issue Execution Start Write Result HAZARDs TYPE I1: LW.D $F6, 32(R2) 1 2 8 I2: ADDD $F2, $F6, $F4 2 9 11 I3: MULTD $F0, $F4, $F2 3 12 32 I4: SUBD $F12, $F2, $F6 12 13 15 I5: ADDD $F0, $F12, $F2 16 17 19 RAW $F6 RAW $F2 STRUCT RSADDSUB (RAW $F2) (RAW$F6) STRUCT RSADDSUB (RAW $F12) (RAW $F2) 2. Is there a hardware configuration that can respect the shown execution timing? YES Tomasulo with ROB to solve the WAW $F0 I5:I3 3. If it is possible, describe the architecture in terms of Functional Units and Reservation Stations: List the number, type and latency of Functional Units and the number of Reservation Stations for each FU 1 Load/Store Unit with Latency 6 clock cycles + 1 Reservation Station 1 ADD/SUB Unit with Latency 2 clock cycles + 1 Reservation Station 1 MUL Unit with Latency 20 clock cycles + 1 Reservation Station I4 and I5 delayed ISSUEs imply that there is only one reservation station for ADDD/SUBD generating the STRUCTURAL HAZARDs Page 3 - SOLUTION

EXERCISE 3: VLIW (6 points) In this problem, you will port code to a simple 3-issue VLIW machine, and schedule it to improve performance. Details about the 3-issue VLIW machine with 3 fully pipelined functional units: Integer ALU with 1 cycle latency to next Integer/FP and 2 cycle latency to next Branch Memory Unit with 3 cycle latency Floating Point Unit with 3 cycle latency (each FPU can complete one add or one multiply per clock cycle) Branch completed with 1 cycle delay slot (branch solved in ID stage) No interlocks C Code: Assembly Code: for(int i=0; i<n; i++) loop:ld f1, 0(r6) A[i] = A[i] + A[i]; ld f2, 0(r6) fadd f2, f2, f1 st f2, 0(r6) addi r6, r6, 4 bne r6, r7, loop ESERCISE 3.A Considering one iteration of the loop, schedule the assembly code for the 3-issue VLIW machine in the following table by using the list-based scheduling, but neither using software pipelining nor loop unrolling nor modifying loop indexes. Please do not need to write in NOPs (can leave blank). Integer ALU Memory Unit FPU C0 ld f1, 0(r6) C1 ld f2, 0(r6) C2 C3 C4 fadd f2, f2, f1 C5 C6 C7 addi r6, r6, 4 st f2, 0(r6) C8 C9 bne r6, r7, loop C10 (br delay slot) C11 C12 C13 How long is the Critical Path? 11 cycles What performance did you achieve in FP ops per cycle? 1/11 What performance did you achieve in cycles per loop iteration? 11/1 What code efficiency did you achieve? 6/33 What CPI did you achieve? CPI = #VLIW_cycles/ IC = 11 / 6 = 1.83 Page 4 - SOLUTION

EXERCISE 3.B Considering the unrolling of two iterations of the loop as follows: Assembly Code: loop:ld f1, 0(r6) ld f2, 0(r6) fadd f2, f2, f1 st f2, 0(r6) ld f3, 4(r6) ld f4, 4(r6) fadd f4, f4, f3 st f4, 4(r6) addi r6, r6, 8 bne r6, r7, loop Considering one iteration of the unrolled loop, schedule the assembly code by using the list-based scheduling, but neither using software pipelining nor loop unrolling nor modifying loop indexes. Please do not need to write in NOPs (can leave blank). Integer ALU Memory Unit FPU C0 ld f1, 0(r6) C1 ld f2, 0(r6) C2 ld f3, 4(r6) C3 ld f4, 4(r6) C4 fadd f2, f2, f1 C5 C6 fadd f4, f4, f3 C7 st f2, 0(r6) C8 C9 addi r6, r6, 8 st f4, 4(r6) C10 C11 bne r6, r7, loop C12 (br delay slot) C13 C14 C15 How long is the Critical Path? 13 cycles What performance did you achieve in FP ops per cycle? 2/13 What performance did you achieve in cycles per loop iteration? 13/2 What code efficiency did you achieve? 10/39 What CPI did you achieve? CPI = #VLIW_cycles/ IC = 13 / 10 = 1.3 How much is the Speedup with respect to the previous EXE 3.A? SpeedUp = CPI A / CPI B = 1.83 / 1.3 = 1.41 => 41% speedup Page 5 - SOLUTION

QUESTION 1: (5 points) 1. Superscalar processors today use an approach that exploits both Instruction Level Parallelism and Thread Level Parallelism. Briefly explain the approach SMT 2. Which HW mechanisms of dynamic scheduled processors intrinsically support this approach? 3. Which modifications to a generic architecture of dynamically scheduled superscalar processors are required? Page 6 - SOLUTION

QUESTION 2: (5 points) 1. Explain the concepts of the Reorder Buffer and the motivations to introduce it in dynamically superscalar architectures. 2. In Speculative Tomasulo how is the Reorder Buffer used and during which stage the entry for the ROB is assigned? Page 7 - SOLUTION

SIX QUESTIONS: (6 points) Answer with True or False to the following questions: Q1) In MIMD architectures with a physically centralized memory organization, the memory address space is shared Answer: T F False Feedback: The addressing space does not depend from the physical organization Q2) Increasing the issue rate of modern processors (for instance up to 12 instructions per clock cycle) provides a huge improvement in the processor clock frequency Answer: T F False Feedback: The answer is FALSE. In fact, increasing the issue rate of a processor means increase the memory accesses, branch resolutions, register renaming and accesses per clock cycle. But, due to the complexity to implement such an architecture, we would sacrifice the maximum clock rate. So, this choice has also some important disadvantages: notice that the processor with the highest issue rate is also the one with the slowest clock cycle. Mark the True answers to the following questions (some questions might have multiple True answers): Q3) Considering the MESI write-invalidate write-back protocol, a cached block can be in one of four states: Modified, Exclusive, Shared, Invalid. For which of these states is the memory up to date? Answer 1: Only Exclusive. Answer 2: Only Shared. Answer 3: Shared and Exclusive. Answer 4: Shared and Modified. T T T (TRUE) T Feedback: The correct answer is: Shared and Exclusive. When the block is in the shared state it means that no write has been performed on the block. Both memory and caches hold the latest value. When the state is Exclusive it means that only one cache holds a copy of the block and that it hasn t performed any write on it (i.e. both the memory and the single cache holding the copy have the latest value of the block). Q4) In case of delayed branch technique, which of the following assertions are true? Answer 1: Instruction in the branch delay slot is always executed T (TRUE) Answer 2: Instruction in the branch delay slot is always a NOP T Answer 3: Instruction in the branch delay slot is always an independent instruction from before the branch T Answer 4: Helps the compiler to statically predict the outcome of the branch T Feedback: The instruction in the branch delay slot is always executed, whether or not the branch is taken. The job of the compiler is to make the instruction placed in the branch delay slot valid and useful, the remaining slots are filled with NOPs Page 8 - SOLUTION

Q5) Which complications are introduced by out-of-order execution and out-of-order completion? Answer 1: WAR and WAW hazards T (TRUE) Answer 2: RAW hazards T Answer 3: Exception handling T (TRUE) Answer 4: Control hazards T Answer 5: Instruction reordering T Feedback: Out-of-order execution introduces the possibility of WAR and WAW hazards, which do not exist in the five-stage integer pipeline. Out-of-order completion creates major complications in handling exceptions. Dynamic scheduling with out-of-order completion must preserve exception behavior in the sense that exactly those exceptions that would arise if the program were executed in strict program order actually do arise. Q6) Which of these modifications would you introduce on a cache to decrease the conflict misses? Answer 1: Increase the cache size T (TRUE) Answer 2: Increase the cache associativity T (TRUE) Answer 3: Increase the size of the blocks T Answer 4: Add a second level of cache T Feedback: The best and most effective way of reducing the conflict misses is moving to a higher associativity cache organization. It is also true that an increase on the cache size leads to a reduction of conflict misses as well as a reduction of capacity misses. It is wrong, instead, that increasing the size of the blocks results in a reduction of conflict misses, and actually it is the other way around, namely decreasing the size of the blocks produces a decrease of conflict misses. See also pages 12-13 of the set of slides Memory performance improvement techniques. Page 9 - SOLUTION