Static, multiple-issue (superscaler) pipelines

Similar documents
Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Processor (IV) - advanced ILP. Hwansoo Han

The Processor: Instruction-Level Parallelism

Chapter 4 The Processor 1. Chapter 4D. The Processor

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Advanced Instruction-Level Parallelism

CS/CoE 1541 Exam 1 (Spring 2019).

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

CS3350B Computer Architecture Quiz 3 March 15, 2018

CS/CoE 1541 Mid Term Exam (Fall 2018).

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

Multiple Instruction Issue. Superscalars

Pipelining. CSC Friday, November 6, 2015

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

( ) תשס"ח סמסטר ב' May, 2008 Hugo Guterman Web site:

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Final Exam Fall 2007

Chapter 4 The Processor (Part 4)

Chapter 4. The Processor

Chapter 4. The Processor

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

EIE/ENE 334 Microprocessors

COMPUTER ORGANIZATION AND DESI

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Question 1: (20 points) For this question, refer to the following pipeline architecture.

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Super Scalar. Kalyan Basu March 21,

CS2100 Computer Organisation Tutorial #10: Pipelining Answers to Selected Questions

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Chapter 4. The Processor

Alexandria University

CSE 490/590 Computer Architecture Homework 2

ארכי טק טורת יחיד ת עיבוד מרכזי ת

Chapter 4. The Processor

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

ECEC 355: Pipelining

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

5 th Edition. The Processor We will examine two MIPS implementations A simplified version A more realistic pipelined version

Hardware-based Speculation

The University of Alabama in Huntsville Electrical & Computer Engineering Department CPE Test II November 14, 2000

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Lec 25: Parallel Processors. Announcements

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

CS 61C: Great Ideas in Computer Architecture. Multiple Instruction Issue, Virtual Memory Introduction

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

Complex Pipelines and Branch Prediction

6.004 Tutorial Problems L22 Branch Prediction

Thomas Polzer Institut für Technische Informatik

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

ECE Exam II - Solutions October 30 th, :35 pm 5:55pm

ELE 655 Microprocessor System Design

Four Steps of Speculative Tomasulo cycle 0

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

CS252 Graduate Computer Architecture Midterm 1 Solutions

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

Processor (II) - pipelining. Hwansoo Han

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

LECTURE 9. Pipeline Hazards

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Assignment 1 solutions

Chapter 4. The Processor

Please state clearly any assumptions you make in solving the following problems.

CS146 Computer Architecture. Fall Midterm Exam

SOLUTION. Midterm #1 February 26th, 2018 Professor Krste Asanovic Name:

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

Full Datapath. Chapter 4 The Processor 2

Chapter 4 The Processor 1. Chapter 4B. The Processor

Full Datapath. Chapter 4 The Processor 2

ECE 313 Computer Organization FINAL EXAM December 14, This exam is open book and open notes. You have 2 hours.

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

Week 11: Assignment Solutions

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

Chapter 7. Microarchitecture. Copyright 2013 Elsevier Inc. All rights reserved.

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

LECTURE 3: THE PROCESSOR

ADVANCED COMPUTER ARCHITECTURES: Prof. C. SILVANO Written exam 11 July 2011

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

LECTURE 10. Pipelining: Advanced ILP

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

ECE154A Introduction to Computer Architecture. Homework 4 solution

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

COMPUTER ORGANIZATION AND DESIGN

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Transcription:

Static, multiple-issue (superscaler) pipelines Start more than one instruction in the same cycle Instruction Register file EX + MEM + WB PC Instruction Register file EX + MEM + WB 79 A static two-issue datapath 8 Data 80

Forwarding paths in a superscaler 8 Data 81 Superscaler execution (two pipelines) A super-instruction is actually two MIPS instructions one to execute on the ALU/branch pipe and the other on the load/store pipe The compiler packs super instructions may use a no-op in a superinstruction if cannot find two suitable instructions. Should make sure that there is no data hazards (by inserting no-ops) Fewer no-ops inserted if the architecture supports forwarding More no-ops inserted if hardware does not support forwarding In each cycle a super-instruction is fetched and its two instructions are pushed through the two pipelines. Effectiveness depends on the ability of the compiler to pack two instructions into every super-instruction. 82

Loop scheduling on a super-scalar pipeline Loop: lw $t0, 0($s1) // $t0 = array element add $t0, $t0, $s2 // add scalar in $s2 sw $t0, 0($s1) // store result addi $s1, $s1, -4 // decrement pointer bne $s1, $zero, Loop // branch if $s1!= 0 An equivalent code: Loop: lw $t0, 0($s1) // $t0 = array element addi $s1, $s1, -4 // decrement pointer add $t0, $t0, $s2 // add scalar in $s2 sw $t0, 4($s1) // store result bne $s1, $zero, Loop // branch if $s1!= 0 A schedule on two pipelines (with hardware support for forwarding): ALU or bne instructions lw/sw instructions Loop: addi $s1, $s1, -4 lw $t0, 0($s1) cycle 1 cycle 2 add $t0, $t0, $s2 cycle 3 bne $s1, $zero, Loop sw $t0, 4($s1) cycle 4 Takes 4 cycles to execute one iteration (assuming perfect branch prediction) 83 Loop unrolling Loop: lw $t0, 0($s1) addi $s1, $s1, -4 add $t0, $t0, $s2 sw $t2, 4($s1) bne $s1, $zero, Loop 5 instructions per iteration Loop: lw $t0, 0($s1) lw $t1, -4($s1) addi $s1, $s1, -8 add $t0, $t0, $s2 unrolling add $t1, $t1, $s2 sw $t0, 8($s1) sw $t1, 4($s1) bne $s1, $zero, Loop 8 instructions per two iterations Duplicate the body of the loop (lw, add, sw) using $t1, a register different than $t0 Update the loop index only once (subtract 8 rather than 4 from $s1) Change the constants (offsets) to reflect the new values of the loop index. Advantages: fewer total executed instructions (less overhead for loop control) Disadvantages: use more registers Problem: what if the number of iterations is not even? 84

Scheduling the unrolled loop Loop: lw $t0, 0($s1) lw $t1, -4($s1) addi $s1, $s1, -8 add $t0, $t0, $s2 add $t1, $t1, $s2 sw $t0, 8($s1) sw $t1, 4($s1) bne $s1, $zero, Loop ALU or bne instructions lw/sw instructions Loop: lw $t0, 0($s1) cycle 1 addi $s1, $s1, -8 lw $t1, -4($s1) cycle 2 add $t0, $t0, $s2 cycle 3 add $t1, $t1, $s2 sw $t0, 8($s1) cycle 4 bne $s1, $zero, Loop sw $t1, 4($s1) cycle 5 5 cycles per two iterations (ignoring control hazards) 85 Unrolling 4 times ALU or bne instructions lw/sw instructions Loop: addi $s1, $s1, -16 lw $t0, 0($s1) cycle 1 lw $t1, 12($s1) cycle 2 add $t0, $t0, $s2 lw $t2, 8($s1) cycle 3 add $t1, $t1, $s2 lw $t3, 4($s1) cycle 4 add $t2, $t2, $s2 sw $t0, 16($s1) cycle 5 add $t3, $t3, $s2 sw $t1, 12($s1) cycle 6 sw $t2, 8($s1) cycle 7 bne $s1, $zero, Loop sw $t1, 4($s1) cycle 8 8 cycles per four iterations (ignoring control hazards) Is there a limitation on the number of times we can unroll? How will the schedule change if the hardware does not support forwarding and stalling? 86

Hazards in the Dual-Issue MIPS More instructions executing in parallel cause more hazards Data hazard Even with forwarding paths between the two pipelines Can t use ALU result in load/store in same packet add $t0, $s0, $s1 lw $s2, 0($t0) Load-use hazard lw $s2, 0($t0) add $t0, $s2, $s1 Schedule in two packets separated by at least one cycle Control Hazard Penalty for a wrong branch is proportional to issue width. Example: if branch is resolved in EX stage, then 4 instructions have to be squashed in case of a branch misprediction. 5 instruction have to be squashed if the instruction after the branch is issued at the same cycle as the branch. 87 Dynamic Scheduling Static scheduling Schedule instructions at compile time to get the best execution time Dynamic scheduling Hardware changes instruction execution sequence in order to minimize execution time (out of order execution) Dependences must be honored (read after write, write after read and write after write). For the best result, control dependences must be tackled Components of dynamic scheduling Check for dependences do we have ready instructions? Select ready instructions and map them to multiple function units Instruction window When we look for parallel instructions, we want to consider many instructions (in instruction window ) for the best result Branches hinder forming a large, accurate window 88

The Opteron X4 Microarchitecture Instruction Fetch Decode Dispatch Execute Mem Write Back 89 The ARM Cortex-A8 architecture FIGURE 4.75 The A8 pipeline. The first three stages fetch instructions into a 12-entry instruction fetch buffer. The Address Generation Unit (AGU) uses a Branch Target Buffer (BTB), Global History Buffer (GHB), and a Return Stack (RS) to predict branches to try to keep the fetch queue full. Instruction decode is five stages and instruction execution is six stages. 90