In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

Similar documents
Multiple Instruction Issue. Superscalars

Advanced Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Processor (IV) - advanced ILP. Hwansoo Han

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

Lec 25: Parallel Processors. Announcements

COMPUTER ORGANIZATION AND DESI

Static, multiple-issue (superscaler) pipelines

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Chapter 4 The Processor (Part 4)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

CS425 Computer Systems Architecture

Chapter 4 The Processor 1. Chapter 4D. The Processor

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Multicore and Parallel Processing

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Lecture 9: Multiple Issue (Superscalar and VLIW)

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

LECTURE 10. Pipelining: Advanced ILP

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

Chapter 4. The Processor

ECE 4750 Computer Architecture, Fall 2015 T16 Advanced Processors: VLIW Processors

CS/CoE 1541 Mid Term Exam (Fall 2018).

Chapter 4. The Processor

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

EIE/ENE 334 Microprocessors

CS252 Graduate Computer Architecture Midterm 1 Solutions

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

November 7, 2014 Prediction

CS3350B Computer Architecture Quiz 3 March 15, 2018

High Performance SMIPS Processor. Jonathan Eastep May 8, 2005

ECE 4750 Computer Architecture, Fall 2018 T15 Advanced Processors: VLIW Processors

Chapter 4. The Processor

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3

Thomas Polzer Institut für Technische Informatik

Chapter 3 (Cont III): Exploiting ILP with Software Approaches. Copyright Josep Torrellas 1999, 2001, 2002,

5 th Edition. The Processor We will examine two MIPS implementations A simplified version A more realistic pipelined version

CS 251, Winter 2019, Assignment % of course mark

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Pipeline Architecture RISC

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

ELE 655 Microprocessor System Design

Parallelism, Multicore, and Synchronization

ECE 154B Spring Project 4. Dual-Issue Superscalar MIPS Processor. Project Checkoff: Friday, June 1 nd, Report Due: Monday, June 4 th, 2018

Four Steps of Speculative Tomasulo cycle 0

CS2100 Computer Organisation Tutorial #10: Pipelining Answers to Selected Questions

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

CS 61C: Great Ideas in Computer Architecture. Multiple Instruction Issue, Virtual Memory Introduction

Course on Advanced Computer Architectures

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

Getting CPI under 1: Outline

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

CS146 Computer Architecture. Fall Midterm Exam

CS 251, Winter 2018, Assignment % of course mark

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Preventing Stalls: 1

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

Question 1: (20 points) For this question, refer to the following pipeline architecture.

Chapter 4. The Processor

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

CS/CoE 1541 Exam 1 (Spring 2019).

Full Datapath. Chapter 4 The Processor 2

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

CS152 Computer Architecture and Engineering. Complex Pipelines

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

CSE 490/590 Computer Architecture Homework 2

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

CS / ECE 6810 Midterm Exam - Oct 21st 2008

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

6.004 Tutorial Problems L22 Branch Prediction

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

Multiple Issue ILP Processors. Summary of discussions

Computer Organization and Structure

Assignment 1 solutions

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Midterm I March 21 st, 2007 CS252 Graduate Computer Architecture

Pipelining. CSC Friday, November 6, 2015

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

Written Exam / Tentamen

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CS252 Prerequisite Quiz. Solutions Fall 2007

Pipelined Processor Design

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Lecture 2: Processor and Pipelining 1

Transcription:

In-order vs. Out-of-order Execution In-order instruction execution instructions are fetched, executed & committed in compilergenerated order if one instruction stalls, all instructions behind it stall instructions are statically scheduled by the hardware scheduled in compiler-generated order how many of the next n instructions can be issued, where n is the superscalar issue width superscalars can have structural & data hazards within the n instructions advantage of in-order instruction scheduling: simpler implementation faster clock cycle fewer transistors faster design/development/debug time Spring 2010 CSE 471 - Multiple Instruction Width 1 In-order vs. Out-of-order Execution Out-of-order instruction execution instructions are fetched in compiler-generated order instruction completion may be in-order (today) or out-of-order (older computers) in between they may be executed in some other order instructions are dynamically scheduled by the hardware hardware decides in what order instructions can be executed instructions behind a stalled instruction can pass it if not dependent upon it advantages: higher performance better at hiding latencies, less processor stalling higher utilization of functional units Spring 2010 CSE 471 - Multiple Instruction Width 2

In-order instruction issue: Alpha 21164 2 styles of static instruction scheduling dispatch buffer & instruction slotting (Alpha 21164) shift register model (UltraSPARC-1) Spring 2010 CSE 471 - Multiple Instruction Width 3 In-order instruction issue: Alpha 21164 Instruction slotting can issue up to 4 instructions completely empty the instruction buffer before filling it again compiler can pad with nops so a conflicting instruction is issued with the following instructions, not alone Spring 2010 CSE 471 - Multiple Instruction Width 4

21164 Instruction Unit Pipeline Fetch & issue S0: instruction fetch branch prediction bits read S1: opcode decode target address calculation if predict taken, redirect the fetch S2: instruction slotting: decide which of the next 4 instructions can be issued intra-cycle structural hazard check intra-cycle data hazard check S3: instruction dispatch inter-cycle load-use hazard check register read Spring 2010 CSE 471 - Multiple Instruction Width 5 Spring 2010 CSE 471 - Multiple Instruction Width 6

In-order instruction issue: UltraSparc 1 Shift register model can issue up to 4 instructions per cycle shift in new instructions after every group of instructions is issued Spring 2010 CSE 471 - Multiple Instruction Width 7 Code Scheduling on Superscalars Original code Loop: lw R1, 0(R5) addu R1, R1, R6 sw R1, 0(R5) addi R5, R5, -4 bne R5, R0, Loop Spring 2010 CSE 471 - Multiple Instruction Width 8

Code Scheduling on Superscalars Original code Loop: lw R1, 0(R5) addu R1, R1, R6 sw R1, 0(R5) addi R5, R5, -4 bne R5, R0, Loop With load-latency-hiding code Loop: lw R1, 0(s1) addi R5, R5, -4 addu R1, R1, R6 sw R1, 4(R5) bne R5, $0, Loop ALU/branch instructions memory instructions clock cycle Loop: 1 2 3 4 Spring 2010 CSE 471 - Multiple Instruction Width 9 Code Scheduling on Superscalars: Loop Unrolling ALU/branch instruction Data transfer instruction clock cycle Loop: addi R5, R5, -16 lw R1, 0(R5) 1 lw R2, 12(R5) 2 addu R1, R1, R6 lw R3, 8(R5) 3 addu R2, R2, R6 lw R4, 4(R5) 4 addu R3, R3, R6 sw R1, 16(R5) 5 addu R4, R4, R6 sw R2, 12(R5) 6 sw R3, 8(R5) 7 bne R5, R0, Loop sw R4, 4(R5) 8 What is the cycles per iteration? What is the IPC? Spring 2010 CSE 471 - Multiple Instruction Width 10

Code Scheduling on Superscalars: Loop Unrolling Advantages: + + - - Spring 2010 CSE 471 - Multiple Instruction Width 11 Superscalars Hardware impact: more & pipelined functional units multi-ported registers for multiple register access more buses from the register file to the additional functional units multiple decoders more hazard detection logic more bypass logic wider instruction fetch multi-banked L1 data cache or else the processor has structural hazards (due to an unbalanced design) and stalling There are restrictions on instruction types that can be issued together to reduce the amount of hardware. Static (compiler) scheduling helps. Spring 2010 CSE 471 - Multiple Instruction Width 12