Superscalar Organization

Similar documents
ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

Pipelining to Superscalar

EE382A Lecture 3: Superscalar and Out-of-order Processor Basics

Beyond Pipelining. CP-226: Computer Architecture. Lecture 23 (19 April 2013) CADSL

Pipeline Processor Design

Limitations of Scalar Pipelines

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

Superscalar Organization

Superscalar Processor

EECS 470 Lecture 4. Pipelining & Hazards II. Fall 2018 Jon Beaumont

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture Prof. Sherief Reda School of Engineering Brown University

ECE/CS 552: Introduction to Superscalar Processors

Day 1: Introduction Course: Superscalar Architecture

John P. Shen Microprocessor Research Intel Labs March 19, 2002

EECS 470. Control Hazards and ILP. Lecture 3 Winter 2014

ECE 552: Introduction To Computer Architecture 1. Scalar upper bound on throughput. Instructor: Mikko H Lipasti. University of Wisconsin-Madison

Architectures for Instruction-Level Parallelism

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

E0-243: Computer Architecture

Multiple Instruction Issue. Superscalars

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Hardware-Based Speculation

EECC551 Exam Review 4 questions out of 6 questions

Case Study IBM PowerPC 620

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Advanced Computer Architecture

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

The Processor: Instruction-Level Parallelism

LIMITS OF ILP. B649 Parallel Architectures and Programming

Computer Architecture Spring 2016

ECE 587 Advanced Computer Architecture I

Advanced Instruction-Level Parallelism

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Complex Pipelines and Branch Prediction

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Processor (IV) - advanced ILP. Hwansoo Han

CS433 Homework 2 (Chapter 3)

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Super Scalar. Kalyan Basu March 21,

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Instruction-Level Parallelism and Its Exploitation

Exploitation of instruction level parallelism

Multiple Instruction Issue and Hardware Based Speculation

CS 152, Spring 2012 Section 8

Hardware-based Speculation

Copyright 2012, Elsevier Inc. All rights reserved.

TDT 4260 lecture 7 spring semester 2015

CS433 Homework 2 (Chapter 3)

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Instruction Level Parallelism

EC 513 Computer Architecture

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Dynamic Scheduling. CSE471 Susan Eggers 1

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Computer Science 146. Computer Architecture

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

CS425 Computer Systems Architecture

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

COMPUTER ORGANIZATION AND DESI

Lecture-13 (ROB and Multi-threading) CS422-Spring

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Four Steps of Speculative Tomasulo cycle 0

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Handout 2 ILP: Part B

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Advanced issues in pipelining

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

CS 152, Spring 2011 Section 8

CS425 Computer Systems Architecture

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

Instruction Level Parallelism

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Multi-cycle Instructions in the Pipeline (Floating Point)

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Getting CPI under 1: Outline

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

5008: Computer Architecture

Course on Advanced Computer Architectures

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Wide Instruction Fetch

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Transcription:

Superscalar Organization Nima Honarmand

Instruction-Level Parallelism (ILP) Recall: Parallelism is the number of independent tasks available ILP is a measure of inter-dependencies between insns. Average ILP = num. instruction / num. cyc required in an ideal machine code1: ILP = 1 i.e. must execute serially code2: ILP = 3 code1: r1 r2 + 1 r3 r1 / 17 r4 r0 - r3 i.e. can execute at the same time code2: r1 r2 + 1 r3 r9 / 17 r4 r0 - r10

ILP!= IPC ILP usually assumes Infinite resources Perfect fetch Unit-latency for all instructions ILP is a property of the program dataflow IPC is the real observed metric How many insns. are executed per cycle ILP is an upper-bound on the attainable IPC Specific to a particular program

Purported Limits on ILP Weiss and Smith [1984] 1.58 Sohi and Vajapeyam [1987] 1.81 Tjaden and Flynn [1970] 1.86 Tjaden and Flynn [1973] 1.96 Uht [1986] 2.00 Smith et al. [1989] 2.00 Jouppi and Wall [1988] 2.40 Johnson [1991] 2.50 Acosta et al. [1986] 2.79 Wedig [1982] 3.00 Butler et al. [1991] 5.8 Melvin and Patt [1991] 6 Wall [1991] 7 Kuck et al. [1972] 8 Riseman and Foster [1972] 51 Nicolau and Fisher [1984] 90

ILP Limits of Scalar Pipelines (1) Scalar upper bound on throughput Limited to CPI >= 1 Solution: superscalar pipelines with multiple insns at each stage Prefetch U-pipe Decode1 V-pipe Decode2 Decode2 Pentium Pipeline Execute Execute Writeback Writeback

ILP Limits of Scalar Pipelines (2) Inefficient unified pipeline Lower resource utilization and longer instruction latency Solution: diversified pipelines IF ID RD EX ALU MEM1 FP1 BR MEM2 FP2 FP3 WB

ILP Limits of Scalar Pipelines (3) Rigid pipeline stall policy A stalled instruction stalls all newer instructions Solution 1: out-of-order execution IF ID RD EX WB Dispatch Buffer Reorder Buffer ALU MEM1 FP1 BR MEM2 FP2 FP3 ( in order ) ( out of order ) ( out of order ) ( in order )

ILP Limits of Scalar Pipelines (3) Rigid pipeline stall policy A stalled instruction stalls all newer instructions Solution 1: out-of-order execution Solution 2: interstage buffers In Program Order Out of Order In Program Order Fetch Decode Dispatch Execute Complete Instruction Buffer Dispatch Buffer Issuing Buffer Completion Buffer Store Buffer Retire

ILP Limits of Scalar Pipelines (4) Instruction dependencies limit parallelism Frequent stalls due to data and control dependencies Solution 1: renaming for WAR and WAW register dependences Solution 2: speculation for control dependences and memory dependences

ILP Limits of Scalar Pipelines (Summary) 1. Scalar upper bound on throughput Limited to CPI >= 1 Solution: superscalar pipelines with multiple insns at each stage 2. Inefficient unified pipeline Lower resource utilization and longer instruction latency Solution: diversified pipelines 3. Rigid pipeline stall policy A stalled instruction stalls all newer instructions Solution: out-of-order execution and inter-stage buffers 4. Instruction dependencies limit parallelism Frequent stalls due to data and control dependencies Solutions: renaming and speculation State of the art: Out-of-Order Superscalar Pipelines

Overall Picture Fetch issues: Fetch multiple isns Branches Branch target mis-alignment Decode issues: Identify insns Find dependences Execution issues: Dispatch insns Resolve dependences Bypass networks Multiple outstanding memory accesses Completion issues: Out-of-order completion Speculative instructions Precise exceptions Register Data Flow Branch Predictor Reorder Buffer (ROB) Store Queue I-cache FETCH DECODE COMMIT Instruction Buffer Instruction Flow Integer Floating-point Media Memory EXECUTE D-cache Memory Data Flow State of the art: Out-of-Order Superscalar Pipelines