Limitations of Scalar Pipelines

Similar documents
Superscalar Processor

ECE/CS 552: Introduction to Superscalar Processors

ECE 552: Introduction To Computer Architecture 1. Scalar upper bound on throughput. Instructor: Mikko H Lipasti. University of Wisconsin-Madison

Superscalar Organization

Superscalar Organization

E0-243: Computer Architecture

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multiple Instruction Issue. Superscalars

Architectures for Instruction-Level Parallelism

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture Prof. Sherief Reda School of Engineering Brown University

EE382A Lecture 3: Superscalar and Out-of-order Processor Basics

Hardware-Based Speculation

Complex Pipelines and Branch Prediction

Hardware-based Speculation

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Pipelining to Superscalar

Dynamic Scheduling. CSE471 Susan Eggers 1

Processor (IV) - advanced ILP. Hwansoo Han

Case Study IBM PowerPC 620

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

CS152 Computer Architecture and Engineering. Complex Pipelines

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Day 1: Introduction Course: Superscalar Architecture

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Advanced issues in pipelining

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Pipelining. CSC Friday, November 6, 2015

Handout 2 ILP: Part B

Tutorial 11. Final Exam Review

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Four Steps of Speculative Tomasulo cycle 0

EECC551 Exam Review 4 questions out of 6 questions

COMPUTER ORGANIZATION AND DESI

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

Lecture-13 (ROB and Multi-threading) CS422-Spring

PowerPC 620 Case Study

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

CS146 Computer Architecture. Fall Midterm Exam

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Pipelining and Vector Processing

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Chapter 9. Pipelining Design Techniques

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Current Microprocessors. Efficient Utilization of Hardware Blocks. Efficient Utilization of Hardware Blocks. Pipeline

Superscalar Processors

The Processor: Instruction-Level Parallelism

November 7, 2014 Prediction

HARDWARE SPECULATION. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Instruction Level Parallelism

CS 152 Computer Architecture and Engineering

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

5008: Computer Architecture

CS433 Homework 2 (Chapter 3)

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques,

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processors Ch 14

Keywords and Review Questions

Advanced Computer Architecture

Chapter 4. The Processor

CS 152, Spring 2012 Section 8

Computer System Architecture Final Examination Spring 2002

Static vs. Dynamic Scheduling

HW1 Solutions. Type Old Mix New Mix Cost CPI

Pipelined Processor Design. EE/ECE 4305: Computer Architecture University of Minnesota Duluth By Dr. Taek M. Kwon

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

CS433 Homework 2 (Chapter 3)

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Processor: Superscalars Dynamic Scheduling

Superscalar Processor Design

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

Advanced Instruction-Level Parallelism

Pentium IV-XEON. Computer architectures M

DYNAMIC SPECULATIVE EXECUTION

Super Scalar. Kalyan Basu March 21,

Transcription:

Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC <= 1 or CPI >= 1 Inefficient unified pipeline Long latency for each instruction Rigid pipeline stall policy One stalled instruction stalls all newer instructions Parallel Pipelines Intel Pentium Parallel Pipeline D1 D1 D1 (a) No Parallelism (b) Temporal Parallelism D2 D2 D2 (d) Parallel Pipeline (c) Spatial Parallelism U - Pipe V - Pipe 1

Diversified Pipelines ID RD ALU MEM1 FP1 Power4 Diversified Pipelines FP FX/LD 1 Fetch Q Decode FX/LD 2 I-Cache /CR PC Scan Predict Reorder MEM2 FP2 FP3 FP1 FP2 FX1 LD1 LD2 FX2 CR StQ D-Cache Rigid Pipeline Stall Policy Dynamic Pipelines ID Bypassing of Stalled Not Allowed Stalled Backward Propagation of Stalling RD ALU MEM1 FP1 MEM2 FP2 FP3 ( in order ) ( out of order ) Reorder ( out of order ) ( in order ) 2

Interstage s Stage i (1) Stage i + 1 Stage i (>_ n) Stage i + 1 (a) 1 1 Stage i (n) Stage i +1 (c) n ( in order ) n ( in order ) (b) ( in order ) ( out of order ) Superscalar Pipeline Stages In Program Order Out of Order In Program Order Fetch Decode Execute Complete Retire Issuing Completion Store Limitations of Scalar Pipelines Scalar upper bound on throughput IPC <= 1 or CPI >= 1 Solution: wide (superscalar) pipeline Inefficient unified pipeline Long latency for each instruction Solution: diversified, specialized pipelines Rigid pipeline stall policy One stalled instruction stalls all newer instructions Solution: Out-of-order execution, distributed execution pipelines Impediments to High IPC Register Data Branch Predictor Reorder (ROB) Store Queue I-cache FETCH DECODE COMMIT Integer Floating-point Media Memory ECUTE D-cache Memory Data 3

Superscalar Pipeline Design Fetching Issues Decoding Issues ing Issues Execution Issues Completion & Retiring Issues Objective: Fetch multiple instructions per cycle Challenges: Branches: control dependences Branch target misalignment cache misses Solutions Code alignment (static vs.dynamic) Prediction/speculation PC Memory 3 instructions fetched I-Cache Organization Fetch Alignment Address Address PC XX00001 Row Decoder Cache Line R o w De c oder Cache Line Row decoder 000 001 111 00 01 10 11 1 cache line = 1 physical row 1 cache line = 2 physical rows Fetch group Row width 4

Odd directory sets A & B Even directory sets A & B AR TLB hit and buffer control Interlock, dispatch, branch, execution RIOS-I I Fetch Hardware T 255 0 A0 1 A4 2 A8 3 A12 D B0 B4 B8 mux B12 n T 255 0 A1 1 A5 2 A9 3 A13 D B1 B5 B9 mux B13 T 255 0 A2 1 A6 2 A10 3 A14 buffer network n 1 D B2 B6 mux B10 B14 n 2 255 0 A3 1 A7 2 A11 3 A15 D B3 B7 mux B11 B15 n 3 Issues in Decoding Primary Tasks Identify individual instructions (!) Determine instruction types Determine dependences between instructions Two important factors set architecture Pipeline width and Issue Parallel pipeline Centralized instruction fetch Centralized instruction decode Diversified pipeline Distributed instruction execution Necessity of fetching decoding dispatching FU 1 FU 2 FU 3 FU n execution 5

Centralized Reservation Station (issue) Centralized reservation station (dispatch buffer) Distributed Reservation Station Issue buffer Distributed reservation stations Execute Execute Completion buffer Finish Complete Completion buffer Issues in Execution Bypass Networks Fetch Q I-Cache PC Scan Current trends More parallelism bypassing very challenging Deeper pipelines More diversity Functional unit types Integer Floating point Load/store most difficult to make parallel Branch Specialized units (media) FP FP1 FP2 FX/LD 1 Decode /CR Predict Reorder O(n 2 ) interconnect from/to FU inputs and outputs Associative tag-match to find operands Solutions (hurt IPC, help cycle time) Use RF only (Power4) with no bypass network Decompose into clusters (21264) FX1 LD1 StQ FX/LD 2 LD2 D-Cache FX2 CR 6

Specialized units Issues in Completion/Retirement Shift ALU 2 ALU 0 ALU C (a) A B (A B) C Round/Normalize (b) Out-of-order execution ALU instructions Load/store instructions In-order completion/retirement Precise exceptions Memory coherence and consistency Solutions Reorder buffer Store buffer Load queue snooping (later) A Dynamic Superscalar Processor Fetch /decode buffer Impediments to High IPC Decode In order Out of order In order Issue Execute Finish Complete buffer Reservation stations Reorder/completion buffer Store buffer I-cache Branch FETCH Predictor DECODE Integer Floating-point Media Memory Memory Data ECUTE Reorder Register (ROB) Data COMMIT Store D-cache Queue Retire 7

Superscalar Summary flow Branches, jumps, calls: predict target, direction Fetch alignment cache misses Register data flow Register renaming: RAW/WAR/WAW Memory data flow In-order stores: WAR/WAW Store queue: RAW Data cache misses 8