ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

Similar documents
Pipelining to Superscalar

Pipeline Processor Design

Superscalar Organization

EE382A Lecture 3: Superscalar and Out-of-order Processor Basics

Beyond Pipelining. CP-226: Computer Architecture. Lecture 23 (19 April 2013) CADSL

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

Superscalar Organization

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture Prof. Sherief Reda School of Engineering Brown University

Day 1: Introduction Course: Superscalar Architecture

Pipelined Processors. Ideal Pipelining. Example: FP Multiplier. 55:132/22C:160 Spring Jon Kuhl 1

EECS 470 Lecture 4. Pipelining & Hazards II. Fall 2018 Jon Beaumont

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

CS/ECE 552: Introduction To Computer Architecture 1

ECE/CS 552: Introduction to Superscalar Processors

John P. Shen Microprocessor Research Intel Labs March 19, 2002

Limitations of Scalar Pipelines

EECS 470. Control Hazards and ILP. Lecture 3 Winter 2014

E0-243: Computer Architecture

The Processor: Instruction-Level Parallelism

Processors. Young W. Lim. May 12, 2016

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

ECE/CS 552: Pipeline Hazards

Four Steps of Speculative Tomasulo cycle 0

Superscalar Processor

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Hardware-Based Speculation

Processor (IV) - advanced ILP. Hwansoo Han

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Multithreaded Processors. Department of Electrical Engineering Stanford University

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

ECE/CS 552: Pipelining

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

LECTURE 10. Pipelining: Advanced ILP

CS 152 Computer Architecture and Engineering

Lecture-13 (ROB and Multi-threading) CS422-Spring

Hardware-based Speculation

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

INSTRUCTION LEVEL PARALLELISM

Full Datapath. Chapter 4 The Processor 2

Multi-cycle Instructions in the Pipeline (Floating Point)

COMPUTER ORGANIZATION AND DESI

" # " $ % & ' ( ) * + $ " % '* + * ' "

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

TDT 4260 lecture 7 spring semester 2015

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

CS425 Computer Systems Architecture

Instructor Information

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

EC 513 Computer Architecture

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Architectures for Instruction-Level Parallelism

Chapter 4. The Processor

EECS 322 Computer Architecture Superpipline and the Cache

ELE 375 Final Exam Fall, 2000 Prof. Martonosi

CS 152 Computer Architecture and Engineering

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

EECC551 Exam Review 4 questions out of 6 questions

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Superscalar Processor Design

15-740/ Computer Architecture Lecture 4: Pipelining. Prof. Onur Mutlu Carnegie Mellon University

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

Alexandria University

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Advanced Computer Architecture

Pipelining and Vector Processing

IF1 --> IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB. add $10, $2, $3 IF1 IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB sub $4, $10, $6 IF1 IF2 ID1 ID2 --> EX1 EX2 ME1 ME2 WB

ECE 552: Introduction To Computer Architecture 1. Scalar upper bound on throughput. Instructor: Mikko H Lipasti. University of Wisconsin-Madison

Multiple Instruction Issue. Superscalars

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

CS433 Homework 2 (Chapter 3)

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

Handout 2 ILP: Part B

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Lec 25: Parallel Processors. Announcements

5008: Computer Architecture

CS433 Homework 2 (Chapter 3)

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

CS698Y: Modern Memory Systems Lecture-2 (Brushing-up Computer Architecture) Biswabandan Panda

15-740/ Computer Architecture Lecture 7: Pipelining. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/26/2011

Complex Pipelines and Branch Prediction

CS146 Computer Architecture. Fall Midterm Exam

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

Dynamic Control Hazard Avoidance

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Advanced issues in pipelining

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

CS152 Computer Architecture and Engineering. Complex Pipelines

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

3.16 Historical Perspective and References

Transcription:

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith

Pipelining to Superscalar Forecast Real pipelines IBM RISC Experience The case for superscalar Instruction-level parallel machines Superscalar pipeline organization Superscalar pipeline design

MIPS R2000/R3000 Pipeline Stage Phase Function performed IF φ Translate virtual instr. addr. using TLB φ 2 Access I-cache RD φ Return instruction from I-cache, check tags & parity φ 2 Read RF; if branch, generate target ALU φ Start ALU op; if branch, check condition φ 2 MEM φ Access D-cache φ 2 WB φ Write RF φ 2 Finish ALU op; if ld/st, translate addr Separate Adder Return data from D-cache, check tags & parity

Intel i486 5-stage Pipeline Stage IF ID- ID-2 EX WB Function Performed Fetch instruction from 32B prefetch buffer (separate fetch unit fills and flushes prefetch buffer) Translate instr. Into control signals or microcode address Initiate address generation and memory access Access microcode memory Send microinstruction(s) to execute unit Execute ALU and memory operations Write back to RF Prefetch Queue Holds 2 x 6B??? instructions

IBM RISC Experience [Agerwala and Cocke 987] Internal IBM study: Limits of a scalar pipeline? Memory Bandwidth Fetch instr/cycle from I-cache 40% of instructions are load/store (D-cache) Code characteristics (dynamic) Loads 25% Stores 5% ALU/RR 40% Branches & jumps 20% /3 unconditional (always taken) /3 conditional taken, /3 conditional not taken

IBM Experience Cache Performance Assume 00% hit ratio (upper bound) Cache latency: I = D = cycle default Load and branch scheduling Loads 25% cannot be scheduled (delay slot empty) 65% can be moved back or 2 instructions 0% can be moved back instruction Branches & jumps Unconditional 00% schedulable (fill one delay slot) Conditional 50% schedulable (fill one delay slot)

CPI Optimizations Goal and impediments CPI =, prevented by pipeline stalls No cache bypass of RF, no load/branch scheduling Load penalty: 2 cycles: 0.25 x 2 = 0.5 CPI Branch penalty: 2 cycles: 0.2 x 2/3 x 2 = 0.27 CPI Total CPI: + 0.5 + 0.27 =.77 CPI Bypass, no load/branch scheduling Load penalty: cycle: 0.25 x = 0.25 CPI Total CPI: + 0.25 + 0.27 =.52 CPI

More CPI Optimizations Bypass, scheduling of loads/branches Load penalty: 65% + 0% = 75% moved back, no penalty 25% => cycle penalty 0.25 x 0.25 x = 0.0625 CPI Branch Penalty /3 unconditional 00% schedulable => cycle /3 cond. not-taken, => no penalty (predict not-taken) /3 cond. Taken, 50% schedulable => cycle /3 cond. Taken, 50% unschedulable => 2 cycles 0.20 x [/3 x + /3 x 0.5 x + /3 x 0.5 x 2] = 0.67 Total CPI: + 0.063 + 0.67 =.23 CPI

Simplify Branches Assume 90% can be PC-relative No register indirect, no register access Separate adder (like MIPS R3000) Branch penalty reduced Total CPI: + 0.063 + 0.085 =.5 CPI = 0.87 IPC 5% Overhead from program dependences PC-relative Schedulable Penalty Yes (90%) Yes (50%) 0 cycle Yes (90%) No (50%) cycle No (0%) Yes (50%) cycle No (0%) No (50%) 2 cycles

Processor Performance Time Processor Performance = --------------- Program Instructions Cycles = X X Program Instruction (code size) (CPI) In the 980 s (decade of pipelining): CPI: 5.0 =>.5 In the 990 s (decade of superscalar): CPI:.5 => 0.5 (best case) Time Cycle (cycle time)

Revisit Amdahl s Law N No. of Processors h - h - f f Time h = fraction of time in serial code f = fraction that is vectorizable v = speedup for f Speedup Overall speedup: f f v

Revisit Amdahl s Law Sequential bottleneck Even if v is infinite lim Performance limited by nonvectorizable portion (-f) f f v v f N No. of Processors h - h - f f Time

Pipelined Performance Model N Pipeline Depth -g g g = fraction of time pipeline is filled -g = fraction of time pipeline is not filled (stalled)

Pipelined Performance Model N Pipeline Depth -g g g = fraction of time pipeline is filled -g = fraction of time pipeline is not filled (stalled)

Pipelined Performance Model N Pipeline Depth -g g Tyranny of Amdahl s Law [Bob Colwell] When g is even slightly below 00%, a big performance hit will result Stalled cycles are the key adversary and must be minimized as much as possible

Motivation for Superscalar [Agerwala and Cocke] Speedup jumps from 3 to 4.3 for N=6, f=0.8, but s =2 instead of s= (scalar) Typical Range

Superscalar Proposal Moderate tyranny of Amdahl s Law Ease sequential bottleneck More generally applicable Robust (less sensitive to f) Revised Amdahl s Law: Speedup f s f v

Limits on Instruction Level Parallelism (ILP) Weiss and Smith [984].58 Sohi and Vajapeyam [987].8 Tjaden and Flynn [970].86 (Flynn s bottleneck) Tjaden and Flynn [973].96 Uht [986] 2.00 Smith et al. [989] 2.00 Jouppi and Wall [988] 2.40 Johnson [99] 2.50 Acosta et al. [986] 2.79 Wedig [982] 3.00 Butler et al. [99] 5.8 Melvin and Patt [99] 6 Wall [99] 7 (Jouppi disagreed) Kuck et al. [972] 8 Riseman and Foster [972] 5 (no control dependences) Nicolau and Fisher [984] 90 (Fisher s optimism)

Superscalar Proposal Go beyond single instruction pipeline, achieve IPC > Dispatch multiple instructions per cycle Provide more generally applicable form of concurrency (not just vectors) Geared for sequential code that is hard to parallelize otherwise Exploit fine-grained or instruction-level parallelism (ILP)

Classifying ILP Machines [Jouppi, DECWRL 99] Baseline scalar RISC Issue parallelism = IP = Operation latency = OP = Peak IPC = SUCCESSIVE INSTRUCTIONS 2 3 4 5 6 IF DE EX WB 0 2 3 4 5 6 7 8 9 TIME IN CYCLES (OF BASELINE MACHINE)

Classifying ILP Machines [Jouppi, DECWRL 99] Superpipelined: cycle time = /m of baseline Issue parallelism = IP = inst / minor cycle Operation latency = OP = m minor cycles Peak IPC = m instr / major cycle (m x speedup?) 2 3 4 5 6 IF DE EX WB 2 3 4 5 6

Classifying ILP Machines [Jouppi, DECWRL 99] Superscalar: Issue parallelism = IP = n inst / cycle Operation latency = OP = cycle Peak IPC = n instr / cycle (n x speedup?) 2 3 4 5 6 7 8 9 IF DE EX WB

Classifying ILP Machines [Jouppi, DECWRL 99] VLIW: Very Long Instruction Word Issue parallelism = IP = n inst / cycle Operation latency = OP = cycle Peak IPC = n instr / cycle = VLIW / cycle IF DE WB EX

Classifying ILP Machines [Jouppi, DECWRL 99] Superpipelined-Superscalar Issue parallelism = IP = n inst / minor cycle Operation latency = OP = m minor cycles Peak IPC = n x m instr / major cycle 2 3 4 5 6 7 8 9 IF DE EX WB

Superscalar vs. Superpipelined Roughly equivalent performance If n = m then both have about the same IPC Parallelism exposed in space vs. time SUPERSCALAR Key: SUPERPIPELINED IFetch Dcode Execute Writeback 0 2 3 4 5 6 7 8 9 Time in Cycles (of Base Machine) 0 2 3

Superscalar Challenges I-cache Branch Predictor FETCH Instruction Buffer Instruction Flow DECODE Integer Floating-point Media Memory Register Data Flow Reorder Buffer (ROB) Store Queue EXECUTE COMMIT D-cache Memory Data Flow