Pipelining to Superscalar

Similar documents
ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

Pipeline Processor Design

EE382A Lecture 3: Superscalar and Out-of-order Processor Basics

Beyond Pipelining. CP-226: Computer Architecture. Lecture 23 (19 April 2013) CADSL

Superscalar Organization

Superscalar Organization

Pipelined Processors. Ideal Pipelining. Example: FP Multiplier. 55:132/22C:160 Spring Jon Kuhl 1

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture

Day 1: Introduction Course: Superscalar Architecture

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture Prof. Sherief Reda School of Engineering Brown University

CS/ECE 552: Introduction To Computer Architecture 1

John P. Shen Microprocessor Research Intel Labs March 19, 2002

E0-243: Computer Architecture

EECS 470 Lecture 4. Pipelining & Hazards II. Fall 2018 Jon Beaumont

The Processor: Instruction-Level Parallelism

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Limitations of Scalar Pipelines

Multithreaded Processors. Department of Electrical Engineering Stanford University

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Processors. Young W. Lim. May 12, 2016

Full Datapath. Chapter 4 The Processor 2

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Processor (IV) - advanced ILP. Hwansoo Han

Superscalar Processor

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

EECS 470. Control Hazards and ILP. Lecture 3 Winter 2014

Lecture-13 (ROB and Multi-threading) CS422-Spring

COMPUTER ORGANIZATION AND DESI

Chapter 4. The Processor

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Four Steps of Speculative Tomasulo cycle 0

EECS 322 Computer Architecture Superpipline and the Cache

TDT 4260 lecture 7 spring semester 2015

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

ECE/CS 552: Introduction to Superscalar Processors

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Lec 25: Parallel Processors. Announcements

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Advanced issues in pipelining

Complex Pipelines and Branch Prediction

Instructor Information

ELE 375 Final Exam Fall, 2000 Prof. Martonosi

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

CS 152 Computer Architecture and Engineering

Thomas Polzer Institut für Technische Informatik

Multi-cycle Instructions in the Pipeline (Floating Point)

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Chapter 4. The Processor

EITF20: Computer Architecture Part2.2.1: Pipeline-1

IF1 --> IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB. add $10, $2, $3 IF1 IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB sub $4, $10, $6 IF1 IF2 ID1 ID2 --> EX1 EX2 ME1 ME2 WB

Hardware-based Speculation

" # " $ % & ' ( ) * + $ " % '* + * ' "

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS152 Computer Architecture and Engineering. Complex Pipelines

Multiple Instruction Issue. Superscalars

LECTURE 3: THE PROCESSOR

Keywords and Review Questions

ECE/CS 552: Pipeline Hazards

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Advanced Instruction-Level Parallelism

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

CS 152 Computer Architecture and Engineering

CS146 Computer Architecture. Fall Midterm Exam

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

15-740/ Computer Architecture Lecture 7: Pipelining. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/26/2011

ECE 552: Introduction To Computer Architecture 1. Scalar upper bound on throughput. Instructor: Mikko H Lipasti. University of Wisconsin-Madison

Pipelining. CSC Friday, November 6, 2015

EECC551 Exam Review 4 questions out of 6 questions

Dynamic Control Hazard Avoidance

CS 654 Computer Architecture Summary. Peter Kemper

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CS698Y: Modern Memory Systems Lecture-2 (Brushing-up Computer Architecture) Biswabandan Panda

Multiple Issue ILP Processors. Summary of discussions

CS 61C: Great Ideas in Computer Architecture. Multiple Instruction Issue, Virtual Memory Introduction

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Alexandria University

Control Hazards. Branch Prediction

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Architectures for Instruction-Level Parallelism

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Chapter 4 The Processor 1. Chapter 4D. The Processor

CS425 Computer Systems Architecture

15-740/ Computer Architecture Lecture 4: Pipelining. Prof. Onur Mutlu Carnegie Mellon University

Chapter 4. The Processor

Instruction-Level Parallelism and Its Exploitation

Chapter 4. The Processor

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Handout 2 ILP: Part B

Advanced Computer Architecture

Control Hazards. Prediction

Unit 8: Superscalar Pipelines

Transcription:

Pipelining to Superscalar ECE/CS 752 Fall 207 Prof. Mikko H. Lipasti University of Wisconsin-Madison

Pipelining to Superscalar Forecast Limits of pipelining The case for superscalar Instruction-level parallel machines Superscalar pipeline organization Superscalar pipeline design

Generic Instruction Processing Pipeline ALU LOAD STORE BRANCH IF: I-CACHE PC I-CACHE PC I-CACHE PC I-CACHE PC IF ID: DECODE DECODE DECODE DECODE ID 2 OF: RD. REG. RD. REG. RD. REG. RD. REG. RD 3 ADDR. GEN. ALU 4 EX: ALU OP. RD. MEM. MEM 5 OS: WR. REG. WR. REG. ADDR.GEN. ADDR. GEN. WB 6 WR. MEM. WR. PC

IBM RISC Experience [Agerwala and Cocke 987] Internal IBM study: Limits of a scalar pipeline? Memory Bandwidth Fetch instr/cycle from I-cache 40% of instructions are load/store (D-cache) Code characteristics (dynamic) Loads 25% Stores 5% ALU/RR 40% Branches & jumps 20% /3 unconditional (always taken) /3 conditional taken, /3 conditional not taken

IBM Experience Cache Performance Assume 00% hit ratio (upper bound) Cache latency: I = D = cycle default Load and branch scheduling Loads 25% cannot be scheduled (delay slot empty) 65% can be moved back or 2 instructions 0% can be moved back instruction Branches & jumps Unconditional 00% schedulable (fill one delay slot) Conditional 50% schedulable (fill one delay slot)

CPI Optimizations Goal and impediments CPI =, prevented by pipeline stalls No cache bypass of RF, no load/branch scheduling Load penalty: 2 cycles: 0.25 x 2 = 0.5 CPI Branch penalty: 2 cycles: 0.2 x 2/3 x 2 = 0.27 CPI Total CPI: + 0.5 + 0.27 =.77 CPI Bypass, no load/branch scheduling Load penalty: cycle: 0.25 x = 0.25 CPI Total CPI: + 0.25 + 0.27 =.52 CPI

More CPI Optimizations Bypass, scheduling of loads/branches Load penalty: 65% + 0% = 75% moved back, no penalty 25% => cycle penalty 0.25 x 0.25 x = 0.0625 CPI Branch Penalty /3 unconditional 00% schedulable => cycle /3 cond. not-taken, => no penalty (predict not-taken) /3 cond. Taken, 50% schedulable => cycle /3 cond. Taken, 50% unschedulable => 2 cycles 0.20 x [/3 x + /3 x 0.5 x + /3 x 0.5 x 2] = 0.67 Total CPI: + 0.063 + 0.67 =.23 CPI

Simplify Branches Assume 90% can be PC-relative No register indirect, no register access Separate adder (like MIPS R3000) Branch penalty reduced Total CPI: + 0.063 + 0.085 =.5 CPI = 0.87 IPC 5% Overhead from program dependences PC-relative Schedulable Penalty Yes (90%) Yes (50%) 0 cycle Yes (90%) No (50%) cycle No (0%) Yes (50%) cycle No (0%) No (50%) 2 cycles

Limits of Pipelining IBM RISC Experience Control and data dependences add 5% Best case CPI of.5, IPC of 0.87 Deeper pipelines (higher frequency) magnify dependence penalties This analysis assumes 00% cache hit rates Hit rates approach 00% for some programs Many important programs have much worse hit rates Later!

Stage Phase Function performed CPU, circa 986 IF φ Translate virtual instr. addr. using TLB φ 2 Access I-cache RD φ Return instruction from I-cache, check tags & parity φ 2 Read RF; if branch, generate target ALU φ Start ALU op; if branch, check condition φ 2 Finish ALU op; if ld/st, translate addr Separate Adder MEM φ Access D-cache φ 2 Return data from D-cache, check tags & parity WB φ Write RF φ 2 Source: https://imgtec.com MIPS R2000, ~ most elegant pipeline ever devised J. Larus Enablers: RISC ISA, pipelining, on-chip cache memory Mikko Lipasti-University of Wisconsin 0

Processor Performance Time Processor Performance = --------------- Program Instructions Cycles = X X Program Instruction (code size) (CPI) In the 980 s (decade of pipelining): CPI: 5.0 =>.5 In the 990 s (decade of superscalar): CPI:.5 => 0.5 (best case) In the 2000 s (decade of multicore): Time Cycle (cycle time) Focus on thread-level parallelism, CPI => 0.33 (best case)

Amdahl s Law N No. of Processors h - h - f f Time h = fraction of time in serial code f = fraction that is vectorizable v = speedup for f Speedup = Overall speedup: f + f v

Revisit Amdahl s Law Sequential bottleneck Even if v is infinite lim Performance limited by nonvectorizable portion (-f) f f v = v + f N No. of Processors h - h - f f Time

Pipelined Performance Model N Pipeline Depth -g g g = fraction of time pipeline is filled -g = fraction of time pipeline is not filled (stalled)

Pipelined Performance Model N Pipeline Depth -g g g = fraction of time pipeline is filled -g = fraction of time pipeline is not filled (stalled)

Pipelined Performance Model N Pipeline Depth -g g Tyranny of Amdahl s Law [Bob Colwell] When g is even slightly below 00%, a big performance hit will result Stalled cycles are the key adversary and must be minimized as much as possible

Motivation for Superscalar [Agerwala and Cocke] Speedup jumps from 3 to 4.3 for N=6, f=0.8, but s =2 instead of s= (scalar) Typical Range

Superscalar Proposal Moderate tyranny of Amdahl s Law Ease sequential bottleneck More generally applicable Robust (less sensitive to f) Revised Amdahl s Law: Speedup = ( f ) s + f v

Limits on Instruction Level Parallelism (ILP) Weiss and Smith [984].58 Sohi and Vajapeyam [987].8 Tjaden and Flynn [970].86 (Flynn s bottleneck) Tjaden and Flynn [973].96 Uht [986] 2.00 Smith et al. [989] 2.00 Jouppi and Wall [988] 2.40 Johnson [99] 2.50 Acosta et al. [986] 2.79 Wedig [982] 3.00 Butler et al. [99] 5.8 Melvin and Patt [99] 6 Wall [99] 7 (Jouppi disagreed) Kuck et al. [972] 8 Riseman and Foster [972] 5 (no control dependences) Nicolau and Fisher [984] 90 (Fisher s optimism)

Superscalar Proposal Go beyond single instruction pipeline, achieve IPC > Dispatch multiple instructions per cycle Provide more generally applicable form of concurrency (not just vectors) Geared for sequential code that is hard to parallelize otherwise Exploit fine-grained or instruction-level parallelism (ILP)

Classifying ILP Machines [Jouppi, DECWRL 99] Baseline scalar RISC Issue parallelism = IP = Operation latency = OP = Peak IPC = SUCCESSIVE INSTRUCTIONS 2 3 4 5 6 IF DE EX WB 0 2 3 4 5 6 7 8 9 TIME IN CYCLES (OF BASELINE MACHINE)

Classifying ILP Machines [Jouppi, DECWRL 99] Superpipelined: cycle time = /m of baseline Issue parallelism = IP = inst / minor cycle Operation latency = OP = m minor cycles Peak IPC = m instr / major cycle (m x speedup?) 2 3 4 5 6 IF DE EX WB 2 3 4 5 6

Classifying ILP Machines [Jouppi, DECWRL 99] Superscalar: Issue parallelism = IP = n inst / cycle Operation latency = OP = cycle Peak IPC = n instr / cycle (n x speedup?) 2 3 4 5 6 7 8 9 IF DE EX WB

Classifying ILP Machines [Jouppi, DECWRL 99] VLIW: Very Long Instruction Word Issue parallelism = IP = n inst / cycle Operation latency = OP = cycle Peak IPC = n instr / cycle = VLIW / cycle IF DE WB EX

Classifying ILP Machines [Jouppi, DECWRL 99] Superpipelined-Superscalar Issue parallelism = IP = n inst / minor cycle Operation latency = OP = m minor cycles Peak IPC = n x m instr / major cycle 2 3 4 5 6 7 8 9 IF DE EX WB

Superscalar vs. Superpipelined Roughly equivalent performance If n = m then both have about the same IPC Parallelism exposed in space vs. time SUPERSCALAR Key: SUPERPIPELINED IFetch Dcode Execute Writeback 0 2 3 4 5 6 7 8 9 Time in Cycles (of Base Machine) 0 2 3

Superpipelining: Result Latency Superpipelining - Jouppi, 989 essentially describes a pipelined execution stage Jouppií s base machine Underpipelined machines cannot issue instructions as fast as they are executed Underpipelined machine Superpipelined machine Note - key characteristic of Superpipelined machines is that results are not available to M- successive instructions

Superscalar Challenges I-cache Branch Predictor FETCH Instruction Buffer Instruction Flow DECODE Integer Floating-point Media Memory Register Data Flow Reorder Buffer (ROB) Store Queue EXECUTE COMMIT D-cache Memory Data Flow