Unit 8: Superscalar Pipelines

Similar documents
A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

Multi-Core Cost Questions. Technology Scaling Question. Multi-Core Cost Questions. Technology Scaling Question. Remainder of CSE 560: Parallelism

Remainder of CIS501: Parallelism. CIS 501 Computer Architecture. Readings. This Unit: Superscalar Execution

! Last unit: pipeline-level parallelism! Work on execute of one instruction in parallel with decode of next

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

The Processor: Instruction-Level Parallelism

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Processor (IV) - advanced ILP. Hwansoo Han

CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner

Hardware-based Speculation

Chapter 4. The Processor

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Pipelining and Vector Processing

Pipelining. CSC Friday, November 6, 2015

COMPUTER ORGANIZATION AND DESI

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Multiple Instruction Issue. Superscalars

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Multi-cycle Instructions in the Pipeline (Floating Point)

Advanced issues in pipelining

CPU Pipelining Issues

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Full Datapath. Chapter 4 The Processor 2

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Limitations of Scalar Pipelines

Static & Dynamic Instruction Scheduling

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

Chapter 4. The Processor

Pipelining to Superscalar

Advanced Instruction-Level Parallelism

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Basic Pipelining Concepts

Advanced Computer Architecture

Chapter 4 The Processor (Part 4)

CS425 Computer Systems Architecture

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

Intel released new technology call P6P

5008: Computer Architecture

Chapter 06: Instruction Pipelining and Parallel Processing

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

COMPUTER ORGANIZATION AND DESIGN

Superscalar Processor

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Use-Based Register Caching with Decoupled Indexing

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Control Hazards. Branch Prediction

CS 152, Spring 2011 Section 8

Control Hazards. Branch Recovery. Control Hazard Pipeline Diagram. Branch Performance

CS 426 Parallel Computing. Parallel Computing Platforms

Computer Science 246 Computer Architecture

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

Getting CPI under 1: Outline

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Processor: Superscalars Dynamic Scheduling

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques,

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CS146 Computer Architecture. Fall Midterm Exam

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

EECC551 Exam Review 4 questions out of 6 questions

Advanced processor designs

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

Chapter 4 The Processor 1. Chapter 4D. The Processor

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Multithreaded Processors. Department of Electrical Engineering Stanford University

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

Exploitation of instruction level parallelism

LECTURE 3: THE PROCESSOR

EC 513 Computer Architecture

Superscalar Processors

Metodologie di Progettazione Hardware-Software

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Readings. H+P Appendix A, Chapter 2.3 This will be partly review for those who took ECE 152

CS 654 Computer Architecture Summary. Peter Kemper

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

EITF20: Computer Architecture Part4.1.1: Cache - 2

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

Full Datapath. Chapter 4 The Processor 2

Transcription:

A Key Theme: arallelism reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode of next CIS 501: Computer Architecture Unit 8: Superscalar ipelines Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'ennsylvania' with'sources'that'included'university'of'wisconsin'slides' by'mark'hill,'guri'sohi,'jim'smith,'and'david'wood' CIS 501: Comp. Arch. rof. Milo Martin Superscalar 1 Next: instruction-level parallelism (IL) Execute multiple independent instructions fully in parallel Then: Static & dynamic scheduling Extract much more IL Data-level parallelism (DL) Single-instruction, multiple data (one insn., four 64-bit adds) Thread-level parallelism (TL) Multiple software threads running on multiple cores CIS 501: Comp. Arch. rof. Milo Martin Superscalar 2 This Unit: (In-Order) Superscalar ipelines Readings App App App System software Mem CU I/O Idea of instruction-level parallelism Superscalar hardware issues ypassing and register file Stall logic Fetch Textbook (MA:FSTCM) Sections 3.1, 3.2 (but not Sidebar in 3.2), 3.5.1 Sections 4.2, 4.3, 5.3.3 Superscalar vs VLIW/EIC CIS 501: Comp. Arch. rof. Milo Martin Superscalar 3 CIS 501: Comp. Arch. rof. Milo Martin Superscalar 4

Scalar ipeline & the Flynn ottleneck An Opportunity ut consider: ADD r1, r2 -> r3 ADD r4, r5 -> r6 Why not execute them at the same time? (We can!) So far we have looked at scalar pipelines One instruction per stage With control speculation, bypassing, etc. erformance limit (aka Flynn ottleneck ) is CI = IC = 1 Limit is never even achieved (hazards) Diminishing returns from super-pipelining (hazards + overhead) What about: ADD r1, r2 -> r3 ADD r4, r3 -> r6 In this case, dependences prevent parallel execution What about three instructions at a time? Or four instructions at a time? CIS 501: Comp. Arch. rof. Milo Martin Superscalar 5 CIS 501: Comp. Arch. rof. Milo Martin Superscalar 6 What Checking Is Required? For two instructions: 2 checks For three instructions: 6 checks For four instructions: 12 checks ADD src1 4, src2 4 -> dest 4 (6 checks) lus checking for load-to-use stalls from prior n loads CIS 501: Comp. Arch. rof. Milo Martin Superscalar 7 What Checking Is Required? For two instructions: 2 checks For three instructions: 6 checks For four instructions: 12 checks ADD src1 4, src2 4 -> dest 4 (6 checks) lus checking for load-to-use stalls from prior n loads CIS 501: Comp. Arch. rof. Milo Martin Superscalar 8

Multiple-Issue or Superscalar ipeline How do we build such superscalar hardware? Overcome this limit using multiple issue Also called superscalar Two instructions per stage at once, or three, or four, or eight Instruction-Level arallelism (IL) [Fisher, IEEE TC 81] Today, typically 4-wide (Intel Core i7, AMD Opteron) Some more (ower5 is 5-issue; Itanium is 6-issue) Some less (dual-issue is common for simple cores) CIS 501: Comp. Arch. rof. Milo Martin Superscalar 9 CIS 501: Comp. Arch. rof. Milo Martin Superscalar 10 A Typical Dual-Issue ipeline (1 of 2) A Typical Dual-Issue ipeline (2 of 2) Fetch an entire 16 or 32 cache block 4 to 8 instructions (assuming 4-byte average instruction length) redict a single branch per cycle arallel decode Need to check for conflicting instructions Is output register of I 1 is an input register to I 2? Other stalls, too (for example, load-use delay) CIS 501: Comp. Arch. rof. Milo Martin Superscalar 11 Multi-ported register file Larger area, latency, power, cost, complexity Multiple execution units Simple adders are easy, but bypass paths are expensive Memory unit Single load per cycle (stall at decode) probably okay for dual issue Alternative: add a read port to data cache Larger area, latency, power, cost, complexity CIS 501: Comp. Arch. rof. Milo Martin Superscalar 12

Superscalar Challenges - Front End Superscalar Implementation Challenges Superscalar instruction fetch Modest: fetch multiple instructions per cycle Aggressive: buffer instructions and/or predict multiple branches Superscalar instruction decode Replicate decoders Superscalar instruction issue Determine when instructions can proceed in parallel More complex stall logic - order N 2 for N-wide machine Not all combinations of types of instructions possible Superscalar register read ort for each register read (4-wide superscalar 8 read ports ) Each port needs its own set of address and data wires Latency & area #ports 2 CIS 501: Comp. Arch. rof. Milo Martin Superscalar 13 CIS 501: Comp. Arch. rof. Milo Martin Superscalar 14 Superscalar Challenges - ack End Superscalar ypass Superscalar instruction execution Replicate arithmetic units (but not all, say, integer divider) erhaps multiple cache ports (slower access, higher energy) Only for 4-wide or larger (why? only ~35% are load/store insn) Superscalar bypass paths More possible sources for data values Order (N 2 * ) for N-wide machine with execute pipeline depth Superscalar instruction register writeback One write port per instruction that writes a register Example, 4-wide superscalar 4 write ports Fundamental challenge: Amount of IL (instruction-level parallelism) in the program Compiler must schedule code and extract parallelism versus N 2 bypass network N+1 input muxes at each ALU input N 2 point-to-point connections Routing lengthens wires Heavy capacitive load And this is just one bypass stage (MX)! There is also WX bypassing Even more for deeper pipelines One of the big problems of superscalar Why? On the critical path of single-cycle bypass & execute loop CIS 501: Comp. Arch. rof. Milo Martin Superscalar 15 CIS 501: Comp. Arch. rof. Milo Martin Superscalar 16

Not All N 2 Created Equal N 2 bypass vs. N 2 stall logic & dependence cross-check Which is the bigger problem? N 2 bypass by far 64- bit quantities (vs. 5-bit) Multiple levels (MX, WX) of bypass (vs. 1 level of stall logic) Must fit in one clock period with ALU (vs. not) Dependence cross-check not even 2nd biggest N 2 problem Regfile is also an N 2 problem (think latency where N is #ports) And also more serious than cross-check CIS 501: Comp. Arch. rof. Milo Martin Superscalar 17 Mitigating N 2 ypass & Register File Clustering: mitigates N 2 bypass Group ALUs into K clusters Full bypassing within a cluster Limited bypassing between clusters With 1 or 2 cycle delay Can hurt IC, but faster clock (N/K) + 1 inputs at each mux (N/K) 2 bypass paths in each cluster Steering: key to performance Steer dependent insns to same cluster Cluster register file, too Replica a register file per cluster All register writes update all replicas Fewer read ports; only for cluster CIS 501: Comp. Arch. rof. Milo Martin Superscalar 18 Mitigating N 2 RegFile: Clustering++ Another Challenge: Superscalar Fetch cluster 0 cluster 1 RF0 RF1 Clustering: split N-wide execution pipeline into K clusters With centralized register file, 2N read ports and N write ports Clustered register file: extend clustering to register file Replicate the register file (one replica per cluster) Register file supplies register operands to just its cluster All register writes go to all register files (keep them in sync) Advantage: fewer read ports per register! K register files, each with 2N/K read ports and N write ports CIS 501 (Martin): Superscalar 19 DM What is involved in fetching multiple instructions per cycle? In same cache block? no problem 64-byte cache block is 16 instructions (~4 bytes per instruction) Favors larger block size (independent of hit rate) What if next instruction is last instruction in a block? Fetch only one instruction that cycle Or, some processors may allow fetching from 2 consecutive blocks What about taken branches? How many instructions can be fetched on average? Average number of instructions per taken branch? Assume: 20% branches, 50% taken ~10 instructions Consider a 5-instruction loop with an 4-issue processor Without smarter fetch, IL is limited to 2.5 (not 4, which is bad) CIS 501: Comp. Arch. rof. Milo Martin Superscalar 20

Increasing Superscalar Fetch Rate insn queue also loop stream detector Option #1: over-fetch and buffer Add a queue between fetch and decode (18 entries in Intel Core2) Compensates for cycles that fetch less than maximum instructions decouples the front end (fetch) from the back end (execute) Option #2: loop stream detector (Core 2, Core i7) ut entire loop body into a small cache Core2: 18 macro-ops, up to four taken branches Core i7: 28 micro-ops (avoids re-decoding macro-ops!) Any branch mis-prediction requires normal re-fetch Other options: next-next-block prediction, trace cache CIS 501: Comp. Arch. rof. Milo Martin Superscalar 21 Multiple-Issue Implementations Statically-scheduled (in-order) superscalar What we ve talked about thus far + Executes unmodified sequential programs Hardware must figure out what can be done in parallel E.g., entium (2-wide), UltraSARC (4-wide), Alpha 21164 (4-wide) Very Long Instruction Word (VLIW) - Compiler identifies independent instructions, new ISA + Hardware can be simple and perhaps lower power E.g., TransMeta Crusoe (4-wide) Variant: Explicitly arallel Instruction Computing (EIC) A bit more flexible encoding & some hardware to help compiler E.g., Intel Itanium (6-wide) Dynamically-scheduled superscalar (next topic) Hardware extracts more IL by on-the-fly reordering Core 2, Core i7 (4-wide), Alpha 21264 (4-wide) CIS 501: Comp. Arch. rof. Milo Martin Superscalar 22 Trends in Single-rocessor Multiple Issue 486 entium entiumii entium4 Itanium ItaniumII Core2 Year 1989 1993 1998 2001 2002 2004 2006 Width 1 2 3 3 3 6 4 Issue width has saturated at 4-6 for high-performance cores Canceled Alpha 21464 was 8-way issue Not enough IL to justify going to wider issue Hardware or compiler scheduling needed to exploit 4-6 effectively More on this in the next unit For high-performance per watt cores (say, smart phones) Typically 2-wide superscalar (but increasing each generation) Multiple Issue Redux Multiple issue Exploits insn level parallelism (IL) beyond pipelining Improves IC, but perhaps at some clock & energy penalty 4-6 way issue is about the peak issue width currently justifiable Low-power implementations today typically 2-wide superscalar roblem spots N 2 bypass & register file clustering Fetch + branch prediction buffering, loop streaming, trace cache N 2 dependency check VLIW/EIC (but unclear how key this is) Implementations Superscalar vs. VLIW/EIC CIS 501: Comp. Arch. rof. Milo Martin Superscalar 23 CIS 501: Comp. Arch. rof. Milo Martin Superscalar 24

This Unit: (In-Order) Superscalar ipelines App App App System software Idea of instruction-level parallelism Mem CU I/O Superscalar hardware issues ypassing and register file Stall logic Fetch Superscalar vs VLIW/EIC CIS 501: Comp. Arch. rof. Milo Martin Superscalar 25