CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Similar documents
Limitations of Scalar Pipelines

Superscalar Processor

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced issues in pipelining

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

EC 513 Computer Architecture

The Processor: Instruction-Level Parallelism

Superscalar Processors Ch 14

Hardware-based Speculation

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

EECC551 Exam Review 4 questions out of 6 questions

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Dynamic Scheduling. CSE471 Susan Eggers 1

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Pipelining and Vector Processing

E0-243: Computer Architecture

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Processor (IV) - advanced ILP. Hwansoo Han

ECE 552: Introduction To Computer Architecture 1. Scalar upper bound on throughput. Instructor: Mikko H Lipasti. University of Wisconsin-Madison

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

CS425 Computer Systems Architecture

November 7, 2014 Prediction

ECE/CS 552: Introduction to Superscalar Processors

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Copyright 2012, Elsevier Inc. All rights reserved.

Case Study IBM PowerPC 620

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Metodologie di Progettazione Hardware-Software

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Pentium IV-XEON. Computer architectures M

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Processors, Performance, and Profiling

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Wide Instruction Fetch

Superscalar Organization

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Chapter 4 The Processor 1. Chapter 4D. The Processor

Lecture-13 (ROB and Multi-threading) CS422-Spring

" # " $ % & ' ( ) * + $ " % '* + * ' "

45-year CPU Evolution: 1 Law -2 Equations

Multi-cycle Instructions in the Pipeline (Floating Point)

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Superscalar Processors. Company LOGO

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

COMPUTER ORGANIZATION AND DESI

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen

Advanced Computer Architecture

Tutorial 11. Final Exam Review

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

Complex Pipelines and Branch Prediction

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Hardware-Based Speculation

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Chapter-5 Memory Hierarchy Design

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Handout 2 ILP: Part B

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3)

Architectures for Instruction-Level Parallelism

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Instruction Pipelining Review

ECE 341. Lecture # 15

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture

Hardware-Based Speculation

IF1 --> IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB. add $10, $2, $3 IF1 IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB sub $4, $10, $6 IF1 IF2 ID1 ID2 --> EX1 EX2 ME1 ME2 WB

Static vs. Dynamic Scheduling

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

Multiple Instruction Issue. Superscalars

Keywords and Review Questions

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Chapter 13 Reduced Instruction Set Computers

Super Scalar. Kalyan Basu March 21,

Advanced processor designs

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Topics. Computer Organization CS Improving Performance. Opportunity for (Easy) Points. Three Generic Data Hazards

Pipelining and Vector Processing

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Precise Exceptions and Out-of-Order Execution. Samira Khan

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

Transcription:

CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1

Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per cycle limited to 1 need to start multiple instructions per cycle for IPC > 1 IC is fixed by ISA 2

Scalar Pipeline Limitations frequency increases limited dynamic power consumption stage sizes cannot go much smaller and still achieve useful work (currently around 10 gates) need parallel pipeline 2. Inefficient Unification subcomputations vary in speed e.g. integer add fast (1/2 cycle), f.p. division slow, memory operations slow need specialized execution units need diversified pipeline 3

Scalar Pipeline Limitations 3. Rigid Sequencing if instruction i stalls due to dependency, all following instructions i+1, i+2,... also stall instructions i+1, i+2,... may not share dependency e.g. fmul f3,f1,f2 fadd f5,f3,f4 add r3,r1,r2 sub r3,r3,#1 sub add fadd ---- fmul stall propagation 4

Scalar Pipeline Limitations allowing out-of-order execution can hide penalty stalls need dynamic pipeline 5

Superscalar Concepts 1. Pipeline Parallelism temporal parallelism spatial parallelism Shen + Lipasti, Fig 4.2(d) 6

Superscalar Concepts Intel Pentium Pipeline (s=2, in-order), 1993 requires added register ports uses 8-way interleaved cache for parallel access (accesses to same bank are serialized) V pipeline handles only simple instructions; U handles all instructions Shen + Lipasti, Fig 4.4(b) 7

Superscalar Concepts 2. Pipeline Diversification symmetric EX stages all instructions incur maximum penalty requires more forwarding paths or stalls asymmetric EX stages mix of types should match dynamic instruction mix enough to exploit program ILP e.g. CDC6600 (1964) has 10 functional units 8

Superscalar Concepts e.g. Motorola 88110 (1992) has 10 functional units 2 integer 1 bit-field 2 graphics 1 load/store 1 multiplier 1 f.p. add 1 divide 1 branch single cycle 2 cycles (pipelined) 3 cycles (pipelined) 3 cycles (pipelined) 3 cycles (not pipelined) N/A 9

Superscalar Concepts 3. Dynamic Pipeline scalar pipeline interstage buffers hold one instruction, typically for 1 cycle superscalar pipeline interstage buffers hold n instructions Shen + Lipasti, Fig 4.8(b) 10

Superscalar Concepts if n entries proceed in lock-step a stall for one instruction, stalls all n entries (plus all preceeding stages) 11

Superscalar Concepts if entries are independent an instruction may stall without affecting other instructions in buffer if following instructions are to proceed, buffer size must exceed n instructions may now exit the buffer out-of-order Shen + Lipasti, Fig 4.8(c) 12

Superscalar Concepts example: dynamic pipeline (s=3) instructions enter dispatch buffer in order reorder buffer (ROB) ensures writeback performed in program order necessary for precise exceptions reorder buffer entries are allocated at dispatch Shen + Lipasti, Fig 4.9 13

Superscalar Pipeline Structure subtasks ( stages ) 1. fetch 2. decode 3. dispatch 4. execute 5. complete (update machine state i.e. registers) 6. retire (update memory) 14

1. fetch fetch S (pipeline width) instructions per cycle from I-cache Shen + Lipasti, Fig 4.11 15

requires S instructions per row challenges misalignment requires multiple cycles CISC instructions variable length control-flow instructions Shen + Lipasti, Fig 4.12 16

misalignment solutions software: compiler aligns branch targets makes object code tuned to specific cache organization hardware: added logic to support wrapping end of rows (but not end of cache lines) example: RS6000 I-cache (1990) 4 instructions/row 4 rows/line instructions interleaved across 4 sub-arrays 17

Shen + Lipasti, Fig 4.13 18

T-logic one per sub-array detects misaligned address and increment index e.g. IFAR (instruction fetch address register) indexes to A4» all instructions (A4, A5, A6, A7) from same row e.g. IFAR indexes to A10» two instructions (A10, A11) from row 2, two instructions (A12, A13) from row 3 can t cross cache line boundaries two-way set associate (A and B blocks) 19

calculating average instructions fetched per cycle 16 possible start addresses in a cache line A0-A12: 4 per cycle A13: 3 per cycle A14: 2 per cycle A15: 1 per cycle avg instr cycle = ( 13 4) + ( 1 3) + ( 1 2) + ( 1 1) 16 20

control-flow instructions branch instructions in a fetch group may result in discarding following instructions reduces bandwidth solution profiling: JIT compiler re-organizes basic blocks so that fallthrough (branch not taken) is most common case doesn t help unconditional branches other techniques: branch folding, trace cache (more later) 21

2. decode tasks: identify instruction boundaries, instruction types, interdependencies RISC fixed-length instructions: identifying boundaries easy regular instruction format: common op-code field makes identifying instruction types easy detecting RAW register hazards within fetch group # comparitors S = 2( i 1) = 2 i 2 1 = i= 1 i= 1 i= 1 O( S number of register ports and operand busses increases linearly S S 2 ) 22

CISC takes multiple cycles/stages e.g. 5 stages for Intel P6 microarchitecture (Pentium Pro,...) variable-length instructions must examine multiple bytes in parallel instructions translated to internal 3-address RISC instruction set for pipelining e.g. VAX, 1985 e.g. AMD K5: ROPs = RISC operations e.g. Intel P6: μops = micro operations» 1 IA32 1.5 to 2 μops (on average) 23

e.g. Intel P6 decode unit Shen + Lipasti, Fig 4.14 24

decoder 1 & 2 simple instructions only decoder 0 all instruction types can generate up to 4 μops per cycle if more than 4 needed, the μrom is used to emit a sequence of μops up to 6 μops per cycle μops go to reorder buffer (ROB) for dispatch ROB can hold up to 40 μops 25

complex decoding requires more depth in the decode stage increases branch penalties pre-decoding extra information is added to instructions stored in the I-cache speeds up decode leverages temporal locality of instruction fetches 26

e.g AMD K5 pre-decode Shen + Lipasti, Fig 4.15 27

AMD K5 pre-decode 8 bytes fetched in parallel adds 5 bits per instruction byte identifies start and end bytes of IA32 instruction # of ROPs needed op-code and prefix byte locations decode can generate 4 ROPs per cycle increases I-cache miss penalty high hit rate helps increases I-cache data size ~50% (tags and prediction bits don t change) 28

pre-decode used for some RISC processors lesser gains than for CISC identify branches early identify independent instructions e.g. PowerPC 620, UltraSPARC, MIPS R10000, HP PA- 8000 alternative cache fully-decoded instructions e.g. Intel Netburst (P4) trace cache (more later) e.g. Intel Sandy Bridge (core i) 29

Intel Sandy Bridge 32KB L1 I-Cache pre-decode between I-cache and Instruction Queue 1.5K μop Cache has 80% hit rate David Kanter, http://www.realworldtech.com/sandy-bridge/4/, Fig 3 30

3. dispatch collect operands and distribute instructions to functional units fetch and decode is centralized: the fetch group treated as a unit dispatch de-centralizes execution instructions pending execution are held in reservation stations, together with (available) operands 31

centralized reservation stations Shen + Lipasti, Fig 4.17 32

distributed reservation stations Shen + Lipasti, Fig 4.18 33

centralized best utilization of reservation stations increased hardware complexity for control for multi-ported buffer for insert (by dispatch) and remove by functional units slower Intel P6 to Haswell microarchitectures have centralized (unified) reservation stations 6 ports in Ivy Bridge 8 ports in Sandy Bridge 34

distributed lower overall utilization can t share empty entries between functional units simply control and single-ported insert and remove PowerPC 620 has distributed reservation stations 35

hybrid: clustered reservation stations MIPS R10000 Solihin et al, 1999 doi=10.1.1.24.8528, Fig 3 36

terminology dispatch: associate instruction with functional unit issue: start execution in functional unit dispatch and issue combined in centralized R.S. separate steps in distributed R.S. 37

4. execution specialized function units improve performance e.g. Intel Netburst: double-pumped integer unit could execute two integer instructions/cycle e.g. Intel Sandy Bridge: 256-bit FADD unit 8 single-precision FP adds every cycle e.g. Intel Haswell: 256-bit FMA/FADD unit fused multiply add (1 μop) has same latency (5 cycles) as 1 FMUL instruction dot product, matrix multiply, Horner s method for polynomial evaluation 38

instruction mix ideally matched by functional units in reality number of functional units must exceed pipeline width to avoid stalls waiting on a particular functional unit e.g. Intel Haswell» 8 μop wide» ~20 execution units» forwarding paths for integer, SIMD integer and FP (scalar or SIMD) are separate fewer ports, less 39

5. complete update machine state architected registers those registers that the programmer knows about i.e. those specified in the ISA general purpose registers, f.p. registers, condition-code register, control/status register program counter instructions are marked finished in the reorder buffer when the function unit finishes it (out of order) instructions exit the ROB (in order) 40

6. retire update memory state (usually the D-cache) 41

interrupts and exceptions 1 alter program flow interrupts are generated by hardware outside the CPU exceptions are generated within the processor processor detected: page fault, f.p. overflow,... program generated: trap instructions (used for OS calls) interrupts and program generated exceptions fetch unit stops and instructions in pipeline are finished before servicing 1 Michal Ludvig http://www.logix.cz/michal/doc/i386/chp09-00.htm 42

interrupts instruction fetch stops and instructions in pipeline are finished (drained) before servicing processor-detected exceptions instruction can t complete and usually needs OS intervention the excepting instruction is tagged in the ROB when it reaches the head of the ROB some machine state is checkpointed (e.g. PC, status register) remaining instructions in ROB are discarded (precise exceptions) ISR is invoked execution resumes at excepting instruction 43