CS425 Computer Systems Architecture

Similar documents
Lecture 9: Multiple Issue (Superscalar and VLIW)

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Lecture 13 - VLIW Machines and Statically Scheduled ILP

Advanced Computer Architecture

VLIW/EPIC: Statically Scheduled ILP

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Four Steps of Speculative Tomasulo cycle 0

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

CS 152 Computer Architecture and Engineering. Lecture 13 - VLIW Machines and Statically Scheduled ILP

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

Multiple Issue ILP Processors. Summary of discussions

CS 152 Computer Architecture and Engineering. Lecture 16 - VLIW Machines and Statically Scheduled ILP

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Multi-cycle Instructions in the Pipeline (Floating Point)

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Getting CPI under 1: Outline

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Advanced issues in pipelining

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Metodologie di Progettazione Hardware-Software

Super Scalar. Kalyan Basu March 21,

CS425 Computer Systems Architecture

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

INSTRUCTION LEVEL PARALLELISM

Multiple Instruction Issue. Superscalars

5008: Computer Architecture

Lecture 5: VLIW, Software Pipelining, and Limits to ILP. Review: Tomasulo

ECE 4750 Computer Architecture, Fall 2015 T16 Advanced Processors: VLIW Processors

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

TDT 4260 lecture 7 spring semester 2015

CS425 Computer Systems Architecture

Lecture'9:' Multiple'Issue'(Superscalar'and'VLIW)

Instruction-Level Parallelism and Its Exploitation

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Computer Science 246 Computer Architecture

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

ECE 4750 Computer Architecture, Fall 2018 T15 Advanced Processors: VLIW Processors

Hardware-Based Speculation

Processor (IV) - advanced ILP. Hwansoo Han

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

EECC551 Review. Dynamic Hardware-Based Speculation

Chapter 4 The Processor 1. Chapter 4D. The Processor

Hardware-Based Speculation

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Hardware-based Speculation

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Lecture 5: VLIW, Software Pipelining, and Limits to ILP Professor David A. Patterson Computer Science 252 Spring 1998

EECC551 Exam Review 4 questions out of 6 questions

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

The Processor: Instruction-Level Parallelism

Course on Advanced Computer Architectures

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

EC 513 Computer Architecture

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Exploitation of instruction level parallelism

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques,

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Computer Science 146. Computer Architecture

Handout 2 ILP: Part B

Lecture-13 (ROB and Multi-threading) CS422-Spring

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Processor: Superscalars Dynamic Scheduling

CS 152 Computer Architecture and Engineering. Lecture 16: Vector Computers

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Topics. Digital Systems Architecture EECE EECE Software Approaches to ILP Part 2. Ideas To Reduce Stalls. Processor Case Studies

HY425 Lecture 09: Software to exploit ILP

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Static vs. Dynamic Scheduling

HY425 Lecture 09: Software to exploit ILP

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

計算機結構 Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Announcement. ECE475/ECE4420 Computer Architecture L4: Advanced Issues in Pipelining. Edward Suh Computer Systems Laboratory

TDT 4260 TDT ILP Chap 2, App. C

Transcription:

CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1

Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order execution, In-order Commit CS425 - Vassilis Papaefstathiou 2

Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls CONTROL Have to maintain: Data Flow Exception Behavior Dynamic instruction scheduling (HW) Scoreboard (reduce RAW stalls) Register Renaming (reduce WAR & WAW stalls) Tomasulo Reorder buffer Branch Prediction (reduce control stalls) Multiple Issue (CPI < 1) Multithreading (CPI < 1) Static instruction scheduling (SW/compiler) Loop Unrolling SW pipelining Trace Scheduling CS425 - Vassilis Papaefstathiou 3

Beyond CPI = 1 Initial goal to achieve CPI = 1 Can we improve beyond this? Superscalar: varying no. instructions/cycle (1 to 8), i.e. 1-way, 2-way,, 8-way superscalar scheduled by compiler (statically scheduled) or by HW (dynamically scheduled) e.g. IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000 the successful approach (to date) for general purpose computing Lead to use of Instructions Per Cycle (IPC) vs. CPI CS425 - Vassilis Papaefstathiou 4

Beyond CPI = 1 Alternative approach: Very Long Instruction Words (VLIW): fixed number of instructions (4-16) scheduled by the compiler; put ops into wide templates Currently found more success in DSP, Multimedia applications Joint HP/Intel agreement in 1999/2000 Intel Architecture-64 (Merced/A-64) 64-bit address Style: Explicitly Parallel Instruction Computer (EPIC) CS425 - Vassilis Papaefstathiou 5

Getting CPI < 1: Issuing Multiple Instr/Cycle Superscalar DLX: 2 instructions, 1 FP & 1 anything else Fetch 64-bits/clock cycle; Integer on left, FP on right Can only issue 2 nd instruction if 1 st instruction issues More ports for FP registers to do FP load & FP op in a pair Type Pipe Stages Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB 1 cycle load delay expands to 3 instructions in SS instruction in right half can t use it, nor instructions in next slot CS425 - Vassilis Papaefstathiou 6

In-Order Superscalar Pipeline Commit Point PC Inst. Mem 2 D Dual Decode GPR s X1 + X2 Data Mem X3 W Fetch two instructions per cycle; issue both simultaneously if one is integer/memory and other is floating point Inexpensive way of increasing throughput, examples include Alpha 21064 (1992) & MIPS R5000 series (1996) Same idea can be extended to wider issue by duplicating functional units (e.g. 4-issue UltraSPARC) but regfile ports and bypassing costs grow quickly FPRs X1 X2 FAdd X3 X2 FMul X3 Unpipelined FDiv X2 divider X3 W CS425 - Vassilis Papaefstathiou 7

Superscalar Pipeline (PowerPC- and enhanced Tomasulo-Scheme) Instruction Fetch... Instruction Decode and Rename... Instruction Window Instructions in the instruction window are free from control dependencies due to branch prediction, and free from name dependences due to register renaming. Issue Only (true) data dependences and structural conflicts remain to be solved. Reservation Stations Reservation Stations Execution... Execution Retire and Write Back CS425 - Vassilis Papaefstathiou 8

Superpipelined Machines MIPS R4000 Machine issues instructions faster than they are executed Advantage: increase in the number of instructions which can be in the pipeline at one time and hence the level of parallelism. Disadvantage: The larger number of instructions "in flight" (i.e. in some part of the pipeline) at any time, increases the potential for data and control dependencies to introduce stalls. Clock frequency is high. CS425 - Vassilis Papaefstathiou 9

Sequential ISA Bottleneck Sequential source code a = foo(b); for (i=0, i< Superscalar compiler Sequential machine code Find independent operations Schedule operations Superscalar processor Check instruction dependencies Schedule execution CS425 - Vassilis Papaefstathiou 10

Review: Unrolled Looping in Scalar 1 Loop: LD F0,0(R1) LD to ADDD: 1 Cycle 2 LD F6,-8(R1) ADDD to SD: 2 Cycles 3 LD F10,-16(R1) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2 7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4 10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,#32 13 BNEZ R1,LOOP 14 SD 8(R1),F16 ; 8-32 = -24 14 clock cycles, or 3.5 per iteration CS425 - Vassilis Papaefstathiou 11

Loop Unrolling in Superscalar Integer instruction FP instruction Clock cycle Loop: LD F0,0(R1) 1 LD F6,-8(R1) 2 LD F10,-16(R1) ADDD F4,F0,F2 3 LD F14,-24(R1) ADDD F8,F6,F2 4 LD F18,-32(R1) ADDD F12,F10,F2 5 SD 0(R1),F4 ADDD F16,F14,F2 6 SD -8(R1),F8 ADDD F20,F18,F2 7 SD -16(R1),F12 8 SD -24(R1),F16 9 SUBI R1,R1,#40 10 BNEZ R1,LOOP 11 SD -32(R1),F20 12 Unrolled 5 times to avoid delays 12 clocks, or 2.4 clocks per iteration (1.5X) CS425 - Vassilis Papaefstathiou 12

SS Advantages and Challenges The potential advantages of a SS processor versus a vector or VLIW processor are their ability to extract some parallelism from less structured code (i.e. not only loops) and their ability to easily cache all forms of data. While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: Exactly 50% FP operations No hazards If more instructions issue at same time, greater difficulty of decode and issue Even 2 way-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue CS425 - Vassilis Papaefstathiou 13

Example Desktop Processor: Intel Core 2 Superpipelined & Superscalar (4-way) CS425 - Vassilis Papaefstathiou 14

Example Mobile Processor: ARM A72 CS425 - Vassilis Papaefstathiou 15

Example Server Processor: IBM POWER8 CS425 - Vassilis Papaefstathiou 16

All in one: 2-way SS + OoO + Branch Prediction + Reorder Buffer (Speculation) Show read cycle for clarification 2 issues/cycle & 2 commits/cycle otherwise reorder buffer will overflow CS425 - Vassilis Papaefstathiou 17

Alternative Solutions Very Long Instruction Word (VLIW) Explicitly Parallel Instruction Computing (EPIC) Simultaneous Multithreading (SMT), next lecture Multi-core processors, ~last lecture VLIW: tradeoff instruction space for simple decoding The long instruction word has room for many operations By definition, all the operations the compiler puts in the long instruction word are independent execute in parallel E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch o 16 to 24 bits per field 7x16 or 112 bits to 7x24 or 168 bits wide o Intel Itanium 1 and 2 contain 6 operations per instruction packet Need compiling technique that schedules across several branches CS425 - Vassilis Papaefstathiou 18

VLIW: Very Long Instruction Word Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2 Two Integer Units, Single Cycle Latency Multiple operations packed into one instruction Each operation slot is for a fixed function Constant operation latencies are specified Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency Architecture requires guarantee of: Parallelism within an instruction no cross-operation RAW check No data use before data ready no data interlocks CS425 - Vassilis Papaefstathiou 19

VLIW Compiler Responsibilities Schedule operations to maximize parallel execution Guarantees intra-instruction parallelism Schedule to avoid data hazards (no interlocks) Typically separates operations with explicit NOPs CS425 - Vassilis Papaefstathiou 20

Typical VLIW processor CS425 - Vassilis Papaefstathiou 21

Loop Unrolling in VLIW Memory Memory FP FP Int. op/ Clock reference 1 reference 2 operation 1 operation 2 branch LD F0,0(R1) LD F6,-8(R1) 1 LD F10,-16(R1) LD F14,-24(R1) 2 LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3 LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4 ADDD F20,F18,F2 ADDD F24,F22,F2 5 SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6 SD -16(R1),F12 SD -24(R1),F16 7 SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 8 SD -48(R1),F28 BNEZ R1,LOOP 9 Unrolled 7 times to avoid delays Unrolling 7 times results in 9 clocks, or 1.3 clocks per iteration (1.8x vs SS) Average: 2.5 ops per clock, 50% efficiency Note: Need more registers in VLIW (15 vs. 6 in SS) CS425 - Vassilis Papaefstathiou 22

Advantages of VLIW Compiler prepares fixed packets of multiple operations that give the full "plan of execution" dependencies are determined by compiler and used to schedule according to function unit latencies function units are assigned by compiler and correspond to the position within the instruction packet ("slotting") compiler produces fully-scheduled, hazard-free code hardware doesn't have to "rediscover" dependencies or schedule CS425 - Vassilis Papaefstathiou 23

Disadvantages of VLIW Object-code compatibility recompile all code for every machine, even for two machines in same generation Object code size instruction padding wastes instruction memory/cache loop unrolling/software pipelining replicates code Scheduling variable latency memory operations caches and/or memory bank conflicts impose statically unpredictable variability as the issue rate and number of memory references becomes large, this synchronization restriction becomes unacceptable Knowing branch probabilities Profiling requires a significant extra step in build process Scheduling for statically unpredictable branches optimal schedule varies with branch path CS425 - Vassilis Papaefstathiou 24

What if there are not many loops? Branches limit basic block size in control-flow intensive irregular code Difficult to find ILP in individual basic blocks Basic block CS425 - Vassilis Papaefstathiou 25

Trace Scheduling [Fisher, Ellis] Trace selection: Pick string of basic blocks, a trace, that represents most frequent branch path Use profiling feedback or compiler heuristics to find common branch paths Trace Compaction: Schedule whole trace at once. Packing operations to few wide instructions Add fixup code to cope with branches jumping out of trace Effective to certain classes of programs Key assumption is that the trace is much more probable than the alternatives CS425 - Vassilis Papaefstathiou 26

Intel Itanium, EPIC IA-64 EPIC is the style of architecture (cf. CISC, RISC) Explicitly Parallel Instruction Computing (really just VLIW) IA-64 (Intel Itanium architecture) is Intel s chosen ISA (cf. x86, MIPS) IA-64 = Intel Architecture 64-bit An object-code-compatible VLIW Merced was first Itanium implementation (cf. 8086) First customer shipment expected 1997 (actually 2001) McKinley, second implementation shipped in 2002 Recent version, Poulson, eight cores, 32nm, 2012 Different instruction format than VLIW architectures using with indicators Support for SW speculation CS425 - Vassilis Papaefstathiou 27

Eight Core Itanium Poulson [Intel 2012] 8 cores 1-cycle 16KB L1 I&D caches 9-cycle 512KB L2 I-cache 8-cycle 256KB L2 D-cache 32 MB shared L3 cache 544mm 2 in 32nm CMOS Over 3 billion transistors Cores are 2-way multithreaded 6 instruction/cycle fetch Two 128-bit bundles (3 instrs/bundle) Up to 12 instrs/cycle execute CS425 - Vassilis Papaefstathiou 28

IA-64 Registers 128 General Purpose 64-bit Integer Registers 128 General Purpose 82-bit Floating Point Registers 64 x 1-bit Predicate Registers 8 x 64-bit Branch Registers Register stack mechanism: GPRs rotate to reduce code size for software pipelined loops Rotation is a simple form of register renaming allowing one instruction to address different physical registers on each procedure call CS425 - Vassilis Papaefstathiou 29

Execution Units CS425 - Vassilis Papaefstathiou 30

IA-64 Instruction Format Instruction 2 Instruction 1 Instruction 0 Template 128-bit instruction bundle (41*3+5) Template bits describe the types of instructions and the grouping of these instructions with others in adjacent bundles Each group contains instructions that can execute in parallel and the boundary of a group (stop) is indicated by the template bundle j-1 bundle j bundle j+1 bundle j+2 group i-1 group i group i+1 group i+2 CS425 - Vassilis Papaefstathiou 31

IA-64 Template CS425 - Vassilis Papaefstathiou 32

IA-64 Basic Architecture CS425 - Vassilis Papaefstathiou 33

IA-64 Predicated Execution Problem: Mispredicted branches limit ILP Solution: Eliminate hard to predict branches with predicated execution Almost all IA-64 instructions can be executed conditionally under predicate Instruction becomes NOP if predicate register false b0: b1: b2: b3: Inst 1 Inst 2 br a==b, b2 Inst 3 Inst 4 jmp b3 Inst 5 Inst 6 Inst 7 Inst 8 if else then Four basic blocks Predication Inst 1 Inst 2 p1= (a!=b), p2 = (a==b) (p1) Inst 3 (p2) Inst 5 (p1) Inst 4 (p2) Inst 6 Inst 7 Inst 8 One basic block Mahlke et al, ISCA95: On average >50% branches removed CS425 - Vassilis Papaefstathiou 34

Branch Predication CS425 - Vassilis Papaefstathiou 35

Branch Predication Example CS425 - Vassilis Papaefstathiou 36