Multiple Issue ILP Processors. Summary of discussions

Similar documents
CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CS425 Computer Systems Architecture

Four Steps of Speculative Tomasulo cycle 0

Lecture 9: Multiple Issue (Superscalar and VLIW)

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction

Advanced issues in pipelining

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Multi-cycle Instructions in the Pipeline (Floating Point)

The Processor: Instruction-Level Parallelism

Lec 25: Parallel Processors. Announcements

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Processor (IV) - advanced ILP. Hwansoo Han

Metodologie di Progettazione Hardware-Software

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Multiple Instruction Issue. Superscalars

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

INSTRUCTION LEVEL PARALLELISM

Chapter 4 The Processor 1. Chapter 4D. The Processor

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CS146 Computer Architecture. Fall Midterm Exam

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Advanced processor designs

Super Scalar. Kalyan Basu March 21,

COMPUTER ORGANIZATION AND DESI

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

Lecture 5: VLIW, Software Pipelining, and Limits to ILP. Review: Tomasulo

Multicore and Parallel Processing

EECC551 Exam Review 4 questions out of 6 questions

Advanced Computer Architecture

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Advanced Instruction-Level Parallelism

5008: Computer Architecture

Computer Science 246 Computer Architecture

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Getting CPI under 1: Outline

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Alexandria University

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Instruction Level Parallelism

Chapter 4. The Processor

EECC551 Review. Dynamic Hardware-Based Speculation

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Lecture 5: VLIW, Software Pipelining, and Limits to ILP Professor David A. Patterson Computer Science 252 Spring 1998

VLIW/EPIC: Statically Scheduled ILP

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Chapter 4 The Processor (Part 4)

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Advanced Computer Architecture

Dynamic Control Hazard Avoidance

ECE331: Hardware Organization and Design

Chapter 3 (Cont III): Exploiting ILP with Software Approaches. Copyright Josep Torrellas 1999, 2001, 2002,

Full Datapath. Chapter 4 The Processor 2

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

CS 61C: Great Ideas in Computer Architecture. Multiple Instruction Issue, Virtual Memory Introduction

Computer Architecture

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Instructor Information

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

EITF20: Computer Architecture Part2.2.1: Pipeline-1

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Instruction-Level Parallelism and Its Exploitation

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

Lecture 13 - VLIW Machines and Statically Scheduled ILP

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

CS Mid-Term Examination - Fall Solutions. Section A.

Chapter 4. The Processor

Modern Computer Architecture

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

EC 513 Computer Architecture

Transcription:

Summary of discussions Multiple Issue ILP Processors ILP processors - VLIW/EPIC, Superscalar Superscalar has hardware logic for extracting parallelism - Solutions for stalls etc. must be provided in hardware Stalls play an even greater role in ILP processors Software solutions, such as code scheduling through code movement, can lead to improved execution times - More sophisticated techniques needed - Can we provide some H/W support to help the compiler leads to EPIC/VLIW In statically scheduled superscalar instructions issue in order, and all pipeline hazards checked at issue time - Inst causing hazard will force subsequent inst to be stalled In statically scheduled VLIW, compiler generates multiple issue packets of instructions During instruction fetch, pipeline receives number of inst from IF stage issue packet - Examine each inst in packet: if no hazard then issue else wait - Issue unit examines all inst in packet Complexity implies further splitting of issue stage Getting CPI < 1: Issuing Multiple Instructions/Cycle Vector Processing: Explicit coding of independent loops as operations on large vectors of numbers - Multimedia instructions being added to many processors Superscalar: varying no. instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo) - IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium III/4 (Very) Long Instruction Words (V)LIW: fixed number of instructions (4-16) scheduled by the compiler; put ops into wide templates (TBD) - Intel Architecture-64 (IA-64) 64-bit address Renamed: Explicitly Parallel Instruction Computer (EPIC) Anticipated success of multiple instructions lead to Instructions Per Clock cycle (IPC) vs. CPI Getting CPI < 1: Issuing Multiple Instructions/Cycle Superscalar MIPS: 2 instructions, 1 FP & 1 anything Fetch 64-bits/clock cycle; Int on left, FP on right Can only issue 2nd instruction if 1st instruction issues More ports for FP registers to do FP load & FP op in a pair Type Pipe Stages Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB 1 cycle load delay expands to 3 instructions in SS - instruction in right half can t use it, nor instructions in next slot

Multiple Issue Issues issue packet: group of instructions from fetch unit that could potentially issue in 1 clock - If instruction causes structural hazard or a data hazard either due to earlier instruction in execution or to earlier instruction in issue packet, then instruction does not issue - 0 to N instruction issues per clock cycle, for N-issue Performing issue checks in 1 cycle could limit clock cycle time: O(n 2 -n) comparisons - => issue stage usually split and pipelined - 1st stage decides how many instructions from within this packet can issue, 2nd stage examines hazards among selected instructions and those already been issued - => higher branch penalties => prediction accuracy important Multiple Issue Challenges While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: - Exactly 50% FP operations AND No hazards If more instructions issue at same time, greater difficulty of decode and issue: - Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue; (N-issue ~O(N 2 -N) comparisons) - Register file: need 2x reads and 1x writes/cycle - Rename logic: must be able to rename same register multiple times in one cycle! For instance, consider 4-way issue: add r1, r2, r3 add p11, p4, p7 sub r4, r1, r2 sub p22, p11, p4 lw r1, 4(r4) lw p23, 4(p22) add r5, r1, r2 add p12, p23, p4 Imagine doing this transformation in a single cycle! - Result buses: Need to complete multiple instructions/cycle So, need multiple buses with associated matching logic at every reservation station. Or, need multiple forwarding paths Summary of Course Amdahl s Law: Performance and Cost Speedup overall = ExTime old ExTime new = CPI Law: CPU CPU time time = Seconds = Instructions x Cycles Cycles x Seconds Program Program Instruction Cycle Cycle Designing to Last through Trends Capacity Speed Logic 2x in 3 years 2x in 3 years DRAM 4x in 4 years 2x in 10 years Disk 4x in 3 years 2x in 5 years Processor 2x every 1.5 years? 1 (1 - Fraction enhanced ) + Fraction enhanced Speedup enhanced

Performance and Cost Thumb rules for performance: - Common case fast - Locality - Parallelism - Critical path Cost vs. Price - Can PC industry support engineering/research investment? For better or worse, benchmarks shape a field Interested in learning more on performance? CS 238 Computer Systems Performance - Don t ask me when Goodbye to Performance and Cost Will sustain 2X every 1.5 years? - Can integrated circuits improve below 1.8 micron in speed as well as capacity? 5-6 yrs to PhD => 16X CPU speed, 10XDRAM Capacity, 25X Disk capacity? (20 GHz CPU, 10GB DRAM, 2TB disk?) Processor Architecture ISA What ISA looks like to pipeline? - Cray: load/store machine; registers; simple instr. format RISC: Making an ISA that supports pipelined execution 80x86: importance of being their first VLIW/EPIC: compiler controls Instruction Level Parallelism (ILP) Interested in learning more on compilers and ISA? CS 246 Compiler Optimization Superscalar Execution (Chapter 3) Dynamic scheduling with Tomasulo - Why does it work, and why is it the de-facto method today? Multiple Instruction issue - Challenges? Did ILP limits really restrict practical machines to 4-issue, 4-commit? Did we ever really get CPI below 1.0? Branch prediction: How accurate did it become? - For real programs, how much better than 2 bit table?

Software Scheduling Instruction Level Parallelism (ILP) found either by compiler or hardware. Loop level parallelism is easiest to see - SW dependencies/compiler sophistication determine if compiler can unroll loops - Memory dependencies hardest to determine => Memory disambiguation - Very sophisticated transformations available Trace Sceduling to Parallelize If statements Superscalar and VLIW: CPI < 1 (IPC > 1) - Dynamic issue vs. Static issue - More instructions issue at same time => larger hazard penalty - Limitation is often number of instructions that you can successfully fetch and decode per cycle EPIC/VLIW What did IA-64/EPIC do well besides floating point programs? - Was the only difference the 64-bit address v. 32-bit address? - What happened to the AMD 64-bit address 80x86 proposal? What happened on EPIC code size vs. x86? Did anybody propose anything at ISA to help with software quality? availability? Security? Hardware versus Software Speculation Mechanisms To speculate extensively, must be able to disambiguate memory references - Much easier in HW than in SW for code with pointers HW-based speculation works better when control flow is unpredictable, and when HW-based branch prediction is superior to SW-based branch prediction done at compile time - Mispredictions mean wasted speculation HW-based speculation maintains precise exception model even for speculated instructions HW-based speculation does not require compensation or bookkeeping code Hardware versus Software Speculation Mechanisms cont d Compiler-based approaches may benefit from the ability to see further in the code sequence, resulting in better code scheduling HW-based speculation with dynamic scheduling does not require different code sequences to achieve good performance for different implementations of an architecture - may be the most important in the long run?

IA-64 EPIC vs. Classic VLIW Similarities: - Compiler generated wide instructions - Static detection of dependencies - ILP encoded in the binary (a group) - Large number of architected registers Differences: - Instructions in a bundle can have dependencies - Hardware interlock between dependent instructions - Accommodates varying number of functional units and latencies - Allows dynamic scheduling and functional unit binding Static scheduling are suggestive rather than absolute Code compatibility across generations but software won t run at top speed until it is recompiled so shrink-wrap binary might need to include multiple builds EPIC and Compiler Optimization EPIC requires dependency free scheduled code Burden of extracting parallelism falls on compiler success of EPIC architectures depends on efficiency of Compilers!! We provide overview of Compiler Optimization techniques (as they apply to EPIC/ILP) Hardware-Software Interface Machine Program Introduction to Compiler Optimization Available resources statically fixed Designed to support wide variety of programs Interested in running many programs fast X Required resources dynamically varying Designed to run well on a variety of machines Interested in having itself run fast Performance = t cyc x CPI x code size Reflects how well the machine resources match the program requirements

Compiler Tasks Code Translation - Source language target language FORTRAN C C MIPS, PowerPC or Alpha machine code MIPS binary Alpha binary Code Optimization - Code runs faster - Match dynamic code behavior to static machine structure Compiler Structure Frond End Optimizer Back End IR IR high-level source code Dependence Analyzer Machine independent machine code Machine dependent (IR= intermediate representation) Midterm Exam When: Tuesday, October 28 th, 6:15pm - 2 hours What? - Chapters 1,2,3, Appendix AB,G (all material upto/including EPIC/VLIW), Homeworks 1,2,3. Looks like? - 3 parts: (1) multiple choice, (2) short answers (2 sentence), and (3) detailed/analytical questions (like homeworks?) If you miss exam then makeup IF you have valid excuse - Sickness, your company sent you out on travel, others? NOT valid excuses: - You need to get your car serviced at the same time as exam - You have a party to attend the previous night