Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Similar documents
Lecture: Static ILP. Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

Lecture: Pipeline Wrap-Up and Static ILP

Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )

Lecture: Static ILP. Topics: predication, speculation (Sections C.5, 3.2)

ILP: COMPILER-BASED TECHNIQUES

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Hardware-based Speculation

ECSE 425 Lecture 11: Loop Unrolling

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Lecture 6: Static ILP

EE 4683/5683: COMPUTER ARCHITECTURE

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Instruction-Level Parallelism (ILP)

CS425 Computer Systems Architecture

Multi-cycle Instructions in the Pipeline (Floating Point)

Computer Science 246 Computer Architecture

Copyright 2012, Elsevier Inc. All rights reserved.

EECC551 Exam Review 4 questions out of 6 questions

HY425 Lecture 09: Software to exploit ILP

HY425 Lecture 09: Software to exploit ILP

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Four Steps of Speculative Tomasulo cycle 0

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Hardware-Based Speculation

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

TDT 4260 lecture 7 spring semester 2015

計算機結構 Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

COSC 6385 Computer Architecture. Instruction Level Parallelism

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Hardware-Based Speculation

HW 2 is out! Due 9/25!

INSTRUCTION LEVEL PARALLELISM

ILP: Instruction Level Parallelism

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

EEC 581 Computer Architecture. Lec 4 Instruction Level Parallelism

CS433 Homework 3 (Chapter 3)

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Topics. Digital Systems Architecture EECE EECE Software Approaches to ILP Part 2. Ideas To Reduce Stalls. Processor Case Studies

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Static vs. Dynamic Scheduling

Multiple Instruction Issue. Superscalars

Instruction-Level Parallelism and Its Exploitation

Hiroaki Kobayashi 12/21/2004

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CS / ECE 6810 Midterm Exam - Oct 21st 2008

Adapted from David Patterson s slides on graduate computer architecture

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

Question 1 (5 points) Consider a cache with the following specifications Address space is 1024 words. The memory is word addressable The size of the

Exploitation of instruction level parallelism

Super Scalar. Kalyan Basu March 21,

Computer Architecture: Mul1ple Issue. Berk Sunar and Thomas Eisenbarth ECE 505

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

Getting CPI under 1: Outline

Instruction Level Parallelism

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

CS425 Computer Systems Architecture

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

TDT 4260 TDT ILP Chap 2, App. C

Course on Advanced Computer Architectures

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Superscalar Architectures: Part 2

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

Floating Point/Multicycle Pipelining in DLX

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

CMSC 611: Advanced Computer Architecture

Computer Science 146. Computer Architecture

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Lecture 4: Introduction to Advanced Pipelining

Advanced Pipelining and Instruction- Level Parallelism 4

Page 1 ILP. ILP Basics & Branch Prediction. Smarter Schedule. Basic Block Problems. Parallelism independent enough

Chapter 3 (Cont III): Exploiting ILP with Software Approaches. Copyright Josep Torrellas 1999, 2001, 2002,

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

COSC 6385 Computer Architecture - Instruction Level Parallelism (II)

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Page # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?

Lecture 4: Advanced Pipelines. Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)

Lecture 9: Multiple Issue (Superscalar and VLIW)

CSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

Metodologie di Progettazione Hardware-Software

The Processor: Instruction-Level Parallelism

5008: Computer Architecture

Processor (IV) - advanced ILP. Hwansoo Han

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Static Compiler Optimization Techniques

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Transcription:

Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2) 1

Static vs Dynamic Scheduling Arguments against dynamic scheduling: requires complex structures to identify independent instructions (scoreboards, issue queue) high power consumption low clock speed high design and verification effort the compiler can easily compute instruction latencies and dependences complex software is always preferred to complex hardware (?) 2

ILP Instruction-level parallelism: overlap among instructions: pipelining or multiple instruction execution What determines the degree of ILP? dependences: property of the program hazards: property of the pipeline 3

Loop Scheduling The compiler s job is to minimize s Focus on loops: account for most cycles, relatively easy to analyze and optimize 4

Assumptions Load: 2-cycles (1 cycle for consumer) FP ALU: 4-cycles (3 cycle for consumer; 2 cycle if the consumer is a store) One branch delay slot Int ALU: 1-cycle (no for consumer, 1 cycle if the consumer is a branch) LD -> any : 1 FPALU -> any: 3 s FPALU -> ST : 2 s IntALU -> BR : 1 5

Loop Example for (i=1000; i>0; i--) x[i] = x[i] + s; Source code Loop: L.D F0, 0(R1) ; F0 = array element ADD.D F4, F0, F2 ; add scalar S.D F4, 0(R1) ; store result DADDUI R1, R1,# -8 ; decrement address pointer BNE R1, R2, Loop ; branch if R1!= R2 NOP Assembly code 6

Loop Example for (i=1000; i>0; i--) x[i] = x[i] + s; Source code LD -> any : 1 FPALU -> any: 3 s FPALU -> ST : 2 s IntALU -> BR : 1 Loop: L.D F0, 0(R1) ; F0 = array element ADD.D F4, F0, F2 ; add scalar S.D F4, 0(R1) ; store result DADDUI R1, R1,# -8 ; decrement address pointer BNE R1, R2, Loop ; branch if R1!= R2 NOP Loop: L.D F0, 0(R1) ; F0 = array element ADD.D F4, F0, F2 ; add scalar S.D F4, 0(R1) ; store result DADDUI R1, R1,# -8 ; decrement address pointer BNE R1, R2, Loop ; branch if R1!= R2 Assembly code 10-cycle schedule 7

Smart Schedule Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop LD -> any : 1 FPALU -> any: 3 s FPALU -> ST : 2 s IntALU -> BR : 1 Loop: L.D F0, 0(R1) DADDUI R1, R1,# -8 ADD.D F4, F0, F2 BNE R1, R2, Loop S.D F4, 8(R1) By re-ordering instructions, it takes 6 cycles per iteration instead of 10 We were able to violate an anti-dependence easily because an immediate was involved Loop overhead (instrs that do book-keeping for the loop): 2 Actual work (the ld, add.d, and s.d): 3 instrs Can we somehow get execution time to be 3 cycles per iteration? 8

Loop Unrolling Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6, -8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1) L.D F14, -24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) DADDUI R1, R1, #-32 BNE R1,R2, Loop Loop overhead: 2 instrs; Work: 12 instrs How long will the above schedule take to complete? 9

Scheduled and Unrolled Loop Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) L.D F14, -24(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 S.D F4, 0(R1) S.D F8, -8(R1) DADDUI R1, R1, # -32 S.D F12, 16(R1) BNE R1,R2, Loop S.D F16, 8(R1) LD -> any : 1 FPALU -> any: 3 s FPALU -> ST : 2 s IntALU -> BR : 1 Execution time: 14 cycles or 3.5 cycles per original iteration 10

Loop Unrolling Increases program size Requires more registers To unroll an n-iteration loop by degree k, we will need (n/k) iterations of the larger loop, followed by (n mod k) iterations of the original loop 11

Automating Loop Unrolling Determine the dependences across iterations: in the example, we knew that loads and stores in different iterations did not conflict and could be re-ordered Determine if unrolling will help possible only if iterations are independent Determine address offsets for different loads/stores Dependency analysis to schedule code without introducing hazards; eliminate name dependences by using additional registers 12

Superscalar Pipelines Integer pipeline Handles L.D, S.D, ADDUI, BNE FP pipeline Handles ADD.D What is the schedule with an unroll degree of 4? 13

Superscalar Pipelines Integer pipeline FP pipeline Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) ADD.D F4,F0,F2 L.D F14,-24(R1) ADD.D F8,F6,F2 L.D F18,-32(R1) ADD.D F12,F10,F2 S.D F4,0(R1) ADD.D F16,F14,F2 S.D F8,-8(R1) ADD.D F20,F18,F2 S.D F12,-16(R1) DADDUI R1,R1,# -40 S.D F16,16(R1) BNE R1,R2,Loop S.D F20,8(R1) Need unroll by degree 5 to eliminate s The compiler may specify instructions that can be issued as one packet The compiler may specify a fixed number of instructions in each packet: Very Large Instruction Word (VLIW) 14

Software Pipeline?! DADDUI BNE DADDUI BNE DADDUI BNE DADDUI Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop L.D DADDUI BNE L.D DADDUI ADD.D BNE ADD.D BNE 15

Software Pipeline Original iter 1 Original iter 2 Original iter 3 Original iter 4 New iter 1 New iter 2 New iter 3 L.D ADD.D L.D New iter 4 16

Software Pipelining Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop Loop: S.D F4, 16(R1) ADD.D F4, F0, F2 L.D F0, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop Advantages: achieves nearly the same effect as loop unrolling, but without the code expansion an unrolled loop may have inefficiencies at the start and end of each iteration, while a sw-pipelined loop is almost always in steady state a sw-pipelined loop can also be unrolled to reduce loop overhead Disadvantages: does not reduce loop overhead, may require more registers 17

Title Bullet 18