Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )

Similar documents
Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Lecture: Static ILP. Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

Lecture: Pipeline Wrap-Up and Static ILP

Lecture: Static ILP. Topics: predication, speculation (Sections C.5, 3.2)

Lecture 6: Static ILP

EE 4683/5683: COMPUTER ARCHITECTURE

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Hardware-based Speculation

EECC551 Exam Review 4 questions out of 6 questions

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

ECSE 425 Lecture 11: Loop Unrolling

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Instruction-Level Parallelism (ILP)

Copyright 2012, Elsevier Inc. All rights reserved.

HW 2 is out! Due 9/25!

計算機結構 Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

CS425 Computer Systems Architecture

HY425 Lecture 09: Software to exploit ILP

HY425 Lecture 09: Software to exploit ILP

Hiroaki Kobayashi 12/21/2004

Computer Science 246 Computer Architecture

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

ILP: COMPILER-BASED TECHNIQUES

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Hardware-Based Speculation

EEC 581 Computer Architecture. Lec 4 Instruction Level Parallelism

ILP: Instruction Level Parallelism

Topics. Digital Systems Architecture EECE EECE Software Approaches to ILP Part 2. Ideas To Reduce Stalls. Processor Case Studies

COSC 6385 Computer Architecture. Instruction Level Parallelism

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

TDT 4260 lecture 7 spring semester 2015

Advanced Pipelining and Instruction- Level Parallelism 4

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Four Steps of Speculative Tomasulo cycle 0

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Getting CPI under 1: Outline

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

INSTRUCTION LEVEL PARALLELISM

Hardware-Based Speculation

Floating Point/Multicycle Pipelining in DLX

Multi-cycle Instructions in the Pipeline (Floating Point)

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Question 1 (5 points) Consider a cache with the following specifications Address space is 1024 words. The memory is word addressable The size of the

Page # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Lecture 4: Introduction to Advanced Pipelining

Page 1 ILP. ILP Basics & Branch Prediction. Smarter Schedule. Basic Block Problems. Parallelism independent enough

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

Adapted from David Patterson s slides on graduate computer architecture

Static vs. Dynamic Scheduling

Exploitation of instruction level parallelism

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

Lecture: Pipelining Basics

Computer Science 146. Computer Architecture

Computer Architecture Practical 1 Pipelining

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

Instruction-Level Parallelism and Its Exploitation

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

Static Compiler Optimization Techniques

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Computer Architecture: Mul1ple Issue. Berk Sunar and Thomas Eisenbarth ECE 505

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Super Scalar. Kalyan Basu March 21,

Compiler techniques for leveraging ILP

Instruction Level Parallelism

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

CPE 631 Advanced Computer Systems Architecture: Homework #3,#4

CS252 Prerequisite Quiz. Solutions Fall 2007

Metodologie di Progettazione Hardware-Software

CS / ECE 6810 Midterm Exam - Oct 21st 2008

Course on Advanced Computer Architectures

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Instruction Level Parallelism (ILP)

5008: Computer Architecture

Superscalar Architectures: Part 2

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

TDT 4260 TDT ILP Chap 2, App. C

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP) and Static & Dynamic Instruction Scheduling Instruction level parallelism

Instruction Level Parallelism. Taken from

Lecture 15: Instruc.on Level Parallelism -- Introduc.on, Compiler Techniques, and Advanced Branch Predic.on

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Multiple Instruction Issue. Superscalars

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

EE557--FALL 2000 MIDTERM 2. Open books and notes

COSC 6385 Computer Architecture - Instruction Level Parallelism (II)

Transcription:

Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 4.4) 1

Static vs Dynamic Scheduling Arguments against dynamic scheduling: requires complex structures to identify independent instructions (scoreboards, issue queue) high power consumption low clock speed high design and verification effort the compiler can easily compute instruction latencies and dependences complex software is always preferred to complex hardware (?) 2

Loop Scheduling Revert back to the 5-stage in-order pipeline The compiler s job is to minimize s Focus on loops: account for most cycles, relatively easy to analyze and optimize Recall: a load has a two-cycle latency (1 cycle for the consumer that immediately follows), FP ALU feeding another 3 cycles, FP ALU feeding a store 2 cycles 3

Loop Example for (i=1000; i>0; i--) x[i] = x[i] + s; Source code Loop: L.D F0, 0(R1) ; F0 = array element ADD.D F4, F0, F2 ; add scalar S.D F4, 0(R1) ; store result DADDUI R1, R1,# -8 ; decrement address pointer BNE R1, R2, Loop ; branch if R1!= R2 Assembly code Loop: L.D F0, 0(R1) ; F0 = array element ADD.D F4, F0, F2 ; add scalar S.D F4, 0(R1) ; store result DADDUI R1, R1,# -8 ; decrement address pointer BNE R1, R2, Loop ; branch if R1!= R2 10-cycle schedule 4

Smart Schedule Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop Loop: L.D F0, 0(R1) DADDUI R1, R1,# -8 ADD.D F4, F0, F2 BNE R1, R2, Loop S.D F4, 8(R1) By re-ordering instructions, it takes 6 cycles per iteration instead of 10 We were able to violate an anti-dependence easily because an immediate was involved Loop overhead (instrs that do book-keeping for the loop): 2 Actual work (the ld, add.d, and s.d): 3 instrs Can we somehow get execution time to be 3 cycles per iteration? 5

Loop Unrolling Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6, -8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1) L.D F14, -24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) DADDUI R1, R1, #-32 BNE R1,R2, Loop Loop overhead: 2 instrs; Work: 12 instrs How long will the above schedule take to complete? 6

Scheduled and Unrolled Loop Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) L.D F14, -24(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 S.D F4, 0(R1) S.D F8, -8(R1) DADDUI R1, R1, # -32 S.D F12, 16(R1) BNE R1,R2, Loop S.D F16, 8(R1) Execution time: 14 cycles or 3.5 cycles per original iteration 7

Loop Unrolling Increases program size Requires more registers To unroll an n-iteration loop by degree k, we will need (n/k) iterations of the larger loop, followed by (n mod k) iterations of the original loop 8

Automating Loop Unrolling Determine the dependences across iterations: in the example, we knew that loads and stores in different iterations did not conflict and could be re-ordered Determine if unrolling will help possible only if iterations are independent Determine address offsets for different loads/stores Dependency analysis to schedule code without introducing hazards; eliminate name dependences by using additional registers 9

Superscalar Pipelines Integer pipeline FP pipeline Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) ADD.D F4,F0,F2 L.D F14,-24(R1) ADD.D F8,F6,F2 L.D F18,-32(R1) ADD.D F12,F10,F2 S.D F4,0(R1) ADD.D F16,F14,F2 S.D F8,-8(R1) ADD.D F20,F18,F2 S.D F12,-16(R1) DADDUI R1,R1,# -40 S.D F16,16(R1) BNE R1,R2,Loop S.D F20,8(R1) The compiler may specify instructions that can be issued as one packet The compiler may specify a fixed number of instructions in each packet: Very Large Instruction Word (VLIW) 10

Loop Dependences If a loop only has dependences within an iteration, the loop is considered parallel multiple iterations can be executed together so long as order within an iteration is preserved If a loop has dependeces across iterations, it is not parallel and these dependeces are referred to as loop-carried Not all loop-carried dependences imply lack of parallelism 11

Examples For (i=1000; i>0; i=i-1) x[i] = x[i] + s; For (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; S1 B[i+1] = B[i] + A[i+1]; S2 } For (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; S1 B[i+1] = C[i] + D[i]; S2 } For (i=1000; i>0; i=i-1) x[i] = x[i-3] + s; S1 12

Examples For (i=1000; i>0; i=i-1) x[i] = x[i] + s; No dependences For (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; S1 B[i+1] = B[i] + A[i+1]; S2 } For (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; S1 B[i+1] = C[i] + D[i]; S2 } For (i=1000; i>0; i=i-1) x[i] = x[i-3] + s; S1 S2 depends on S1 in the same iteration S1 depends on S1 from prev iteration S2 depends on S2 from prev iteration S1 depends on S2 from prev iteration S1 depends on S1 from 3 prev iterations Referred to as a recursion Dependence distance 3; limited parallelism 13

Constructing Parallel Loops If loop-carried dependences are not cyclic (S1 depending on S1 is cyclic), loops can be restructured to be parallel For (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; S1 B[i+1] = C[i] + D[i]; S2 } S1 depends on S2 from prev iteration A[1] = A[1] + B[1]; For (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; S3 A[i+1] = A[i+1] + B[i+1]; S4 } B[101] = C[100] + D[100]; S4 depends on S3 of same iteration 14

Static Branch Prediction To avoid s, we may hoist instructions above branches it helps if these instructions are along the correct path Static branch prediction enables scheduling decisions by determining the commonly executed path Simple heuristics: always taken, backward branches are usually taken (loops), forward branches are usually not taken, etc. most have mispredict ratios of 30-40% Profiling helps because branch behavior usually has a bimodal distribution branches are highly biased as taken or not taken mispredict ratios 5-22% 15

Title Bullet 16