Simone Campanoni Loop transformations

Similar documents
Control flow graphs and loop optimizations. Thursday, October 24, 13

Loops. Lather, Rinse, Repeat. CS4410: Spring 2013

Loop Optimizations. Outline. Loop Invariant Code Motion. Induction Variables. Loop Invariant Code Motion. Loop Invariant Code Motion

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

CSC D70: Compiler Optimization Memory Optimizations

CSE P 501 Compilers. Loops Hal Perkins Spring UW CSE P 501 Spring 2018 U-1

Loop Transformations! Part II!

Tiling: A Data Locality Optimizing Algorithm

Compiler Construction 2010/2011 Loop Optimizations

Supercomputing in Plain English Part IV: Henry Neeman, Director

Compiler Design. Fall Control-Flow Analysis. Prof. Pedro C. Diniz

Compiler Construction 2016/2017 Loop Optimizations

Compiling for Advanced Architectures

Tiling: A Data Locality Optimizing Algorithm

Parallelisation. Michael O Boyle. March 2014

CprE 488 Embedded Systems Design. Lecture 6 Software Optimization

Loop Unrolling. Source: for i := 1 to 100 by 1 A[i] := A[i] + B[i]; endfor

Module 18: Loop Optimizations Lecture 35: Amdahl s Law. The Lecture Contains: Amdahl s Law. Induction Variable Substitution.

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization

A Bad Name. CS 2210: Optimization. Register Allocation. Optimization. Reaching Definitions. Dataflow Analyses 4/10/2013

Spring 2 Spring Loop Optimizations

Lecture 21. Software Pipelining & Prefetching. I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining

Code optimization. Have we achieved optimal code? Impossible to answer! We make improvements to the code. Aim: faster code and/or less space

Bentley Rules for Optimizing Work

Coarse-Grained Parallelism

EE 4683/5683: COMPUTER ARCHITECTURE

Supercomputing and Science An Introduction to High Performance Computing

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

Loops. Announcements. Loop fusion. Loop unrolling. Code motion. Array. Good targets for optimization. Basic loop optimizations:

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers

Office Hours: Mon/Wed 3:30-4:30 GDC Office Hours: Tue 3:30-4:30 Thu 3:30-4:30 GDC 5.

Lecture Notes on Loop Optimizations

6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT

Building a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano

CS 701. Class Meets. Instructor. Teaching Assistant. Key Dates. Charles N. Fischer. Fall Tuesdays & Thursdays, 11:00 12: Engineering Hall

Machine-Independent Optimizations

Optimization Prof. James L. Frankel Harvard University

EE/CSCI 451: Parallel and Distributed Computation

Final Exam Review Questions

Parallel Numerical Algorithms

Introduction to optimizations. CS Compiler Design. Phases inside the compiler. Optimization. Introduction to Optimizations. V.

CS 351 Final Exam Solutions

Advanced optimizations of cache performance ( 2.2)

USC 227 Office hours: 3-4 Monday and Wednesday CS553 Lecture 1 Introduction 4

CS 2461: Computer Architecture 1

Lecture 3 Local Optimizations, Intro to SSA

Computer Science 146. Computer Architecture

PERFORMANCE OPTIMISATION

Linear Loop Transformations for Locality Enhancement

Exploring Parallelism At Different Levels

Computer Science 246 Computer Architecture

Tour of common optimizations

Tiling: A Data Locality Optimizing Algorithm

Program Transformations for the Memory Hierarchy

ECE 5775 (Fall 17) High-Level Digital Design Automation. Static Single Assignment

CS 293S Parallelism and Dependence Theory

COMPILER DESIGN - CODE OPTIMIZATION

Plan for Today. Concepts. Next Time. Some slides are from Calvin Lin s grad compiler slides. CS553 Lecture 2 Optimizations and LLVM 1

Usually, target code is semantically equivalent to source code, but not always!

Outline. Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities

Intermediate Representation (IR)

CSE 501: Compiler Construction. Course outline. Goals for language implementation. Why study compilers? Models of compilation

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

OpenMP! baseado em material cedido por Cristiana Amza!

Lecture 8: Induction Variable Optimizations

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

Intermediate Representations. Reading & Topics. Intermediate Representations CS2210

Program Op*miza*on and Analysis. Chenyang Lu CSE 467S

Memories. CPE480/CS480/EE480, Spring Hank Dietz.

16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Leftover. UMA vs. NUMA Parallel Computer Architecture. Shared-memory Distributed-memory

Module 18: Loop Optimizations Lecture 36: Cycle Shrinking. The Lecture Contains: Cycle Shrinking. Cycle Shrinking in Distance Varying Loops

CS553 Lecture Dynamic Optimizations 2

Transforming Imperfectly Nested Loops

Assignment statement optimizations

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Floating Point/Multicycle Pipelining in DLX

Lecture Notes on Loop-Invariant Code Motion

Lecture 6: Static ILP

CS 426 Fall Machine Problem 4. Machine Problem 4. CS 426 Compiler Construction Fall Semester 2017

An example of optimization in LLVM. Compiler construction Step 1: Naive translation to LLVM. Step 2: Translating to SSA form (opt -mem2reg)

Goals of Program Optimization (1 of 2)

What Do Compilers Do? How Can the Compiler Improve Performance? What Do We Mean By Optimization?

Memory Consistency. Challenges. Program order Memory access order

Parallel Programming & Cluster Computing Stupid Compiler Tricks

IA-64 Compiler Technology

Parallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

CS202 Compiler Construction

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Compiler Optimization

Supercomputing in Plain English

CS553 Lecture Profile-Guided Optimizations 3

Control flow and loop detection. TDT4205 Lecture 29

Topic 9: Control Flow

Transcription:

Simone Campanoni simonec@eecs.northwestern.edu Loop transformations

Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations

Simple loop transformations Simple loop transformations are used to Increase performance/energy savings and/or Unblock other transformations E.g., increase the number of constant propagations E.g., Extract thread-level parallelism from sequential code E.g., Generate vector instructions

Loop unrolling %a = 0 %a = 0 for (a=0; a < 4; a++){ } // Body %a=cmp %a, 4 branch %a Body Unrolling factor: 2 %a=cmp %a, 4 branch %a Body %a=add %a, 1 Body %a=add %a, 2

Loop unrolling in LLVM Dependences

Loop unrolling in LLVM Headers

Loop unrolling in LLVM Get the results of the required analyses

Fetch a loop

Loop unrolling in LLVM API

Loop unrolling in LLVM: example Body Body Body

Loop unrolling: the trip count

Loop unrolling: the trip multiple

Loop unrolling in LLVM: the runtime checks

Loop unrolling in LLVM: example 2 i=0 i=0 If (argc > 0) If (argc > 0) i++ Body if (i == argc) i++ Body if (i == argc) i++ Body if (i == argc) return r return r

Loop unrolling in LLVM: the runtime checks

If (argc > 0) Loop unrolling in LLVM: example 3 i_rest = i & 3 i_mul = i i_rest i++ Body if (i == argc) return r i=0 If (argc > 0) Runtime checks If (i_mul > 0) auto n=0 for (;n<i_mul; n+=4){ Body Body Body Body } for(auto m=0;m<i_rest;m++){ Body } return r

Loop unrolling in LLVM: example 4 trip = argc i=0 If (trip > 0) i++ Body if (i < trip) return r

Loop unrolling in LLVM: example 4 trip = argc i = 0 If (trip > 0) trip = argc i=0 If (trip > 0) i++ Body if (i < trip) i++ Body if (i < trip) i++ Body if (i < trip) return r return r

Loop peeling %a=cmp %a, 10 branch %a Body %a=add %a, 1 Peeling factor: 1 Body %a = add %a, 1 %a=cmp %a, %10 branch %a Body %a=add %a, 1

Loop peeling in LLVM API No trip count No flags (almost) always possible

Loop peeling in LLVM: example

Loop unrolling and peeling together

Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations

Optimizations in small, hot loops Most programs: 90% of time is spent in few, small, hot loops while (){ } statement 1 statement 2 statement 3 Deleting a single statement from a small, hot loop might have a big impact (100 seconds -> 70 seconds)

Loop example 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} do { 3: a = 1; 4: y = x + N; 5: b = k + z; 6: c = a * 3; 7: if (N < 0){ 8: m = 5; 9: break; } 10: x++; 11:} while (x < N); Observation: each statement in that loop will contribute to the program execution time Idea: what about moving statements from inside a loop to outside it? Which statements can be moved outside our loop? How to identify them automatically? (code analysis) How to move them? (code transformation)

Loop-invariant computations Let d be the following definition (d) t = x op y d is a loop-invariant of a loop L if (assuming x, y do not escape) x and y are constants or All reaching definitions of x and y are outside the loop, or Only one definition reaches x (or y), and that definition is loop-invariant

Loop example 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} do { 3: a = 1; 4: y = x + N; 5: b = k + z; 6: c = a * 3; 7: if (N < 0){ 8: m = 5; 9: break; } 10: x++; 11:} while (x < N); d is a loop-invariant of a loop L if x and y are constants or All reaching definitions of x and y are outside the loop, or Only one definition reaches x (or y), and that definition is loop-invariant??

Loop-invariant computations in LLVM

Hoisting code In order to hoist a loop-invariant computation out of a loop, we need a place to put it We could copy it to all immediate predecessors of the loop header... for (auto it = pred_begin(h); it!= pred_end(h); ++it){ BasicBlock *pbb = *it; p = pbb->getterminator(); inv->movebefore(p); } n1 n2 n3 Header Is it correct?...but we can avoid code duplication (and bugs) by taking advantage of loop normalization that guarantees the existence of the pre-header

Hoisting code In order to hoist a loop-invariant computation out of a loop, we need a place to put it We could copy it to all immediate predecessors of the loop header... pbb = loop->getlooppreheader(); p = pbb->getterminator(); inv->movebefore(p); Exit node Pre-header Header Body...but we can avoid code duplication (and bugs) by taking advantage of loop normalization that guarantees the existence of the pre-header Latch

Can we hoist all invariant instructions of a loop L in the pre-header of L? Pre-header for (inv : invariants(loop)){ pbb = loop->getlooppreheader(); p = pbb->getterminator(); inv->movebefore(p); } Exit node Header Body Latch

Hoisting conditions For a loop-invariant definition (d) t = x op y We can hoist d into the loop s pre-header if 1. d dominates all loop exits at which t is live-out, and 2. there is only one definition?? of t in the loop, and 3. t is not live-out of the pre-header 1,3 2

Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations

Loop example 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} do { 3: a = 1; 4: y = x + N; 5: b = k + z; 6: c = a * 3; 7: if (N < 0){ 8: m = 5; 9: break; } 10: x++; 11:} while (x < N); Assuming a,b,c,m are used after our code Do we have to execute 4 for every iteration? Do we have to execute 10 for every iteration?

Loop example 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} do { 3: a = 1; 4: y = x + N; 5: b = k + z; 6: c = a * 3; 7: if (N < 0){ 8: m = 5; 9: break; } 10: x++; 11:} while (x < N); y=n Do we have to execute 4 for every iteration? Compute manually values of x and y for every iteration What do you see? Do we have to execute 10 for every iteration?

Loop example 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} do { 3: a = 1; 4: 5: b = k + z; 6: c = a * 3; 7: if (N < 0){ 8: m = 5; 9: break; } 10: x++;y++; 11:} while (x < N); y=n Do we have to execute 4 for every iteration? Do we have to execute 10 for every iteration?

Loop example 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} do { 3: a = 1; 4: 5: b = k + z; 6: c = a * 3; 7: if (N < 0){ 8: m = 5; 9: break; } 10: x++;y++; 11:} while (y < (2*N)); y=n Do we have to execute 4 for every iteration? Do we have to execute 10 for every iteration?

Loop example 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} do { 3: a = 1; 4: 5: b = k + z; 6: c = a * 3; 7: if (N < 0){ 8: m = 5; 9: break; } 10: y++; 11:} while (y < (2*N)); y=n Do we have to execute 4 for every iteration? Do we have to execute 10 for every iteration?

Loop example 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} do { 3: a = 1; 4: 5: b = k + z; 6: c = a * 3; 7: if (N < 0){ 8: m = 5; 9: break; } 10: y++; 11:} while (y < tmp); y=n;tmp=2*n; Do we have to execute 4 for every iteration? x, y are induction variables Do we have to execute 10 for every iteration?

Is the code transformation worth it? 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} A :y=n;tmp=2*n; do { 3: a = 1; 5: b = k + z; 6: c = a * 3; 7: if (N < 0){ 8: m = 5; 9: break; } 10: y++; 11:} while (y < tmp); 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} do { 3: a = 1; 4: y = x + N; 5: b = k + z; 6: c = a * 3; 7: if (N < 0){ 8: m = 5; 9: break; } 10: x++; 11:} while (x < N);

and after Loop Invariant Code Motion... 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} A :y=n;tmp=2*n; 3 :a=1; 5 :b=k+z; 6: c=a*3; do{ 7: if (N < 0){ 8: m = 5; 9: break; } 10: y++; 11:} while (y < tmp); 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} do { 3: a = 1; 4: y = x + N; 5: b = k + z; 6: c = a * 3; 7: if (N < 0){ 8: m = 5; 9: break; } 10: x++; 11:} while (x < N);

and with a better Loop Invariant Code Motion... 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} A :y=n;tmp=2*n; 3 :a=1; 5 :b=k+z; 6: c=a*3; 7: if (N < 0){ 8: m=5; } do{ 10: y++; 11:} while (y < tmp); 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} do { 3: a = 1; 4: y = x + N; 5: b = k + z; 6: c = a * 3; 7: if (N < 0){ 8: m = 5; 9: break; } 10: x++; 11:} while (x < N);

and after dead code elimination... 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} 3 :a=1; 5 :b=k+z; 6: c=a*3; 7: if (N < 0){ 8: m=5; } Assuming a,b,c,m are used after our code 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} do { 3: a = 1; 4: y = x + N; 5: b = k + z; 6: c = a * 3; 7: if (N < 0){ 8: m = 5; 9: break; } 10: x++; 11:} while (x < N);

Induction variable observation Observation: some variables change by a constant amount on each loop iteration x initialized at 0; increments by 1 y initialized at N; increments by 1 These are all induction variables How can we identify them?

Induction variable elimination Suppose we have a loop variable i initially set to i 0 ; each iteration i = i + 1 and a variable that linearly depends on it x = i * c 1 + c 2 We can constants Initialize x = i o * c 1 + c 2 Increment x by c 1 each iteration

Is it faster? 1: i = i o 2: do { 3: i = i + 1;... A: x = i* c 1 + c 2 B:} while (i < maxi); 1: i = i o N1:x = i o * c 1 + c 2 2: do { 3: i = i + 1;... A: x = x + c 1 B:} while (i < maxi); On some hardware, adds are much faster than multiplies Strength reduction

Induction variables Basic induction variables i = i op c c is loop-invariant a.k.a. independent induction variable Derived induction variables j = i * c 1 + c 2 a.k.a. dependent induction variable

Identify induction variables: step 1 Find the basic IVs 1Scan loop body for defs of the form x = x + c, where c is loop-invariant 2Record these basic IVs as x = (x, 1, c) this represents the IV: x = x * 1 + c

Identify induction variables: step 2 Find derived IVs 1Scan for derived IVs of the form k = i * c1 + c2 where i is a basic IV and this is the only definition of k in the loop 2Record as k = (i, c1, c2) We say k is in the family of i

Identified induction variables i: basic j: basic q: derived from i k: derived from i x: derived from j z: derived from k A forest of induction variables

Many optimizations rely on IVs Like induction variable elimination we have seen before or like loop unrolling to compute the trip count

Induction variable elimination: step 1 1Iterate over IVs k = j * c1 + c2 where IV j =(i, a, b), and this is the only def of k in the loop, and there is no def of i between the def of j and the def of k i = j = i k = j 2Record as k = (i, a*c1, b*c1+c2)

Induction variable elimination: step 2 For an induction variable k = (i, c1, c2) 1Initialize k = i * c1 + c2 in the pre-header 2Replace k s def in the loop by k = k + c1 Make sure to do this after i s definition

Normalize induction variables Basic induction variables are normalized to Initial value: 0 Incremented by 1 at each iteration Explicit final value (either constant or loop-invariant) x = x*1 + 1 (x, 1, 1)

Normalize induction variables in LLVM indvars: normalize induction variables highlight the basic induction variables scalar-evolution: Scalar evolution analysis Represent scalar expressions (e.g., x = y op z) It supports induction variables (e.g., x = x + 1) It lowers the burden of explicitly handling the composition of expressions

LLVM scalar evolution example SCEV: {A, B, C}<flag>*<%D> A: Initial; B: Operator; C: Operand; D: basic block where it get defined

LLVM scalar evolution example SCEV: {A, B, C}<flag>*<%D> A: Initial; B: Operator; C: Operand; D: basic block where it get defined

LLVM scalar evolution example: pass deps

An use of the Scalar Evaluation pass Problem: we need to infer the number of iterations that a loop will execute for (int i=0; i < 10; i++){ } BODY Solution for this loop: 10 Any idea?

Scalar evolution in LLVM Analysis used by Induction variable substitution Strength reduction Vectorization SCEVs are modeled by the llvm::scev class There is a sub-class for each kind of SCEV (e.g., llvm::scevaddexpr) A SCEV is a tree of SCEVs Leaves: Constant : llvm:scevconstant (e.g., 1) Unknown: llvm:scevunknown To iterate over a tree: llvm:scevvisitor

Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations

Loop transformations Restructure a loop to expose more optimization opportunities and/or transform the loop overhead Loop unrolling, loop peeling, Reorganize a loop to improve memory utilization Cache blocking, skewing, loop reversal Distribute a loop over cores/processors DOACROSS, DOALL, DSWP, HELIX

Loop transformations for memory optimizations How many clock cycles will it take? varx = array[5];...

Goal: improve cache performance Temporal locality A resource that has just been referenced will more likely be referenced again in the near future Spatial locality The likelihood of referencing a resource is higher if a resource near it was just referenced Ideally, a compiler generates code with high temporal and spatial locality for the target architecture What to minimize: bad replacement decisions

What a compiler can do Time: When is an object accessed? Space: Where does an object exist in the address space? These are the two knobs a compiler can manipulate

Manipulating time and space Time: reordering computation Determine when an object will be accessed, and predict a better time to access it Space: changing data layout Determine an object s shape and location, and determine a better layout

First understand cache behavior... When do cache misses occur? Use locality analysis Can we change the visitation order to produce better behavior? Evaluate costs Does the new visitation order still produce correct results? Use dependence analysis

and then rely on loop transformations loop interchange cache blocking loop fusion loop reversal...

Code example double A[N][N], B[N][N]; for i = 0 to N-1{ for j = 0 to N-1{... = A[i][j]... } } Iteration space for A

Loop interchange for i = 0 to N-1 for j = 0 to N-1 = A[j][i] For j = 0 to N-1 for i = 0 to N-1 = A[j][i] Assumptions: N is large; A is row-major; 2 elements per cache line A[][] in C? Java?

Java (similar in C) To create a matrix: double [][] A = new double[3][3]; A is an array of arrays A is not a 2 dimensional array!

Java (similar in C) To create a matrix: double [][] A = new double[3][]; A[0] = new double[3]; A[1] = new double[3]; A[2] = new double[3];

Java (similar in C) To create a matrix: double [][] A = new double[3][]; A[0] = new double[10]; A[1] = new double[5]; A[2] = new double[42]; A is a jagged array

C#: [][] vs. [,] double [][] A = new double[3][]; A[0] = new double[3]; A[1] = new double[3]; A[2] = new double[3]; double [,] A = new double[3,3]; The compiler can easily choose between raw-major vs. column-major

Cache blocking (a.k.a. tiling) for i = 0 to N-1 for j = 0 to N-1 f(a[i], A[j]) for JJ = 0 to N-1 by B for i = 0 to N-1 for j = JJ to min(n-1,jj+b-1) f(a[i], A[j])

Loop fusion for i = 0 to N-1 C[i] = A[i]*2 + B[i] for i = 0 to N-1 D[i] = A[i] * 2 for i = 0 to N-1 C[i] = A[i] * 2 + B[i] D[i] = A[i] * 2 Reduce loop overhead Improve locality by combining loops that reference the same array Increase the granularity of work done in a loop

Locality analysis Reuse: Accessing a location that has been accessed previously Locality: Accessing a location that is in the cache Observe: Locality only occurs when there is reuse! but reuse does not imply locality

Steps in locality analysis Find data reuse Determine localized iteration space Set of inner loops where the data accessed by an iteration is expected to fit within the cache Find data locality Reuse localized iteration space locality