Loops. Lather, Rinse, Repeat. CS4410: Spring 2013

Similar documents
Control flow graphs and loop optimizations. Thursday, October 24, 13

Compiler Construction 2010/2011 Loop Optimizations

Compiler Construction 2016/2017 Loop Optimizations

CSE P 501 Compilers. Loops Hal Perkins Spring UW CSE P 501 Spring 2018 U-1

Loop Optimizations. Outline. Loop Invariant Code Motion. Induction Variables. Loop Invariant Code Motion. Loop Invariant Code Motion

Simone Campanoni Loop transformations

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Compiler Design. Fall Control-Flow Analysis. Prof. Pedro C. Diniz

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

CS202 Compiler Construction

A Bad Name. CS 2210: Optimization. Register Allocation. Optimization. Reaching Definitions. Dataflow Analyses 4/10/2013

Lecture 8: Induction Variable Optimizations

Control flow and loop detection. TDT4205 Lecture 29

Lecture Notes on Loop Optimizations

Intermediate representation

Module 18: Loop Optimizations Lecture 35: Amdahl s Law. The Lecture Contains: Amdahl s Law. Induction Variable Substitution.

Tiling: A Data Locality Optimizing Algorithm

Intermediate Representations. Reading & Topics. Intermediate Representations CS2210

Code optimization. Have we achieved optimal code? Impossible to answer! We make improvements to the code. Aim: faster code and/or less space

CS202 Compiler Construction

An Overview of GCC Architecture (source: wikipedia) Control-Flow Analysis and Loop Detection

Goals of Program Optimization (1 of 2)

Loop Unrolling. Source: for i := 1 to 100 by 1 A[i] := A[i] + B[i]; endfor

Loop Transformations! Part II!

COP5621 Exam 4 - Spring 2005

Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

Compiler Design. Fall Data-Flow Analysis. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Languages and Compiler Design II IR Code Optimization

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers

Tour of common optimizations

Control Flow Analysis. Reading & Topics. Optimization Overview CS2210. Muchnick: chapter 7

CS 2461: Computer Architecture 1

Induction Variable Identification (cont)

Register Allocation. Lecture 16

Compiler Passes. Optimization. The Role of the Optimizer. Optimizations. The Optimizer (or Middle End) Traditional Three-pass Compiler

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization

CS577 Modern Language Processors. Spring 2018 Lecture Optimization

Advanced Compiler Construction Theory And Practice

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

We can express this in dataflow equations using gen and kill sets, where the sets are now sets of expressions.

Tiling: A Data Locality Optimizing Algorithm

A main goal is to achieve a better performance. Code Optimization. Chapter 9

Why Global Dataflow Analysis?

An example of optimization in LLVM. Compiler construction Step 1: Naive translation to LLVM. Step 2: Translating to SSA form (opt -mem2reg)

Loop Invariant Code Motion. Background: ud- and du-chains. Upward Exposed Uses. Identifying Loop Invariant Code. Last Time Control flow analysis

Middle End. Code Improvement (or Optimization) Analyzes IR and rewrites (or transforms) IR Primary goal is to reduce running time of the compiled code

Data Flow Analysis. CSCE Lecture 9-02/15/2018

Assignment statement optimizations

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

Redundant Computation Elimination Optimizations. Redundancy Elimination. Value Numbering CS2210

16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as

Introduction to optimizations. CS Compiler Design. Phases inside the compiler. Optimization. Introduction to Optimizations. V.

EE 352 Lab 4 Cache Me If You Can

CSC D70: Compiler Optimization LICM: Loop Invariant Code Motion

Bentley Rules for Optimizing Work

Lecture Notes on Loop-Invariant Code Motion

Compilers. Optimization. Yannis Smaragdakis, U. Athens

Supercomputing in Plain English Part IV: Henry Neeman, Director

EE 352 Lab 5 Cache Me If You Can

Using Static Single Assignment Form

Topic 9: Control Flow

Lecture 21 CIS 341: COMPILERS

CSC D70: Compiler Optimization Memory Optimizations

Plan for Today. Concepts. Next Time. Some slides are from Calvin Lin s grad compiler slides. CS553 Lecture 2 Optimizations and LLVM 1

Lecture 2: Control Flow Analysis

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

Program Transformations for the Memory Hierarchy

CS553 Lecture Dynamic Optimizations 2

Machine-Independent Optimizations

Compiler Optimization and Code Generation

Polly Polyhedral Optimizations for LLVM

Control Flow Analysis

Dependence and Data Flow Models. (c) 2007 Mauro Pezzè & Michal Young Ch 6, slide 1

Control-Flow Graphs & Dataflow Analysis. CS4410: Spring 2013

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Dependence and Data Flow Models. (c) 2007 Mauro Pezzè & Michal Young Ch 6, slide 1

MIT Introduction to Dataflow Analysis. Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

Transforming Imperfectly Nested Loops

Data Flow Analysis and Computation of SSA

Lecture 25: Register Allocation

Register allocation. Register allocation: ffl have value in a register when used. ffl limited resources. ffl changes instruction choices

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Programming Language Processor Theory

We can express this in dataflow equations using gen and kill sets, where the sets are now sets of expressions.

What Do Compilers Do? How Can the Compiler Improve Performance? What Do We Mean By Optimization?

Register Allocation (via graph coloring) Lecture 25. CS 536 Spring

CprE 488 Embedded Systems Design. Lecture 6 Software Optimization

EECS 583 Class 8 Classic Optimization

Introduction to Optimization Local Value Numbering

Principle of Polyhedral model for loop optimization. cschen 陳鍾樞

Loop Transformations, Dependences, and Parallelization

USC 227 Office hours: 3-4 Monday and Wednesday CS553 Lecture 1 Introduction 4

Supercomputing and Science An Introduction to High Performance Computing

PERFORMANCE OPTIMISATION

Why Data Flow Models? Dependence and Data Flow Models. Learning objectives. Def-Use Pairs (1) Models from Chapter 5 emphasized control

Lecture 3 Local Optimizations, Intro to SSA

CSE 501: Compiler Construction. Course outline. Goals for language implementation. Why study compilers? Models of compilation

Lecture 23 CIS 341: COMPILERS

Module 14: Approaches to Control Flow Analysis Lecture 27: Algorithm and Interval. The Lecture Contains: Algorithm to Find Dominators.

Transcription:

Loops or Lather, Rinse, Repeat CS4410: Spring 2013

Program Loops Reading: Appel Ch. 18 Loop = a computation repeatedly executed until a terminating condition is reached High-level loop constructs: While loop: while (e) s; For loop: for(i=0; i<u; i+=c) s;

Program Loops Why are loops important? Most of the execution time is spent in loops Typically: 90/10 rule, 10% code is a loop Therefore, loops are important targets of optimization

Loop Optimizations: So we want techniques for improving them Low-level optimization: Moving around code in a single loop usually performed at 3-addr code stage or later e.g., loop invariant removal, induction variable strength reduction & elimination, loop unrolling High-level optimization: Restructuring loops, often affects multiple loops e.g., loop fusion, loop interchange, loop tiling

Example: invariant removal L0: t := 0 L1: i := i + 1 t := a + b *i := t if i<n goto L1 else L2 L2: x := t

Example: invariant removal L0: t := 0 L1: i := i + 1 t := a + b *i := t if i<n goto L1 else L2 L2: x := t

Example: invariant removal L0: t := 0 t := a + b L1: i := i + 1 *i := t if i<n goto L1 else L2 L2: x := t

Example: induction variable L0: i := 0 /* s=0; */ s := 0 /* for(i=0;i<100;i++)*/ jump L2 /* s += a[i]; */ L1: t1 := i*4 t2 := a+t1 t3 := *t2 s := s + t3 i := i+1 L2: if i < 100 goto L1 else goto L3 L3:...

Example: induction variable L0: i := 0 /* s=0; */ s := 0 /* for(i=0;i<100;i++)*/ jump L2 /* s += a[i]; */ L1: t1 := i*4 Note: t1 == i*4 t2 := a+t1 t3 := *t2 s := s + t3 at each point in loop i := i+1 L2: if i < 100 goto L1 else goto L3 L3:...

Example: induction variable L0: i := 0 s := 0 t1 := 0 jump L2 L1: t2 := a+t1 t3 := *t2 s := s + t3 i := i+1 t1 := t1+4 L2: if i < 100 goto L1 else goto L3 L3:...

Example: induction variable L0: i := 0 s := 0 t1 := 0 jump L2 L1: t2 := a+t1 ; t2 == a+t1 == a+i*4 t3 := *t2 s := s + t3 i := i+1 t1 := t1+4 L2: if i < 100 goto L1 else goto L3 L3:...

Example: induction variable L0: i := 0 s := 0 t1 := 0 t2 := a jump L2 L1: t3 := *t2 Notice t1 no longer used! s := s + t3 i := i+1 t1 := t1+4 t2 := t2+4 ; t2 == a+t1 == a+i*4 L2: if i < 100 goto L1 else goto L3 L3:...

Example: induction variable L0: i := 0 s := 0 t2 := a jump L2 L1: t3 := *t2 s := s + t3 i := i+1 t2 := t2+4 L2: if i < 100 goto L1 else goto L3 L3:...

Example: induction variable L0: i := 0 s := 0 t2 := a t5 := t2+400 jump L2 L1: t3 := *t2 s := s + t3 i := i+1 t2 := t2+4 L2: if t2 < t5 goto L1 else goto L3 L3:...

Example: induction variable L0: i := 0 s := 0 t2 := a t5 := t2+400 jump L2 L1: t3 := *t2 s := s + t3 i := i+1 t2 := t2+4 L2: if t2 < t5 goto L1 else goto L3 L3:...

Example: induction variable L0: s := 0 t2 := a t5 := t2+400 jump L2 L1: t3 := *t2 s := s + t3 t2 := t2+4 L2: if t2 < t5 goto L1 else goto L3 L3:...

Gotta find loops first: What is a loop? can't just "look" at graphs we're going to assume some additional structure Defn: a loop is a subset S of nodes where: there is a distinguished header node h you can get from h to any node in S you can get from any node in S to h there's no edge from a node outside S to any other node than h.

Examples:

Examples:

Examples:

Examples:

Examples:

Consider: a b c Does it have a "loop"?

This graph is called irreducible a can't be header: no edge from c or b to it. a b can't be header: c has outside edge from a. b c c can't be header: b has outside edge from a. According to our definition, no loop. But obviously, there's a cycle

Reducible Flow Graphs So why did we define loops this way? header gives us a "handle" for the loop. e.g., a good spot for hoisting invariant statements structured control-flow only produces reducible graphs. a graph where all cycles are loops according to our definition. Java: only reducible graphs C/C++: goto can produce irreducible graph many analyses & loop optimizations depend upon having reducible graphs.

Finding Loops Defn: node d dominates node n if every path from the start node to n must go through d. Defn: an edge from n to a dominator d is called a back-edge. Defn: a natural loop of a back edge n d is the set of nodes x such that d dominates x and there is a path from x to n not including d. So that's how we find loops!

Example: c a b f d a dominates a,b,c,d,e,f,g,h b dominates b,c,d,e,f,g,h c dominates c,e d dominates d e dominates e f dominates f,g,h g dominates g,h h dominates h e g back-edges? f->b, g->a h loops?

Calculating Dominators: D[n] : the set of nodes that dominate n. D[n0] = {n0} D[n] = {n} (D[p 1 ] D[p 2 ] D[p m ]) where pred[n] = {p 1,p 2,,p n } It's pretty easy to solve this equation. start off assuming D[n0] = {n0} (where n0 is start node, with no predecessors) D[n] = all nodes (where n is not the start node) iteratively update D[n] based on predecessors until you reach a fixed point.

Representing Dominators We don't actually need to keep around the set of all dominators for each node. Instead, we construct a dominator tree. if both d and e dominate n, then either d dominates e or vice versa. that tells us there is a "closest" or immediate dominator.

Example: a Immediate Dominator Tree a b b c d c d e f g e f g h h

Nested Loops If loops A & B have headers a & b s.t. a!= b and a dominates b, and all of the nodes in B are a subset of nodes in A, then we say B is nested within A. We usually concentrate our attention on nested loops first (since we spend more time in them.)

Disjoint and Nested Loops Property: for any two natural loops in a flow graph, one of the following is true: 1. They are disjoint 2. They are nested 3. They have the same header Eliminate alternative 3: if two loops have the same header and none is nested in the other, combine all nodes into a single loop.

Loop Preheader Several optimizations add code before header Insert a new basic block (called preheader) in the CFG to hold this code

Loop Optimizations Now we know the loops Next: optimize these loops Loop invariant code motion Strength reduction of induction variables Induction variable elimination

Loop Invariant Computation A definition x:= reaches a control-flow point if there is a path from the assignment to that point that contains no other assignment to x. An assignment x := v 1 v 2 is invariant for a loop if for both operands v 1 & v 2 either they are constant, or all of their definitions that reach the assignment are outside the loop, or only one definition reaches the assignment and it is loop invariant.

Example: L0: t := 0 L1: i := i + 1 t := a + b *i := t if i<n goto L1 else L2 L2: x := t

Calculating Reaching Defn's: Assign a unique id to each definition. Define defs(x) to be the set of all definitions of the temp x. Gen Kill d : x := v 1 v 2 {d} defs(x) - {d} d : x := v {d} defs(x) - {d} <everything else> { } { } DefIn[n] = DefOut[p 1 ] DefOut[p n ] where Pred[n] = {p 1,,p n } DefOut[n] = Gen[n] (DefIn[n] - Kill[n])

Hoisting / Code Motion We would like to hoist invariant computations out of the loop. But this is trickier than it sounds: We have already dealt with problem of where to place the hoisted statements by introducing preheader nodes Even then, we can run into trouble

Valid Hoisting: L0: t := 0 L1: i := i + 1 t := a + b *i := t if i<n goto L1 else L2 L2: x := t

Valid Hoisting: L0: t := 0 t := a + b L1: i := i + 1 *i := t if i<n goto L1 else L2 L2: x := t

Invalid Hoisting: L0: t := 0 L1: i := i + 1 *i := t t := a + b t's definition is loop invariant but hoisting it conflicts with this use of the old t. if i<n goto L1 else L2 L2: x := t

Conditions for Safe Hoisting: An invariant assignment d:x:= v 1 v 2 is safe to hoist if: d dominates all loop exits at which x is liveout, and there is only one definition of x in the loop, and x is not live-out at the entry point for the loop (the pre-header.)

Induction Variables An induction variable is a variable in a loop, whose value is a function of the loop iteration number: v = f(i) In compilers, this is a linear function: f(i) = c*i + d Observation: linear combinations of linear functions are linear functions Consequence: linear combinations of induction variables are induction variables

Families of Induction Variables Basic induction variable: a variable whose only definition in the loop body is of the form i = i + c (where c is loop invariant) Derived induction variables: Each basic induction variable i defines a family of induction variables Fam(i) i in Fam(i) k in Fam(i) if there is only one defn of k in the loop body, and it has the form k = j*c or k = j+c, where j in Fam(i) c is loop invariant The only defn of j that reaches defn of k is in the loop There is no defn of I between the defns of j and k

Induction Variables s := 0 i := 0 L1: if i >= n goto L2 j := i*4 k := j+a x := *k s := s+x i := i+1 L2:... We can express j & k as linear functions of i: j = 4*i + 0 k = 4*i + a where the coefficients are either constants or loop-invariant.

Induction Variables s := 0 i := 0 L1: if i >= n goto L2 j := i*4 k := j+a x := *k s := s+x i := i+1 L2:... So let's represent them as triples of the form (t, e 0, e 1 ): j = (i, 0, 4) k = (i, a, 4) i = (i, 1, 1)

Induction Variables s := 0 i := 0 L1: if i >= n goto L2 j := i*4 k := j+a x := *k s := s+x i := i+1 L2:... Note that i only changes by the same amount each iteration of the loop. We say that i is a linear induction variable. So it's easy to express the change in j & k.

Induction Variables s := 0 i := 0 L1: if i >= n goto L2 j := i*4 k := j+a x := *k s := s+x i := i+1 L2:... If i changes by c, then since: j = 4*i + 0 k = 4*i + a we know that j & k change by 4*c.

Finding Induction Variables Scan loop body to find all basic induction variables do Scan loop to find all variables k with one assignment of form k = j*b, where j is an induction variable <i,c,d>, and make k an induction variable with triple <i,c*b,d> Scan loop to find all variables k with one assignment of form k = j+/-b where j is an induction variable with triple <i,c,d>, and make k and induction variable with triple <i,c,d+/-b) until no more induction variables found

Strength Reduction For each derived induction variable j of the form (i, e 0, e 1 ) make a fresh temp j'. At the loop pre-header, initialize j' to e 0. After each i:=i+c, define j':=j'+(e 1 *c). note that e 1 *c can be computed in the loop header (i.e., it's loop invariant.) Replace the unique assignment of j in the loop with j := j'.

Example s := 0 i := 0 j' := 0 k' := a L1: if i >= n goto L2 j := i*4 k := j+a x := *k s := s+x i := i+1 L2:...

Example s := 0 i := 0 j' := 0 k' := a L1: if i >= n goto L2 j := i*4 k := j+a x := *k s := s+x i := i+1 j' := j'+4 k' := k'+4 L2:...

Example s := 0 i := 0 j' := 0 k' := a L1: if i >= n goto L2 j := j' k := k' x := *k s := s+x i := i+1 j' := j'+4 k' := k'+4 L2:... Copy-propagation or coalescing will eliminate the distinction between j/j' and k/k'.

Useless Variables s := 0 i := 0 j' := 0 k' := a L1: if i >= n goto L2 x := *k' s := s+x i := i+1 j' := j'+4 k' := k'+4 L2:... A variable is useless for L if it is not live out at all exits from L and its only use is in a definition of itself. For example, j' is useless. We can delete useless variables from loops.

Useless Variables s := 0 i := 0 j' := 0 k' := a L1: if i >= n goto L2 L2:... x := *k' s := s+x i := i+1 k' := k'+4 DCE will pick up the dead initialization in the pre-header

Almost Useless Variables s := 0 i := 0 k' := a L1: if i >= n goto L2 x := *k' s := s+x i := i+1 k' := k'+4 L2:... The variable i is almost useless. It would be if it weren't used in the comparison See Appel for how to determine when/how it's safe to rewrite this test in terms of other induction variables in the family of i.

High-Level Loop Optimizations Require restructuring loops or sets of loops Combining two loops (loop fusion) Switching the order of a nested loop (loop interchange) Completely changing the traversal order (loop tiling) These sorts of high level optimizations usually take place at the AST level (where loop structure is obvious)

Cache Behavior Most loop transformations target cache behavior Attempt to increase spatial or temporal locality Locality can be exploited when there is reuse of data (for temporal locality) or recent access of nearby data (for spatial locality) Loops are a good opportunity for this: many loops iterate through matrices or arrays Consider matrix-vector multiply example

Cache Behavior Loops are a good opportunity for this: many loops iterate through matrices or arrays Consider matrix-vector multiply example Multiple traversals or vector: opportunity for spatial and temporal locality Regular access to array: opportunity for spatial locality avior i y A y = Ax for (i = 0; i < N; i++) for (j = 0; j < N; j++) y[i] += A[i][j] * x[j] j x

Loop Fusion Combine two loops together into a single loop Why is this useful? Is it always legal? for (i=1;i<=n;i++) c[i] = a[i]; for (i=1;i<=n;i++) b[i] = a[i]; for (i=1;i<=n;i++) { c[i] = a[i]; b[i] = a[i]; } c[1:n] c[1:n] a[1:n] b[1:n] a[1:n] a[1:n] b[1:n]

Loop Interchange Change the order of a nested loop This is not always legal: it changes the order in which elements are accessed Consider matrix-matrix multiply when A is stored in column-major order (i.e., each column is stored in contiguous memory) i y A for (i = 0; i < N; i++) for (j = 0; j < N; j++) y[i] += A[i][j] * x[j] j x

Loop Interchange Change the order of a nested loop This is not always legal: it changes the order in which elements are accessed Consider matrix-matrix multiply when A is stored in column-major order (i.e., each column is stored in contiguous memory) i y j A for (j = 0; j < N; j++) for (i = 0; i < N; i++) y[i] += A[i][j] * x[j] x

Loop Tiling Also called loop blocking Goal: break up loop into smaller pieces to get spatial & temporal locality One of the more complex loop transformations Create new inner loops so data accessed in inner loops fit in cache Also changes iteration order so may not be legal for (i = 0; i < N; i++) for (j = 0; j < N; j++) y[i] += A[i][j] * x[j] for (ii = 0; ii < N; ii += B) for (jj = 0; jj < N; jj += B) for (i = ii; i < ii+b; i++) for (j = jj; j < jj+b; j++) y[i] += A[i][j] * x[j] i j x y A

y A Loop Tiling Also called loop blocking Goal: break up loop into smaller pieces to get spatial & temporal locality One of the more complex loop transformations Create new inner loops so data accessed in inner loops fit in cache Also changes iteration order so may not be legal for (i = 0; i < N; i++) for (j = 0; j < N; j++) y[i] += A[i][j] * x[j] for (ii = 0; ii < N; ii += B) for (jj = 0; jj < N; jj += B) for (i = ii; i < ii+b; i++) for (j = jj; j < jj+b; j++) y[i] += A[i][j] * x[j] i B j B x

Loop Optimizations Loop transformations can have dramatic effects on performance Transforming loops correctly and automatically is very difficult! Researchers have developed many techniques to determine legality of loop transformations and automatically transform loops.