Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Similar documents
Loop Transformations! Part II!

Loops and Locality. with an introduc-on to the memory hierarchy. COMP 506 Rice University Spring target code. source code OpJmizer

Compiling for Advanced Architectures

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

CS 293S Parallelism and Dependence Theory

Tiling: A Data Locality Optimizing Algorithm

Supercomputing in Plain English Part IV: Henry Neeman, Director

Coarse-Grained Parallelism

Transforming Imperfectly Nested Loops

Module 16: Data Flow Analysis in Presence of Procedure Calls Lecture 32: Iteration. The Lecture Contains: Iteration Space.

The Polyhedral Model (Transformations)

Simone Campanoni Loop transformations

PERFORMANCE OPTIMISATION


Floating Point/Multicycle Pipelining in DLX

Module 18: Loop Optimizations Lecture 36: Cycle Shrinking. The Lecture Contains: Cycle Shrinking. Cycle Shrinking in Distance Varying Loops

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2

Module 18: Loop Optimizations Lecture 35: Amdahl s Law. The Lecture Contains: Amdahl s Law. Induction Variable Substitution.

EE 4683/5683: COMPUTER ARCHITECTURE

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Optimising for the p690 memory system

Instruction-Level Parallelism (ILP)

Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching

Enhancing Parallelism

Tiling: A Data Locality Optimizing Algorithm

A Preliminary Assessment of the ACRI 1 Fortran Compiler

Compiler techniques for leveraging ILP

Automatic Tuning of Scientific Applications. Apan Qasem Ken Kennedy Rice University Houston, TX

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Advanced Compiler Construction Theory And Practice

Advanced optimizations of cache performance ( 2.2)

Review. Loop Fusion Example

Program Transformations for the Memory Hierarchy

CS 351 Final Exam Solutions

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Intelligent Compilation

Dependence Analysis. Hwansoo Han

Tiling: A Data Locality Optimizing Algorithm

Loop Modifications to Enhance Data-Parallel Performance

Control flow graphs and loop optimizations. Thursday, October 24, 13

Parallelisation. Michael O Boyle. March 2014

CMSC 611: Advanced Computer Architecture

Linear Loop Transformations for Locality Enhancement

CSC D70: Compiler Optimization Memory Optimizations

Exploring Parallelism At Different Levels

Auto-tuning a High-Level Language Targeted to GPU Codes. By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos

ECE 5730 Memory Systems

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

Lecture 21. Software Pipelining & Prefetching. I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining

CprE 488 Embedded Systems Design. Lecture 6 Software Optimization

Instruction Level Parallelism (ILP)

Computer Science 246 Computer Architecture

Lecture 11 Reducing Cache Misses. Computer Architectures S

Computer Architecture Spring 2016

Using Cache Models and Empirical Search in Automatic Tuning of Applications. Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX

ILP: Instruction Level Parallelism

Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )

Preface. Intel Technology Journal Q4, Lin Chao Editor Intel Technology Journal

Copyright 2012, Elsevier Inc. All rights reserved.

Fusion of Loops for Parallelism and Locality

CMSC411 Fall 2013 Midterm 2 Solutions

Outline. L5: Writing Correct Programs, cont. Is this CUDA code correct? Administrative 2/11/09. Next assignment (a homework) given out on Monday

Register Allocation. Lecture 16

Supercomputing in Plain English

CS422 Computer Architecture

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Tutorial 11. Final Exam Review

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Cache-oblivious Programming

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010

Legal and impossible dependences

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Pipeline Parallelism and the OpenMP Doacross Construct. COMP515 - guest lecture October 27th, 2015 Jun Shirako

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

CS 2461: Computer Architecture 1

Review. Topics. Lecture 3. Advanced Programming Topics. Review: How executable files are generated. Defining macros through compilation flags

COSC 6385 Computer Architecture - Pipelining (II)

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Introduction to optimizations. CS Compiler Design. Phases inside the compiler. Optimization. Introduction to Optimizations. V.

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

計算機結構 Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

HY425 Lecture 09: Software to exploit ILP

HY425 Lecture 09: Software to exploit ILP

Advanced Pipelining and Instruction- Level Parallelism 4

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Automatic Translation of Fortran Programs to Vector Form. Randy Allen and Ken Kennedy

Chapter 3 & Appendix C Part B: ILP and Its Exploitation

Vector Processors. Abhishek Kulkarni Girish Subramanian

Loop Optimizations. Outline. Loop Invariant Code Motion. Induction Variables. Loop Invariant Code Motion. Loop Invariant Code Motion

Rechnerarchitektur (RA)

CS 701. Class Meets. Instructor. Teaching Assistant. Key Dates. Charles N. Fischer. Fall Tuesdays & Thursdays, 11:00 12: Engineering Hall

Loops. Lather, Rinse, Repeat. CS4410: Spring 2013

Loop Nest Optimizer of GCC. Sebastian Pop. Avgust, 2006

Leftover. UMA vs. NUMA Parallel Computer Architecture. Shared-memory Distributed-memory

Chapter 06: Instruction Pipelining and Parallel Processing

Manual and Compiler Optimizations

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Transcription:

Autotuning John Cavazos University of Delaware

What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic algorithms, hill-climbing Random (often quite good!)

Parameters to tune in all of these application compiler runtime system operating system virtualization hardware

Traditional Compilers One size fits all approach Tuned for average performance Aggressive opts often turned off Target hard to model analytically application compiler runtime system operating system virtualization hardware

Solution : Random Search! Identify large set of optimizations to search over Some optimizations require parameter values, search over those values also! Out-performs state-of-the-art compiler application compiler runtime system operating system virtualization hardware

Optimization Sequence Representation Use random number generator to construct sequence Example: -LNO:interchange=1:prefetch=2:blocking_size=32:fusion=1: Generate each of these parameter values using a random number generator Note: need to define a range of interesting values a-priori 6

Case Study: Random vs State-of-the-Art PathScale compiler Compare to highest optimization level 121 compiler flags AMD Athlon processor Real machine; Not simulation 57 benchmarks SPEC (INT 95, INT/FP 2000), MiBench, Polyhedral

Evaluated Search Strategies RAND Randomly select 500 optimization sequences Combined Elimination [CGO 2006] Pure search technique Evaluate optimizations one at a time Eliminate negative optimizations in one go Out-performed other pure search techniques PC Model [CGO 2007] Machine learning model using performance counters Mapping of performance counters to good optimizations

Performance vs Evaluations Random (17%) PC Model (17%) Combined Elimination (12%)

Some recommendations Use small input sizes that are representative Be careful as tuning on small inputs may not give you the best performance on regular (larger) inputs Reduce application to most important kernel(s) and tune those If kernels can be mapped to highly-tuned library implementations, use those!

Some recommendations No optimization search will help a bad algorithm! Chose the correct algorithm first!

Using quasi-intelligent search Can easily setup a genetic algorithm or hill-climbing to perform search over optimization space We can help you set this up. Random often does as well!

Performance counters Can be used to narrow search to particular set of optimizations Lots of cache misses may require loop restructuring, e.g., blocking Lots of resource stalls may require instruction scheduling

Most Informative Perf Counters 1. L1 Cache Accesses 2. L1 Dcache Hits 3. TLB Data Misses 4. Branch Instructions 5. Resource Stalls 6. Total Cycles 7. L2 Icache Hits 8. Vector Instructions 9. L2 Dcache Hits 10. L2 Cache Accesses 11. L1 Dcache Accesses 12. Hardware Interrupts 13. L2 Cache Hits 14. L1 Cache Hits 15. Branch Misses

Dependence Analysis and Loop Transformations John Cavazos University of Delaware

Lecture Overview Very Brief Introduction to Dependences Loop Transformations

The Big Picture What are our goals? Simple Goal: Make execution time as small as possible Which leads to: Achieve execution of many (all, in the best case) instructions in parallel Find independent instructions

Dependences We will concentrate on data dependences Simple example of data dependence: S 1 PI = 3.14 S 2 R = 5.0 S 3 AREA = PI * R ** 2 Statement S 3 cannot be moved before either S 1 or S 2 without compromising correct results

Dependences Formally: There is a data dependence from statement S 1 to statement S 2 (S 2 depends on S 1 ) if: 1. Both statements access the same memory location and at least one of them stores onto it, and 2. There is a feasible run-time execution path from S 1 to S 2

Load Store Classification Quick review of dependences classified in terms of load-store order: 1. True dependence (RAW hazard) 2. Antidependence (WAR hazard) 3. Output dependence (WAW hazard)

Dependence in Loops Let us look at two different loops: S 1 DO I = 1, N A(I+1) = A(I)+ B(I) ENDDO S 1 DO I = 1, N A(I+2) = A(I)+B(I) ENDDO In both cases, statement S 1 depends on itself

Transformations We call a transformation safe if the transformed program has the same "meaning" as the original program But, what is the "meaning" of a program? For our purposes: Two computations are equivalent if, on the same inputs: They produce the same outputs in the same order

Reordering Transformations Is any program transformation that changes the order of execution of the code, without adding or deleting any executions of any statements

Properties of Reordering Transformations A reordering transformation does not eliminate dependences However, it can change the ordering of the dependence which will lead to incorrect behavior A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence.

Loop Transformations Compilers have always focused on loops Higher execution counts Repeated, related operations Much of real work takes place in loops *

Several effects to attack Overhead Decrease control-structure cost per iteration Locality Spatial locality use of co-resident data Temporal locality reuse of same data Parallelism Execute independent iterations of loop in parallel *

Eliminating Overhead Loop unrolling (the oldest trick in the book) To reduce overhead, replicate the loop body do i = 1 to 100 by 1 a(i) = a(i) + b(i) end becomes (unroll by 4) Sources of Improvement Less overhead per useful operation do i = 1 to 100 by 4 a(i) = a(i) + b(i) a(i+1) = a(i+1) + b(i+1) a(i+2) = a(i+2) + b(i+2) a(i+3) = a(i+3) + b(i+3) end Longer basic blocks for local optimization

Eliminating Overhead Loop unrolling with unknown bounds Generate guard loops do i = 1 to n by 1 a(i) = a(i) + b(i) end becomes (unroll by 4) i = 1 do while (i+3 < n) a(i) = a(i) + b(i) a(i+1) = a(i+1) + b(i+1) a(i+2) = a(i+2) + b(i+2) a(i+3) = a(i+3) + b(i+3) i = i + 4 end do while (i < n) a(i) = a(i) + b(i) i = i + 1 end

Eliminating Overhead One other use for loop unrolling Eliminate copies at the end of a loop t 1 = b(0) do i = 1 to 100 t 2 = b(i) a(i) = a(i) + t 1 + t 2 t 1 = t 2 end becomes (unroll + rename) t 1 = b(0) do i = 1 to 100 by 2 t 2 = b(i) a(i) = a(i) + t 1 + t 2 t 1 = b(i+1) a(i+1) = a(i+1) + t 2 + t 1 end

Loop Unswitching Hoist invariant control-flow out of loop nest Replicate the loop & specialize it No tests, branches in loop body Longer segments of straight-line code *

Loop Unswitching loop statements if test then then part else else part endif more statements endloop becomes (unswitch) If test then loop statements then part more statements endloop else loop statements else part more statements endloop endif *

Loop Unswitching do i = 1 to 100 a(i) = a(i) + b(i) if (expression) then d(i) = 0 end becomes (unswitch) if (expression) then do i = 1 to 100 a(i) = a(i) + b(i) d(i) = 0 end else do i = 1 to 100 a(i) = a(i) + b(i) end *

Loop Fusion Two loops over same iteration space one loop Safe if does not change the values used or defined by any statement in either loop (i.e., does not violate deps) do i = 1 to n c(i) = a(i) + b(i) end do j = 1 to n d(j) = a(j) * e(j) end becomes (fuse) do i = 1 to n c(i) = a(i) + b(i) d(i) = a(i) * e(i) end For big arrays, a(i) may not be in the cache a(i) will be found in the cache *

Loop Fusion Advantages Enhance temporal locality Reduce control overhead Longer blocks for local optimization & scheduling Can convert inter-loop reuse to intra-loop reuse *

Loop Fusion of Parallel Loops Parallel loop fusion legal if dependences loop independent Source and target of flow dependence map to same loop iteration *

Loop distribution (fission) Single loop with independent statements multiple loops Starts by constructing statement level dependence graph Safe to perform distribution if: No cycles in the dependence graph Statements forming cycle in dependence graph put in same loop

Loop distribution (fission) Reads b, c, { e, f, h, & k Writes a, d, & g do i = 1 to n a(i) = b(i) + c(i) d(i) = e(i) * f(i) g(i) = h(i) - k(i) end becomes (fission) do i = 1 to n a(i) = b(i) + c(i) end do i = 1 to n d(i) = e(i) * f(i) end do i = 1 to n g(i) = h(i) - k(i) end } Reads b & c Writes a } Reads e & f Writes d } Reads h & k Writes g

Loop distribution (fission) (1) for I = 1 to N do (2) A[I] = A[i] + B[i-1] (3) B[I] = C[I-1]*X+C (4) C[I] = 1/B[I] (5) D[I] = sqrt(c[i]) (6) endfor Has the following dependence graph

Loop distribution (fission) (1) for I = 1 to N do (2) A[I] = A[i] + B[i-1] (3) B[I] = C[I-1]*X+C (4) C[I] = 1/B[I] (5) D[I] = sqrt(c[i]) (6) endfor becomes (fission) (1) for I = 1 to N do (2) A[I] = A[i] + B[i-1] (3) endfor (4) for (5) B[I] = C[I-1]*X+C (6) C[I] = 1/B[I] (7)endfor (8)for (9) D[I] = sqrt(c[i]) (10)endfor

Loop Fission Advantages Enables other transformations E.g., Vectorization Resulting loops have smaller cache footprints More reuse hits in the cache * 40

Loop Interchange do i = 1 to 50 do j = 1 to 100 a(i,j) = b(i,j) * c(i,j) end end becomes (interchange) do j = 1 to 100 do i = 1 to 50 a(i,j) = b(i,j) * c(i,j) end end Swap inner & outer loops to rearrange iteration space Effect Improves reuse by using more elements per cache line Goal is to get as much reuse into inner loop as possible * 41

Loop Interchange Effect If one loop carries all dependence relations Swap to outermost loop and all inner loops executed in parallel If outer loops iterates many times and inner only a few Swap outer and inner loops to reduce startup overhead Improves reuse by using more elements per cache line Goal is to get as much reuse into inner loop as possible * 42

Reordering Loops for Locality In row-major order, the opposite loop ordering causes the same effects In Fortran s column-major order, a(4,4) would lay out as 1,1 2,1 3,1 4,1 1,2 2,2 3,2 4,2 1,3 2,3 3,3 4,3 1,4 2,4 3,4 4,4 cache line As little as 1 used element per line After interchange, direction of Iteration is changed 1,1 2,1 3,1 4,1 1,2 2,2 3,2 4,2 1,3 2,3 3,3 4,3 1,4 2,4 3,4 4,4 cache line Runs down cache line *

Loop permutation Interchange is degenerate case Two perfectly nested loops More general problem is called permutation Safety Permutation is safe iff no data dependences are reversed The flow of data from definitions to uses is preserved

Loop Permutation Effects Change order of access & order of computation Move accesses closer in time increase temporal locality Move computations farther apart cover pipeline latencies 45

Strip Mining Splits a loop into two loops do j = 1 to 100 do i = 1 to 50 a(i,j) = b(i,j) * c(i,j) endend becomes (strip mine) do j = 1 to 100 do ii = 1 to 50 by 8 do i = ii to min(ii+7,50) a(i,j) = b(i,j) * c(i,j) end end end Note: This is always safe, but used by itself not profitable!

Strip Mining Effects May slow down the code (extra loop) Enables vectorization 47

Loop Tiling (blocking) Want to exploit temporal locality in loop nest.

Loop Tiling (blocking)

Loop Tiling (blocking)

Loop Tiling (blocking)

Loop Tiling (blocking)

Loop Tiling Effects Reduces volume of data between reuses Works on one tile at a time (tile size is B by B) Choice of tile size is crucial 53

Scalar Replacement Allocators never keep c(i) in a register We can trick the allocator by rewriting the references The plan Locate patterns of consistent reuse Make loads and stores use temporary scalar variable Replace references with temporary s name

Scalar Replacement do i = 1 to n do j = 1 to n a(i) = a(i) + b(j) end end becomes (scalar replacement) do i = 1 to n t = a(i) do j = 1 to n t = t + b(j) end a(i) = t end Almost any register allocator can get t into a register 55

Scalar Replacement Effects Decreases number of loads and stores Keeps reused values in names that can be allocated to registers In essence, this exposes the reuse of a(i) to subsequent passes 56