Loop Nest Optimizer of GCC. Sebastian Pop. Avgust, 2006

Similar documents
GRAPHITE: Polyhedral Analyses and Optimizations

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time

The New Framework for Loop Nest Optimization in GCC: from Prototyping to Evaluation

The Polyhedral Model Is More Widely Applicable Than You Think

CS671 Parallel Programming in the Many-Core Era

Tiling: A Data Locality Optimizing Algorithm

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

The Polyhedral Model (Transformations)

Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies

Facilitating the Search for Compositions of Program Transformations

Control flow graphs and loop optimizations. Thursday, October 24, 13

Polly Polyhedral Optimizations for LLVM

Tiling: A Data Locality Optimizing Algorithm

Iterative Optimization in the Polyhedral Model: Part II, Multidimensional Time

Introducing the GCC to the Polyhedron Model

UNIVERSITÉ DE PARIS-SUD 11 U.F.R. SCIENTIFIQUE D ORSAY

Polyhedral Compilation Foundations

Compiling for Advanced Architectures

Code Merge. Flow Analysis. bookkeeping

Loop Transformations! Part II!

Review. Loop Fusion Example

Introduction. No Optimization. Basic Optimizations. Normal Optimizations. Advanced Optimizations. Inter-Procedural Optimizations

KernelGen a toolchain for automatic GPU-centric applications porting. Nicolas Lihogrud Dmitry Mikushin Andrew Adinets

Lecture 11 Loop Transformations for Parallelism and Locality

Polyèdres et compilation

UNIVERSITÉ DE PARIS-SUD 11 U.F.R. SCIENTIFIQUE D ORSAY

Polyhedral Operations. Algorithms needed for automation. Logistics

Loop Transformations, Dependences, and Parallelization

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

High-Level Loop Optimizations for GCC

The SGI Pro64 Compiler Infrastructure - A Tutorial

Parametric Multi-Level Tiling of Imperfectly Nested Loops*

GCC Internals Control and data flow support

Coarse-Grained Parallelism

TOBIAS GROSSER Parkas Group, Computer Science Department, Ècole Normale Supèrieure / INRIA 45 Rue d Ulm, Paris, 75005, France

Overview of a Compiler

Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching

An introduction to DSP s. Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Overview of a Compiler

Overview of a Compiler

CS 293S Parallelism and Dependence Theory

An Overview to. Polyhedral Model. Fangzhou Jiao

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Module 18: Loop Optimizations Lecture 36: Cycle Shrinking. The Lecture Contains: Cycle Shrinking. Cycle Shrinking in Distance Varying Loops

Interval Polyhedra: An Abstract Domain to Infer Interval Linear Relationships

More Data Locality for Static Control Programs on NUMA Architectures

Legal and impossible dependences

Simone Campanoni Loop transformations

The Apron Library. Bertrand Jeannet and Antoine Miné. CAV 09 conference 02/07/2009 INRIA, CNRS/ENS

Iterative Optimization in the Polyhedral Model

Supercomputing in Plain English Part IV: Henry Neeman, Director

An Integrated Program Representation for Loop Optimizations

Far Fetched Prefetching? Tomofumi Yuki INRIA Rennes Antoine Morvan ENS Cachan Bretagne Steven Derrien University of Rennes 1

Polyhedral Compilation Foundations

Louis-Noël Pouchet

Auto-Vectorization of Interleaved Data for SIMD

Affine Loop Optimization using Modulo Unrolling in CHAPEL

Enabling more optimizations in GRAPHITE: ignoring memory-based dependences

Array Optimizations in OCaml

Polly - Polyhedral optimization in LLVM

Tiling: A Data Locality Optimizing Algorithm

FADA : Fuzzy Array Dataflow Analysis

Compiling for HSA accelerators with GCC


Multithreading: Exploiting Thread-Level Parallelism within a Processor

Solving the Live-out Iterator Problem, Part I

RECAP. B649 Parallel Architectures and Programming

Revisiting Loop Transformations with X10 Clocks. Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Exploring Parallelism At Different Levels

Performance Analysis and Optimization MAQAO Tool

Vector Processors. Kavitha Chandrasekar Sreesudhan Ramkumar

Program transformations and optimizations in the polyhedral framework

ABSTRACT AFFINE LOOP OPTIMIZATION BASED ON MODULO UNROLLING IN CHAPEL

Dynamic Control Hazard Avoidance

Where We Are. Lexical Analysis. Syntax Analysis. IR Generation. IR Optimization. Code Generation. Machine Code. Optimization.

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2

Structured Parallel Programming Patterns for Efficient Computation

A Formally-Verified C static analyzer

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

Polly Documentation. Release 8.0-devel. The Polly Team

Static WCET Analysis: Methods and Tools

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT

Affine Loop Optimization Based on Modulo Unrolling in Chapel

Fourier-Motzkin and Farkas Questions (HW10)

Formal verification of a static analyzer based on abstract interpretation

A polyhedral loop transformation framework for parallelization and tuning

Static Single Assignment Form in the COINS Compiler Infrastructure - Current Status and Background -

Loops. Lather, Rinse, Repeat. CS4410: Spring 2013

Compilation for Heterogeneous Platforms

Topic I (d): Static Single Assignment Form (SSA)

Structured Parallel Programming

The Polyhedral Compilation Framework

Hugh Leather, Edwin Bonilla, Michael O'Boyle

The GNU Compiler Collection

Loop Tiling in the Presence of Exceptions

Recap. Practical Compiling for Modern Machines (Special Topics in Programming Languages)

AMD S X86 OPEN64 COMPILER. Michael Lai AMD

A Dynamic Periodicity Detector: Application to Speedup Computation

Transcription:

Loop Nest Optimizer of GCC CRI / Ecole des mines de Paris Avgust, 26

Architecture of GCC and Loop Nest Optimizer C C++ Java F95 Ada GENERIC GIMPLE Analyses aliasing data dependences number of iterations GIMPLE + CFG + SSA + Loops LNO Machine description RTL ASM

Future Plans for the LNO GRAPHITE: extension of linear transforms 2 parallel code generation (via libgomp) 3 machine models and abstract simulators 4 static profitability analyses 5 hybrid analyses (compress static analysis + dynamic part)

Problems with Classical LNO Transforms Motivations for GRAPHITE: source to source modifies the compiled program difficult to undo order of transforms fixed once for all invalidated data deps: ad-hoc correction or rebuild difficult to compose

Problems with Classical LNO Transforms Motivations for GRAPHITE: source to source modifies the compiled program difficult to undo order of transforms fixed once for all invalidated data deps: ad-hoc correction or rebuild difficult to compose solved in WRaP-IT(from 22 at INRIA on ORC/Open64) GRAPHITE = WRaP-IT for GCC

GRAPHITE: Intuitive Idea

GRAPHITE: Intuitive Idea Boat Castle

GRAPHITE: Intuitive Idea Boat Basic Bricks Castle

GRAPHITE: Intuitive Idea Boat Basic Bricks Castle C,C++,F95,... GIMPLE GRAPHITE (programming languages) (basic imperative language) (geometrical language)

GRAPHITE : Representation on Top of Gimple-SSA Statements + parametric affine inequalities a domain = bounds of enclosing loops 2 a list of access functions 3 a schedule = execution time (static + dynamic) for (i=; i<m; i++) for (j=5; j<n; j++) A[2*i][j+] =... 2 6 4 i j m n cst 5 3 7 5 i i + m j 5 j + n

GRAPHITE : Representation on Top of Gimple-SSA Statements + parametric affine inequalities a domain = bounds of enclosing loops 2 a list of access functions 3 a schedule = execution time (static + dynamic) for (i=; i<m; i++) for (j=5; j<n; j++) A[2*i][j+] =... 2 4 i j m n cst 2 3 5 2 i j +

GRAPHITE : Representation on Top of Gimple-SSA Statements + parametric affine inequalities a domain = bounds of enclosing loops 2 a list of access functions 3 a schedule = execution time (static + dynamic) GRAPHITE(, 2, 3) extends LAMBDA(, 2) GRAPHITE: Gimple Represented As Polyhedra (with interchangeable envelopes)

GRAPHITE versus LAMBDA common part: unimodular transform data and iteration order transform regions: extended from loops to SCoP static control parts : sequences, affine conditions and loops GRAPHITE knows about the sequence! enables more loop transforms: fusion, fission, tiling, software pipelining, scheduling

Compose Transforms Small set of primitives (basic operations on matrices) motion 2 interchange 3 strip-mine 4 insert, delete 5 shift Composed transforms fission, fusion: tiling: 2 + 3 6 skew, reversal, reindexing 7 privatize

Optimal Transform? Find sequences of transforms based on size of loops cache misses simulation Automatic selection of transforms amounts to choosing a point in a vector space hard part (open questions) WRaP-IT uses directives

Results From WRaP-IT on Top of PathScale EKOPath swim from SPEC CPU2 32% speedup on AthlonXP wrt. peak EKOPath (V2.) 38% speedup for Athlon64 wrt. peak EKOPath (V2.) principal SCoP: 42 lines of code apply 3 transforms to principal SCoP fusion, tiling, peeling, unrolling, interchange, strip-mining result 2267 LOC 39 sec source to assembly on AthlonXP 2.8GHz 22 sec in the backend 2 sec polyhedral data deps 4 sec polyhedral code gen

Static Estimation of Runtime Properties How hard is it to simulate a processor? DSP: almost deterministic superscalar: hard to predict processor transforms VLIW: hard to predict compilers future decisions Need to simulate exact behavior?

Static Estimation of Runtime Properties How hard is it to simulate a processor? DSP: almost deterministic superscalar: hard to predict processor transforms VLIW: hard to predict compilers future decisions Need to simulate exact behavior? No! Idea: abstract simulation.

Abstract Simulation Program Semantics + Precise Machine Description Simulator

Abstract Simulation Program Semantics Abstraction + Precise Machine Description Abstraction Simulator Abstract Program + Abstract Machine Abstract simulator

Hybrid Analyses (Static + Dynamic) Properties for validating a transform: Static decidable Dynamic decidable collect failed static problems symbolically compress When static analysis fails, instrument code (instantiate at run time) code generation problems (code size + completing static analysis overhead)

GRAPHITE: Road Map select SCoPs filter out difficult codes (Alexandru Plesco) 2 extend LAMBDA build schedule functions, GLooG 3 cost models more static analyzers, and transform selection 4 array regions improve data deps in interproc mode 5 lib integration PolyLib, PiPLib, Omega, lib-apron 2 From GIMPLE GIMPLE Generation GIMPLE Data Dependences 4 Array Regions GRAPHITE 3 3 Transform Selection Cost Models GIMPLE 5 Numerical Domains Common Interface Omega PolyLib PIPlib Octagons Intervals Congruences

Questions?

lib-apron: interchange envelopes limit computation complexity = restrict expressivity use coarser representations Octagons Polyhedra Boxes (4 constraints) (n constraints) (8 constraints)