The Polyhedral Model Is More Widely Applicable Than You Think

Similar documents
Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time

A polyhedral loop transformation framework for parallelization and tuning

Polyhedral Compilation Foundations

Loop Nest Optimizer of GCC. Sebastian Pop. Avgust, 2006

Parametric Multi-Level Tiling of Imperfectly Nested Loops*

CS671 Parallel Programming in the Many-Core Era

Polyhedral Compilation Foundations

PolyOpt/C. A Polyhedral Optimizer for the ROSE compiler Edition 0.2, for PolyOpt/C March 12th Louis-Noël Pouchet

FADA : Fuzzy Array Dataflow Analysis

Predic've Modeling in a Polyhedral Op'miza'on Space

GRAPHITE: Polyhedral Analyses and Optimizations

More Data Locality for Static Control Programs on NUMA Architectures

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Putting Automatic Polyhedral Compilation for GPGPU to Work

Polly Polyhedral Optimizations for LLVM

Loop Optimizations. Outline. Loop Invariant Code Motion. Induction Variables. Loop Invariant Code Motion. Loop Invariant Code Motion

Polly - Polyhedral optimization in LLVM

Solving the Live-out Iterator Problem, Part I

Automatic Transformations for Effective Parallel Execution on Intel Many Integrated Core

Polyhedral-Based Data Reuse Optimization for Configurable Computing

UNIVERSITÉ DE PARIS-SUD 11 U.F.R. SCIENTIFIQUE D ORSAY

UCI. Intel Itanium Line Processor Efforts. Xiaobin Li. PASCAL EECS Dept. UC, Irvine. University of California, Irvine

A Polyhedral Compilation Framework for Loops with Dynamic Data-Dependent Bounds

CS 293S Parallelism and Dependence Theory

PoCC. The Polyhedral Compiler Collection package Edition 0.3, for PoCC 1.2 February 18th Louis-Noël Pouchet

Polyhedral Operations. Algorithms needed for automation. Logistics

Understanding PolyBench/C 3.2 Kernels

Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies

Tessellating Stencils. Liang Yuan, Yunquan Zhang, Peng Guo, Shan Huang SKL of Computer Architecture, ICT, CAS

Tiling: A Data Locality Optimizing Algorithm

Static and Dynamic Frequency Scaling on Multicore CPUs

Combined Iterative and Model-driven Optimization in an Automatic Parallelization Framework

A Polyhedral AST generation is more than scanning polyhedra

Loop Transformations: Convexity, Pruning and Optimization

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Revisiting Loop Transformations with X10 Clocks. Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

C6000 Compiler Roadmap

Iterative Optimization in the Polyhedral Model

Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures

Neural Network Assisted Tile Size Selection

Compiler techniques for leveraging ILP

Lattice-Based Memory Allocation

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

UNIVERSITÉ DE PARIS-SUD 11 U.F.R. SCIENTIFIQUE D ORSAY

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Exploitation of instruction level parallelism

Iterative Optimization in the Polyhedral Model: Part II, Multidimensional Time

Global Optimization. Lecture Outline. Global flow analysis. Global constant propagation. Liveness analysis. Local Optimization. Global Optimization


Louis-Noël Pouchet

Language and Compiler Parallelization Support for Hashtables

CS 426 Parallel Computing. Parallel Computing Platforms

A Compiler Framework for Optimization of Affine Loop Nests for General Purpose Computations on GPUs

Far Fetched Prefetching? Tomofumi Yuki INRIA Rennes Antoine Morvan ENS Cachan Bretagne Steven Derrien University of Rennes 1

Compiling for Advanced Architectures

Compiler/Runtime Framework for Dynamic Dataflow Parallelization of Tiled Programs

Polyhedral Search Space Exploration in the ExaStencils Code Generator

Tiling: A Data Locality Optimizing Algorithm

Out-of-Order Parallel Simulation of SystemC Models. G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.)

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

An Overview of GCC Architecture (source: wikipedia) Control-Flow Analysis and Loop Detection

Native Simulation of Complex VLIW Instruction Sets Using Static Binary Translation and Hardware-Assisted Virtualization

Fourier-Motzkin and Farkas Questions (HW10)

Loop Transformations: Convexity, Pruning and Optimization

A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs

Parallel Programming with OpenMP. CS240A, T. Yang

High Performance Computing on GPUs using NVIDIA CUDA

CS 3360 Design and Implementation of Programming Languages. Exam 1

Array Optimizations in OCaml

The IA-64 Architecture. Salient Points

The Polyhedral Model (Transformations)

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

Affine Loop Optimization using Modulo Unrolling in CHAPEL

A Parallelizing Compiler for Multicore Systems

Affine and Unimodular Transformations for Non-Uniform Nested Loops

Instruction Level Parallelism (ILP)

Advanced optimizations of cache performance ( 2.2)

Exploring Parallelism At Different Levels

The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

Field Analysis. Last time Exploit encapsulation to improve memory system performance

Speculation and Future-Generation Computer Architecture

Diego Caballero and Vectorizer Team, Intel Corporation. April 16 th, 2018 Euro LLVM Developers Meeting. Bristol, UK.

Polyhedral Optimizations of Explicitly Parallel Programs

Clint: A Direct Manipulation Tool for Parallelizing Compute-Intensive Program Parts

FPGA-based Supercomputing: New Opportunities and Challenges

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.

Defensive Loop Tiling for Shared Cache. Bin Bao Adobe Systems Chen Ding University of Rochester

Hybrid Analysis and its Application to Thread Level Parallelization. Lawrence Rauchwerger

Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology

Automatic Polyhedral Optimization of Stencil Codes

OpenMP Optimization and its Translation to OpenGL

What about branches? Branch outcomes are not known until EXE What are our options?

Compilers. Lecture 2 Overview. (original slides by Sam

Application parallelization for multi-core Android devices

Effective Automatic Parallelization and Locality Optimization Using The Polyhedral Model

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

HDL. Operations and dependencies. FSMs Logic functions HDL. Interconnected logic blocks HDL BEHAVIORAL VIEW LOGIC LEVEL ARCHITECTURAL LEVEL

C-LANGUAGE CURRICULAM

Foundations of SPARQL Query Optimization

Transcription:

The Polyhedral Model Is More Widely Applicable Than You Think Mohamed-Walid Benabderrahmane 1 Louis-Noël Pouchet 1,2 Albert Cohen 1 Cédric Bastoul 1 1 ALCHEMY group, INRIA Saclay / University of Paris-Sud 11, France 2 The Ohio State University, USA March 26, 2010

Introduction CC 2010 Motivation: High Level Optimization Complex program transformations To exhibit and to exploit parallelism Type implicit/explicit Extraction Instruction pipeline implicit hardware + compiler Superscalar implicit hardware + compiler VLIW-EPIC explicit compiler Vector explicit compiler Multithreading explicit compiler + system To benefit from data locality Type implicit/explicit Extraction Temporal locality implicit (except on local memories) compiler Spatial locality implicit (except on some DSPs) compiler 2

Introduction CC 2010 Finding & Applying Transformations Very hard in general Which transformations, in which order? Is the semantics preserved? Is it profitable (performance, energy...)? Much easier within the scope of the polyhedral model Complex sequences of optimizations in a single step Exact data dependence analysis Many existing optimizing algorithms But restricted to static control codes Contributions: Extending the polyhedral model to handle full functions Revisiting the framework to support these extensions Demonstrate that codes with data-dependent control flow may benefit from existing techniques, even with conservative dependence approximations 3

Introduction CC 2010 Outline 1 The Polyhedral Framework, Principles and Limitations 2 Extending the Polyhedral Model Analysis Transformations Code Generation 3 Experimental Results 4 Conclusion 4

The Polyhedral Framework CC 2010 Polyhedral Representation For each program statement, capture its control array access semantics through parametrized affine (in)equalities: 1 A domain D : A x + a 0 The bounds of the enclosing loops 2 A list of access functions f ( x) = F x + f To describe array references 3 A schedule θ( x) = T x + t An affine function assigning logical dates to iterations for (i = 1; i <= n; i++) for (j = 1; j <= n; j++) if (i <= n-j+2) S 1 : M[2*i+1][i-j+n] = 0; D S1 : 1 0 1 0 0 1 0 1 1 1 ( ) i j + 1 n 1 n n + 2 0 Iteration Domain of S 1 5

The Polyhedral Framework CC 2010 Polyhedral Representation For each program statement, capture its control array access semantics through parametrized affine (in)equalities: 1 A domain D : A x + a 0 The bounds of the enclosing loops 2 A list of access functions f ( x) = F x + f To describe array references 3 A schedule θ( x) = T x + t An affine function assigning logical dates to iterations for (i = 1; i <= n; i++) for (j = 1; j <= n; j++) if (i <= n-j+2) S 1 : M[2*i+1][i-j+n] = 0; f S1,M ( ) [ ]( ) ( ) i 2 0 i 1 = + j 1 1 j n Subscript Function of M[ f ( x)] 5

The Polyhedral Framework CC 2010 Polyhedral Representation For each program statement, capture its control array access semantics through parametrized affine (in)equalities: 1 A domain D : A x + a 0 The bounds of the enclosing loops 2 A list of access functions f ( x) = F x + f To describe array references 3 A schedule θ( x) = T x + t An affine function assigning logical dates to iterations for (i = 1; i <= n; i++) for (j = 1; j <= n; j++) if (i <= n-j+2) S 1 : M[2*i+1][i-j+n] = 0; θ S1 ( i j ) [ ]( ) ( ) 1 0 i 0 = + 0 1 j 0 Identity Schedule 5

The Polyhedral Framework CC 2010 Polyhedral Model Constraints Strict control constraints to be eligible: static control Affine bounds (for) Affine conditions (if) Does it mean that more general codes cannot benefit from a polyhedral compilation framework? 6

The Polyhedral Framework CC 2010 Motivating Transformation: Loop Fusion // 2strings: count occurences of two words in the same string nb1 = 0; for(i=0; i < size_string - size_word1 ; i++){ match1 = 0; while(word1[match1] == string[i+match1] && match1 <= size_word1 ) match1 ++; if (match1 == size_word1 ) nb1 ++; nb2 = 0; for(i=0; i < size_string - size_word2 ; i++) { match2 = 0; while(word2[match2] == string[i+match2] && match2 <= size_word2 ) match2 ++; if (match2 == size_word2 ) nb2 ++; Loop fusion would improve data locality Tough by hand Trivial transformation if expressed in the polyhedral domain But while loops and non-static if conditions here... 7

The Polyhedral Framework CC 2010 Revisiting The Polyhedral Framework for (i = 1; i <= 3; i++) for (j = 1; j <= 3; j++) A[i+j] =... 1 Program analysis 3 2 1 j 1 2 3 4 5 6 2 Affine transformation j 3 2 1 1 2 3 i 1 2 3 4 5 6 3 Code generation i t for (t = 2; t <= 6; t++) for (i = max(1,t -3); i <= min(t -1,3); i++) A[t] =... 8

Revisiting the Polyhedral Framework: Analysis CC 2010 Extension to while Loops Extend iteration domain to support predication tags (Virtually) Convert while loops into infinite for loops Tag statement iteration domains with exit predicates while ( condition ) S(); for (i = 0;; i++) { ep = condition ; if (ep) S(); else break; { i 0 (ep = condition) (a) Original Code (b) Equivalent Code (c) Iteration Domain of S 9

Revisiting the Polyhedral Framework: Analysis CC 2010 Extension to Non-Static if Conditionals Extend iteration domain to support predication tags Tag statement iteration domains with control predicates for (i = 0; i < N; i++) if ( condition ) S(); for (i = 0; i < N; i++) cp = condition ; if (cp) S(); i 0 i < N (cp = condition) (a) Original Code (b) Equivalent Code (c) Iteration Domain of S 10

Revisiting the Polyhedral Framework: Analysis CC 2010 A Conservative Approach Problem: exact data dependence analysis is not always possible Conservative escape: it is safe to consider extra dependences Non-static control is over-approximated (predicates considered always true) Non-static references are over-approximated (e.g. arrays are considered as single variables) Predicate evaluations are considered as plain statements Predicated statements depend on their predicate definitions OK for data dependence analysis but not sufficient for some more evolved analyses (see paper) 11

Revisiting the Polyhedral Framework: Analysis CC 2010 A Conservative Approach: Example (Outer Product Kernel) Original Kernel for (i = 0; i < N; i++) { if (x[i] == 0) { for (j = 0; j < M; j++) { A[i][j] = 0; else { for (j = 0; j < M; j++) { A[i][j] = x[i] * y[j]; 12

Revisiting the Polyhedral Framework: Analysis CC 2010 A Conservative Approach: Example (Outer Product Kernel) Control Predication for (i = 0; i < N; i++) { cp = (x[i] == 0); for (j = 0; j < M; j++) { if (cp) { A[i][j] = 0; for (j = 0; j < M; j++) { if (!cp) { A[i][j] = x[i] * y[j]; 12

Revisiting the Polyhedral Framework: Analysis CC 2010 A Conservative Approach: Example (Outer Product Kernel) Abstract Program for Data Dependence Analysis for (i = 0; i < N; i++) { S 0 : Write = {cp, Read = {x[i] for (j = 0; j < M; j++) { S 1 : Write = {A[i][j], Read = {cp for (j = 0; j < M; j++) { S 2 : Write = {A[i][j], Read = {x[i], y[j], cp 12

Revisiting the Polyhedral Framework: Transformations CC 2010 Transformation Expressiveness Recovery Problem: manipulating unbounded domains is not easy (how to distribute while loops with one-dimensional schedule?) Solution: an artificial parameter, w (meaning ω, or while) The upper bounds of all unbounded loops is w w is strictly greater than all upper bounds w is only used in affine transformations w is removed during the code generation process w allows any existing polyhedral transformation technique to be used in the extended model 13

Revisiting the Polyhedral Framework: Code Generation CC 2010 Quilleré-Rajopadhye-Wilde Algorithm j n.. 7 6 5 4 3 2 1 Direct use of polyhedral operations [Quilleré et al. IJPP00] Depth recursion with direct optimization of conditionals: Projection onto outer dimensions Separation into disjoint polyhedra 1 2 3 4 5 6 7... n i for(i = 1; i <= 6; i += 2) for (j = 1; j <= 7-i; j++) { S1(i, j); S2(i, j); for (j = 8-i; j <= n; j++) S1(i, j); for (i = 7; i <= n; i += 2) for (j = 1; j <= n; j++) S1(i, j); 14

Revisiting the Polyhedral Framework: Code Generation CC 2010 Predicate Post-Processing Usual QRW code generation for predicated domains Exit and control predicates are post-processed The target code is modified according to the situation Post-pass insertion of predicate evaluations First scenario: same exit predicates for (i = 0; i < w; i++) { S1 (); {ep1 S2 (); {ep1 while (ep1) { S1 (); S2 (); (a) Intermediate Code (b) Post-Processed Code 15

Revisiting the Polyhedral Framework: Code Generation CC 2010 Predicate Post-Processing Usual QRW code generation for predicated domains Exit and control predicates are post-processed The target code is modified according to the situation Post-pass insertion of predicate evaluations Second scenario: different exit predicates for (i = 0; i < w; i++) { S1 (); {ep1 S2 (); {ep2 while (ep1 && ep2) { S1 (); S2 (); while (ep1) S1 (); while (ep2) S2 (); (a) Intermediate Code (b) Post-Processed Code 15

Revisiting the Polyhedral Framework: Code Generation CC 2010 Predicate Post-Processing Usual QRW code generation for predicated domains Exit and control predicates are post-processed The target code is modified according to the situation Post-pass insertion of predicate evaluations Third scenario: exit predicate inside a regular loop for (i = 0; i < w; i++) { S1 (); S2 (); {ep1 stop1 = 0; for (i = 0; i < N; i++) { S1 (); if (ep1 &&!stop1) S2 (); else stop1 = 1; (a) Intermediate Code (b) Post-Processed Code 15

Revisiting the Polyhedral Framework: Code Generation CC 2010 Predicate Post-Processing Usual QRW code generation for predicated domains Exit and control predicates are post-processed The target code is modified according to the situation Post-pass insertion of predicate evaluations Additional optimizations Hoisting predicate evaluations Privatization of predicate variables 15

Experimental Results CC 2010 Experimental Results State-of-the-art polyhedral optimization techniques applied to (partially) irregular programs LeTSeE [Pouchet et al. PLDI08] Pluto [Bondhugula et al. PLDI08] Speedup regular Speedup extended Compilation time penalty LetSee Pluto LetSee Pluto LetSee Pluto 2strings N/A N/A 1.18 1 N/A N/A Sat-add 1 1.08 1.51 1.61 1.22 1.35 QR 1.04 1.09 1.04 8.66 9.56 2.10 ShortPath N/A N/A 1.53 5.88 N/A N/A TransClos N/A N/A 1.43 2.27 N/A N/A Givens 1 1 1.03 7.02 21.23 15.39 Dither N/A N/A 1 5.42 N/A N/A Svdvar 1 3.54 1 3.82 1.93 1.33 Svbksb 1 1 1 1.96 2 1.66 Gauss-J 1 1.46 1 1.77 2.51 1.22 PtIncluded 1 1 1 1.44 10.12 1.44 Setup: Intel Core 2 Quad Q6600 Backend compiler (and baseline): ICC 11.0 icc -fast -parallel -openmp 16

Conclusion CC 2010 Conclusion The limitation to static control programs is mostly artificial Slight and natural extension to consider irregular codes Infinite loops plus exit and control predication w parameter to preserve affine schedule expressiveness Code generation with predicate support Benefit from unmodified existing techniques for both analysis and optimization Currently rely on a conservative dependence analysis New extensions should be investigated Minimizing the conservative aspects (inspection and speculation for control dependences) Designing optimizations in the context of full functions (algorithmic complexity issues) 17