Automatic Polyhedral Optimization of Stencil Codes

Size: px
Start display at page:

Download "Automatic Polyhedral Optimization of Stencil Codes"

Transcription

1 Automatic Polyhedral Optimization of Stencil Codes ExaStencils 2014 Stefan Kronawitter Armin Größlinger Christian Lengauer

2 The Need for Different Optimizations 3D 1st-grade Jacobi smoother Speedup Naive Implementation Basic Transformations Temporal Blocking Temp&Spatial Blocking Speedup Threads Ivy Bridge (6 MB cache) Threads BlueGene/Q (32 MB cache) 1 / 20

3 Why Use the Polyhedron Model? Transformations / optimizations simple (if considered in isolation) rather complex in combination Benefits of the polyhedron model easy composition of transformations correctness, even on boundary 2 / 20

4 Why Use the Polyhedron Model? Transformations / optimizations simple (if considered in isolation) rather complex in combination Benefits of the polyhedron model easy composition of transformations correctness, even on boundary 2 / 20

5 Why Use the Polyhedron Model? Transformations / optimizations simple (if considered in isolation) rather complex in combination Benefits of the polyhedron model easy composition of transformations correctness, even on boundary 2 / 20

6 Why Use the Polyhedron Model? Transformations / optimizations simple (if considered in isolation) rather complex in combination Benefits of the polyhedron model easy composition of transformations correctness, even on boundary 2 / 20

7 Manual Transformations? Input: fdtd-2d for(t = 0; t < tmax; t++) { for (j = 0; j < ny; j++) ey[0][j] = _edge_[t]; for (i = 1; i < nx; i++) for (j = 0; j < ny; j++) ey[i][j] = ey[i][j] - 0.5*(hz[i][j]-hz[i-1][j]); for (i = 0; i < nx; i++) for (j = 1; j < ny; j++) ex[i][j] = ex[i][j] - 0.5*(hz[i][j]-hz[i][j-1]); for (i = 0; i < nx - 1; i++) for (j = 0; j < ny - 1; j++) hz[i][j] = hz[i][j] - 0.7* (ex[i][j+1] - ex[i][j] + ey[i+1][j]-ey[i][j]); } 3 / 20

8 Automatic Transformations! Optimized: fdtd-2d tiled [code from Luis-Noël Pouchet] for (c0 = 0; c0 <= (((ny + 2 * tmax + -3) * 32 < 0?((32 < 0?-((-(ny + 2 * tmax + -3) ) / 32) : -((-(ny + 2 * tmax + -3) ) / 32))) : (ny + 2 * tmax + -3) / 32)); ++c0) { #pragma omp parallel for private(c3, c4, c2, c5) for (c1 = (((c0 * 2 < 0?-(-c0 / 2) : ((2 < 0?(-c ) / -2 : (c ) / 2)))) > (((32 * c0 + -tmax + 1) * 32 < 0?-(-(32 * c0 + -tmax + 1) / 32) : ((32 < 0?(-(32 * c0 + -tmax + 1) ) / -32 : (32 * c0 + -tmax ) / 32))))?((c0 * 2 < 0?-(-c0 / 2) : ((2 < 0?(-c ) / -2 : (c ) / 2)))) : (((32 * c0 + -tmax + 1) * 32 < 0?-(-(32 * c0 + -tmax + 1) / 32) : ((32 < 0?(-(32 * c0 + -tmax + 1) ) / -32 : (32 * c0 + -tmax ) / 32))))); c1 <= (((((((ny + tmax + -2) * 32 < 0?((32 < 0?-((-(ny + tmax + -2) ) / 32) : -((-(ny + tmax + -2) ) / 32))) : (ny + tmax + -2) / 32)) < (((32 * c0 + ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) ) / 64) : -((-(32 * c0 + ny + 30) ) / 64))) : (32 * c0 + ny + 30) / 64))?(((ny + tmax + -2) * 32 < 0?((32 < 0?-((-(ny + tmax + -2) ) / 32) : -((-(ny + tmax + -2) ) / 32))) : (ny + tmax + -2) / 32)) : (((32 * c0 + ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) ) / 64) : -((-(32 * c0 + ny + 30) ) / 64))) : (32 * c0 + ny + 30) / 64)))) < c0?(((((ny + tmax + -2) * 32 < 0?((32 < 0?-((-(ny + tmax + -2) ) / 32) : -((-(ny + tmax + -2) ) / 32))) : (ny + tmax + -2) / 32)) < (((32 * c0 + ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) ) / 64) : -((-(32 * c0 + ny + 30) ) / 64))) : (32 * c0 + ny + 30) / 64))?(((ny + tmax + -2) * 32 < 0?((32 < 0?-((-(ny + tmax + -2) ) / 32) : -((-(ny + tmax + -2) ) / 32))) : (ny + tmax + -2) / 32)) : (((32 * c0 + ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) ) / 64) : -((-(32 * c0 + ny + 30) ) / 64))) : (32 * c0 + ny + 30) / 64)))) : c0)); ++c1) { for (c2 = c0 + -c1; c2 <= (((((tmax + nx + -2) * 32 < 0?((32 < 0?-((-(tmax + nx + -2) ) / 32) : -((-(tmax + nx + -2) ) / 32))) : (tmax + nx + -2) / 32)) < (((32 * c * c1 + nx + 30) * 32 < 0?((32 < 0?-((-(32 * c * c1 + nx + 30) ) / 32) : -((-(32 * c * c1 + nx + 30) ) / 32))) : (32 * c * c1 + nx + 30) / 32))?(((tmax + nx + -2) * 32 < 0?((32 < 0?-((-(tmax + nx + -2) ) / 32) : -((-(tmax + nx + -2) ) / 32))) : (tmax + nx + -2) / 32)) : (((32 * c * c1 + nx + 30) * 32 < 0?((32 < 0?-((-(32 * c * c1 + nx + 30) ) / 32) : -((-(32 * c * c1 + nx + 30) ) / 32))) : (32 * c * c1 + nx + 30) / 32)))); ++c2) { if (c0 == 2 * c1 && c0 == 2 * c2) { for (c3 = 16 * c0; c3 <= ((tmax + -1 < 16 * c0 + 30?tmax + -1 : 16 * c0 + 30)); ++c3) if (c0 % 2 == 0) (ey[0])[0] = (_edge_[c3]);... (200 more lines!) 4 / 20

9 for (i=1; i<=n; ++i) for (j=1; j<=n-i+1; ++j) A[i][j] = A[i-1][j] + A[i][j-1]; Polyhedron Model 5 / 20

10 for (i=1; i<=n; ++i) for (j=1; j<=n-i+1; ++j) A[i][j] = A[i-1][j] + A[i][j-1]; Iteration domain j Polyhedron Model i 1 i n 1 j n i / 20

11 for (i=1; i<=n; ++i) for (j=1; j<=n-i+1; ++j) A[i][j] = A[i-1][j] + A[i][j-1]; Iteration domain j Polyhedron Model i 1 i n 1 j n i + 1 Dependences (i, j) (i + 1, j) (i, j) (i, j + 1) 5 / 20

12 for (i=1; i<=n; ++i) for (j=1; j<=n-i+1; ++j) A[i][j] = A[i-1][j] + A[i][j-1]; Polyhedron Model Iteration domain j Transformation t = i + j 1 p = j p i 1 i n 1 j n i + 1 Dependences (i, j) (i + 1, j) (i, j) (i, j + 1) 1 t n 1 p t t (t, p) (t + 1, p) (t, p) (t + 1, p + 1) 5 / 20

13 for (i=1; i<=n; ++i) for (j=1; j<=n-i+1; ++j) A[i][j] = A[i-1][j] + A[i][j-1]; Polyhedron Model for (t=1; t<=n; ++t) #pragma omp parallel for for (p=1; p<=t; ++p) A[t-p+1][p] =...; Iteration domain j Transformation t = i + j 1 p = j p i 1 i n 1 j n i + 1 Dependences (i, j) (i + 1, j) (i, j) (i, j + 1) 1 t n 1 p t t (t, p) (t + 1, p) (t, p) (t + 1, p + 1) 5 / 20

14 Constraints on the Input Data structures only scalars and arrays allowed alias information must be available for (i=0; i<n; ++i) { P = P->next; // bad: linked list *c = A[i]; // does c point inside A or B? A[i] = B[i+3]; // do A and B alias? } for the blue variables: additional information is essential (are there hidden dependences?) 6 / 20

15 Constraints on Input Loop bounds and array subscripts must be affine in surrounding loop variables and parameters iteration domain is Z-polyhedron for (i=0; i<n; ++i) for (j=0; j<=i; ++j) { A[(i*i+i)/2+j] = B[2*j][i-j]; C[n*i] = D[C[j]]; } // linearized triangle marked parts are relevant for model extraction: affine and non-affine 7 / 20

16 Stencil Optimizations L1 L2 L3 L4 IR Continuous Domain & Continuous Model Discrete Domain & Discrete Model Algorithmic Components & Parameters Complete Program Specification Polyhedral Model Intermediate Representation Extraction point representation already executable ( C-like ) but with some abstract elements, e.g. abstract communication node loop node for grid traversion 8 / 20

17 Optimizations Red-black Gauss-Seidel smoother first, all red, then all black points are updated in place each traversion updates every second element only more pressure on cache and reduced memory bandwidth more colors for larger stencils possible Generate color splitting 9 / 20

18 Temporal Blocking Normal update x y input result 10 / 20

19 Temporal Blocking Normal update x y input result 10 / 20

20 Temporal Blocking Normal update x y input result 10 / 20

21 Temporal Blocking Combination of two subsequent updates x y input intermediate result 11 / 20

22 Temporal Blocking Combination of two subsequent updates x y input intermediate result 11 / 20

23 Temporal Blocking Combination of two subsequent updates x y input intermediate result 11 / 20

24 Temporal Blocking Combination of two subsequent updates x y input intermediate result block size can be determined automatically 11 / 20

25 Vectorization Find transformations that allow SIMD parallelism innermost loop must be parallel subsequent iterations must access neighbouring grid elements j i for (i =..) for (j =..; ++j) A[..][..+j] =.. B[..][..+j]; 12 / 20

26 Domain-Specific Extensions Polyhedron model already allows color splitting (temporal) blocking vectorization x j y input intermediate result i 13 / 20

27 Domain-Specific Extensions Polyhedron model already allows color splitting (temporal) blocking vectorization x j y input intermediate result i Required extensions reductions for (i=0; i<n; ++i) sum = sum + A[i]; non-cuboid grids x y 13 / 20

28 Support for Reductions Example: sum reduction for (i=0; i<n; ++i) sum = sum + A[i]; sum is updated in each iteration 14 / 20

29 Support for Reductions Example: sum reduction for (i=0; i<n; ++i) sum = sum + A[i]; sum is updated in each iteration Extracted model i purely sequential execution order 14 / 20

30 Support for Reductions Example: sum reduction for (i=0; i<n; ++i) sum = sum + A[i]; sum is updated in each iteration Extracted model i purely sequential execution order Domain knowledge tells p because + is associative t 14 / 20

31 Example: triangular grid x y Support for non-cuboid Grids grid is stored contiguously in memory 15 / 20

32 Example: triangular grid x y Support for non-cuboid Grids grid is stored contiguously in memory Usage for (i=..) for (j=..).. A[(i*i+i)/2+j]..; array access is not affine 15 / 20

33 Example: triangular grid x y Support for non-cuboid Grids grid is stored contiguously in memory Usage for (i=..) for (j=..).. A[(i*i+i)/2+j]..; Domain knowledge tells for (i=..) for (j=..).. A[i][j]..; array access is not affine is equivalent to access above (with respect to the dependences) 15 / 20

34 (De-)Serialization for Communication L1 L2 L3 L4 IR Continuous Domain & Continuous Model Discrete Domain & Discrete Model Algorithmic Components & Parameters Complete Program Specification Polyhedral Model Intermediate Representation Polyhedral Model Generation point late in transformation process shortly before target code generation 16 / 20

35 Generation of (De-)Serialization Code Communication north neighbour east neighbour data exchange with all direct neighbours required some elements must be sent to multiple nodes avoid loading these elements repeatedly from main memory fill send buffers simultaneously 17 / 20

36 Serialization Code 2D example N W E S 18 / 20

37 Serialization Code 2D example N W E S Model representation for N: [n] -> { N[x,y]->[x,y]: x=1 and 1<=y<=n-2 } S, E and W analogously 18 / 20

38 Serialization Code 2D example W N S E Model representation for N: [n] -> { N[x,y]->[x,y]: x=1 and 1<=y<=n-2 } S, E and W analogously W(1, 1); for (int x=1; x<n-1; ++x) N(1, x); E(1, n-2); for (int y=2; y<n-2; ++y) { W(y, 1); E(y, n-2); } W(n-2, 1); for (int x=1; x<n-1; ++x) S(n-2, x); E(n-2, n-2); 18 / 20

39 Stencil optimizations Summary polyhedral representation extracted from IR domain knowledge: grid size number of pre- and post-smoothing steps transformations and their ordering correct treatment of boundaries (De-)Serialization polyhedral representation created directly generate optimal grid traversion code according to memory layout 19 / 20

40 Stencil optimizations Summary polyhedral representation extracted from IR domain knowledge: grid size number of pre- and post-smoothing steps transformations and their ordering correct treatment of boundaries can be performed completely automatically (De-)Serialization polyhedral representation created directly generate optimal grid traversion code according to memory layout can be performed completely automatically 19 / 20

41 End... Thank you for your attention!... any questions? 20 / 20

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014 Automatic Generation of Algorithms and Data Structures for Geometric Multigrid Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014 Introduction Multigrid Goal: Solve a partial differential

More information

Optimization of two Jacobi Smoother Kernels by Domain-Specific Program Transformation

Optimization of two Jacobi Smoother Kernels by Domain-Specific Program Transformation Optimization of two Jacobi Smoother Kernels by Domain-Specific Program Transformation ABSTRACT Stefan Kronawitter University of Passau Innstraße 33 903 Passau, Germany stefan.kronawitter@uni-passau.de

More information

The Polyhedral Compilation Framework

The Polyhedral Compilation Framework The Polyhedral Compilation Framework Louis-Noël Pouchet Dept. of Computer Science and Engineering Ohio State University pouchet@cse.ohio-state.edu October 20, 2011 Introduction: Overview of Today s Lecture

More information

Parallel Poisson Solver in Fortran

Parallel Poisson Solver in Fortran Parallel Poisson Solver in Fortran Nilas Mandrup Hansen, Ask Hjorth Larsen January 19, 1 1 Introduction In this assignment the D Poisson problem (Eq.1) is to be solved in either C/C++ or FORTRAN, first

More information

How to Optimize Geometric Multigrid Methods on GPUs

How to Optimize Geometric Multigrid Methods on GPUs How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient

More information

A Framework for Automatic OpenMP Code Generation

A Framework for Automatic OpenMP Code Generation 1/31 A Framework for Automatic OpenMP Code Generation Raghesh A (CS09M032) Guide: Dr. Shankar Balachandran May 2nd, 2011 Outline 2/31 The Framework An Example Necessary Background Polyhedral Model SCoP

More information

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Optimization Outline Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Hierarchy 1-2 ns Registers 32 512 B 3-10 ns 8-30 ns 60-250 ns 5-20

More information

Polly Polyhedral Optimizations for LLVM

Polly Polyhedral Optimizations for LLVM Polly Polyhedral Optimizations for LLVM Tobias Grosser - Hongbin Zheng - Raghesh Aloor Andreas Simbürger - Armin Grösslinger - Louis-Noël Pouchet April 03, 2011 Polly - Polyhedral Optimizations for LLVM

More information

Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion

Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion M. Stürmer, H. Köstler, and U. Rüde Lehrstuhl für Systemsimulation Friedrich-Alexander-Universität Erlangen-Nürnberg

More information

Polyhedral Search Space Exploration in the ExaStencils Code Generator

Polyhedral Search Space Exploration in the ExaStencils Code Generator Preprint version before issue assignment Polyhedral Search Space Exploration in the ExaStencils Code Generator STEFAN KRONAWITTER and CHRISTIAN LENGAUER, University of Passau, Germany Performance optimization

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality A Crash Course in Compilers for Parallel Computing Mary Hall Fall, 2008 1 Overview of Crash Course L1: Data Dependence Analysis and Parallelization (Oct. 30) L2 & L3: Loop Reordering Transformations, Reuse

More information

Synchronous Shared Memory Parallel Examples. HPC Fall 2012 Prof. Robert van Engelen

Synchronous Shared Memory Parallel Examples. HPC Fall 2012 Prof. Robert van Engelen Synchronous Shared Memory Parallel Examples HPC Fall 2012 Prof. Robert van Engelen Examples Data parallel prefix sum and OpenMP example Task parallel prefix sum and OpenMP example Simple heat distribution

More information

Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures

Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures Uday Bondhugula Indian Institute of Science Supercomputing 2013 Nov 16 22, 2013 Denver, Colorado 1/46 1 Introduction 2 Distributed-memory

More information

CS 293S Parallelism and Dependence Theory

CS 293S Parallelism and Dependence Theory CS 293S Parallelism and Dependence Theory Yufei Ding Reference Book: Optimizing Compilers for Modern Architecture by Allen & Kennedy Slides adapted from Louis-Noël Pouche, Mary Hall End of Moore's Law

More information

Outline. L5: Writing Correct Programs, cont. Is this CUDA code correct? Administrative 2/11/09. Next assignment (a homework) given out on Monday

Outline. L5: Writing Correct Programs, cont. Is this CUDA code correct? Administrative 2/11/09. Next assignment (a homework) given out on Monday Outline L5: Writing Correct Programs, cont. 1 How to tell if your parallelization is correct? Race conditions and data dependences Tools that detect race conditions Abstractions for writing correct parallel

More information

Program transformations and optimizations in the polyhedral framework

Program transformations and optimizations in the polyhedral framework Program transformations and optimizations in the polyhedral framework Louis-Noël Pouchet Dept. of Computer Science UCLA May 14, 2013 1st Polyhedral Spring School St-Germain au Mont d or, France : Overview

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 9 ] Shared Memory Performance Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture

More information

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010 Parallel Programming Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code Mary Hall September 23, 2010 1 Observations from the Assignment Many of you are doing really well Some more are doing

More information

An Intel Xeon Phi Backend for the ExaStencils Code Generator

An Intel Xeon Phi Backend for the ExaStencils Code Generator Bachelor thesis An Intel Xeon Phi Backend for the ExaStencils Code Generator Thomas Lang Supervisor: Tutor: Prof. Christian Lengauer, Ph.D. Dr. Armin Größlinger 27th April 2016 Abstract Stencil computations

More information

A Domain-Specific Language and Compiler for Stencil Computations for Different Target Architectures J. Ram Ramanujam Louisiana State University

A Domain-Specific Language and Compiler for Stencil Computations for Different Target Architectures J. Ram Ramanujam Louisiana State University A Domain-Specific Language and Compiler for Stencil Computations for Different Target Architectures J. Ram Ramanujam Louisiana State University SIMAC3 Workshop, Boston Univ. Acknowledgments Collaborators

More information

Code optimization in a 3D diffusion model

Code optimization in a 3D diffusion model Code optimization in a 3D diffusion model Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Agenda Background Diffusion

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

Synchronous Shared Memory Parallel Examples. HPC Fall 2010 Prof. Robert van Engelen

Synchronous Shared Memory Parallel Examples. HPC Fall 2010 Prof. Robert van Engelen Synchronous Shared Memory Parallel Examples HPC Fall 2010 Prof. Robert van Engelen Examples Data parallel prefix sum and OpenMP example Task parallel prefix sum and OpenMP example Simple heat distribution

More information

Affine Loop Optimization using Modulo Unrolling in CHAPEL

Affine Loop Optimization using Modulo Unrolling in CHAPEL Affine Loop Optimization using Modulo Unrolling in CHAPEL Aroon Sharma, Joshua Koehler, Rajeev Barua LTS POC: Michael Ferguson 2 Overall Goal Improve the runtime of certain types of parallel computers

More information

Exploring Parallelism At Different Levels

Exploring Parallelism At Different Levels Exploring Parallelism At Different Levels Balanced composition and customization of optimizations 7/9/2014 DragonStar 2014 - Qing Yi 1 Exploring Parallelism Focus on Parallelism at different granularities

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Previously Performance analysis of existing codes Data dependence analysis for detecting parallelism Specifying transformations using frameworks Today Usefulness

More information

Software design for highly scalable numerical algorithms

Software design for highly scalable numerical algorithms Software design for highly scalable numerical algorithms Harald Köstler Workshop on Recent Advances in Parallel and High Performance Computing Techniques and Applications 12.1.2015, Singapore Contents

More information

Synchronous Computation Examples. HPC Fall 2008 Prof. Robert van Engelen

Synchronous Computation Examples. HPC Fall 2008 Prof. Robert van Engelen Synchronous Computation Examples HPC Fall 2008 Prof. Robert van Engelen Overview Data parallel prefix sum with OpenMP Simple heat distribution problem with OpenMP Iterative solver with OpenMP Simple heat

More information

OS impact on performance

OS impact on performance PhD student CEA, DAM, DIF, F-91297, Arpajon, France Advisor : William Jalby CEA supervisor : Marc Pérache 1 Plan Remind goal of OS Reproducibility Conclusion 2 OS : between applications and hardware 3

More information

A polyhedral loop transformation framework for parallelization and tuning

A polyhedral loop transformation framework for parallelization and tuning A polyhedral loop transformation framework for parallelization and tuning Ohio State University Uday Bondhugula, Muthu Baskaran, Albert Hartono, Sriram Krishnamoorthy, P. Sadayappan Argonne National Laboratory

More information

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops Parallel Programming OpenMP Parallel programming for multiprocessors for loops OpenMP OpenMP An application programming interface (API) for parallel programming on multiprocessors Assumes shared memory

More information

Simone Campanoni Loop transformations

Simone Campanoni Loop transformations Simone Campanoni simonec@eecs.northwestern.edu Loop transformations Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations Simple loop transformations Simple

More information

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST Introduction to tuning on many core platforms Gilles Gouaillardet RIST gilles@rist.or.jp Agenda Why do we need many core platforms? Single-thread optimization Parallelization Conclusions Why do we need

More information

Parametric Multi-Level Tiling of Imperfectly Nested Loops*

Parametric Multi-Level Tiling of Imperfectly Nested Loops* Parametric Multi-Level Tiling of Imperfectly Nested Loops* Albert Hartono 1, Cedric Bastoul 2,3 Sriram Krishnamoorthy 4 J. Ramanujam 6 Muthu Baskaran 1 Albert Cohen 2 Boyana Norris 5 P. Sadayappan 1 1

More information

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation

More information

Convey Vector Personalities FPGA Acceleration with an OpenMP-like programming effort?

Convey Vector Personalities FPGA Acceleration with an OpenMP-like programming effort? Convey Vector Personalities FPGA Acceleration with an OpenMP-like programming effort? Björn Meyer, Jörn Schumacher, Christian Plessl, Jens Förstner University of Paderborn, Germany 2 ), - 4 * 4 + - 6-4.

More information

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19 Parallel Processing WS 2018/19 Universität Siegen rolanda.dwismuellera@duni-siegena.de Tel.: 0271/740-4050, Büro: H-B 8404 Stand: September 7, 2018 Betriebssysteme / verteilte Systeme Parallel Processing

More information

Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model

Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model ERLANGEN REGIONAL COMPUTING CENTER Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model Holger Stengel, J. Treibig, G. Hager, G. Wellein Erlangen Regional

More information

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

Fourier-Motzkin and Farkas Questions (HW10)

Fourier-Motzkin and Farkas Questions (HW10) Automating Scheduling Logistics Final report for project due this Friday, 5/4/12 Quiz 4 due this Monday, 5/7/12 Poster session Thursday May 10 from 2-4pm Distance students need to contact me to set up

More information

Bandwidth Avoiding Stencil Computations

Bandwidth Avoiding Stencil Computations Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization Group UC Berkeley March 13, 2008 http://bebop.cs.berkeley.edu

More information

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing HW #9 10., 10.3, 10.7 Due April 17 { } Review Completing Graph Algorithms Maximal Independent Set Johnson s shortest path algorithm using adjacency lists Q= V; for all v in Q l[v] = infinity; l[s] = 0;

More information

CS4230 Parallel Programming. Lecture 12: More Task Parallelism 10/5/12

CS4230 Parallel Programming. Lecture 12: More Task Parallelism 10/5/12 CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, 2012 1! Homework 3: Due Before Class, Thurs. Oct. 18 handin cs4230 hw3 Problem 1 (Amdahl s Law): (i) Assuming a

More information

CSC D70: Compiler Optimization Memory Optimizations

CSC D70: Compiler Optimization Memory Optimizations CSC D70: Compiler Optimization Memory Optimizations Prof. Gennady Pekhimenko University of Toronto Winter 2018 The content of this lecture is adapted from the lectures of Todd Mowry, Greg Steffan, and

More information

Allows program to be incrementally parallelized

Allows program to be incrementally parallelized Basic OpenMP What is OpenMP An open standard for shared memory programming in C/C+ + and Fortran supported by Intel, Gnu, Microsoft, Apple, IBM, HP and others Compiler directives and library support OpenMP

More information

Polly First successful optimizations - How to proceed?

Polly First successful optimizations - How to proceed? Polly First successful optimizations - How to proceed? Tobias Grosser, Raghesh A November 18, 2011 Polly - First successful optimizations - How to proceed? November 18, 2011 1 / 27 Me - Tobias Grosser

More information

Overview of OpenMP. Unit 19. Using OpenMP. Parallel for. OpenMP Library for Parallelism

Overview of OpenMP. Unit 19. Using OpenMP. Parallel for. OpenMP Library for Parallelism 19.1 Overview of OpenMP 19.2 Unit 19 OpenMP Library for Parallelism A library or API (Application Programming Interface) for parallelism Requires compiler support (make sure the compiler you use supports

More information

Numerical Algorithms

Numerical Algorithms Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0

More information

The Challenges of Non-linear Parameters and Variables in Automatic Loop Parallelisation

The Challenges of Non-linear Parameters and Variables in Automatic Loop Parallelisation The Challenges of Non-linear Parameters and Variables in Automatic Loop Parallelisation Armin Größlinger December 2, 2009 Rigorosum Fakultät für Informatik und Mathematik Universität Passau Automatic Loop

More information

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010 Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

CS 433 Homework 4. Assigned on 10/17/2017 Due in class on 11/7/ Please write your name and NetID clearly on the first page.

CS 433 Homework 4. Assigned on 10/17/2017 Due in class on 11/7/ Please write your name and NetID clearly on the first page. CS 433 Homework 4 Assigned on 10/17/2017 Due in class on 11/7/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.

More information

Compiling for Advanced Architectures

Compiling for Advanced Architectures Compiling for Advanced Architectures In this lecture, we will concentrate on compilation issues for compiling scientific codes Typically, scientific codes Use arrays as their main data structures Have

More information

19.1. Unit 19. OpenMP Library for Parallelism

19.1. Unit 19. OpenMP Library for Parallelism 19.1 Unit 19 OpenMP Library for Parallelism 19.2 Overview of OpenMP A library or API (Application Programming Interface) for parallelism Requires compiler support (make sure the compiler you use supports

More information

Neural Network Assisted Tile Size Selection

Neural Network Assisted Tile Size Selection Neural Network Assisted Tile Size Selection Mohammed Rahman, Louis-Noël Pouchet and P. Sadayappan Dept. of Computer Science and Engineering Ohio State University June 22, 2010 iwapt 2010 Workshop Berkeley,

More information

More Data Locality for Static Control Programs on NUMA Architectures

More Data Locality for Static Control Programs on NUMA Architectures More Data Locality for Static Control Programs on NUMA Architectures Adilla Susungi 1, Albert Cohen 2, Claude Tadonki 1 1 MINES ParisTech, PSL Research University 2 Inria and DI, Ecole Normale Supérieure

More information

Lecture 2. Memory locality optimizations Address space organization

Lecture 2. Memory locality optimizations Address space organization Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 8 Processor-level SIMD SIMD instructions can perform

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Lecture 9: Performance tuning Sources of overhead There are 6 main causes of poor performance in shared memory parallel programs: sequential code communication load imbalance synchronisation

More information

Putting Automatic Polyhedral Compilation for GPGPU to Work

Putting Automatic Polyhedral Compilation for GPGPU to Work Putting Automatic Polyhedral Compilation for GPGPU to Work Soufiane Baghdadi 1, Armin Größlinger 2,1, and Albert Cohen 1 1 INRIA Saclay and LRI, Paris-Sud 11 University, France {soufiane.baghdadi,albert.cohen@inria.fr

More information

Analytical Tool-Supported Modeling of Streaming and Stencil Loops

Analytical Tool-Supported Modeling of Streaming and Stencil Loops ERLANGEN REGIONAL COMPUTING CENTER Analytical Tool-Supported Modeling of Streaming and Stencil Loops Georg Hager, Julian Hammer Erlangen Regional Computing Center (RRZE) Scalable Tools Workshop August

More information

Parallelizing Adaptive Triangular Grids with Refinement Trees and Space Filling Curves

Parallelizing Adaptive Triangular Grids with Refinement Trees and Space Filling Curves Parallelizing Adaptive Triangular Grids with Refinement Trees and Space Filling Curves Daniel Butnaru butnaru@in.tum.de Advisor: Michael Bader bader@in.tum.de JASS 08 Computational Science and Engineering

More information

smooth coefficients H. Köstler, U. Rüde

smooth coefficients H. Köstler, U. Rüde A robust multigrid solver for the optical flow problem with non- smooth coefficients H. Köstler, U. Rüde Overview Optical Flow Problem Data term and various regularizers A Robust Multigrid Solver Galerkin

More information

(Sparse) Linear Solvers

(Sparse) Linear Solvers (Sparse) Linear Solvers Ax = B Why? Many geometry processing applications boil down to: solve one or more linear systems Parameterization Editing Reconstruction Fairing Morphing 2 Don t you just invert

More information

HPC Algorithms and Applications

HPC Algorithms and Applications HPC Algorithms and Applications Dwarf #5 Structured Grids Michael Bader Winter 2012/2013 Dwarf #5 Structured Grids, Winter 2012/2013 1 Dwarf #5 Structured Grids 1. dense linear algebra 2. sparse linear

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

Code Optimizations for High Performance GPU Computing

Code Optimizations for High Performance GPU Computing Code Optimizations for High Performance GPU Computing Yi Yang and Huiyang Zhou Department of Electrical and Computer Engineering North Carolina State University 1 Question to answer Given a task to accelerate

More information

Trajectory Pattern Mining. Figures and charts are from some materials downloaded from the internet.

Trajectory Pattern Mining. Figures and charts are from some materials downloaded from the internet. Trajectory Pattern Mining Figures and charts are from some materials downloaded from the internet. Outline Spatio-temporal data types Mining trajectory patterns Spatio-temporal data types Spatial extension

More information

Parallel Programming. March 15,

Parallel Programming. March 15, Parallel Programming March 15, 2010 1 Some Definitions Computational Models and Models of Computation real world system domain model - mathematical - organizational -... computational model March 15, 2010

More information

Lecture 2: Introduction to OpenMP with application to a simple PDE solver

Lecture 2: Introduction to OpenMP with application to a simple PDE solver Lecture 2: Introduction to OpenMP with application to a simple PDE solver Mike Giles Mathematical Institute Mike Giles Lecture 2: Introduction to OpenMP 1 / 24 Hardware and software Hardware: a processor

More information

Polyhedral Operations. Algorithms needed for automation. Logistics

Polyhedral Operations. Algorithms needed for automation. Logistics Polyhedral Operations Logistics Intermediate reports late deadline is Friday March 30 at midnight HW6 (posted) and HW7 (posted) due April 5 th Tuesday April 4 th, help session during class with Manaf,

More information

PERFORMANCE OPTIMISATION

PERFORMANCE OPTIMISATION PERFORMANCE OPTIMISATION Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Hardware design Image from Colfax training material Pipeline Simple five stage pipeline: 1. Instruction fetch get instruction

More information

Challenges in Fully Generating Multigrid Solvers for the Simulation of non-newtonian Fluids

Challenges in Fully Generating Multigrid Solvers for the Simulation of non-newtonian Fluids Challenges in Fully Generating Multigrid Solvers for the Simulation of non-newtonian Fluids Sebastian Kuckuk FAU Erlangen-Nürnberg 18.01.2016 HiStencils 2016, Prague, Czech Republic Outline Outline Scope

More information

Offload acceleration of scientific calculations within.net assemblies

Offload acceleration of scientific calculations within.net assemblies Offload acceleration of scientific calculations within.net assemblies Lebedev A. 1, Khachumov V. 2 1 Rybinsk State Aviation Technical University, Rybinsk, Russia 2 Institute for Systems Analysis of Russian

More information

Towards Generating Solvers for the Simulation of non-newtonian Fluids. Harald Köstler, Sebastian Kuckuk FAU Erlangen-Nürnberg

Towards Generating Solvers for the Simulation of non-newtonian Fluids. Harald Köstler, Sebastian Kuckuk FAU Erlangen-Nürnberg Towards Generating Solvers for the Simulation of non-newtonian Fluids Harald Köstler, Sebastian Kuckuk FAU Erlangen-Nürnberg 22.12.2015 Outline Outline Scope and Motivation Project ExaStencils The Application

More information

Week 7 Convex Hulls in 3D

Week 7 Convex Hulls in 3D 1 Week 7 Convex Hulls in 3D 2 Polyhedra A polyhedron is the natural generalization of a 2D polygon to 3D 3 Closed Polyhedral Surface A closed polyhedral surface is a finite set of interior disjoint polygons

More information

CSL 860: Modern Parallel

CSL 860: Modern Parallel CSL 860: Modern Parallel Computation Hello OpenMP #pragma omp parallel { // I am now thread iof n switch(omp_get_thread_num()) { case 0 : blah1.. case 1: blah2.. // Back to normal Parallel Construct Extremely

More information

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading

More information

OpenMP Tutorial. Seung-Jai Min. School of Electrical and Computer Engineering Purdue University, West Lafayette, IN

OpenMP Tutorial. Seung-Jai Min. School of Electrical and Computer Engineering Purdue University, West Lafayette, IN OpenMP Tutorial Seung-Jai Min (smin@purdue.edu) School of Electrical and Computer Engineering Purdue University, West Lafayette, IN 1 Parallel Programming Standards Thread Libraries - Win32 API / Posix

More information

Introduction to Multigrid and its Parallelization

Introduction to Multigrid and its Parallelization Introduction to Multigrid and its Parallelization! Thomas D. Economon Lecture 14a May 28, 2014 Announcements 2 HW 1 & 2 have been returned. Any questions? Final projects are due June 11, 5 pm. If you are

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time Louis-Noël Pouchet, Cédric Bastoul, Albert Cohen and Nicolas Vasilache ALCHEMY, INRIA Futurs / University of Paris-Sud XI March

More information

Performance Engineering - Case study: Jacobi stencil

Performance Engineering - Case study: Jacobi stencil Performance Engineering - Case study: Jacobi stencil The basics in two dimensions (2D) Layer condition in 2D From 2D to 3D OpenMP parallelization strategies and layer condition in 3D NT stores Prof. Dr.

More information

Performance Comparison Between Patus and Pluto Compilers on Stencils

Performance Comparison Between Patus and Pluto Compilers on Stencils Louisiana State University LSU Digital Commons LSU Master's Theses Graduate School 214 Performance Comparison Between Patus and Pluto Compilers on Stencils Pratik Prabhu Hanagodimath Louisiana State University

More information

Transforming Imperfectly Nested Loops

Transforming Imperfectly Nested Loops Transforming Imperfectly Nested Loops 1 Classes of loop transformations: Iteration re-numbering: (eg) loop interchange Example DO 10 J = 1,100 DO 10 I = 1,100 DO 10 I = 1,100 vs DO 10 J = 1,100 Y(I) =

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

ECE 563 Spring 2012 First Exam

ECE 563 Spring 2012 First Exam ECE 563 Spring 2012 First Exam version 1 This is a take-home test. You must work, if found cheating you will be failed in the course and you will be turned in to the Dean of Students. To make it easy not

More information

Jane Li. Assistant Professor Mechanical Engineering Department, Robotic Engineering Program Worcester Polytechnic Institute

Jane Li. Assistant Professor Mechanical Engineering Department, Robotic Engineering Program Worcester Polytechnic Institute Jane Li Assistant Professor Mechanical Engineering Department, Robotic Engineering Program Worcester Polytechnic Institute (3 pts) How to generate Delaunay Triangulation? (3 pts) Explain the difference

More information

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017 Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference London, 2017 Agenda Vectorization is becoming more and more important What is

More information

Lesson 1 1 Introduction

Lesson 1 1 Introduction Lesson 1 1 Introduction The Multithreaded DAG Model DAG = Directed Acyclic Graph : a collection of vertices and directed edges (lines with arrows). Each edge connects two vertices. The final result of

More information

Parallel Programming Patterns

Parallel Programming Patterns Parallel Programming Patterns Pattern-Driven Parallel Application Development 7/10/2014 DragonStar 2014 - Qing Yi 1 Parallelism and Performance p Automatic compiler optimizations have their limitations

More information

Program Transformations for the Memory Hierarchy

Program Transformations for the Memory Hierarchy Program Transformations for the Memory Hierarchy Locality Analysis and Reuse Copyright 214, Pedro C. Diniz, all rights reserved. Students enrolled in the Compilers class at the University of Southern California

More information

OpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach

OpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach OpenMP: Vectorization and #pragma omp simd Markus Höhnerbach 1 / 26 Where does it come from? c i = a i + b i i a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 + b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 = c 1 c 2 c 3 c 4 c 5 c

More information

The ECM (Execution-Cache-Memory) Performance Model

The ECM (Execution-Cache-Memory) Performance Model The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited Loop Kernels. Proceedings of the Workshop Memory issues on Multi- and Manycore

More information

Polyhedral Optimizations of Explicitly Parallel Programs

Polyhedral Optimizations of Explicitly Parallel Programs Habanero Extreme Scale Software Research Group Department of Computer Science Rice University The 24th International Conference on Parallel Architectures and Compilation Techniques (PACT) October 19, 2015

More information

Lecture Notes on Cache Iteration & Data Dependencies

Lecture Notes on Cache Iteration & Data Dependencies Lecture Notes on Cache Iteration & Data Dependencies 15-411: Compiler Design André Platzer Lecture 23 1 Introduction Cache optimization can have a huge impact on program execution speed. It can accelerate

More information

Supercomputing in Plain English Part IV: Henry Neeman, Director

Supercomputing in Plain English Part IV: Henry Neeman, Director Supercomputing in Plain English Part IV: Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma Wednesday September 19 2007 Outline! Dependency Analysis! What is

More information

Administrative. Optimizing Stencil Computations. March 18, Stencil Computations, Performance Issues. Stencil Computations 3/18/13

Administrative. Optimizing Stencil Computations. March 18, Stencil Computations, Performance Issues. Stencil Computations 3/18/13 Administrative Optimizing Stencil Computations March 18, 2013 Midterm coming April 3? In class March 25, can bring one page of notes Review notes, readings and review lecture Prior exams are posted Design

More information