Parametric Multi-Level Tiling of Imperfectly Nested Loops*
|
|
- Griffin Simon
- 6 years ago
- Views:
Transcription
1 Parametric Multi-Level Tiling of Imperfectly Nested Loops* Albert Hartono 1, Cedric Bastoul 2,3 Sriram Krishnamoorthy 4 J. Ramanujam 6 Muthu Baskaran 1 Albert Cohen 2 Boyana Norris 5 P. Sadayappan 1 1 Ohio State University 2 INRIA Saclay 3 Paris-Sud 11 University 4 Pacific Northwest National Laboratory 5 Argonne National Laboratory 6 Louisiana State University * Funded by NSF
2 One Slide Summary Imperfectly nested loops are common in practice Parametric tiled loop generator can provide valuable compiler support for auto-tuning Current general solutions for tiled code generation Parametric tiling of perfect loop nests Non-parametric tiling of imperfect loop nests Both use polyhedral model and ILP machinery (Constraint) Inequalities of the loop bounds must be linear in terms of loop iterators and problem sizes => problem with parametric tile sizes We have recently developed a hybrid solution for parametric tiling of imperfect loop nests
3 Loop Tiling Key loop transformation for both: Efficient coarse-grained parallel execution Data locality optimization j for (i=1; i<=7; i++) for (j=1; j<=6; j++) S(i,j); i Inter-tile loops Intra-tile loops for (it=1; it<=7; it+=ti) for (jt=1; jt<=6; jt+=tj) for (i=it; i<min(7,it+ti-1); i++) for (j=jt; j<min(6,jt+tj-1); j++) S(i,j); j i
4 for (i=1; i<n; i++) for (j=2; j<n; j++) S1: a[i][j] = a[j][i] + a[i][j-1]; i j 2 j n-1 i n-1 x S1 = i j i 1 j I S1 = x S1 n 1 Stmt instances integer points in polyhedra systems of linear inequalities
5 N=4 M=3 for (i=; i<n; i++) { for (j=; j<n; j++) for(k=; k<n; k++) S1; for (p=; p<m; p++)s2; } Uniform, powerful abstraction for imperfect loop nests Uniform, powerful handling of parametric loop bounds Loop transform == Affine scheduling functions =>Arbitrary sequence of transforms == change of affine coeffs.
6 Input Program Output Program Loops -> Polyhedra Data Dependence Analysis Transforms (Affine Functions) Code Generation: Polyhedra -> Loops
7 Parametric Tiled Code Generation for (i=1; i<=n; i++) for (j=1; j<=n; j++) S(i,j); Tile loop i with tile size Ti Tile loop j with tile size Tj for (it=1; it<=n; it+=ti) for (jt=1; jt<=n; jt+=tj) for (i=it; i<min(n,it+ti-1); i++) for (j=jt; j<min(n,jt+tj-1); j++) S(i,j); Tiled code generation is straightforward for rectangular, perfectly nested loops But tiled code generation is more challenging if Inner loop bounds depend on outer loops Data dependences make rectangular tiling illegal Loops are imperfectly nested Polyhedral compilation model enables tiled code generation for arbitrary affine codes with imperfectly nested loops
8 Loop Code Generation from Polyhedra Code generation in polyhedral compiler framework: The process of converting a polyhedral representation of computations back into loop structures CLooG State-of-the-art polyhedral code generator Takes statement domains and affine schedules to generate transformed code Uses efficient polyhedral scanning algorithm to generate imperfectly nested loops that scan a union of polyhedra (corresponding to statement domains)
9 Loop Code Generation from Polyhedra (cont.) N 2 1 j for (i=1; i<=n; i++) for (j=i; j<=n; j++) S1(i,j); for (i=1; i<=m; i++) /* M<N */ for (j=1; j<=n; j++) S2(i,j); S1 S2 for (j=1; j<=n; j++) { S1(1,j); S2(1,j); } for (i=2; i<=n; i++) { for (j=1; j<=i-1; j++) S2(i,j); for (j=i; j<=n; j++) { S1(i,j); S2(i,j); } } for (i=m+1; i<=n; i++) for (j=i; j<=n; j++) S1(i,j); 1 2 M N i
10 Tiled Code Generation in Polyhedral Model j i 1 i N Tile sizes = 32 x 32 Original loop: for (i=1; i<=n; i++) for (j=1; j<=n; j++) S(i,j); Statement domain: Tiled loop: N j N for (it=; it<=floord(n,32); it++) for (jt=; jt<=floord(n,32); jt++) for (i=max(1,32*it); i<=min(n,32*it+31); i++) for (j=max(1,32*jt); j<=min(n,32*jt+31); j++) S(i,j); 2 1 j it jt i j N 1 = i-32 it 1 2 N i i-32 it 31 j-32 jt Affine schedule: j-32 jt 31 it 1 1 i jt 1 i N i = 1. 1 j j 1 j N it jt i j N 1 it = it jt = jt i = i j = j Constraint of polyhedral model and ILP machinery: Inequalities of the loop bounds must be linear in terms of loop iterators and symbolic parameters
11 Parametric Tiling: Perfectly Nested Loop No full tiles Full tiles j for (i=lbi; i<=ubi; i++) for (j=lbj(i); j<=ubj(i); j++) S(i,j); Output pseudocode: for it { [compute lbv] [compute ubv] if (lbv<ubv) { [prolog j] [full tiles j] [epilog j] } else { [untiled j] } } [epilog i] Full tiles (loop i) Partial tile (loop i) i
12 Parametric Tiling: Imperfectly Nested Loops Output pseudocode: for (i=lbi; i<=ubi; i++) { for (j1=lbj1(i); j1<=ubj1(i); j1++) S1(i,j); for (j2=lbj2(i); j2<=ubj2(i); j2++) S2(i,j); } Combined and interleaved Combined and interleaved for it { [compute lbv1,ubv1,lbv2,ubv2] if (lbv1<ubv1) { [prolog j1] [full tiles j1] if (lbv2<ubv2) { [epilog j1 + prolog j2] [full tiles j2] [epilog j2] } else { [epilog j1 + untiled j2] } } else { /* omitted */ } } [epilog i] ubv2 lbv2 ubv1 if (lbv2<ubv2) lbv1 { [untiled j1 + prolog j2] [tiled j2] [epilog j2] } else { j [untiled j1 + untiled One j2] tile segment } along i dimension i S2a S1a S2b S1b Statement domain of S2 Statement domain of S1 Combined and interleaved Combined and interleaved
13 Essential for: Exploiting data locality in deep multi-level memory hierarchies Approach: Boundary tiles can be recursively tiled using smaller tile sizes Multi-Level Tiling j i 12 3 levels of tiling
14 Implementation: PrimeTile A Parametric Multi-Level Tiler for Imperfect Loop Nests Loop nest sequence Pre-process Pluto Iteration space polyhedra + Affine schedules for rectangular tileability Parser + AST Generator Loop ASTs Loop Tiling Transformer Rectangularly tileable loop code (with complete embedding information) Parametric multi-level tiled loop ASTs Modified CLooG Code Generator All statements in a loop nest have the same number of surrounding loops. Parametric multi-level tiled code
15 Experiments Xeon workstation (dual quad-core E5462 Xeon processors (8 cores total) running at 2.8 GHz (16 MHz FSB) with 32 KB L1 cache, 12 MB of L2 cache (6 MB shared per core pair), and 16 GB of DDR2 FBDIMM RAM, running Linux kernel version (x86-64)) GCC version Options: -O3 Comparisons with other tiled-code generators Tiled code generator Tile sizes Loop nest structure HiTLOG Parametric Perfect Pluto Fixed Imperfect PrimeTile Parametric Imperfect
16 Benchmarks Name Description Imperfect nest Require skewing LU LU factorization Yes No N=25 2D FDTD 2D Finite Difference Time Domain method Input problem size Yes Yes T=2, N=2 1D Jacobi 1D Jacobi method Yes Yes T=2, N=6x1 6 Cholesky Cholesky factorization Yes No N=5 TriSolver Triangular solver Yes No N=3 Seidel 3D Gauss Seidel No Yes T=2, N=2 DSYRK Symmetric rank k update No No N=3 DTRMM Triangular matrix multiplication No No N=3
17 Generation time (seconds) Efficiency of Code Generation LU Pluto PrimeTile (full) PrimeTile (no boundary tiling) Generation time (seconds) Cholesky Pluto PrimeTile (full) PrimeTile (no boundary tiling) Levels of tiling Levels of tiling
18 Generation time (seconds) Efficiency of Code Generation (cont.) DSYRK Pluto PrimeTile (full) PrimeTile (no boundary tiling) HiTLOG Generation time (seconds) DTRMM Pluto PrimeTile (full) PrimeTile (no boundary tiling) HiTLOG Levels of tiling Levels of tiling Fully polyhedral fixed tiled code generation does not scale Double benefit of PrimeTile: better scalability and parametric tiling
19 1 Performance of Generated Tiled Code Pluto PrimeTile HiTLOG Execution time (seconds) LU 2D FDTD 1D Jacobi Cholesky TriSolver Seidel DSYRK DTRMM Parametric tiled code efficiency is comparable to or better than fixed tiled code
20 Impact of Separation of Partial and Full Tiles 1 Pluto PrimeTile Pluto(unroll/jam) PrimeTile(unroll) PrimeTile(regtile) Execution time (seconds) LU 2D FDTD 1D Jacobi Cholesky TriSolver
21 Impact of Separation of Partial and Full Tiles Pluto PrimeTile HiTLOG Pluto(unroll/jam) PrimeTile(unroll) PrimeTile(regtile) HiTLOG(unroll) HiTLOG(regtile) 1 Execution time (seconds) Seidel DSYRK DTRMM Identification of full-tile loops enables downstream optimization (e.g., register tiling)
22 Summary Developed an effective general approach to parametric multi-level tiling of imperfectly nested affine loops Achieved separation of partial tiles from full tiles, thereby enabling optimizations such as register tiling Ongoing/follow-up work targets parallel parametric tiling of affine imperfect loop nests Software download: 1. A beta release of PrimeTile 2. A modified version of CLooG
23 Thank You!
A polyhedral loop transformation framework for parallelization and tuning
A polyhedral loop transformation framework for parallelization and tuning Ohio State University Uday Bondhugula, Muthu Baskaran, Albert Hartono, Sriram Krishnamoorthy, P. Sadayappan Argonne National Laboratory
More informationCompact Multi-Dimensional Kernel Extraction for Register Tiling
Compact Multi-Dimensional Kernel Extraction for Register Tiling Lakshminarayanan Renganarayana 1 Uday Bondhugula 1 Salem Derisavi 2 Alexandre E. Eichenberger 1 Kevin O Brien 1 1 IBM T.J. Watson Research
More informationIterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time
Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time Louis-Noël Pouchet, Cédric Bastoul, Albert Cohen and Nicolas Vasilache ALCHEMY, INRIA Futurs / University of Paris-Sud XI March
More informationPolyhedral Operations. Algorithms needed for automation. Logistics
Polyhedral Operations Logistics Intermediate reports late deadline is Friday March 30 at midnight HW6 (posted) and HW7 (posted) due April 5 th Tuesday April 4 th, help session during class with Manaf,
More informationThe Polyhedral Model Is More Widely Applicable Than You Think
The Polyhedral Model Is More Widely Applicable Than You Think Mohamed-Walid Benabderrahmane 1 Louis-Noël Pouchet 1,2 Albert Cohen 1 Cédric Bastoul 1 1 ALCHEMY group, INRIA Saclay / University of Paris-Sud
More informationDaeGon Kim and Sanjay V. Rajopadhye [kim January 22, 2010
Computer Science Technical Report On Parameterized Tiled Loop Generation and Its Parallelization DaeGon Kim and Sanjay V. Rajopadhye [kim svr]@cs.colostate.edu January 22, 2010 Computer Science Department
More informationPLuTo: A Practical and Fully Automatic Polyhedral Program Optimization System
PLuTo: A Practical and Fully Automatic Polyhedral Program Optimization System Uday Bondhugula J. Ramanujam P. Sadayappan Dept. of Computer Science and Engineering Dept. of Electrical & Computer Engg. and
More informationEssential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2
Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 S2
More informationCOMPILE-TIME AND RUN-TIME OPTIMIZATIONS FOR ENHANCING LOCALITY AND PARALLELISM ON MULTI-CORE AND MANY-CORE SYSTEMS
COMPILE-TIME AND RUN-TIME OPTIMIZATIONS FOR ENHANCING LOCALITY AND PARALLELISM ON MULTI-CORE AND MANY-CORE SYSTEMS DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor
More informationPolyhedral Compilation Foundations
Polyhedral Compilation Foundations Louis-Noël Pouchet pouchet@cse.ohio-state.edu Dept. of Computer Science and Engineering, the Ohio State University Feb 22, 2010 888.11, Class #5 Introduction: Polyhedral
More informationAutomatic Transformations for Effective Parallel Execution on Intel Many Integrated Core
Automatic Transformations for Effective Parallel Execution on Intel Many Integrated Core Kevin Stock The Ohio State University stockk@cse.ohio-state.edu Louis-Noël Pouchet The Ohio State University pouchet@cse.ohio-state.edu
More informationPolyhedral-Based Data Reuse Optimization for Configurable Computing
Polyhedral-Based Data Reuse Optimization for Configurable Computing Louis-Noël Pouchet 1 Peng Zhang 1 P. Sadayappan 2 Jason Cong 1 1 University of California, Los Angeles 2 The Ohio State University February
More informationStatic and Dynamic Frequency Scaling on Multicore CPUs
Static and Dynamic Frequency Scaling on Multicore CPUs Wenlei Bao 1 Changwan Hong 1 Sudheer Chunduri 2 Sriram Krishnamoorthy 3 Louis-Noël Pouchet 4 Fabrice Rastello 5 P. Sadayappan 1 1 The Ohio State University
More informationCS671 Parallel Programming in the Many-Core Era
1 CS671 Parallel Programming in the Many-Core Era Polyhedral Framework for Compilation: Polyhedral Model Representation, Data Dependence Analysis, Scheduling and Data Locality Optimizations December 3,
More informationPolyOpt/C. A Polyhedral Optimizer for the ROSE compiler Edition 0.2, for PolyOpt/C March 12th Louis-Noël Pouchet
PolyOpt/C A Polyhedral Optimizer for the ROSE compiler Edition 0.2, for PolyOpt/C 0.2.1 March 12th 2012 Louis-Noël Pouchet This manual is dedicated to PolyOpt/C version 0.2.1, a framework for Polyhedral
More informationLooPo: Automatic Loop Parallelization
LooPo: Automatic Loop Parallelization Michael Claßen Fakultät für Informatik und Mathematik Düsseldorf, November 27 th 2008 Model-Based Loop Transformations model-based approach: map source code to an
More informationCombined Iterative and Model-driven Optimization in an Automatic Parallelization Framework
Combined Iterative and Model-driven Optimization in an Automatic Parallelization Framework Louis-Noël Pouchet The Ohio State University pouchet@cse.ohio-state.edu Uday Bondhugula IBM T.J. Watson Research
More informationNeural Network Assisted Tile Size Selection
Neural Network Assisted Tile Size Selection Mohammed Rahman, Louis-Noël Pouchet and P. Sadayappan Dept. of Computer Science and Engineering Ohio State University June 22, 2010 iwapt 2010 Workshop Berkeley,
More informationAn Overview to. Polyhedral Model. Fangzhou Jiao
An Overview to Polyhedral Model Fangzhou Jiao Polyhedral Model A framework for performing loop transformation Loop representation: using polytopes to achieve fine-grain representation of program Loop transformation:
More informationA Compiler Framework for Optimization of Affine Loop Nests for General Purpose Computations on GPUs
A Compiler Framework for Optimization of Affine Loop Nests for General Purpose Computations on GPUs Muthu Manikandan Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1 J. Ramanujam 2 Atanas Rountev 1
More informationPerformance Comparison Between Patus and Pluto Compilers on Stencils
Louisiana State University LSU Digital Commons LSU Master's Theses Graduate School 214 Performance Comparison Between Patus and Pluto Compilers on Stencils Pratik Prabhu Hanagodimath Louisiana State University
More informationPutting Automatic Polyhedral Compilation for GPGPU to Work
Putting Automatic Polyhedral Compilation for GPGPU to Work Soufiane Baghdadi 1, Armin Größlinger 2,1, and Albert Cohen 1 1 INRIA Saclay and LRI, Paris-Sud 11 University, France {soufiane.baghdadi,albert.cohen@inria.fr
More informationEffective Automatic Parallelization and Locality Optimization Using The Polyhedral Model
Effective Automatic Parallelization and Locality Optimization Using The Polyhedral Model DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate
More informationCompiling Affine Loop Nests for Distributed-Memory Parallel Architectures
Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures Uday Bondhugula Indian Institute of Science Supercomputing 2013 Nov 16 22, 2013 Denver, Colorado 1/46 1 Introduction 2 Distributed-memory
More informationTiling: A Data Locality Optimizing Algorithm
Tiling: A Data Locality Optimizing Algorithm Previously Unroll and Jam Homework PA3 is due Monday November 2nd Today Unroll and Jam is tiling Code generation for fixed-sized tiles Paper writing and critique
More informationAdaptive Runtime Selection of Parallel Schedules in the Polytope Model
Adaptive Runtime Selection of Parallel Schedules in the Polytope Model Benoit Pradelle, Philippe Clauss, Vincent Loechner To cite this version: Benoit Pradelle, Philippe Clauss, Vincent Loechner. Adaptive
More informationPredictive Modeling in a Polyhedral Optimization Space
Noname manuscript No. (will be inserted by the editor) Predictive Modeling in a Polyhedral Optimization Space Eunjung Park 1 John Cavazos 1 Louis-Noël Pouchet 2,3 Cédric Bastoul 4 Albert Cohen 5 P. Sadayappan
More informationNeural Network Assisted Tile Size Selection
Neural Network Assisted Tile Size Selection Mohammed Rahman, Louis-Noël Pouchet, and P. Sadayappan The Ohio State University {rahmanm,pouchet,saday}@cse.ohio-state.edu Abstract. Data locality optimization
More informationMultilevel Acyclic Partitioning of Directed Acyclic Graphs for Enhancing Data Locality
Multilevel Acyclic Partitioning of Directed Acyclic Graphs for Enhancing Data Locality Julien Herrmann 1, Bora Uçar 2, Kamer Kaya 3, Aravind Sukumaran Rajam 4, Fabrice Rastello 5, P. Sadayappan 4, Ümit
More informationOil and Water Can Mix: An Integration of Polyhedral and AST-based Transformations
Oil and Water Can Mix: An Integration of Polyhedral and AST-based Transformations Jun Shirako Rice University Louis-Noël Pouchet University of California Los Angeles Vivek Sarkar Rice University Abstract
More informationA Compiler Framework for Optimization of Affine Loop Nests for GPGPUs
A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs Muthu Manikandan Baskaran Department of Computer Science and Engg. The Ohio State University baskaran@cse.ohiostate.edu J. Ramanujam
More informationLegal and impossible dependences
Transformations and Dependences 1 operations, column Fourier-Motzkin elimination us use these tools to determine (i) legality of permutation and Let generation of transformed code. (ii) Recall: Polyhedral
More informationGenerating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory
Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation
More informationPolly First successful optimizations - How to proceed?
Polly First successful optimizations - How to proceed? Tobias Grosser, Raghesh A November 18, 2011 Polly - First successful optimizations - How to proceed? November 18, 2011 1 / 27 Me - Tobias Grosser
More informationCache Oblivious Parallelograms in Iterative Stencil Computations
Cache Oblivious Parallelograms in Iterative Stencil Computations Robert Strzodka Max Planck Institut Informatik Campus E Saarbrücken, Germany strzodka@mpiinf.mpg.de Mohammed Shaheen Max Planck Institut
More informationGeneration of parallel synchronization-free tiled code
Computing (2018) 100:277 302 https://doi.org/10.1007/s00607-017-0576-3 Generation of parallel synchronization-free tiled code Wlodzimierz Bielecki 1 Marek Palkowski 1 Piotr Skotnicki 1 Received: 22 August
More informationPredic've Modeling in a Polyhedral Op'miza'on Space
Predic've Modeling in a Polyhedral Op'miza'on Space Eunjung EJ Park 1, Louis- Noël Pouchet 2, John Cavazos 1, Albert Cohen 3, and P. Sadayappan 2 1 University of Delaware 2 The Ohio State University 3
More informationComputing and Informatics, Vol. 36, 2017, , doi: /cai
Computing and Informatics, Vol. 36, 2017, 566 596, doi: 10.4149/cai 2017 3 566 NESTED-LOOPS TILING FOR PARALLELIZATION AND LOCALITY OPTIMIZATION Saeed Parsa, Mohammad Hamzei Department of Computer Engineering
More informationThe Polyhedral Compilation Framework
The Polyhedral Compilation Framework Louis-Noël Pouchet Dept. of Computer Science and Engineering Ohio State University pouchet@cse.ohio-state.edu October 20, 2011 Introduction: Overview of Today s Lecture
More informationLoop Nest Optimizer of GCC. Sebastian Pop. Avgust, 2006
Loop Nest Optimizer of GCC CRI / Ecole des mines de Paris Avgust, 26 Architecture of GCC and Loop Nest Optimizer C C++ Java F95 Ada GENERIC GIMPLE Analyses aliasing data dependences number of iterations
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationPARALLEL TILED CODE GENERATION WITH LOOP PERMUTATION WITHIN TILES
Computing and Informatics, Vol. 36, 07, 6 8, doi: 0.449/cai 07 6 6 PARALLEL TILED CODE GENERATION WITH LOOP PERMUTATION WITHIN TILES Marek Palkowski, Wlodzimierz Bielecki Faculty of Computer Science West
More informationAffine and Unimodular Transformations for Non-Uniform Nested Loops
th WSEAS International Conference on COMPUTERS, Heraklion, Greece, July 3-, 008 Affine and Unimodular Transformations for Non-Uniform Nested Loops FAWZY A. TORKEY, AFAF A. SALAH, NAHED M. EL DESOUKY and
More informationGXBIT: COMBINING POLYHEDRAL MODEL WITH DYNAMIC BINARY TRANSLATION
GXBIT: COMBINING POLYHEDRAL MODEL WITH DYNAMIC BINARY TRANSLATION 1 ZHANG KANG, 2 ZHOU FANFU AND 3 LIANG ALEI 1 China Telecommunication, Shanghai, China 2 Department of Computer Science and Engineering,
More informationTiling: A Data Locality Optimizing Algorithm
Tiling: A Data Locality Optimizing Algorithm Previously Performance analysis of existing codes Data dependence analysis for detecting parallelism Specifying transformations using frameworks Today Usefulness
More informationAffine Loop Optimization using Modulo Unrolling in CHAPEL
Affine Loop Optimization using Modulo Unrolling in CHAPEL Aroon Sharma, Joshua Koehler, Rajeev Barua LTS POC: Michael Ferguson 2 Overall Goal Improve the runtime of certain types of parallel computers
More informationFADA : Fuzzy Array Dataflow Analysis
FADA : Fuzzy Array Dataflow Analysis M. Belaoucha, D. Barthou, S. Touati 27/06/2008 Abstract This document explains the basis of fuzzy data dependence analysis (FADA) and its applications on code fragment
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationAutomatic Polyhedral Optimization of Stencil Codes
Automatic Polyhedral Optimization of Stencil Codes ExaStencils 2014 Stefan Kronawitter Armin Größlinger Christian Lengauer 31.03.2014 The Need for Different Optimizations 3D 1st-grade Jacobi smoother Speedup
More informationArray Optimizations in OCaml
Array Optimizations in OCaml Michael Clarkson Cornell University clarkson@cs.cornell.edu Vaibhav Vaish Cornell University vaibhav@cs.cornell.edu May 7, 2001 Abstract OCaml is a modern programming language
More informationOffload acceleration of scientific calculations within.net assemblies
Offload acceleration of scientific calculations within.net assemblies Lebedev A. 1, Khachumov V. 2 1 Rybinsk State Aviation Technical University, Rybinsk, Russia 2 Institute for Systems Analysis of Russian
More informationTiling Stencil Computations to Maximize Parallelism
Tiling Stencil Computations to Maximize Parallelism Vinayaka Bandishti, Irshad Pananilath, and Uday Bondhugula Department of Computer Science and Automation Indian Institute of Science, Bangalore 5612
More informationIntel Knights Landing Hardware
Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute
More informationImproving Polyhedral Code Generation for High-Level Synthesis
Improving Polyhedral Code Generation for High-Level Synthesis Wei Zuo 1,5 Peng Li 2,3 Deming Chen 5 Louis-Noël Pouchet 4,3 Shunan Zhong 1 Jason Cong 4,3 1 Beijing Institute of Technology 2 Peking University
More informationUnderstanding PolyBench/C 3.2 Kernels
Understanding PolyBench/C 3.2 Kernels Tomofumi Yuki INRIA Rennes, FRANCE tomofumi.yuki@inria.fr ABSTRACT In this position paper, we argue the need for more rigorous specification of kernels in the PolyBench/C
More informationOverpartioning with the Rice dhpf Compiler
Overpartioning with the Rice dhpf Compiler Strategies for Achieving High Performance in High Performance Fortran Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/hug00overpartioning.pdf
More informationCS 612: Software Design for High-performance Architectures. Keshav Pingali Cornell University
CS 612: Software Design for High-performance Architectures Keshav Pingali Cornell University Administration Instructor: Keshav Pingali 457 Rhodes Hall pingali@cs.cornell.edu TA: Kamen Yotov 492 Rhodes
More informationParameterized Tiled Loops for Free
Parameterized Tiled Loops for Free Lakshminarayanan Renganarayanan DaeGon Kim Sanjay Rajopadhye Michelle Mills Strout Computer Science Department Colorado State University {ln,kim}@cs.colostate.edu Sanjay.Rajopadhye@colostate.edu
More informationStatic and Dynamic Frequency Scaling on Multicore CPUs
Static and Dynamic Frequency Scaling on Multicore CPUs WENLEI BAO and CHANGWAN HONG, The Ohio State University SUDHEER CHUNDURI, IBM Research India SRIRAM KRISHNAMOORTHY, Pacific Northwest National Laboratory
More informationVerification of Polyhedral Optimizations with Constant Loop Bounds in Finite State Space Computations
Verification of Polyhedral Optimizations with Constant Loop Bounds in Finite State Space Computations Markus Schordan 1, Pei-Hung Lin 1, Dan Quinlan 1, and Louis-Noël Pouchet 2 1 Lawrence Livermore National
More informationTessellating Stencils. Liang Yuan, Yunquan Zhang, Peng Guo, Shan Huang SKL of Computer Architecture, ICT, CAS
Tessellating Stencils Liang Yuan, Yunquan Zhang, Peng Guo, Shan Huang SKL of Computer Architecture, ICT, CAS Outline Introduction Related work Tessellating Stencils Stencil Stencil Overview update each
More informationTechniques for Optimizing FEM/MoM Codes
Techniques for Optimizing FEM/MoM Codes Y. Ji, T. H. Hubing, and H. Wang Electromagnetic Compatibility Laboratory Department of Electrical & Computer Engineering University of Missouri-Rolla Rolla, MO
More informationPredictive Modeling in a Polyhedral Optimization Space
Predictive Modeling in a Polyhedral Optimization Space Eunjung Park, Louis-Noël Pouchet, John Cavazos, Albert Cohen and P. Sadayappan University of Delaware {ejpark,cavazos}@cis.udel.edu The Ohio State
More informationScientific Computing. Some slides from James Lambers, Stanford
Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical
More informationSC12 HPC Educators session: Unveiling parallelization strategies at undergraduate level
SC12 HPC Educators session: Unveiling parallelization strategies at undergraduate level E. Ayguadé, R. M. Badia, D. Jiménez, J. Labarta and V. Subotic August 31, 2012 Index Index 1 1 The infrastructure:
More informationMarch 14, / 27. The isl Scheduler. Sven Verdoolaege. KU Leuven and Polly Labs. March 14, 2018
March 14, 2018 1 / 27 The isl Scheduler Sven Verdoolaege KU Leuven and Polly Labs March 14, 2018 March 14, 2018 2 / 27 Outline 1 isl Overview 2 The isl Scheduler Input/Output Algorithms Issues isl Overview
More informationModule 16: Data Flow Analysis in Presence of Procedure Calls Lecture 32: Iteration. The Lecture Contains: Iteration Space.
The Lecture Contains: Iteration Space Iteration Vector Normalized Iteration Vector Dependence Distance Direction Vector Loop Carried Dependence Relations Dependence Level Iteration Vector - Triangular
More informationA Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality
A Crash Course in Compilers for Parallel Computing Mary Hall Fall, 2008 1 Overview of Crash Course L1: Data Dependence Analysis and Parallelization (Oct. 30) L2 & L3: Loop Reordering Transformations, Reuse
More informationFourier-Motzkin and Farkas Questions (HW10)
Automating Scheduling Logistics Final report for project due this Friday, 5/4/12 Quiz 4 due this Monday, 5/7/12 Poster session Thursday May 10 from 2-4pm Distance students need to contact me to set up
More informationSequoia. Mattan Erez. The University of Texas at Austin
Sequoia Mattan Erez The University of Texas at Austin EE382N: Parallelism and Locality, Fall 2015 1 2 Emerging Themes Writing high-performance code amounts to Intelligently structuring algorithms [compiler
More informationLecture 9 Basic Parallelization
Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning
More informationLecture 9 Basic Parallelization
Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning
More informationDense Matrix Multiplication
Dense Matrix Multiplication Abhishek Somani, Debdeep Mukhopadhyay Mentor Graphics, IIT Kharagpur October 7, 2015 Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, 2015 1 / 56 Overview 1 The Problem 2
More informationAlan LaMielle, Michelle Strout Colorado State University March 16, Technical Report CS
Computer Science Technical Report Enabling Code Generation within the Sparse Polyhedral Framework Alan LaMielle, Michelle Strout Colorado State University {lamielle,mstrout@cs.colostate.edu March 16, 2010
More information6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT
6.189 IAP 2007 Lecture 11 Parallelizing Compilers 1 6.189 IAP 2007 MIT Outline Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Generation of Parallel
More informationMore Data Locality for Static Control Programs on NUMA Architectures
More Data Locality for Static Control Programs on NUMA Architectures Adilla Susungi 1, Albert Cohen 2, Claude Tadonki 1 1 MINES ParisTech, PSL Research University 2 Inria and DI, Ecole Normale Supérieure
More informationPolyhedral Optimizations of Explicitly Parallel Programs
Habanero Extreme Scale Software Research Group Department of Computer Science Rice University The 24th International Conference on Parallel Architectures and Compilation Techniques (PACT) October 19, 2015
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,
More informationAutomatic OpenCL Optimization for Locality and Parallelism Management
Automatic OpenCL Optimization for Locality and Parallelism Management Xing Zhou, Swapnil Ghike In collaboration with: Jean-Pierre Giacalone, Bob Kuhn and Yang Ni (Intel) Maria Garzaran and David Padua
More informationLanguage and Compiler Parallelization Support for Hashtables
Language Compiler Parallelization Support for Hashtables A Project Report Submitted in partial fulfilment of the requirements for the Degree of Master of Engineering in Computer Science Engineering by
More informationLinear Loop Transformations for Locality Enhancement
Linear Loop Transformations for Locality Enhancement 1 Story so far Cache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a linear transformation
More informationLoop Transformations: Convexity, Pruning and Optimization
Loop Transformations: Convexity, Pruning and Optimization Louis-Noël Pouchet, Uday Bondhugula, Cédric Bastoul, Albert Cohen, Jagannathan Ramanujam, Ponnuswamy Sadayappan, Nicolas Vasilache To cite this
More informationTiling Stencil Computations to Maximize Parallelism
Tiling Stencil Computations to Maximize Parallelism A THESIS SUBMITTED FOR THE DEGREE OF Master of Science (Engineering) IN THE COMPUTER SCIENCE AND ENGINEERING by Vinayaka Prakasha Bandishti Computer
More informationParametrically Tiled Distributed Memory Parallelization of Polyhedral Programs
Computer Science Technical Report Parametrically Tiled Distributed Memory Parallelization of Polyhedral Programs Tomofumi Yuki Sanjay Rajopadhye June 10, 2013 Colorado State University Technical Report
More informationGRAPHITE: Polyhedral Analyses and Optimizations
GRAPHITE: Polyhedral Analyses and Optimizations for GCC Sebastian Pop 1 Albert Cohen 2 Cédric Bastoul 2 Sylvain Girbal 2 Georges-André Silber 1 Nicolas Vasilache 2 1 CRI, École des mines de Paris, Fontainebleau,
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationTwo-Phase flows on massively parallel multi-gpu clusters
Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous
More informationAdopt a Polyhedral Compiler!
Adopt a Polyhedral Compiler! IMPACT 2013 Workshop Albert Cohen INRIA and École Normale Supérieure, Paris http://www.di.ens.fr/parkasteam.html People Have Great Expectations Accelerating legacy code for
More informationLocality Aware Concurrent Start for Stencil Applications
Locality Aware Concurrent Start for Stencil Applications Sunil Shrestha Guang R. Gao University of Delaware sunil@udel.edu, ggao@capsl.udel.edu Joseph Manzano Andres Marquez John Feo Pacific Nothwest National
More informationModule 18: Loop Optimizations Lecture 36: Cycle Shrinking. The Lecture Contains: Cycle Shrinking. Cycle Shrinking in Distance Varying Loops
The Lecture Contains: Cycle Shrinking Cycle Shrinking in Distance Varying Loops Loop Peeling Index Set Splitting Loop Fusion Loop Fission Loop Reversal Loop Skewing Iteration Space of The Loop Example
More informationTransformations Techniques for extracting Parallelism in Non-Uniform Nested Loops
Transformations Techniques for extracting Parallelism in Non-Uniform Nested Loops FAWZY A. TORKEY, AFAF A. SALAH, NAHED M. EL DESOUKY and SAHAR A. GOMAA ) Kaferelsheikh University, Kaferelsheikh, EGYPT
More informationThe Challenges of Non-linear Parameters and Variables in Automatic Loop Parallelisation
The Challenges of Non-linear Parameters and Variables in Automatic Loop Parallelisation Armin Größlinger December 2, 2009 Rigorosum Fakultät für Informatik und Mathematik Universität Passau Automatic Loop
More informationarxiv: v1 [cs.pl] 1 Feb 2018
PCOT: Cache Oblivious Tiling of Polyhedral Programs Waruna Ranasinghe Nirmal Prajapati Colorado State University Department of Computer Science Fort Collins, CO 8523, USA Tomofumi Yuki INRIA Rennes, France
More informationData-centric Transformations for Locality Enhancement
Data-centric Transformations for Locality Enhancement Induprakas Kodukula Keshav Pingali September 26, 2002 Abstract On modern computers, the performance of programs is often limited by memory latency
More informationPolyhedral Compilation Foundations
Polyhedral Compilation Foundations Louis-Noël Pouchet pouchet@cse.ohio-state.edu Dept. of Computer Science and Engineering, the Ohio State University Feb 15, 2010 888.11, Class #4 Introduction: Polyhedral
More informationCluster Computing Paul A. Farrell 9/15/2011. Dept of Computer Science Kent State University 1. Benchmarking CPU Performance
Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance, since it is designed to defeat any effort to
More informationCompiling Affine Loop Nests for a Dynamic Scheduling Runtime on Shared and Distributed Memory
Compiling Affine Loop Nests for a Dynamic Scheduling Runtime on Shared and Distributed Memory ROSHAN DATHATHRI, RAVI TEJA MULLAPUDI, and UDAY BONDHUGULA, Department of Computer Science and Automation,
More informationBenchmarking CPU Performance. Benchmarking CPU Performance
Cluster Computing Benchmarking CPU Performance Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance,
More informationAdaptive Scientific Software Libraries
Adaptive Scientific Software Libraries Lennart Johnsson Advanced Computing Research Laboratory Department of Computer Science University of Houston Challenges Diversity of execution environments Growing
More informationLoop Transformations, Dependences, and Parallelization
Loop Transformations, Dependences, and Parallelization Announcements HW3 is due Wednesday February 15th Today HW3 intro Unimodular framework rehash with edits Skewing Smith-Waterman (the fix is in!), composing
More information