Polyhedral Optimizations of Explicitly Parallel Programs

Similar documents
Static Data Race Detection for SPMD Programs via an Extended Polyhedral Representation

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

A Data-flow Approach to Solving the von Neumann Bottlenecks

Polyhedral Optimizations for a Data-Flow Graph Language

Pipeline Parallelism and the OpenMP Doacross Construct. COMP515 - guest lecture October 27th, 2015 Jun Shirako

FADA : Fuzzy Array Dataflow Analysis

OpenMP 4.0. Mark Bull, EPCC

Tasking in OpenMP 4. Mirko Cestari - Marco Rorro -

Towards a Holistic Approach to Auto-Parallelization

Make the Most of OpenMP Tasking. Sergi Mateo Bellido Compiler engineer

OpenMP 4.0/4.5. Mark Bull, EPCC

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

A Lightweight OpenMP Runtime

Cilk Plus GETTING STARTED

CS691/SC791: Parallel & Distributed Computing

Bring your application to a new era:

Improving the Practicality of Transactional Memory

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Shared Memory Parallelism - OpenMP

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

CellSs Making it easier to program the Cell Broadband Engine processor

Advanced programming with OpenMP. Libor Bukata a Jan Dvořák

Lightweight Fault Detection in Parallelized Programs

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

Polly Polyhedral Optimizations for LLVM

Overview: The OpenMP Programming Model

CS 553: Algorithmic Language Compilers (PLDI) Graduate Students and Super Undergraduates... Logistics. Plan for Today

Scientific Programming in C XIV. Parallel programming

Advanced OpenMP. Lecture 11: OpenMP 4.0

OpenMP Doacross Loops Case Study

[Potentially] Your first parallel application

Exploring Dynamic Parallelism on OpenMP

BFS preconditioning for high locality, data parallel BFS algorithm N.Vasilache, B. Meister, M. Baskaran, R.Lethin. Reservoir Labs

Task-based Execution of Nested OpenMP Loops

Affine Loop Optimization using Modulo Unrolling in CHAPEL

Progress on OpenMP Specifications

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Task-parallel reductions in OpenMP and OmpSs

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time

Expressing DOACROSS Loop Dependences in OpenMP

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

A brief introduction to OpenMP

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

OpenMP Examples - Tasking

Introduction to HPC and Optimization Tutorial VI

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

Detection and Analysis of Iterative Behavior in Parallel Applications

PolyOpt/C. A Polyhedral Optimizer for the ROSE compiler Edition 0.2, for PolyOpt/C March 12th Louis-Noël Pouchet

OpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Preserving high-level semantics of parallel programming annotations through the compilation flow of optimizing compilers

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

A Uniform Programming Model for Petascale Computing

Generation of parallel synchronization-free tiled code

LLVM-based Communication Optimizations for PGAS Programs

Oil and Water Can Mix: An Integration of Polyhedral and AST-based Transformations

The SGI Pro64 Compiler Infrastructure - A Tutorial

Static and Dynamic Frequency Scaling on Multicore CPUs

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Compsci 590.3: Introduction to Parallel Computing

Concurrent Programming with OpenMP

SIMD Exploitation in (JIT) Compilers

Introduction to OpenMP. Tasks. N.M. Maclaren September 2017

Review. Lecture 12 5/22/2012. Compiler Directives. Library Functions Environment Variables. Compiler directives for construct, collapse clause

Session 4: Parallel Programming with OpenMP

SC12 HPC Educators session: Unveiling parallelization strategies at undergraduate level

OmpSs-2 Specification

Hybrid MPI + OpenMP Approach to Improve the Scalability of a Phase-Field-Crystal Code

Department of Informatics V. HPC-Lab. Session 2: OpenMP M. Bader, A. Breuer. Alex Breuer

Hybrid Analysis and its Application to Thread Level Parallelization. Lawrence Rauchwerger

Introducing OpenMP Tasks into the HYDRO Benchmark

Polyhedral Compilation Foundations

Multigrain Parallelism: Bridging Coarse- Grain Parallel Languages and Fine-Grain Event-Driven Multithreading

More Advanced OpenMP. Saturday, January 30, 16

Overpartioning with the Rice dhpf Compiler

Parallel Programming with OpenMP

JCudaMP: OpenMP/Java on CUDA

S Comparing OpenACC 2.5 and OpenMP 4.5

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

DATA-DRIVEN TASKS THEIR IMPLEMENTATION AND SAĞNAK TAŞIRLAR, VIVEK SARKAR DEPARTMENT OF COMPUTER SCIENCE. RICE UNIVERSITY

Introduction to OpenMP

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Parallel Programming: OpenMP

Lab: Scientific Computing Tsunami-Simulation

Shared Memory programming paradigm: openmp

A Parallelizing Compiler for Multicore Systems

Shared Memory Programming with OpenMP. Lecture 8: Memory model, flush and atomics

Tasking and OpenMP Success Stories

Exploring Parallelism At Different Levels

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

A Characterization of Shared Data Access Patterns in UPC Programs

Towards task-parallel reductions in OpenMP

OpenMP: Open Multiprocessing

Parametric Multi-Level Tiling of Imperfectly Nested Loops*

Runtime Support for Scalable Task-parallel Programs

Dense Matrix Multiplication

OpenMP for next generation heterogeneous clusters

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

Transcription:

Habanero Extreme Scale Software Research Group Department of Computer Science Rice University The 24th International Conference on Parallel Architectures and Compilation Techniques (PACT) October 19, 2015 1

Introduction Introduction and Motivation Moving towards Extreme-Scale and Exa-Scale computing systems Billions of billions operations per second Enabling applications to fully exploit the systems is not easy! How?? 2

Introduction Introduction and Motivation Two approaches from past work: 1) Manually parallelize using explicitly-parallel programming models (E.g., CAF, Cilk, Habanero, MPI, OpenMP, UPC etc) Optimizations performed by programmer not compiler! Tedious! But can deliver good performance, with sufficient effort 2) Automatically parallelize sequential programs Done by compilers not humans! Easy! But, limitations exist. 3

Introduction and Motivation Motivation and Our Approach Motivation Programmer expresses logical parallelism in the application and then let compiler perform optimizations accordingly Our approach Automatically optimize explicitly-parallel programs 4

Glimpse of benefits Introduction and Motivation 5

Background 1 Introduction and Motivation 2 Background 3 Our framework 4 Evaluation 5 Related Work 6 Conclusions and Future work 6

Background Explicit Parallelism - Loop level parallelism Major difference between Sequential and Parallel programs Sequential programs - total execution order Parallel programs - partial execution order Loop-level parallelism (since OpenMP 1.0) Loop is annotated with omp parallel for Iterations of the loop can be run in parallel 1#pragma omp parallel for 2 for (i loop) { 3 S1; 4 S2; 5 S3; 6 } 7

Background Explicit Parallelism - Task level parallelism Task-level parallelism (OpenMP 3.0 & 4.0) Region of code is annotated with omp task Synchronization B/w parent and children - omp taskwait B/w siblings - depend clause 1#pragma omp task depend(out : A)//T1 2 {S1} 3#pragma omp task depend( in : A) //T2 4 {S2} 5#pragma omp task // T3 6 {S3} 7#pragma omp taskwait // Tw 8

Background Explicit Parallelism - Happens before relation Happens-Before relation Specification of partial order among dynamic statement instances HB(S1, S2) = true S1 must happen before S2, where S1 and S2 are statement instances. HB(S1 (i), S2(i)) = true HB(S1, S2) = true, HB(S2, S3) = false 9

Background Explicit Parallelism - Serial elision property Serial-Elision property Removal of all parallel constructs results in a sequential program that is a valid (albeit inefficient) implementation of the parallel program semantics. 1#pragma omp task depend(out : A)//T1 2 {S1} 3#pragma omp task depend( in : A)//T2 4 {S2} 5#pragma omp task // T3 6 {S3} 7#pragma omp taskwait // Tw Original task graph Removing parallel constructs Satisfies serial-elision 10

Background Polyhedral Compilation Techniques Compiler techniques for analysis and transformation of codes with nested loops Algebraic framework for affine program optimizations Advantages over AST based frameworks Reasoning at statement instance level Unifies many complex loop transformations 11

Background Polyhedral Representation (SCoP) A statement (S) in the program is represented as follows in Static Control Part (SCoP): 1) Iteration domain (D S ) Set of statement (S) instances 2) Schedule (Θ S ) Assigns logical time stamp to the statement instances (S) Gives ordering information b/w statement instances Captures sequential execution order of a program Statement instances are executed in increasing order of schedules 3) Access function (A S ) Array subscripts in the statement (S) 12

Background Polyhedral Compilation Techniques - Summary Advantages Precise data dependency computation Unified formulation of complex set of loop transformations Limitations Affine array subscripts But, conservative approaches exist! Static affine control flow Control dependences are modeled in same way as data dependences. Assumes input is sequential program Unaware of happens-before relation in input parallel program 13

Background Automatic parallelization of sequential programs 14

Our framework 1 Introduction and Motivation 2 Background 3 Our framework 4 Evaluation 5 Related Work 6 Conclusions and Future work 15

Our framework Polyhedral optimizations of Parallel Programs (PoPP) 16

PoPP - Program Analysis Our framework Step1: Compute dependences based on the sequential order (use serial elision and ignore parallel constructs) 1 #pragma omp parallel 2 #pragma omp single 3 { 4 for ( int it = itold + 1; it <= itnew; it++) { 5 for ( int i = 0; i < nx; i++) { 6 #pragma omp task depend(out : u[ i ]) \ 7 depend(in: unew[i]) // T1 8 for ( int j = 0; j < ny; j++) 9 S1: u[i ][j] = unew[i ][j]; 10 } 11 for ( int i = 0; i < nx; i++) { 12 #pragma omp task depend(out : unew[ i ]) \ 13 depend(in: f[i], u[i 1], u[i], u[i+1]) // T2 14 for ( int j = 0; j < ny; j++) 15 S2: cpd(i, j, unew, u, f) ; 16 } 17 } 18 #pragma omp taskwait // Tw 19 } Conservative analysis, but may still capture vectorization possibility it itnew 2 1 0 1 2 3 nx (S2 S1) dependences across it & i loops i 17

PoPP - Program Analysis Our framework Step1: Compute dependences based on the sequential order (use serial elision and ignore parallel constructs) Step2: Compute happens-before relation (transitive closure) it 1 #pragma omp parallel 2 #pragma omp single 3 { 4 for ( int it = itold + 1; it <= itnew; it++) { 5 for ( int i = 0; i < nx; i++) { 6 #pragma omp task depend(out : u[ i ]) \ 7 depend(in: unew[i]) // T1 8 for ( int j = 0; j < ny; j++) 9 S1: u[i ][j] = unew[i ][j]; 10 } 11 for ( int i = 0; i < nx; i++) { 12 #pragma omp task depend(out : unew[ i ]) \ 13 depend(in: f[i], u[i 1], u[i], u[i+1]) // T2 14 for ( int j = 0; j < ny; j++) 15 S2: cpd(i, j, unew, u, f) ; 16 } 17 } 18 #pragma omp taskwait // Tw 19 } itnew 2 1 0 1 2 3 nx (S2 S1) HB edges across it & i loops i 18

PoPP - Program Analysis Our framework it Step1: Compute dependences Step2: Compute Happens-before relation (transitive closure) Step3: Intersect 1 & 2 (Gives best of both worlds) it it itnew 2 1 0 1 2 3 nx i itnew 2 1 0 1 2 3 nx i = itnew 2 1 0 1 2 3 nx i Conservative dependencesp S2 S1 1 HB relation HB S2 S1 1 Refined dependences P S2 S1 1 19

PoPP - Program Analysis Our framework i Step1: Compute dependences Step2: Compute Happens-before relation (transitive closure) Step3: Intersect 1 & 2 (Gives best of both worlds) i i nx 2 1 0 1 2 3 ny j nx 2 1 0 1 2 3 ny j = nx 2 1 0 1 2 3 ny j Conservative dependencesp1 S1 S1 (j-loop is parallel for S1) HB relation HB1 S1 S1 ( i-loop is parallel for S1) Refined dependences P 1 S1 S1 (No dependences for S1) 20

Our framework PoPP - Program Transformations itstep4: Use refined dependences in existing optimizations itnew 2 1 0 1 2 Refined dependences,p S2 S1 1 nx i Skewing and tiling the iteration space 21

PoPP - Code generation Our framework Step5: Generate optimized code using fine grained synchronization Doacross loop synchronization - OpenMP 4.1 2 #pragma omp parallel for \ 3 private(c3,c5) ordered(2) 4 for (c1 = itold + 1; c1 <= itnew; c1++) { 5 for (c3 = 2 c1; c3 <= 2 c1 + nx; c3++) { 6 #pragma omp ordered \ 7 depend(sink: c1 1, c3) depend(sink: c1, c3 1) 8 if (c3 <= 2 c1 + nx + 1) { 9 for (c5 = 0; c5 < ny; c5++) 10 S1: u[ 2 c1+c3 ][ c5] = unew[ 2 c1+c3 ][ c5 ]; 11 } 13 if (c3 >= 2 c1 + 1) { 14 for (c5 = 0; c5 < ny; c5++) 15 S2: cpd( 2 c1+c3 1, c5, unew, u, f) ; 16 } 17 #pragma omp ordered depend( source) 18 }} 22

Our framework PoPP - Workflow (in ROSE Compiler) 23

Our framework PoPP - Transformations & Code Generation Transformations - PolyAST framework [Shirako et.al SC 2014] To perform loop optimizations Hybrid approach of polyhedral and AST-based transformations Detects reduction, doacross and doall parallelism from dependences Code Generation Doall parallelism - omp parallel for Doacross parallelism - omp ordered depend Allows fine grained synchronization in multidimensional loop nests 24

Our framework Extensions to Polyhedral frameworks Correctness of Intersection approach Serial-elision property makes it correct! Computing conservative dependences Non-affine subscripts, Unknown function calls, Non-affine conditionals etc Extended access functions to support Extracting and Encoding task-related constructs in polyhedral representation (SCoP) Constructed task SCoP to compute HB relation 25

Evaluation 1 Introduction and Motivation 2 Background 3 Our framework 4 Evaluation 5 Related Work 6 Conclusions and Future work 26

Evaluation Evaluation: Benchmarks and Platforms Intel Xeon 5660 (Westmere) IBM Power 8E (Power 8) Microarch Westmere Power PC Clock speed 2.80GHz 3.02GHz Cores/socket 6 12 Total cores 12 24 Compiler gcc -4.9.2 gcc -4.9.2 Compiler flags -O3 -fast(icc) -O3 KASTORS -Task parallel (3) Jacobi, Jacobi-blocked, Sparse LU RODINIA -Loop parallel (8) Back propagation, CFD solver, Hotspot, Kmeans, LUD, Needle-Wunch, Particle filter, Path finder Unanalyzable data access patterns - 7 benchmarks 27

Variants Evaluation Variants in the experiments Original OpenMP program (Blue bars) Written by programmer Automatic parallelization and optimization of serial elision version of OpenMP program (Green bars) Automatic optimizers Optimized OpenMP program with intersection approach (Yellow bars) Our framework (PoPP) 28

Evaluation KASTORS suite + Intel Westmere (12 cores) Geometric mean improvement - 2.02x 29

Evaluation KASTORS suite + IBM Power8 (24 cores) Geometric mean improvement - 7.32x 30

Evaluation RODINIA suite + Intel Westmere (12 cores) Geometric mean improvement - 1.48x 31

Evaluation RODINIA suite + IBM Power8 (24 cores) Geometric mean improvement - 1.89x 32

Related Work 1 Introduction and Motivation 2 Background 3 Our framework 4 Evaluation 5 Related Work 6 Conclusions and Future work 33

Related work Related Work Dataflow analysis of explicitly parallel programs Extensions to data-parallel/ task-parallel languages [J.F.Collard et.al Europar 96] Extensions to X10 programs with async-finish languages [T. Yuki et.al PPoPP 13] Above work is limited to analysis but we also focus on transformations. PENCIL - Platform Neutral Compute Intermediate Language [Baghdadi et.al. PACT 15] Prunes data-dependence relation on parallel loops No support for task parallel constructs as yet Enforces certain coding restrictions related to aliasing, recursion etc. 34

Related work (contd) Related Work Polyhedral optimization framework for DFGL [Sbirlea et.al LCPC 15] Dataflow programming model - Implicitly parallel Optimizations via polyhedral & AST-based framework Preliminary approach to optimize parallel programs [Pop and Cohen CPC 10] Extract parallel semantics into compiler IR and perform polyhedral optimizations Envisaged on considering OpenMP streaming extensions 35

Conclusions and Future work 1 Introduction and Motivation 2 Background 3 Our framework 4 Evaluation 5 Related Work 6 Conclusions and Future work 36

Conclusions and Future work PoPP - Conclusions and Future work Conclusions: Our approach Reduced spurious dependences from conservative analysis by intersecting with HB relation Broadened the range of legal transformations for parallel programs Integrated HB relation from task-parallel constructs into Polyhedral frameworks Geometric mean performance improvement of 1.62X on Intel Westmere and 2.75X on IBM Power8 - Larger improvements!! Future work: Parallel constructs that don t satisfy serial-elision property Extend to distributed-memory programming models (Eg: MPI) Happens-Before relation for debugging Beyond polyhedral 37

Finally, Conclusions and Future work Optimizing explicitly parallel programs is a new direction for Parallel Architectures and Compilation Techniques (PACT)! Acknowledgments Support in part by DOE Office of Science Advanced Scientific Computing Research program through collaborative agreement DE-SC0008882 Rice Habanero Extreme Scale Software Research Group PACT 2015 program committee Thank you! 38