Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Similar documents
Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures

Compiling Affine Loop Nests for a Dynamic Scheduling Runtime on Shared and Distributed Memory

A polyhedral loop transformation framework for parallelization and tuning

Open Compute Stack (OpenCS) Overview. D.D. Nikolić Updated: 20 August 2018 DAE Tools Project,

CPU-GPU Heterogeneous Computing

Polyhedral Optimizations of Explicitly Parallel Programs

Hybrid Implementation of 3D Kirchhoff Migration

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Overview of research activities Toward portability of performance

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

Introduction to parallel Computing

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Offload acceleration of scientific calculations within.net assemblies

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Auto-tuning a High-Level Language Targeted to GPU Codes. By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos

Parallel Systems. Project topics

Parallel Programming Libraries and implementations

More Data Locality for Static Control Programs on NUMA Architectures

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

The Polyhedral Model Is More Widely Applicable Than You Think

Affine Loop Optimization using Modulo Unrolling in CHAPEL

LooPo: Automatic Loop Parallelization

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Trends and Challenges in Multicore Programming

Building NVLink for Developers

OpenCL for programming shared memory multicore CPUs

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Automatic Intra-Application Load Balancing for Heterogeneous Systems

OP2 FOR MANY-CORE ARCHITECTURES

Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures

Runtime Support for Scalable Task-parallel Programs

Parallel Programming. Libraries and Implementations

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

TORSTEN HOEFLER

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Chapter 3 Parallel Software

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Applying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

2 Installation Installing SnuCL CPU Installing SnuCL Single Installing SnuCL Cluster... 7

Computational Interdisciplinary Modelling High Performance Parallel & Distributed Computing Our Research

A Comprehensive Study on the Performance of Implicit LS-DYNA

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa

Slurm Configuration Impact on Benchmarking

An Efficient Stream Buffer Mechanism for Dataflow Execution on Heterogeneous Platforms with GPUs

On the efficiency of the Accelerated Processing Unit for scientific computing

OpenACC 2.6 Proposed Features

Towards Virtual Shared Memory for Non-Cache-Coherent Multicore Systems

Static Data Race Detection for SPMD Programs via an Extended Polyhedral Representation

Experiences in Tuning Performance of Hybrid MPI/OpenMP Applications on Quad-core Systems

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Fast and Efficient Automatic Memory Management for GPUs using Compiler- Assisted Runtime Coherence Scheme

StarPU: a runtime system for multigpu multicore machines

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

ASYNCHRONOUS COMPUTING IN C++

Parallel Computing. Hwansoo Han (SKKU)

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems

High Performance Computing (HPC) Introduction

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Numerical Algorithms on Multi-GPU Architectures

CUDA GPGPU Workshop 2012

Accelerating Implicit LS-DYNA with GPU

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

OmpCloud: Bridging the Gap between OpenMP and Cloud Computing

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

JCudaMP: OpenMP/Java on CUDA

The APGAS Programming Model for Heterogeneous Architectures. David E. Hudak, Ph.D. Program Director for HPC Engineering

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

Parallelising Pipelined Wavefront Computations on the GPU

Operational Robustness of Accelerator Aware MPI

An Introduction to Parallel Programming

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Parallel FFT Program Optimizations on Heterogeneous Computers

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Transcription:

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation Indian Institute of Science {roshan,chandan.g,thejas,uday}@csa.iisc.ernet.in September 11, 2013 Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 1 / 61

Parallelizing code for distributed-memory architectures OpenMP code for shared-memory systems: for ( i =1; i<=n; i++) { #pragma omp parallel for for ( j =1; j<=n; j++) { <computation> } } MPI code for distributed-memory systems: for ( i =1; i<=n; i++) { set_of_j_s = dist (1, N, processor_id ); for each j in set_of_j_s { <computation> } <communication> } Explicit communication is required between: devices in a heterogeneous system with CPUs and multiple GPUs. nodes in a distributed-memory cluster. Hence, tedious to program. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 2 / 61

Affine loop nests Arbitrarily nested loops with affine bounds and affine accesses. Form the compute-intensive core of scientific computations like: stencil style computations, linear algebra kernels, alternating direction implicit (ADI) integrations. Can be analyzed by the polyhedral model. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 3 / 61

Example iteration space i Dependence (1,0) Dependence (0,1) Dependence (1,1) j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 4 / 61

Example iteration space i Dependence (1,0) Dependence (0,1) Dependence (1,1) Tiles j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 5 / 61

Example iteration space i Dependence (1,0) Dependence (0,1) Dependence (1,1) Tiles Parallel phases j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 6 / 61

Automatic Data Movement For affine loop nests: Statically determine data to be transferred between computes devices. with a goal to move only those values that need to be moved to preserve program semantics. Generate data movement code that is: parametric in problem size symbols and number of compute devices. valid for any computation placement. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 7 / 61

Communication is parameterized on a tile Tile represents an iteration of the innermost distributed loop. May or may not be the result of loop tiling. A tile is executed atomically by a compute device. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 8 / 61

Existing flow-out (FO) scheme i Dependence (1,0) Dependence (0,1) Dependence (1,1) Tiles Flow-out set Flow-out set: The values that need to be communicated to other tiles. Union of per-dependence flow-out sets of all RAW dependences. j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 9 / 61

Existing flow-out (FO) scheme i Dependence (1,0) Dependence (0,1) Dependence (1,1) Tiles Flow-out set Receiving tiles: The set of tiles that require the flow-out set. j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 10 / 61

Existing flow-out (FO) scheme i Dependence (1,0) Dependence (0,1) Dependence (1,1) Tiles Flow-out set Communcation All elements in the flow-out set might not be required by all its receiving tiles. Only ensures that the receiver requires at least one element in the communicated set. Could transfer unnecessary elements. j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 11 / 61

Our first scheme Motivation: All elements in the data communicated should be required by the receiver. Key idea: Determine data that needs to be sent from one tile to another, parameterized on a sending tile and a receiving tile. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 12 / 61

Flow-in (FI) set i Dependence (1,0) Dependence (0,1) Dependence (1,1) Tiles Flow-in set Flow-in set: The values that need to be received from other tiles. Union of per-dependence flow-in sets of all RAW dependences. j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 13 / 61

Flow-out intersection flow-in (FOIFI) scheme i Dependence (1,0) Dependence (0,1) Dependence (1,1) Tiles Flow set Flow set: Parameterized on two tiles. The values that need to be communicated from a sending tile to a receiving tile. Intersection of the flow-out set of the sending tile and the flow-in set of the receiving tile. j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 14 / 61

Flow-out intersection flow-in (FOIFI) scheme i Dependence (1,0) Dependence (0,1) Dependence (1,1) Tiles Flow set Flow set: Parameterized on two tiles. The values that need to be communicated from a sending tile to a receiving tile. Intersection of the flow-out set of the sending tile and the flow-in set of the receiving tile. j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 15 / 61

Flow-out intersection flow-in (FOIFI) scheme i Dependence (1,0) Dependence (0,1) Dependence (1,1) Tiles Flow set Flow set: Parameterized on two tiles. The values that need to be communicated from a sending tile to a receiving tile. Intersection of the flow-out set of the sending tile and the flow-in set of the receiving tile. j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 16 / 61

Flow-out intersection flow-in (FOIFI) scheme i Dependence (1,0) Dependence (0,1) Dependence (1,1) Tiles Flow set Communcation Precise communication when each receiving tile is executed by a different compute device. Could lead to huge duplication when multiple receiving tiles are executed by the same compute device. j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 17 / 61

Comparison with virtual processor based schemes Some existing schemes use a virtual processor to physical mapping to handle symbolic problem sizes and number of compute devices. Tiles can be considered as virtual processors in FOIFI. Lesser redundant communication in FOIFI than prior works that use virtual processors since it: uses exact-dataflow information. combines data to be moved due to multiple dependences. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 18 / 61

Our main scheme Motivation: Partitioning the communication set such that all elements within each partition is required by all receivers of that partition. Key idea: Partition the dependences in a particular way, and determine communication sets and their receivers based on those partitions. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 19 / 61

Flow-out partitioning (FOP) scheme i Dependence (1,0) Dependence (0,1) Dependence (1,1) Tiles Flow-out partitions Source-distinct partitioning of dependences - partitions dependences such that: all dependences in a partition communicate the same set of values. any two dependences in different partitions communicate disjoint set of values. j Determine communication set and receiving tiles for each partition. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 20 / 61

Flow-out partitioning (FOP) scheme i Dependence (1,0) Dependence (0,1) Dependence (1,1) Tiles Flow-out partitions Communication sets of different partitions are disjoint. Union of communication sets of all partitions yields the flow-out set. Hence, the flow-out set of a tile is partitioned. j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 21 / 61

Source-distinct partitioning of dependences i Dependence (1,1) Tiles Partitions of dependences Initially, each dependence is: restricted to those constraints which are inter-tile, and put in a separate partition. j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 22 / 61

Source-distinct partitioning of dependences i Dependence (1,0) Tiles Partitions of dependences Initially, each dependence is: restricted to those constraints which are inter-tile, and put in a separate partition. j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 23 / 61

Source-distinct partitioning of dependences i Dependence (0,1) Tiles Partitions of dependences Initially, each dependence is: restricted to those constraints which are inter-tile, and put in a separate partition. j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 24 / 61

Source-distinct partitioning of dependences i Dependence (1,0) Dependence (1,1) Tiles Partitions of dependences For all pairs of dependences in two partitions: Find the source iterations that access the same region of data - source-identical. Get new dependences by restricting the original dependences to the source-identical iterations. Subtract out the new dependences from the original dependences. j The set of new dependences formed is a new partition. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 25 / 61

Source-distinct partitioning of dependences i Dependence (1,0) Dependence (1,1) Tiles Partitions of dependences For all pairs of dependences in two partitions: Find the source iterations that access the same region of data - source-identical. Get new dependences by restricting the original dependences to the source-identical iterations. Subtract out the new dependences from the original dependences. j The set of new dependences formed is a new partition. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 26 / 61

Source-distinct partitioning of dependences i Dependence (0,1) Dependence (1,1) Tiles Partitions of dependences For all pairs of dependences in two partitions: Find the source iterations that access the same region of data - source-identical. Get new dependences by restricting the original dependences to the source-identical iterations. Subtract out the new dependences from the original dependences. j The set of new dependences formed is a new partition. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 27 / 61

Source-distinct partitioning of dependences i Dependence (0,1) Dependence (1,1) Tiles Partitions of dependences For all pairs of dependences in two partitions: Find the source iterations that access the same region of data - source-identical. Get new dependences by restricting the original dependences to the source-identical iterations. Subtract out the new dependences from the original dependences. j The set of new dependences formed is a new partition. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 28 / 61

Source-distinct partitioning of dependences i Dependence (1,0) Dependence (0,1) Dependence (1,1) Tiles Partitions of dependences For all pairs of dependences in two partitions: Find the source iterations that access the same region of data - source-identical. Get new dependences by restricting the original dependences to the source-identical iterations. Subtract out the new dependences from the original dependences. j The set of new dependences formed is a new partition. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 29 / 61

Source-distinct partitioning of dependences i Dependence (1,0) Dependence (0,1) Dependence (1,1) Tiles Partitions of dependences For all pairs of dependences in two partitions: Find the source iterations that access the same region of data - source-identical. Get new dependences by restricting the original dependences to the source-identical iterations. Subtract out the new dependences from the original dependences. j The set of new dependences formed is a new partition. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 30 / 61

Source-distinct partitioning of dependences i Dependence (1,0) Dependence (0,1) Dependence (1,1) Tiles Partitions of dependences Stop when no new partitions can be formed. j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 31 / 61

Flow-out partitioning (FOP) scheme: at runtime i Dependence (1,0) Dependence (0,1) Dependence (1,1) Tiles Flow-out partitions For each partition and tile executed, one of these is chosen: j multicast-pack: the partioned communication set from this tile is copied to the buffer of its receivers. unicast-pack: the partioned communication set from this tile to a receiving tile is copied to the buffer of that receiver. unicast-pack is chosen only if each receiving tile is executed by a different receiver. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 32 / 61

Flow-out partitioning (FOP) scheme i Dependence (1,0) Dependence (0,1) Dependence (1,1) Tiles Flow-out partitions Communcation Reduces granularity at which receivers are determined. Reduces granularity at which the conditions to choose between multicast-pack and unicast-pack are applied. Minimizes communication of both duplicate and unnecessary elements. j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 33 / 61

Another example - dependences i Dependence1 Dependence2 Let: (k, i, j) - source iteration (k, i, j ) - target iteration Dependence1: k = k + 1 i = i j = j k j=k+1 j Dependence2: k = k + 1 i = i j = k + 1 Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 34 / 61

Another example - FO scheme i Dependence1 Dependence2 Tiles Flow-out set k j=k+1 j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 35 / 61

Another example - FOIFI scheme i Dependence1 Dependence2 Tiles Flow set k j=k+1 j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 36 / 61

Another example - FOIFI scheme i Dependence2 Tiles Flow set k j=k+1 j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 37 / 61

Another example - FOIFI scheme i Dependence2 Tiles Flow set k j=k+1 j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 38 / 61

Another example - FOP scheme i Dependence1 Dependence2 Tiles Flow-out partition k j=k+1 j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 39 / 61

Another example - FOP scheme i Dependence1 Tiles Flow-out partition k j=k+1 j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 40 / 61

Implementation As part of the PLUTO framework. Input is sequential C code which is tiled and parallelized using the PLUTO algorithm. Data movement code is automatically generated using our scheme. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 41 / 61

Implementation - distributed-memory systems Code for distributed-memory systems using existing techniques is automatically generated. Asynchronous MPI primitives are used to communicate between nodes in a distributed-memory system. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 42 / 61

Implementation - heterogeneous systems For heterogeneous systems, the host CPU acts both as a compute device and as the orchestrator of data movement between compute devices, while the GPU acts only as a compute device. OpenCL functions clenqueuereadbufferrect() and clenqueuewritebufferrect() are used for data movement in heterogeneous systems. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 43 / 61

Experimental evaluation: distributed-memory cluster 32-node InfiniBand cluster. Each node consists of two quad-core Intel Xeon E5430 2.66 GHz processors. The cluster uses MVAPICH2-1.8 as the MPI implementation. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 44 / 61

Benchmarks Floyd Warshall (floyd). LU Decomposition (lu). Alternating Direction Implicit solver (adi). 2-D Finite Different Time Domain Kernel (fdtd-2d). Heat 2D equation (heat-2d). Heat 3D equation (heat-3d). The first 4 are from Polybench/C 3.2 suite, while heat-2d and heat-3d are widely used stencil computations. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 45 / 61

Comparison of FOP, FOIFI and FO Same parallelizing transformation -> same frequency of communication. Differ only in the communcation volume. Comparing execution times directly compares their efficiency. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 46 / 61

Comparison of FOP with FO Communication volume reduced by a factor of 1.4 to 63.5. Communication volume reduction translates to significant speedup, except for heat-2d. Speedup of upto 15.9. Mean speedup of 1.55. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 47 / 61

Comparison of FOP with FOIFI Similar behavior for stencil-style codes. For floyd and lu: Communcation volume reduced by a factor of 1.5 to 31.8. Speedup of upto 1.84. Mean speedup of 1.11. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 48 / 61

OMPD - OpenMP to MPI Takes OpenMP code as input and generates MPI code. Primarily a runtime dataflow analysis technique. Handles only those affine loop nests which have a repetitive communication pattern. Communication should not vary based on the outer sequential loop. Cannot handle floyd, lu and time-tiled (outer sequential dimension tiled) stencil style codes. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 49 / 61

Comparison of FOP with OMPD For heat-2d and heat-3d, significant speedup over OMPD. The computation time is much lesser. Better load balance and locality due to advanced transformations. OMPD cannot handle such transformed code. For adi: significant speedup over OMPD. Same volume of communication. Better performance due to loop tiling. Lesser runtime overhead. Mean speedup of 3.06. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 50 / 61

Unified Parallel C (UPC) Unified programming model for both shared-memory and distributed-memory systems. All benchmarks were manually ported to UPC. Sharing data only if it may be accessed remotely. UPC-specific optimizations like localized array accesses, block copy, one-sided communication. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 51 / 61

Comparison of FOP with UPC For lu, heat-2d and heat-3d, significant speedup over UPC. Better load balance and locality due to advanced transformations. Difficult to manually write such transformed code. UPC model is not suitable when the same data element could be written by different nodes in different parallel phases. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 52 / 61

Comparison of FOP with UPC For lu, heat-2d and heat-3d, significant speedup over UPC. Better load balance and locality due to advanced transformations. Difficult to manually write such transformed code. UPC model is not suitable when the same data element could be written by different nodes in different parallel phases. For adi: significant speedup over UPC. Same computation time and communication volume. Data to be communicated is not contiguous in memory. UPC incurs huge runtime overhead for such multiple shared memory requests to non-contiguous data. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 52 / 61

Comparison of FOP with UPC For lu, heat-2d and heat-3d, significant speedup over UPC. Better load balance and locality due to advanced transformations. Difficult to manually write such transformed code. UPC model is not suitable when the same data element could be written by different nodes in different parallel phases. For adi: significant speedup over UPC. Same computation time and communication volume. Data to be communicated is not contiguous in memory. UPC incurs huge runtime overhead for such multiple shared memory requests to non-contiguous data. For fdtd-2d and floyd: UPC performs slightly better. Same computation time and communication volume. Data to be communicated is contiguous in memory. UPC has no additional runtime overhead. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 52 / 61

Comparison of FOP with UPC For lu, heat-2d and heat-3d, significant speedup over UPC. Better load balance and locality due to advanced transformations. Difficult to manually write such transformed code. UPC model is not suitable when the same data element could be written by different nodes in different parallel phases. For adi: significant speedup over UPC. Same computation time and communication volume. Data to be communicated is not contiguous in memory. UPC incurs huge runtime overhead for such multiple shared memory requests to non-contiguous data. For fdtd-2d and floyd: UPC performs slightly better. Same computation time and communication volume. Data to be communicated is contiguous in memory. UPC has no additional runtime overhead. Mean speedup of 2.19. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 52 / 61

Results: distributed-memory cluster Speedup 25 20 15 10 floyd fdtd-2d heat-3d lu heat-2d adi Speedup 25 20 15 10 FOP FOIFI FO UPC 5 5 0 1 2 4 8 16 32 Number of nodes 0 1 2 4 8 16 32 Number of nodes Figure: FOP strong scaling on distributed-memory cluster Figure: floyd speedup of FOP, FOIFI, FO and hand-optimized UPC code over seq on distributed-memory cluster For the transformations and computation placement chosen: FOP achieves the minimum communication volume. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 53 / 61

Experimental evaluation: heterogeneous systems Intel-NVIDIA system: Intel Xeon multicore server consisting of 12 Xeon E5645 cores. 4 NVIDIA Tesla C2050 graphics processors connected on the PCI express bus. NVIDIA driver version 304.64 supporting OpenCL 1.1. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 54 / 61

Comparison of FOP with FO Communication volume reduced by a factor of 11 to 83. Communication volume reduction translates to significant speedup. Speedup of upto 3.47. Mean speedup of 1.53. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 55 / 61

Results: heterogeneous systems 4 3.5 3 floyd fdtd-2d heat-3d lu heat-2d Speedup 2.5 2 1.5 1 0.5 0 1GPU 1CPU+1GPU 2GPUs 4GPUs Device combination Figure: FOP strong scaling on the Intel-NVIDIA system For the transformations and computation placement chosen: FOP achieves the minimum communication volume. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 56 / 61

Acknowledgements Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 57 / 61

Conclusions The framework we propose frees programmers from the burden of moving data. Partitioning of dependences enables precise determination of data to be moved. Our tool is the first one to parallelize affine loop nests for a combination of CPUs and GPUs while providing precision of data movement at the granularity of array elements. Our techniques will be able to provide OpenMP-like programmer productivity for distributed-memory and heterogeneous architectures if implemented in compilers. Publicly available: http://pluto-compiler.sourceforge.net/ Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 58 / 61

Results: distributed-memory cluster Mean speedup of FOP over FO is 1.55x Mean speedup of FOP over OMPD is 3.06x Mean speedup of FOP over UPC is 2.19x Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 59 / 61

Results: heterogeneous systems Mean speedup of FOP over FO is 1.53x Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 60 / 61

Results: heterogeneous systems Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 61 / 61