Experiences with Achieving Portability across Heterogeneous Architectures

Similar documents
Trellis: Portability Across Architectures with a High-level Framework

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

CS691/SC791: Parallel & Distributed Computing

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Getting Started with Directive-based Acceleration: OpenACC

OpenACC Course. Office Hour #2 Q&A

PGI Accelerator Programming Model for Fortran & C

Lecture 4: OpenMP Open Multi-Processing

Hybrid Implementation of 3D Kirchhoff Migration

OPENMP FOR ACCELERATORS

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

OpenACC (Open Accelerators - Introduced in 2012)

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

OpenACC 2.6 Proposed Features

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Trends and Challenges in Multicore Programming

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

PGI Fortran & C Accelerator Programming Model. The Portland Group

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

EXOCHI: Architecture and Programming Environment for A Heterogeneous Multicore Multithreaded System

Addressing Heterogeneity in Manycore Applications

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

Offloading Java to Graphics Processors

A case study of performance portability with OpenMP 4.5

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

Accelerator programming with OpenACC

Concurrent Programming with OpenMP

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

OpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch

Parallel Numerical Algorithms

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Modern Processor Architectures. L25: Modern Compiler Design

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

OmpCloud: Bridging the Gap between OpenMP and Cloud Computing

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

COMP Parallel Computing. Programming Accelerators using Directives

OpenMP for next generation heterogeneous clusters

An Introduction to OpenMP

ECE 574 Cluster Computing Lecture 10

S Comparing OpenACC 2.5 and OpenMP 4.5

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Early Experiences With The OpenMP Accelerator Model

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

An Auto-Tuning Framework for Parallel Multicore Stencil Computations

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

A Comparative Study of OpenACC Implementations

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs

INTRODUCTION TO OPENACC

Exploiting Dynamically Changing Parallelism with a Reconfigurable Array of Homogeneous Sub-cores (a.k.a. Field Programmable Core Array or FPCA)

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

EE/CSCI 451: Parallel and Distributed Computation

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel

Overview of research activities Toward portability of performance

A Cross-Input Adaptive Framework for GPU Program Optimizations

Heterogeneous platforms

Session 4: Parallel Programming with OpenMP

A brief introduction to OpenMP

GPU Programming Using NVIDIA CUDA

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Progress on OpenMP Specifications

JCudaMP: OpenMP/Java on CUDA

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Designing and Optimizing LQCD code using OpenACC

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

OpenMP 4.0. Mark Bull, EPCC

Parallel Systems. Project topics

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Technology for a better society. hetcomp.com

Analysis of Parallelization Techniques and Tools

Introduction to Multicore Programming

Parallel Programming in C with MPI and OpenMP

Modernizing OpenMP for an Accelerated World

Application parallelization for multi-core Android devices

CUDA. Matthew Joyner, Jeremy Williams

Directive-based Programming for Highly-scalable Nodes

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

OpenMP 4.5 target. Presenters: Tom Scogland Oscar Hernandez. Wednesday, June 28 th, Credits for some of the material

arxiv: v1 [hep-lat] 12 Nov 2013

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Auto-tuning a High-Level Language Targeted to GPU Codes. By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos

Task-based Execution of Nested OpenMP Loops

Transcription:

Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron + + University of Virginia ++ Lawrence Livermore National Laboratory

Outline Introduction Portability Problems Approach Molecular Dynamics Code Feature Tracking Code Conclusions Future Work 2

Introduction Cluster nodes are becoming heterogeneous (multi-core CPU and GPU) in order to increase their compute capability Codes executing at the level of a node need to be ported to new architectures/ accelerators for better performance Most existing codes are written for sequential execution, over-optimized for a particular architecture and difficult to port We look at 2 representative applications, and for each one of them, we want to arrive at and maintain a single code base that can run on different architectures source code 3

Outline Introduction Portability Problems Approach Molecular Dynamics Code Feature Tracking Code Conclusions Future Work 4

Sequential form Portability Problems Code Structure Does not express parallelism well; needs effort to extract Many small, independent loops and tasks that could run concurrently Optimizations such as function inlining and loop unrolling Monolithic and coupled structure Hard to identify sections that can be parallelized (and/or offloaded) Hard to determine range of accessed data due to nested pointer structure Architecture-specific optimizations Specifying native instructions to ensure desired execution Memory alignment via padding Orchestrating computation according to cache line size Prospect of significant code changes delays porting 5

Portability Problems Parallel Framework Inconvenient and inadequate current frameworks Cilk, TBB API with runtime calls, work only with CPU OpenMP - convenient annotation-based frameworks, works only with CPU OpenCL - portable CPU-GPU API with runtime calls, low-level and too difficult to learn CUDA, CUDAlite - convenient GPU API with runtime calls, vendor-specific Cetus - OpenMP-CUDA translators, limited to OpenMP, heavily relies on code analysis PGI Accelerator API convenient annotation-based framework, works only with GPU, relies on code analysis Future OpenMP with accelerator support convenient annotationbased framework, expected to fully support CPU and GPU 6

Portability Problems Summary Difficulties - Achieving portable code written in convenient framework is not possible - Porting across architectures requires earning a new, usually architecture-specific, language; scientists are reluctant to do that - Need to convert, usually over-optimized architecture-specific, serial code to parallel code; only skilled programmers can do that Outcomes - Degraded performance after initial porting, difficult to extract parallelism for improvement - Divergent code bases that are difficult to maintain - Instead, researchers will continue using current languages and corresponding architectures as long as they providing at least some incremental benefits 7

Outline Introduction Portability Problems Approach Molecular Dynamics Code Feature Tracking Code Conclusions Future Work 8

Approach Code Structure Clearly expose parallel characteristics of the code to facilitate portability and tradeoffs across architectures Create consolidated loop structure that maximizes the width of available parallelism Organize code in a modular fashion that clearly delineates parallel sections of the code Remove architecture-specific optimizations for(){ (independent iter.) dependenttask; for(){ (independent iter.) dependenttask; arch-specificcode; for(){ (all independent iter.) dependenttask; dependenttask; genericcode; 9

Approach Parallel Framework Use portable directive-based approach to Be compatible with current languages Minimize changes to code Eliminate run-time calls in favor of annotations Achieve best-effort parallelization with high-level directives Tune performance with optional lower-level directives dependenttask; deviceallocate(); devicetransfer(); kernel(); devicetransferback(); devicefree(); dependenttask; kernel{ (independent iter.) dependenttask; # pragma_for parallel for(){ # pragma_for vector for(){ (independent iter.) dependenttask; 10

Approach Portable Implementation Demonstrate approach for writing single code with syntactic and performance portability Follow approach of PGI Accelerator API and create higher-level generic framework that can be translated to OpenMP and PGI Accelerator API #define MCPU // or GPU dependenttask; # pragma api for parallel for(){ # pragma api for vector for(){ (independent iter.) dependenttask; Our generic framework dependenttask; # pragma omp parallel for for(){ for(){ (independent iter.) dependenttask; dependenttask; # pragma acc for parallel for(){ # pragma acc for vector for(){ (independent iter.) dependenttask; PGI Accelerator API OpenMP 11

Outline Introduction Portability Problems Approach Molecular Dynamics Code Feature Tracking Code Conclusions Future Work 12

Molecular Dynamics Code Overview Inside a large box, small boxes with their neighbors are processed in parallel by a multi-core CPU or GPU Select neighbor box Large boxes are assigned to different cluster nodes for parallel computation Select home particle Interactions with neighbor particles # of neighbor particles # of home particles # of neighbor boxes # of home boxes Particle space is divided into large boxes (or cells, could be any shape), where each box is divided into small boxes Code computes interactions between particles in any given small box and its neighbor small boxes 13

Molecular Dynamics Code Code Translation #define MCPU // or GPU #pragma api data copyin(box[0:#_boxes-1]) \ copyin(pos[0:#_par.-1]) copyin(chr[0:#_par.-1]) \ copyout(dis[0:#_par.-1]){ #pragma api compute{ #pragma api parallel(30) private() \ cache(home_box) for(i=0; i<#_home_boxes; i++){ Home box setup #pragma api sequential for(j=0; j<#_neighbor_boxes; j++){ Neighbor box setup pragma api vector (128) private() \ cache(neighbor_box) for(k=0; k<#_home_particles; k++){ pragma api sequential for(l=0; l<#_neighbor_particles;l++){ Calculation of interactions Our generic framework 14

Molecular Dynamics Code Code Translation #define MCPU // or GPU #pragma api data copyin(box[0:#_boxes-1]) \ copyin(pos[0:#_par.-1]) copyin(chr[0:#_par.-1]) \ copyout(dis[0:#_par.-1]){ #pragma api compute{ #pragma api parallel(30) private() \ cache(home_box) for(i=0; i<#_home_boxes; i++){ Home box setup #pragma api sequential for(j=0; j<#_neighbor_boxes; j++){ Neighbor box setup pragma api vector (128) private() \ cache(neighbor_box) for(k=0; k<#_home_particles; k++){ pragma api sequential for(l=0; l<#_neighbor_particles;l++){ Calculation of interactions Our generic framework Source translation omp_set_num_threads(30); #pragma omp parallel for for(i=0; i<#_home_boxes; i++){ Home box setup for(j=0; j<#_neighbor_boxes; j++){ Neighbor box setup for(k=0; k<#_home_particles; k++){ for(l=0; l<#_neighbor_particles;l++){ Calculation of interactions OpenMP 15

Molecular Dynamics Code Code Translation #define MCPU // or GPU #pragma api data copyin(box[0:#_boxes-1]) \ copyin(pos[0:#_par.-1]) copyin(chr[0:#_par.-1]) \ copyout(dis[0:#_par.-1]){ #pragma api compute{ #pragma api parallel(30) private() \ cache(home_box) for(i=0; i<#_home_boxes; i++){ Home box setup #pragma api sequential for(j=0; j<#_neighbor_boxes; j++){ Neighbor box setup pragma api vector (128) private() \ cache(neighbor_box) for(k=0; k<#_home_particles; k++){ pragma api sequential for(l=0; l<#_neighbor_particles;l++){ Calculation of interactions Our generic framework omp_set_num_threads(30); #pragma omp parallel for for(i=0; i<#_home_boxes; i++){ Home box setup for(j=0; j<#_neighbor_boxes; j++){ Neighbor box setup for(k=0; k<#_home_particles; k++){ for(l=0; l<#_neighbor_particles;l++){ Calculation of interactions OpenMP 16

Molecular Dynamics Code Code Translation #define MCPU // or GPU #pragma api data copyin(box[0:#_boxes-1]) \ copyin(pos[0:#_par.-1]) copyin(chr[0:#_par.-1]) \ copyout(dis[0:#_par.-1]){ #pragma api compute{ #pragma api parallel(30) private() \ cache(home_box) for(i=0; i<#_home_boxes; i++){ Home box setup #pragma api sequential for(j=0; j<#_neighbor_boxes; j++){ Neighbor box setup pragma api ector (128) private() \ cache(neighbor_box) for(k=0; k<#_home_particles; k++){ pragma api sequential for(l=0; l<#_neighbor_particles;l++){ Calculation of interactions Our generic framework Source translation omp_set_num_threads(30); #pragma omp parallel for for(i=0; i<#_home_boxes; i++){ Home box setup for(j=0; j<#_neighbor_boxes; j++){ Neighbor box setup for(k=0; k<#_home_particles; k++){ for(l=0; l<#_neighbor_particles;l++){ Calculation of interactions #pragma acc data region copyin(box[0:#_boxes-1]) \ copyin(pos[0:#_par.-1]) copyin(chr[0:#_par.-1]) \ copyout(dis[0:#_par.-1]){ #pragma acc region{ #pragma acc parallel(30) independent private() \ cache(home_box) OpenMP for(i=0; i<#_home_boxes; i++){ Home box setup #pragma acc sequential for(j=0; j<#_neighbor_boxes; j++){ Neighbor box setup pragma acc vector (128) private() \ cache(neighbor_box) for(k=0; k<#_home_particles; k++){ pragma acc sequential for(l=0; l<#_neighbor_particles;l++){ Calculation of interactions PGI Accelerator API 17

Molecular Dynamics Code Code Translation #define MCPU // or GPU #pragma api data copyin(box[0:#_boxes-1]) \ copyin(pos[0:#_par.-1]) copyin(chr[0:#_par.-1]) \ copyout(dis[0:#_par.-1]){ #pragma api compute{ #pragma api parallel(30) private() \ cache(home_box) for(i=0; i<#_home_boxes; i++){ Home box setup #pragma api sequential for(j=0; j<#_neighbor_boxes; j++){ Neighbor box setup pragma api vector (128) private() \ cache(neighbor_box) for(k=0; k<#_home_particles; k++){ pragma api sequential for(l=0; l<#_neighbor_particles;l++){ Calculation of interactions Our generic framework omp_set_num_threads(30); #pragma omp parallel for for(i=0; i<#_home_boxes; i++){ Home box setup for(j=0; j<#_neighbor_boxes; j++){ Neighbor box setup for(k=0; k<#_home_particles; k++){ for(l=0; l<#_neighbor_particles;l++){ Calculation of interactions OpenMP #pragma acc data region copyin(box[0:#_boxes-1]) \ copyin(pos[0:#_par.-1]) copyin(chr[0:#_par.-1]) \ copyout(dis[0:#_par.-1]){ #pragma acc region{ #pragma acc parallel(30) independent private() \ cache(home_box) for(i=0; i<#_home_boxes; i++){ Home box setup #pragma acc sequential for(j=0; j<#_neighbor_boxes; j++){ Neighbor box setup pragma acc vector (128) private() \ cache(neighbor_box) for(k=0; k<#_home_particles; k++){ pragma acc sequential for(l=0; l<#_neighbor_particles;l++){ Calculation of interactions PGI Accelerator API 18

time Molecular Dynamics Code Workload Mapping Time step 1 processors 1 2 3 30 vector units 1, 2, 3 n vector units 1, 2, 3 n vector units 1, 2, 3 n vector units 1, 2, 3 n Boxes home box home box particle interactions home box neighbor box 1 particle interactions home box neighbor box 2 particle interactions home box neighbor box n particle interactions home box home box particle interactions home box neighbor box 1 particle interactions home box neighbor box 2 particle interactions home box neighbor box n particle interactions home box home box particle interactions home box neighbor box 1 particle interactions home box neighbor box 2 particle interactions home box neighbor box n particle interactions home box home box particle interactions home box neighbor box 1 particle interactions home box neighbor box 2 particle interactions home box neighbor box n particle interactions global SYNC Time step 2 19

Molecular Dynamics Code Performance Arch. Code Feature Framework Kernel Length [lines] Exec. Time [s] Speedup [x] 1 1-core CPU Original C 47 56.19 1 2 1-core CPU Structured C 34 73.73 0.76 3 8-core CPU Structured OpenMP 37 10.73 5.23 4 GPU GPUspecific CUDA 59 5.96 9.43 5 96 9.43 6 1-core Structured Our CPU Portable Framework 47 73.72 0.76 7 8-core Structured Our CPU Portable Framework 47 10.73 5.23 8 GPU Structured Our Portable Framework 47 7.11 7.91 9 47 7.91 10 GPU Init/Trans Overhead (CUDA) --- 0.87 --- Our implementation achieves over 50% reduction in code length compared to OpenMP+CUDA GPU performance of our implementation is within 17% of CUDA 20

Outline Introduction Portability Problems Approach Molecular Dynamics Code Feature Tracking Code Conclusions Future Work 21

Feature Tracking Code Overview Read frame (I/O) Sets of frames may be assigned to different cluster nodes for parallel computation Frames in a set are processed in series in each node, while feature detection proceeds in parallel in multi-core CPU or GPU Select inner/outer sample point Detect point displacement # of pixels # of sample points # of frames Code tracks movement of heart walls (by tracking sample points) throughout a sequence of frames Movement of points between frames is determined by comparing areas around points 22

Feature Tracking Code Code Translation #define MCPU // or GPU for(i=0; i<#_frames; i++){ Read frame #pragma api data copyin(frm[0:frm_siz-1]) \ copyin(ini_loc[0:#_smp_pnts.-1]) \ local(con/cor[0:#_pixels]) copyout[fin_loc[0:#_.-1]){ #pragma api compute{ #pragma api parallel(30) independent for(j=0; j<#_sample_points; j++){ #pragma api vector(512) private() Convolving/correlating with templates #pragma api vector(512) private() Determining displacement Our generic framework 23

Feature Tracking Code Code Translation #define MCPU // or GPU for(i=0; i<#_frames; i++){ Read frame #pragma api data copyin(frm[0:frm_siz-1]) \ copyin(ini_loc[0:#_smp_pnts.-1]) \ local(con/cor[0:#_pixels]) copyout[fin_loc[0:#_.-1]){ #pragma api compute{ #pragma api parallel(30) independent for(j=0; j<#_sample_points; j++){ #pragma api vector(512) private() Convolving/correlating with templates #pragma api vector(512) private() Determining displacement Our generic framework Source translation omp_set_num_threads(30); for(i=0; i<#_frames; i++){ Read frame #pragma omp parallel for for(j=0; j<#_sample_points; j++){ Convolving/correlating with templates Determining displacement OpenMP 24

Feature Tracking Code Code Translation #define MCPU // or GPU for(i=0; i<#_frames; i++){ Read frame #pragma api data copyin(frm[0:frm_siz-1]) \ copyin(ini_loc[0:#_smp_pnts.-1]) \ local(con/cor[0:#_pixels]) copyout[fin_loc[0:#_.-1]){ #pragma api compute{ #pragma api parallel(30) independent for(j=0; j<#_sample_points; j++){ #pragma api vector(512) private() Convolving/correlating with templates #pragma api vector(512) private() Determining displacement for(i=0; i<#_frames; i++){ Read frame omp_set_num_threads(30); #pragma omp parallel for for(j=0; j<#_sample_points; j++){ Convolving/correlating with templates Determining displacement OpenMP Our generic framework 25

Feature Tracking Code Code Translation #define MCPU // or GPU for(i=0; i<#_frames; i++){ Read frame #pragma api data copyin(frm[0:frm_siz-1]) \ copyin(ini_loc[0:#_smp_pnts.-1]) \ local(con/cor[0:#_pixels]) copyout[fin_loc[0:#_.-1]){ #pragma api compute{ #pragma api parallel(30) independent for(j=0; j<#_sample_points; j++){ #pragma api vector(512) private() Convolving/correlating with templates #pragma api vector(512) private() Determining displacement Our generic framework Source translation for(i=0; i<#_frames; i++){ Read frame omp_set_num_threads(30); #pragma omp parallel for for(j=0; j<#_sample_points; j++){ Convolving/correlating with templates Determining displacement copyin(ini_loc[0:#_smp_pnts.-1]) \ for(i=0; i<#_frames; i++){ Read frame #pragma acc data region copyin(frm[0:frm_siz-1]) \ local(con/cor[0:#_pixels]) copyout[fin_loc[0:#_.-1]){ #pragma acc region{ #pragma acc parallel(30) independent for(j=0; j<#_sample_points; OpenMP j++){ #pragma acc vector(512) independent private() Convolving/correlating with templates #pragma acc vector(512) independent private() Determining displacement PGI Accelerator API 26

Feature Tracking Code Code Translation #define MCPU // or GPU for(i=0; i<#_frames; i++){ Read frame #pragma api data copyin(frm[0:frm_siz-1]) \ copyin(ini_loc[0:#_smp_pnts.-1]) \ local(con/cor[0:#_pixels]) copyout[fin_loc[0:#_.-1]){ #pragma api compute{ #pragma api parallel(30) independent for(j=0; j<#_sample_points; j++){ #pragma api vector(512) private() Convolving/correlating with templates #pragma api vector(512) private() Determining displacement Our generic framework for(i=0; i<#_frames; i++){ Read frame omp_set_num_threads(30); #pragma omp parallel for for(j=0; j<#_sample_points; j++){ Convolving/correlating with templates Determining displacement OpenMP for(i=0; i<#_frames; i++){ Read frame #pragma acc data region copyin(frm[0:frm_siz-1]) \ copyin(ini_loc[0:#_smp_pnts.-1]) \ local(con/cor[0:#_pixels]) copyout[fin_loc[0:#_.-1]){ #pragma acc region{ #pragma acc parallel(30) independent for(j=0; j<#_sample_points; j++){ #pragma acc vector(512) independent private() Convolving/correlating with templates #pragma acc vector(512) independent private() Determining displacement PGI Accelerator API 27

time Feature Tracking Code Workload Mapping Frame 1 processors 1 2 3 30 vector units 1, 2, 3 n vector units 1, 2, 3 n vector units 1, 2, 3 n vector units 1, 2, 3 n Sample Points Convolution Correlation Shifting Reduction Statistical Determine displacement Convolution Correlation Shifting Reduction Statistical Determine displacement Convolution Correlation Shifting Reduction Statistical Determine displacement Convolution Correlation Shifting Reduction Statistical Determine displacement global sync Frame 2 28

Feature Tracking Code Performance Arch. Code Feature Framework Kernel Length [lines] Exec. Time [s] Speed-up [x] 1 1-core CPU Original C 132 117.21 1 2 1-core CPU Structured C 124 112.44 1.04 3 8-core CPU Structured OpenMP 138 15.84 7.40 4 GPU GPUspecific CUDA 156 6.54 17.92 5 294 17.92 6 1-core Structured Our CPU Portable Framework 144 112.44 1.14 7 8-core Structured Our CPU Portable Framework 144 15.84 8.09 8 GPU Structured Our Portable Framework 144 7.21 16.25 9 144 16.25 10 GPU Init/Trans Overhead (CUDA) --- 1.13 --- Our implementation achieves over 50% reduction in code length compared to OpenMP+CUDA GPU performance of our implementation is within 10% of CUDA 29

Outline Introduction Portability Problems Approach Molecular Dynamics Code Feature Tracking Code Conclusions Future Work 30

Conclusions A common high-level directive-based framework can support efficient execution across architecture types Correct use of this framework with the support of the current state-of-the-art parallelizing compilers can yield comparable performance to custom, lowlevel code Our approach results in increased programmability across architectures and decreased code maintenance cost 31

Outline Introduction Portability Problems Approach Molecular Dynamics Code Feature Tracking Code Conclusions Future Work 32

Future Work Test our solution with a larger number of applications Propose several annotations that allow: Assignment to particular devices Concurrent execution of loops, each limited to a subset of processing elements Extend translator with code analysis and code translation to lower-level language (CUDA) to implement these features for GPU Source code for the translator will be released shortly at lava.cs.virginia.edu 33

Thank you 34