Data Parallel Algorithmic Skeletons with Accelerator Support

Similar documents
Parallel Systems. Project topics

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Technology for a better society. hetcomp.com

High Performance Java. Jordan J. Ruloff, James A. Ross, David A. Richie, Song J. Park, Dale R. Shires, Brian J. Henz September 11, 2012

OpenACC (Open Accelerators - Introduced in 2012)

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Intel Xeon Phi Coprocessors

OpenACC 2.6 Proposed Features

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA

Parallel Programming Libraries and implementations

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Architecture, Programming and Performance of MIC Phi Coprocessor

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Applying Source Level Auto-Vectorization to Aparapi Java

JCudaMP: OpenMP/Java on CUDA

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel

Native Offload of Haskell Repa Programs to Integrated GPUs

The Cut and Thrust of CUDA

Costless Software Abstractions For Parallel Hardware System

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Overview of research activities Toward portability of performance

Heterogeneous Computing and OpenCL

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

Advanced programming with OpenMP. Libor Bukata a Jan Dvořák

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

Introduction to CUDA (1 of n*)

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

SIMD Exploitation in (JIT) Compilers

Generating Performance Portable Code using Rewrite Rules

Parallel Computing. November 20, W.Homberg

Introduction II. Overview

The Stampede is Coming: A New Petascale Resource for the Open Science Community

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Parallel Programming. Libraries and Implementations

CUDA. Matthew Joyner, Jeremy Williams

Kampala August, Agner Fog

CPU-GPU Heterogeneous Computing

Hybrid Implementation of 3D Kirchhoff Migration

OpenCL implementations of principal component analysis for large scale data analysis and benchmarking of heterogeneous clusters.

COMPILER-ASSISTED TEST ACCELERATION ON GPUS FOR EMBEDDED SOFTWARE

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

Intel Array Building Blocks

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2011 COMP4300/6430. Parallel Systems

Accelerating Financial Applications on the GPU

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2

Ajitha Rajan, Christophe Dubach. in preparation for: ISSTA July 2017 Santa Barbara, CA

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

Advanced OpenMP Features

Boundary element quadrature schemes for multi- and many-core architectures

AMD s Unified CPU & GPU Processor Concept

Michel Steuwer.

General introduction: GPUs and the realm of parallel architectures

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Runtime Code Generation and Data Management for Heterogeneous Computing in Java

NumbaPro CUDA Python. Square matrix multiplication

Chapter 3 Parallel Software


CS 112 Introduction to Computing II. Wayne Snyder Computer Science Department Boston University

Introduction to Xeon Phi. Bill Barth January 11, 2013

Parallelism III. MPI, Vectorization, OpenACC, OpenCL. John Cavazos,Tristan Vanderbruggen, and Will Killian

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

n N c CIni.o ewsrg.au

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

An introduction to Halide. Jonathan Ragan-Kelley (Stanford) Andrew Adams (Google) Dillon Sharlet (Google)

Skeleton programming environments Muesli (1)

STUDYING OPENMP WITH VAMPIR & SCORE-P

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Outline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples

A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA

GPU Programming with Ateji PX June 8 th Ateji All rights reserved.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

CME 213 S PRING Eric Darve

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Intel Xeon Phi Coprocessor

STUDYING OPENMP WITH VAMPIR

GPUMAP: A TRANSPARENTLY GPU-ACCELERATED PYTHON MAP FUNCTION. A Thesis. presented to. the Faculty of California Polytechnic State University,

Introduction to GPGPU and GPU-architectures

Heterogeneous Computing

Objective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC

Computer Architecture and Structured Parallel Programming James Reinders, Intel

GPUfs: Integrating a file system with GPUs

Accelerator programming with OpenACC

PARALUTION - a Library for Iterative Sparse Methods on CPU and GPU

The Era of Heterogeneous Computing

Cougar Open CL v1.0. Users Guide for Open CL support for Delphi/ C++Builder and.net

Progress Report on QDP-JIT

Lecture 1: Introduction and Computational Thinking

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Accelerated Test Execution Using GPUs

Transcription:

MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support Steffen Ernsting and Herbert Kuchen July 2, 2015

Agenda WESTFÄLISCHE MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 2 /22 Hardware accelerators GPU and Xeon Phi Skeleton implementation (C++ vs Java) Parallelization Providing the user function Adding additional arguments Performance comparison 4 benchmark applications: Matrix mult., N-Body, Shortest paths, Ray tracing

MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 3 /22 Hardware Accelerators Graphics Processing Units K20x: 2688 CUDA cores C++ + CUDA/OpenCL Offload compute-intensive kernels Intel Xeon Phi ~60 x86 cores (240 threads) C++ + pragmas (+ SIMD intrinsics) Offload and native programming models Intel: Recompile and run

MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 4 /22 Accelerator-Support in Java Various projects aim to add GPU support to Java Bindings vs. JIT (byte)code translation Bindings: e.g. JCUDA, JOCL, JavaCL Code translation: e.g. Aparapi, Rootbeer, Java-GPU Java 9...? Why choose accelerated Java over accelerated C++? Write and compile once, run everywhere Huge JDK: lots of little helpers Automated memory management

Aparapi WESTFÄLISCHE MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 5 /22 Created by Gary Frost (AMD), released to open source JIT-compilation: Java bytecode OpenCL Execution modes: GPU, JTP (Java Thread Pool), SEQ, PHI (experimental) Offers good programmability Restriction: primitive types only Java code: 1 int [] a = {1, 2, 3, 4}; 2 int [] b = {5, 6, 7, 8}; 3 int [] c = new int [a. length ]; 4 5 for (int i =0; i<c. length ; i ++) { 6 c[i]=a[i]+b[i]; 7 } Aparapi code: 1 Kernel k = new Kernel () { 2 public void run (){ 3 int i=getglobalid (); 4 c[i]=a[i]+b[i]; 5 } 6 }; 7 k. execute ( Range. create (c. length ));

MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 6 /22 The Muenster Skeleton Library C++ skeleton library Target architectures: multi-core clusters possibly equipped with accelerators such as GPU and Xeon Phi Parallelization: MPI, OpenMP, and CUDA Data parallel skeletons (with accelerator support) map, zip, fold + variants Task parallel skeletons pipe, farm, D&C, B&B

Parallelization WESTFÄLISCHE MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 7 /22 Inter-node parallelization (MPI): Distributed data structures (DArray<T>, DMatrix<T>) M = new DMatrix(...) M.map(add1) 1 0 3 2 0 2 0 1 1 0 P0 0 2 3 2 P1 0 1 2 1 P0 1 3 4 3 P1 1 2 1 1 2 2 3 0 0 1 1 1 P2 3 0 2 2 P3 0 1 2 2 P2 4 1 3 3 P3 1 2 Intra-node parallelization: C++: OpenMP, CUDA Java: Aparapi (a) (b) 1 T fold ( FoldFunction f) { 2 // OpenMP, CUDA, Aparapi 3 T localresult = localfold (f, localpartition ); 4 5 // MPI 6 T[] localresults = new T[ numprocs ]; 7 allgather ( localresult, localresults ); 8 9 return localfold (f, localresults ); 10 } (c)

MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 8 /22 The User Function (C++) Pass function as skeleton argument But: (host) function pointers cannot be used in device code Use functors instead: 1 template <typename IN, typename OUT > 2 class MapFunctor : public FunctorBase 3 { 4 public : 5 // To be implemented by the user. 6 MSL_UFCT virtual OUT operator () (IN value ) const = 0; 7 8 virtual MapFunctor (); 9 };

MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 9 /22 The User Function (Java) No function pointers in Java Analogous to C++: use functors 1 public abstract class MapKernel extends Kernel { 2 protected final int [] in, out ; 3 4 // To be implemented by the user. 5 public abstract int mapfunction ( int value ); 6 7 public void run () { 8 int gid = getglobalid (); 9 out [ gid ] = mapfunction (in[gid ]); 10 } 11 12 // Called by skeleton implementation. 13 public void init ( DIArray in, DIArray out ) { 14 this.in = in. getlocalpartition (); 15 this.out = out. getlocalpartition (); 16 } 17 }

MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 10 /22 Additional Arguments (C++) Arguments to the user function are determined by the skeleton Add additional arguments as data members But: Works for primitives and PODs, but not for classes with pointer data members Pointer data + multi-gpu: need 1 pointer for each GPU Solution: Observer pattern FunctorBase notify() addargument(...) MapFunctor operator() (...) ArgumentType update() Argument

MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 11 /22 Additional Arguments (Java) Add additional arguments as data members Aparapi handles memory allocation and data transfer to GPU memory 1 public class AddN extends MapKernel { 2 protected int n; 3 4 public AddN ( int n) { 5 this.n = n; 6 } 7 8 public int mapfunction ( int value ) { 9 return value +n; 10 } 11 } 12 DIArray A = new DIArray (...) ; 13 A. map ( new AddN (2) ); Limited to (arrays of) primitive types

Benchmarks WESTFÄLISCHE MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 12 /22 4 benchmark applications: Matrix mult., N-Body, Shortest paths, Ray tracing 2 test systems: 1. GPU cluster: 2 Xeon E5-2450 (16 cores) + 2 K20x GPUs per node 2. Xeon Phi system with 8 Xeon Phi 5110p coprocessors 6 configurations: 2 CPU: C++ CPU, Java CPU 3 GPU: C++ GPU, C++ multi-gpu, Java GPU 1 Xeon Phi: C++ Xeon Phi

MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 13 /22 Case Study: Matrix Multiplication Cannon s algorithm for square matrix multiplication Checkerboard block decomposition (2D Torus) A B p(0,0) p(0,1) p(0,0) p(0,1) p(1,0) p(1,1) p(1,0) p(1,1) mlocal (a) p(0,0) p(1,0) C p(0,1) p(1,1) p(0,0) p(1,1) A (b) p(0,1) = * (a) Initial shifting of A and B (b) Submatrix multiplication + stepwise shifting p(1,0) p(0,0) p(1,1) B p(0,1) p(1,0)

MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 14 /22 Matrix Multiplication Main algorithm: 1 template < typename T> 2 DMatrix <T>& matmult ( DMatrix <T>& A, DMatrix <T>& B, DMatrix <T>* C) { 3 // Initial shifting. 4 A. rotaterows (& negate ); 5 B. rotatecols (& negate ); 6 7 for (int i = 0; i < A. getblocksinrow (); i++) { 8 DotProduct <T> dp(a, B); 9 // Submatrix multiplication. 10 C-> mapindexinplace (dp); 11 12 // Stepwise shifting. 13 A. rotaterows ( -1); 14 B. rotatecols ( -1); 15 } 16 return *C; 17 }

MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 15 /22 Matrix Multiplication Map functor (C++): 1 template < typename T> 2 struct DotProduct : public MapIndexFunctor <T, T> { 3 LMatrix <T> A, B; 4 5 dotproduct (DMatrix <T>& A_, DMatrix <T>& B_) 6 : A(A_), B(B_) 7 { 8 } 9 10 MSL_UFCT T operator ()(int row, int col, T Cij ) const 11 { 12 T sum = Cij ; 13 for ( int k = 0; k < this ->mlocal ; k++) { 14 sum += A[row ][k] * B[k][ col ]; 15 } 16 return sum ; 17 } 18 };

MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 16 /22 Matrix Multiplication Map functor (Java): 1 class DotProduct extends MapIndexInPlaceKernel { 2 protected float [] A, B; 3 4 public Dotproduct ( DFMatrix A, DFMatrix B) { 5 super (); 6 this.a = A. getlocalpartition (); 7 this.b = B. getlocalpartition (); 8 } 9 10 public float mapindexfunction ( int row, int col, float Cij ) { 11 float sum = Cij ; 12 for ( int k = 0; k < mlocal ; k ++) { 13 sum += A[ row * mlocal + k] * B[k * mlocal + col ]; 14 } 15 return sum ; 16 } 17 }

MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 17 /22 Results: Matrix Multiplication Run time 10000 1000 100 C++ CPU Java CPU C++ GPU C++ multi-gpu Java GPU C++ Xeon Phi sec 10 1 0.1 1 4 16 nodes C++ (multi-)gpu performs best (30-160 speedup vs. CPU) Java GPU and C++ Xeon Phi on a similar level C++ CPU and Java CPU on a similar level Superlinear speedups due to cache effects

Results: N-Body WESTFÄLISCHE MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 18 /22 Run time 10000 1000 C++ CPU Java CPU C++ GPU C++ multi-gpu Java GPU C++ Xeon Phi sec 100 10 1 1 2 4 8 16 C++ (multi-)gpu performs best (10-13 speedup vs. CPU) C++ GPU delivers better scalability than Java GPU on higher node counts CPU versions on same level C++ Xeon Phi perf. between CPU and GPU perf. nodes

MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 19 /22 Results: Shortest Paths Run time 100000 10000 1000 C++ CPU Java CPU C++ GPU C++ multi-gpu Java GPU C++ Xeon Phi sec 100 10 1 1 4 16 nodes C++ (multi-)gpu performs best (20-160 speedup vs. CPU) Java GPU and C++ Xeon Phi on a similar level C++ CPU and Java CPU on a similar level

Results: Ray tracing WESTFÄLISCHE MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 20 /22 Run time 1000 C++ CPU C++ GPU C++ multi-gpu C++ Xeon Phi 100 sec 10 1 1 2 4 8 16 C++ (multi-)gpu performs best (5-10 speedup vs. CPU) Xeon Phi performance only close to CPU performance No auto-vectorization nodes

MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 21 /22 Conclusion & Future Work C++ and Java version offer comparable performance Still: restrictions on Java side (Arrays of) primitive types No multi-gpu support (yet) Will be addressed in future work Xeon Phi can be considered in-between CPUs and GPUs Performance-wise and in terms of programmability

WESTFÄLISCHE MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 22 /22 Thank you for your attention! Questions?