DATA PARALLEL PROGRAMMING IN HASKELL

Similar documents
Text. Parallelism Gabriele Keller University of New South Wales

Do Extraterrestrials Use Functional Programming?

GPU Programming I. With thanks to Manuel Chakravarty for some borrowed slides

Identification and Elimination of the Overhead of Accelerate with a Super-resolution Application

CSE 230. Parallelism. The Grand Challenge. How to properly use multi-cores? Need new programming models! Parallelism vs Concurrency

GPGPU Programming in Haskell with Accelerate

GPU Kernels as Data-Parallel Array Computations in Haskell

Partial Vectorisation of Haskell Programs

GPU Programming. Imperative And Functional Approaches. Bo Joel Svensson. April 16, 2013

Native Offload of Haskell Repa Programs to Integrated GPUs

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Software System Design and Implementation

Haskell Beats C using Generalized Stream Fusion

GPU Programming in Haskell. Joel Svensson Joint work with Koen Claessen and Mary Sheeran Chalmers University

Guiding Parallel Array Fusion with Indexed Types

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Parallelism with Haskell

Evaluation of Libraries for Parallel Computing in Haskell A Case Study with a Super-resolution Application

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Multicore programming in Haskell. Simon Marlow Microsoft Research

Introduction to GPU computing

GPU Programming Using NVIDIA CUDA

CUB. collective software primitives. Duane Merrill. NVIDIA Research

OpenACC programming for GPGPUs: Rotor wake simulation

Using GPUs for unstructured grid CFD

Repa 3 tutorial for curious haskells

PARALLEL AND CONCURRENT PROGRAMMING IN HASKELL

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

Using GPUs to compute the multilevel summation of electrostatic forces

Generating a Fast GPU Median Filter with Haskell

Optimising Purely Functional GPU Programs

NOVA: A Functional Language for Data Parallelism

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs

Affine Loop Optimization using Modulo Unrolling in CHAPEL

Advances in Programming Languages

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Chapter 3 Parallel Software

Idris. Programming with Dependent Types. Edwin Brady University of St Andrews, Scotland,

Parallel Functional Programming Lecture 1. John Hughes

OpenACC Course. Office Hour #2 Q&A

Iterative Sparse Triangular Solves for Preconditioning

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

OpenMP for next generation heterogeneous clusters

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Optimising Functional Programming Languages. Max Bolingbroke, Cambridge University CPRG Lectures 2010

Concurrent and Multicore Haskell

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules

1 Overview. 2 A Classification of Parallel Hardware. 3 Parallel Programming Languages 4 C+MPI. 5 Parallel Haskell

High Performance Computing and GPU Programming

Modern Processor Architectures. L25: Modern Compiler Design

Accelerating Data Warehousing Applications Using General Purpose GPUs

Portland State University ECE 588/688. Graphics Processors

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

Concurrency: what, why, how

CS 179 Lecture 4. GPU Compute Architecture

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Principles of Parallel Algorithm Design: Concurrency and Mapping

Large scale Imaging on Current Many- Core Platforms

Designing a Domain-specific Language to Simulate Particles. dan bailey

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

APL on GPUs A Progress Report with a Touch of Machine Learning

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Solving Dense Linear Systems on Graphics Processors

Optimisation Myths and Facts as Seen in Statistical Physics

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Warps and Reduction Algorithms

Trends and Challenges in Multicore Programming

Tesla GPU Computing A Revolution in High Performance Computing

B. Tech. Project Second Stage Report on

Optimization solutions for the segmented sum algorithmic function

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Technology for a better society. hetcomp.com

Concurrency: what, why, how

Parallel Systems I The GPU architecture. Jan Lemeire

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Extended Static Checking for Haskell (ESC/Haskell)

Simon Peyton Jones (Microsoft Research) Tokyo Haskell Users Group April 2010

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Computer Architecture

Accelerating image registration on GPUs

HPC Algorithms and Applications

Haskell: From Basic to Advanced. Part 2 Type Classes, Laziness, IO, Modules

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

Computer Architecture

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Introduction to GPU programming. Introduction to GPU programming p. 1/17

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

Automatic Intra-Application Load Balancing for Heterogeneous Systems

CS377P Programming for Performance GPU Programming - II

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

GPU Programming Languages

Review of previous examinations TMA4280 Introduction to Supercomputing

Introduction to GPU programming with CUDA

Communicating Process Architectures in Light of Parallel Design Patterns and Skeletons

Transcription:

DATA PARALLEL PROGRAMMING IN HASKELL An Overview Manuel M T Chakravarty University of New South Wales INCLUDES JOINT WORK WITH Gabriele Keller Sean Lee Roman Leshchinskiy Ben Lippmeier Trevor McDonell Simon Peyton Jones

Ubiquitous Parallelism

http://en.wikipedia.org/wiki/file:cray_y-mp_gsfc.jpg venerable supercomputer It's parallelism

re o c i t mul U CP http://en.wikipedia.org/wiki/file:cray_y-mp_gsfc.jpg le b a r e r ven mpute o c r e sup http://en.wikipedia.org/wiki/file:intel_cpu_core_i7_2600k_sandy_bridge_top.jpg re o c i t mul U GP http://en.wikipedia.org/wiki/file:geforce_gt_545_ddr3.jpg It's parallelism...but not as we know it!

Our goals

Our goals Exploit parallelism of commodity hardware easily: Performance is important, but productivity is more important.

Our goals Exploit parallelism of commodity hardware easily: Performance is important, but productivity is more important. Semi-automatic parallelism Programmer supplies a parallel algorithm No explicit concurrency (no concurrency control, no races, no deadlocks)

Graphics [Haskell 2012a] Ray tracing

Computer Vision [Haskell 2011] Edge detection

Canny on 2xQuad-core 2.0GHz Intel Harpertown 10000 8000 Safe Unrolled Stencil Single-threaded OpenCV runtime (ms) 4000 2000 1000 800 1024x1024 image 768x768 image 512x512 image 1 2 3 4 5 6 7 8 threads (on 8 PEs) Runtimes for Canny edge detection (100 iterations) OpenCV uses SIMD-instructions, but only one thread

Canny (512x512) 40.00 36.88 30.00 Time (ms) 20.00 10.00 22.24 17.71 15.71 13.73 13.37 12.62 15.23 0 1 2 3 4 5 6 7 8 Number of CPU Threads CPU (as before) GPU (NVIDIA Tesla T10) Canny edge detection: CPU versus GPU parallelism GPU version performs post-processing on the CPU

Physical Simulation [Haskell 2012a] Fluid flow

2.5 relative runtime 2 1.5 1 0.5 0 C Jacobi C Gauss-Seidel Repa -N1 Repa -N2 Repa -N8 64 96 128 192 256 384 512 768 1024 matrix width Runtimes for Jos Stam's Fluid Flow Solver We can beat C!

Fluid flow (1024x1024) 3000 Time (ms) 2250 1500 750 0 1 2 4 8 Number of CPU Threads CPU GPU Jos Stam's Fluid Flow Solver: CPU versus GPU Performance GPU beats the CPU (includes all transfer times)

Medical Imaging [Haskell 2012a] Interpolation of a slice though a 256 256 109 16-bit data volume of an MRI image

Functional Parallelism

Our ingredients Control effects, not concurrency Types guide data representation and behaviour Bulk-parallel aggregate operations

Our ingredients Haskell is by default pure Control effects, not concurrency Types guide data representation and behaviour Bulk-parallel aggregate operations

Our ingredients Haskell is by default pure Control effects, not concurrency Declarative control of operational pragmatics Types guide data representation and behaviour Bulk-parallel aggregate operations

Our ingredients Haskell is by default pure Control effects, not concurrency Declarative control of operational pragmatics Types guide data representation and behaviour Bulk-parallel aggregate operations Data parallelism

Purity & Types

Purity and parallelism processlist :: [Int] -> ([Int], Int) processlist list = (sort list, maximum list)

Purity and parallelism processlist :: [Int] -> ([Int], Int) processlist list = (sort list, maximum list) function argument function body

Purity and parallelism argument type result type processlist :: [Int] -> ([Int], Int) processlist list = (sort list, maximum list) function argument function body

Purity and parallelism argument type result type processlist :: [Int] -> ([Int], Int) processlist list = (sort list, maximum list) function argument function body Purity: function result depends only on arguments

Purity and parallelism argument type result type processlist :: [Int] -> ([Int], Int) processlist list = (sort list, maximum list) function argument function body Purity: function result depends only on arguments Parallelism: execution order only constrained by explicit data dependencies

By default pure := Types track purity

By default pure := Types track purity Pure = no effects Impure = may have effects

By default pure := Types track purity Pure = no effects Impure = may have effects Int IO Int

By default pure := Types track purity Pure = no effects Impure = may have effects Int processlist :: [Int] -> ([Int], Int) IO Int readfile :: FilePath -> IO String

By default pure := Types track purity Pure = no effects Impure = may have effects Int processlist :: [Int] -> ([Int], Int) (sort list, maximum list) IO Int readfile :: FilePath -> IO String copyfile fn1 fn2 = do data <- readfile fn1 writefile fn2 data

Types guide execution

Datatypes for parallelism For bulk-parallel, aggregate operations, we introduce a new datatype: Array r sh e

Datatypes for parallelism For bulk-parallel, aggregate operations, we introduce a new datatype: Array r sh e Representation

Datatypes for parallelism For bulk-parallel, aggregate operations, we introduce a new datatype: Array r sh e Representation Shape

Datatypes for parallelism For bulk-parallel, aggregate operations, we introduce a new datatype: Array r sh e Representation Shape Element type

Array r sh e

Array r sh e Representation: determined by a type index; e.g., D delayed array (represented as a function) U unboxed array (manifest C-style array)

Array r sh e Representation: determined by a type index; e.g., D delayed array (represented as a function) U unboxed array (manifest C-style array) Shape: dimensionality of the array DIM0, DIM1, DIM2, and so on

Array r sh e Representation: determined by a type index; e.g., D delayed array (represented as a function) U unboxed array (manifest C-style array) Shape: dimensionality of the array DIM0, DIM1, DIM2, and so on Element type: stored in the array Primitive types (Int, Float, etc.) and tuples

zipwith :: (Shape sh, Source r1 a, Source r2 b) => (a -> b -> c) -> Array r1 sh a -> Array r2 sh b -> Array D sh c

zipwith :: (Shape sh, Source r1 a, Source r2 b) => (a -> b -> c) -> Array r1 sh a -> Array r2 sh b -> Array D sh c Pure function to be used in parallel

zipwith :: (Shape sh, Source r1 a, Source r2 b) => (a -> b -> c) -> Array r1 sh a -> Array r2 sh b -> Array D sh c Pure function to be used in parallel Pocessed arrays

zipwith :: (Shape sh, Source r1 a, Source r2 b) => (a -> b -> c) -> Array r1 sh a -> Array r2 sh b -> Array D sh c Pure function to be used in parallel Delayed result Pocessed arrays

zipwith :: (Shape sh, Source r1 a, Source r2 b) => (a -> b -> c) -> Array r1 sh a -> Array r2 sh b -> Array D sh c Pure function to be used in parallel Delayed result Pocessed arrays type PC5 = P C (P (S D)(P (S D)(P (S D)(P (S D) X)))) mapstencil2 :: Source r a => Boundary a -> Stencil DIM2 a -> Array r DIM2 a -> Array PC5 DIM2 a

zipwith :: (Shape sh, Source r1 a, Source r2 b) => (a -> b -> c) -> Array r1 sh a -> Array r2 sh b -> Array D sh c Pure function to be used in parallel Delayed result Pocessed arrays type PC5 = P C (P (S D)(P (S D)(P (S D)(P (S D) X)))) mapstencil2 :: Source r a => Boundary a Partitioned result -> Stencil DIM2 a -> Array r DIM2 a -> Array PC5 DIM2 a

A simple example dot product

A simple example dot product dotp v w = sumall (zipwith (*) v w)

A simple example dot product type Vector r e = Array r DIM1 e dotp :: (Num e, Source r1 e, Source r2 e) => Vector r1 e -> Vector r2 e -> e dotp v w = sumall (zipwith (*) v w)

A simple example dot product Elements are any type of numbers type Vector r e = Array r DIM1 e dotp :: (Num e, Source r1 e, Source r2 e) => Vector r1 e -> Vector r2 e -> e dotp v w = sumall (zipwith (*) v w)

A simple example dot product Elements are any type of numbers suitable to be read from an array type Vector r e = Array r DIM1 e dotp :: (Num e, Source r1 e, Source r2 e) => Vector r1 e -> Vector r2 e -> e dotp v w = sumall (zipwith (*) v w)

Data Parallelism ABSTRACTION Parallelism Concurrency

Data Parallelism ABSTRACTION Parallelism Parallelism is safe for pure functions (i.e., functions without external effects) Concurrency

ABSTRACTION Data Parallelism Parallelism Collective operations have got a single conceptual thread of control Parallelism is safe for pure functions (i.e., functions without external effects) Concurrency

Types & Embedded Computations

12 THREADS 24,576 THREADS Core i7 970 CPU NVIDIA GF100 GPU COARSE-GRAINED VERSUS FINE-GRAINED PARALLELISM GPUs require careful program tuning

12 THREADS 24,576 THREADS Core i7 970 CPU NVIDIA GF100 GPU SIMD: groups of threads executing in lock step (warps) Need to be careful about control flow COARSE-GRAINED VERSUS FINE-GRAINED PARALLELISM GPUs require careful program tuning

12 THREADS 24,576 THREADS Core i7 970 CPU SIMD: groups of threads executing in lock step (warps) Need to be careful about control flow NVIDIA GF100 GPU Latency hiding: optimised for regular memory access patterns Optimise memory access COARSE-GRAINED VERSUS FINE-GRAINED PARALLELISM GPUs require careful program tuning

Code generation for embedded code Embedded DSL Restricted control flow First-order GPU code Generative approach based on combinator templates

Code generation for embedded code Embedded DSL Restricted control flow First-order GPU code limited control structures Generative approach based on combinator templates

Code generation for embedded code Embedded DSL Restricted control flow First-order GPU code limited control structures Generative approach based on combinator templates hand-tuned access patterns [DAMP 2011]

Embedded GPU computations Dot product dotp :: Vector Float -> Vector Float -> Acc (Scalar Float) dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 (zipwith (*) xs' ys')

Embedded GPU computations Haskell array Dot product dotp :: Vector Float -> Vector Float -> Acc (Scalar Float) dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 (zipwith (*) xs' ys')

Embedded GPU computations Haskell array Dot product dotp :: Vector Float -> Vector Float -> Acc (Scalar Float) dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 (zipwith (*) xs' ys') Embedded array = desc. of array comps

Embedded GPU computations Haskell array Dot product dotp :: Vector Float -> Vector Float -> Acc (Scalar Float) dotp xs ys = let xs' = use xs ys' = use ys in Embedded array = desc. of array comps Lift Haskell arrays into EDSL fold (+) 0 (zipwith (*) xs' ys') may trigger host device transfer

Embedded GPU computations Haskell array Dot product dotp :: Vector Float -> Vector Float -> Acc (Scalar Float) dotp xs ys = let xs' = use xs ys' = use ys in Embedded array = desc. of array comps Lift Haskell arrays into EDSL fold (+) 0 (zipwith (*) xs' ys') may trigger host device transfer Embedded array computations

Nested Data Parallelism

Modularity Standard (Fortran, CUDA, etc.) is flat, regular parallelism Same for our libraries for functional data parallelism for multicore CPUs (Repa) and GPUs (Accelerate) But we want more

smvm :: SparseMatrix -> Vector -> Vector smvm sm v = [: sump (dotp sv v) sv <- sm :]

Parallel array comprehension smvm :: SparseMatrix -> Vector -> Vector smvm sm v = [: sump (dotp sv v) sv <- sm :]

Parallel array comprehension smvm :: SparseMatrix -> Vector -> Vector smvm sm v = [: sump (dotp sv v) sv <- sm :] Parallel reduction/fold

Parallel array comprehension smvm :: SparseMatrix -> Vector -> Vector smvm sm v = [: sump (dotp sv v) sv <- sm :] Parallel reduction/fold Nested parallelism!

Parallel array comprehension smvm :: SparseMatrix -> Vector -> Vector smvm sm v = [: sump (dotp sv v) sv <- sm :] Parallel reduction/fold Nested parallelism! Defined in a libray: Internally parallel?

Parallel array comprehension smvm :: SparseMatrix -> Vector -> Vector smvm sm v = [: sump (dotp sv v) sv <- sm :] Parallel reduction/fold Nested parallelism! Defined in a libray: Internally parallel? Nested parallelism?

Parallel array comprehension smvm :: SparseMatrix -> Vector -> Vector smvm sm v = [: sump (dotp sv v) sv <- sm :] Parallel reduction/fold Nested parallelism! Defined in a libray: Internally parallel? Nested parallelism? Regular, flat data parallelism is not sufficient!

Nested parallelism Modular Irregular, nested data structures Sparse structures, tree structures Hierachrchical decomposition Nesting to arbitrary, dynamic depth: divide & conquer Lots of compiler work: still very experimental! [FSTTCS 2008, ICFP 2012, Haskell 2012b]

NESTED Nested Data Parallel Haskell FLAT Accelerate Repa EMBEDDED FULL

More expressive: harder to implement NESTED FLAT Accelerate Nested Data Parallel Haskell Repa EMBEDDED FULL

More expressive: harder to implement NESTED FLAT Accelerate Nested Data Parallel Haskell Repa EMBEDDED FULL More restrictive: special hardware

Types are at the centre of everything we are doing

Types are at the centre of everything we are doing Types separate pure from effectful code

Types are at the centre of everything we are doing Types separate pure from effectful code Types guide operational behaviour (data representation, use of parallelism, etc.)

Types are at the centre of everything we are doing Types separate pure from effectful code Types guide operational behaviour (data representation, use of parallelism, etc.) Types identify restricted code for specialised hardware, such as GPUs

Types are at the centre of everything we are doing Types separate pure from effectful code Types guide operational behaviour (data representation, use of parallelism, etc.) Types identify restricted code for specialised hardware, such as GPUs Types guide parallelising program transformations

Summary Blog: http://justtesting.org/ Twitter: @TacticalGrace Core ingredients Control purity, not concurrency Types guide representations and behaviours Bulk-parallel operations Get it Latest Glasgow Haskell Compiler (GHC) Repa, Accelerate & DPH packages from Hackage (Haskell library repository)

Thank you! This research has in part been funded by the Australian Research Council and by Microsoft Corporation.

[EuroPar 2001] Nepal -- Nested Data-Parallelism in Haskell. Chakravarty, Keller, Lechtchinsky & Pfannenstiel. In "Euro-Par 2001: Parallel Processing, 7th Intl. Euro-Par Conference", 2001. [FSTTCS 2008] Harnessing the Multicores: Nested Data Parallelism in Haskell. Peyton Jones, Leshchinskiy, Keller & Chakravarty. In "IARCS Annual Conf. on Foundations of Software Technology & Theoretical Computer Science", 2008. [ICFP 2010] Regular, shape-polymorphic, parallel arrays in Haskell. Keller, Chakravarty, Leshchinskiy, Peyton Jones & Lippmeier. In Proceedings of "ICFP 2010 : The 15th ACM SIGPLAN Intl. Conf. on Functional Programming", 2010. [DAMP 2011] Accelerating Haskell Array Codes with Multicore GPUs. Chakravarty, Keller, Lee, McDonell & Grover. In "Declarative Aspects of Multicore Programming", 2011.

[Haskell 2011] Efficient Parallel Stencil Convolution in Haskell. Lippmeier & Keller. In Proceedings of "ACM SIGPLAN Haskell Symposium 2011", ACM Press, 2011. [ICFP 2012] Work Efficient Higher-Order Vectorisation. Lippmeier, Chakravarty, Keller, Leshchinskiy & Peyton Jones. In Proceedings of "ICFP 2012 : The 17th ACM SIGPLAN Intl. Conf. on Functional Programming", 2012. Forthcoming. [Haskell 2012a] Guiding Parallel Array Fusion with Indexed Types. Lippmeier, Chakravarty, Keller & Peyton Jones. In Proceedings of "ACM SIGPLAN Haskell Symposium 2012", ACM Press, 2012. Forthcoming. [Haskell 2012b] Vectorisation Avoidance. Keller, Chakravarty, Lippmeier, Leshchinskiy & Peyton Jones. In Proceedings of "ACM SIGPLAN Haskell Symposium 2012", ACM Press, 2012. Forthcoming.