GPGPU Programming in Haskell with Accelerate

Size: px

Start display at page:

Download "GPGPU Programming in Haskell with Accelerate"

Anna Walton
5 years ago
Views:

1 GPGPU Programming in Haskell with Accelerate Trevor L. McDonell University of New South

2 Preliminaries Get it from Hackage: - Debugging support enables some extra options to see what is going on cabal install accelerate - fdebug cabal install accelerate- cuda - fdebug Need to import both the base library as well as a specific backend - Import qualified to avoid name clashes with the Prelude import Prelude as P import Data.Array.Accelerate as A import Data.Array.Accelerate.CUDA - - or import Data.Array.Accelerate.Interpreter

3 Accelerate Accelerate is a Domain-Specific Language for GPU programming Copy result back to Haskell Haskell/Accelerate program Transform Accelerate program into CUDA program CUDA code Compile with NVIDIA s compiler & load onto the GPU

4 Accelerate Accelerate computations take place on arrays - Parallelism is introduced in the form of collective operations over arrays Arrays in Accelerate computation Arrays out Arrays have two type parameters - The shape of the array, or dimensionality data Array sh e - The element type of the array: Int, Float, etc.

5 Accelerate Accelerate is split into two worlds: Acc and Exp - Acc represents collective operations over instances of Arrays - Exp is a scalar computation on things of type Elt Collective operations in Acc comprise many scalar operations in Exp, executed in parallel over Arrays - Scalar operations can not contain collective operations This stratification excludes nested data parallelism

6 Accelerate To execute an Accelerate computation (on the GPU): run :: Arrays a => Acc a - > a - run comes from whichever backend we have chosen (CUDA)

7 Accelerate To execute an Accelerate computation (on the GPU): run :: Arrays a => Acc a - > a - run comes from whichever backend we have chosen (CUDA) To get arrays into Acc land use :: Arrays a => a - > Acc a - This may involve copying data to the GPU

8 Accelerate To execute an Accelerate computation (on the GPU): run :: Arrays a => Acc a - > a - run comes from whichever backend we have chosen (CUDA) To get arrays into Acc land use :: Arrays a => a - > Acc a - This may involve copying data to the GPU Using Accelerate focus on everything in between: using combinators of type Acc to build an AST that will be turned into CUDA code and executed by run

9 Arrays data Array sh e Create an array from a list: fromlist :: (Shape sh, Elt e) => sh - > [e] - > Array sh e - Generates a multidimensional array by consuming elements from the list and adding them to the array in row-major order Example: ghci> fromlist (Z:.10) [1..10] - Defaulting does not apply, because Shape is not a standard class

10 Arrays data Array sh e Create an array from a list: fromlist :: (Shape sh, Elt e) => sh - > [e] - > Array sh e - Generates a multidimensional array by consuming elements from the list and adding them to the array in row-major order Example: ghci> fromlist (Z:.10) [1..10] - Defaulting does not apply, because Shape is not a standard class <interactive>:3:1: No instance for (Shape (Z :. head0)) arising from a use of `fromlist' The type variable `head0' is ambiguous Possible fix: add a type signature that fixes these type Note: there is a potential instance available: instance Shape sh => Shape (sh :. Int) - - Defined in `Data.Array.Accelerate.Array.Sugar' Possible fix: add an instance declaration for (Shape (Z : In the expression: fromlist (Z :. 10) [1.. 10] In an equation for `it': it = fromlist (Z :. 10) [ <interactive>:3:14: No instance for (Num head0) arising from the literal `10' The type variable `head0' is ambiguous Possible fix: add a type signature that fixes these type Note: there are several potential instances: instance Num Double - - Defined in `GHC.Float' instance Num Float - - Defined in `GHC.Float' instance Integral a => Num (GHC.Real.Ratio a) - - Defined in `GHC.Real'...plus 12 others In the second argument of `(:.)', namely `10' In the first argument of `fromlist', namely `(Z :. 10)'

11 Arrays data Array sh e Create an array from a list: fromlist :: (Shape sh, Elt e) => sh - > [e] - > Array sh e - Generates a multidimensional array by consuming elements from the list and adding them to the array in row-major order Example: ghci> fromlist (Z:.10) [1..10] - Defaulting does not apply, because Shape is not a standard class <interactive>:3:1: No instance for (Shape (Z :. head0)) arising from a use of `fromlist' The type variable `head0' is ambiguous Possible fix: add a type signature that fixes these type Note: there is a potential instance available: instance Shape sh => Shape (sh :. Int) - - Defined in `Data.Array.Accelerate.Array.Sugar' Possible fix: add an instance declaration for (Shape (Z : In the expression: fromlist (Z :. 10) [1.. 10] In an equation for `it': it = fromlist (Z :. 10) [ <interactive>:3:14: No instance for (Num head0) arising from the literal `10' The type variable `head0' is ambiguous Possible fix: add a type signature that fixes these type Note: there are several potential instances: instance Num Double - - Defined in `GHC.Float' instance Num Float - - Defined in `GHC.Float' instance Integral a => Num (GHC.Real.Ratio a) - - Defined in `GHC.Real'...plus 12 others In the second argument of `(:.)', namely `10' In the first argument of `fromlist', namely `(Z :. 10)'

12 Arrays data Array sh e Create an array from a list: fromlist :: (Shape sh, Elt e) => sh - > [e] - > Array sh e - Generates a multidimensional array by consuming elements from the list and adding them to the array in row-major order Example: ghci> fromlist (Z:.10) [1..10] - Defaulting does not apply, because Shape is not a standard class <interactive>:3:1: No instance for (Shape (Z :. head0)) arising from a use of `fromlist' The type variable `head0' is ambiguous Possible fix: add a type signature that fixes these type Note: there is a potential instance available: instance Shape sh => Shape (sh :. Int) - - Defined in `Data.Array.Accelerate.Array.Sugar' Possible fix: add an instance declaration for (Shape (Z : In the expression: fromlist (Z :. 10) [1.. 10] In an equation for `it': it = fromlist (Z :. 10) [ Number 1 tip: Add type signatures <interactive>:3:14: No instance for (Num head0) arising from the literal `10' The type variable `head0' is ambiguous Possible fix: add a type signature that fixes these type Note: there are several potential instances: instance Num Double - - Defined in `GHC.Float' instance Num Float - - Defined in `GHC.Float' instance Integral a => Num (GHC.Real.Ratio a) - - Defined in `GHC.Real'...plus 12 others In the second argument of `(:.)', namely `10' In the first argument of `fromlist', namely `(Z :. 10)'

13 Arrays data Array sh e Create an array from a list: > fromlist (Z:.10) [1..10] :: Vector Float Array (Z :. 10) [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0]

14 Arrays data Array sh e Create an array from a list: > fromlist (Z:.10) [1..10] :: Vector Float Array (Z :. 10) [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0] Multidimensional arrays are similar: - Elements are filled along the right-most dimension first > fromlist (Z:.3:.5) [1..] :: Array DIM2 Int Array (Z :. 3 :. 5) [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]

15 Accelerate by example Password recovery

16 MD5 Algorithm Aim: - Implement one round of MD5: unsalted, single 512-bit block - Apply to an array of words - Compare hashes to some unknown hash - i.e. standard dictionary attack

17 MD5 Algorithm Algorithm operates on a 4-word state, A, B, C, and D There are 4 x 16 rounds: F, G, H, and I - M i is a word from the input message - K i is a constant - <<< s is left rotate, by some constant ri Each round operates on the 512-bit message block, modifying the state

18 MD5 Algorithm in Accelerate Accelerate is a meta programming language - Use regular Haskell to generate the expression for each step of the round - Produces an unrolled loop type ABCD = (Exp Word32, Exp Word32,... ) md5round :: Acc (Vector Word32) - > ABCD md5round msg = P.foldl round (a0,b0,c0,d0) [0..64] where round :: ABDC - > Int - > ABCD round (a,b,c,d) i =...

19 MD5 Algorithm in Accelerate The constants k i and r i can be embedded directly - The simple list lookup would be death in standard Haskell - Generating the expression need not be performant, only executing it k :: Int - > Exp Word32 k i = constant (ks P.!! i) where ks = [ 0xd76aa478, 0xe8c7b756, 0x242070db, 0xc1bdceee,...

20 MD5 Algorithm in Accelerate The message M is stored as an array, so we need array indexing - Be wary, arbitrary array indexing can kill performance... (!) :: (Shape ix, Elt e) => Acc (Array ix e) - > Exp ix - > Exp e

21 MD5 Algorithm in Accelerate The message M is stored as an array, so we need array indexing - Be wary, arbitrary array indexing can kill performance... (!) :: (Shape ix, Elt e) => Acc (Array ix e) - > Exp ix - > Exp e Get the right word of the message for the given round m :: Int - > Exp Word32 m i i < 16 = msg! index1 (constant i) i < 32 = msg! index1 (constant ((5*i + 1) `rem` 16))...

22 MD5 Algorithm in Accelerate Finally, the non-linear functions F, G, H, and I round :: ABDC - > Int - > ABCD round (a,b,c,d) i i < 16 = shfl (f b c d)... where shfl x = (d, b + ((a + x + k i + m i) `rotatel` r i), b, c) f x y z = (x.&. y).. ((complement x).&. z)...

23 MD5 Algorithm in Accelerate MD5 applied to a single 16-word vector: no parallelism here Lift this operation to an array of n words - Process many words in parallel to compare against the unknown - Need to use generate, the most general form of array construction. Equivalently, the most easily misused generate :: (Shape sh, Elt e) => Exp sh - > (Exp sh - > Elt e) - > Acc (Array sh e)

24 Problem Solving Accelerate Hints for when things don t work as you expect

25 MD5 Algorithm in Accelerate As always, data layout is important - Accelerate arrays are stored in row-major order - For a CPU, this means the input word would be read on a cache line c o r r e c t h o r s e b a t t e r y s t a p l e

26 MD5 Algorithm in Accelerate However in a parallel context many threads must work together - generate uses one thread per element - For best performance CUDA threads need to index adjacent memory - This only works if the dictionary is stored column major instead c o r r e c t h o r s e b a t t e r y s t a p l e

27 MD5 Algorithm in Accelerate However in a parallel context many threads must work together - generate uses one thread per element - For best performance CUDA threads need to index adjacent memory - This only works if the dictionary is stored column major instead c o r r e c t h o r s e b a t t e r y s t a p l e

28 MD5 Algorithm in Accelerate However in a parallel context many threads must work together - generate uses one thread per element - For best performance CUDA threads need to index adjacent memory - This only works if the dictionary is stored column major instead c o r r e c t h o r s e b a t t e r y s t a p l e

29 MD5 Algorithm in Accelerate However in a parallel context many threads must work together - generate uses one thread per element - For best performance CUDA threads need to index adjacent memory - This only works if the dictionary is stored column major instead c h b s o o a t r r t a r s t p e e e l c r e t y

30 MD5 Algorithm in Accelerate However in a parallel context many threads must work together - generate uses one thread per element - For best performance CUDA threads need to index adjacent memory - This only works if the dictionary is stored column major instead c h b s o o a t r r t a r s t p e e e l c r e t y Pay attention to data layout if you use indexing operators

31 Nested Data-Parallelism Accelerate is a language for flat data-parallel computations - The allowed array element types do not include Array: simple types only - Lots of types to statically exclude nested parallelism: Acc vs. Exp

32 Nested Data-Parallelism Accelerate is a language for flat data-parallel computations - The allowed array element types do not include Array: simple types only - Lots of types to statically exclude nested parallelism: Acc vs. Exp - But, it doesn t always succeed, and the error is uninformative *** Exception: *** Internal error in package accelerate *** *** Please submit a bug report at (convertsharingexp): inconsistent shared 'Exp' tree with stable name 54; env = [56]

33 Nested Data-Parallelism Accelerate is a language for flat data-parallel computations - The allowed array element types do not include Array: simple types only - Lots of types to statically exclude nested parallelism: Acc vs. Exp - But, it doesn t always succeed, and the error is uninformative *** Exception: *** Internal error in package accelerate *** *** Please submit a bug report at (convertsharingexp): inconsistent shared 'Exp' tree with stable name 54; env = [56]

34 Nested Data-Parallelism Accelerate is a language for flat data-parallel computations - The allowed array element types do not include Array: simple types only - Lots of types to statically exclude nested parallelism: Acc vs. Exp - But, it doesn t always succeed, and the error is uninformative *** Exception: *** Internal error in package accelerate *** *** Please submit a bug report at (convertsharingexp): inconsistent shared 'Exp' tree with stable name 54; env = [56] *** Exception: Cyclic definition of a value of type 'Exp' (sa = 45)

35 Nested Data-Parallelism Matrix-vector multiplication - Define vector dot product dotp :: Acc (Vector e) - > Acc (Vector e) - > Acc (Scalar e) dotp u v = fold (+) 0 ( zipwith (*) u v )

36 Nested Data-Parallelism Matrix-vector multiplication - Want to apply dotp to every row of the matrix - Extract a row from the matrix takerow :: Exp Int - > Acc (Array DIM2 e) - > Acc (Vector e) takerow n mat = let Z :. _ :. cols = unlift (shape mat) :: Z:. Exp Int :. Exp Int in backpermute (index1 cols) (\ix - > index2 n (unindex1 ix)) mat

37 Nested Data-Parallelism Matrix-vector multiplication - Want to apply dotp to every row of the matrix - Extract a row from the matrix takerow :: Exp Int - > Acc (Array DIM2 e) - > Acc (Vector e) takerow n mat = let Z :. _ :. cols = unlift (shape mat) :: Z:. Exp Int :. Exp Int in backpermute (index1 cols) (\ix - > index2 n (unindex1 ix)) mat - At each element in the output array, where in the input do I read from?

38 Nested Data-Parallelism Matrix-vector multiplication - Want to apply dotp to every row of the matrix - Extract a row from the matrix takerow :: Exp Int - > Acc (Array DIM2 e) - > Acc (Vector e) takerow n mat = let Z :. _ :. cols = unlift (shape mat) :: Z:. Exp Int :. Exp Int in backpermute (index1 cols) (\ix - > index2 n (unindex1 ix)) mat - At each element in the output array, where in the input do I read from? - [un]index1 converts an Int into a (Z:.Int) 1D index (and vice versa)

39 Nested Data-Parallelism Matrix-vector multiplication - Apply dot product to each row of the matrix mvm :: Acc (Array DIM2 e) - > Acc (Vector e) - > Acc (Vector e) mvm mat vec = let Z :. rows :. _ = unlift (shape mat) :: Z :. Exp Int :. Exp Int in generate (index1 rows) (\ix - > the (vec `dotp` takerow (unindex1 ix) mat)) Extract the element from a singleton array

40 Nested Data-Parallelism Matrix-vector multiplication - Apply dot product to each row of the matrix mvm :: Acc (Array DIM2 e) - > Acc (Vector e) - > Acc (Vector e) mvm mat vec = let Z :. rows :. _ = unlift (shape mat) :: Z :. Exp Int :. Exp Int in generate (index1 rows) (\ix - > the (vec `dotp` takerow (unindex1 ix) mat)) indexing an array that Extract the element from a singleton array

41 Nested Data-Parallelism Matrix-vector multiplication - Apply dot product to each row of the matrix mvm :: Acc (Array DIM2 e) - > Acc (Vector e) - > Acc (Vector e) mvm mat vec = let Z :. rows :. _ = unlift (shape mat) :: Z :. Exp Int :. Exp Int in generate (index1 rows) (\ix - > the (vec `dotp` takerow (unindex1 ix) mat)) indexing an array that Extract the element from a singleton array the :: Acc (Scalar e) - > Exp e

42 Nested Data-Parallelism Matrix-vector multiplication - Apply dot product to each row of the matrix mvm :: Acc (Array DIM2 e) - > Acc (Vector e) - > Acc (Vector e) mvm mat vec = let Z :. rows :. _ = unlift (shape mat) :: Z :. Exp Int :. Exp Int in generate (index1 rows) (\ix - > the (vec `dotp` takerow (unindex1 ix) mat)) indexing an array that Extract the element from a singleton array depends on the index given by generate the :: Acc (Scalar e) - > Exp e

43 Nested Data-Parallelism The problem is attempting to execute many separate dot products in parallel dotp :: Acc (Vector e) - > Acc (Vector e) - > Acc (Scalar e) dotp u v = fold (+) 0 ( zipwith (*) u v ) We need a way to execute this step as a single collective operation

44 Reductions Folding (+) over a vector produces a sum - The result is a one-element array (scalar). Why? > let xs = fromlist (Z:.10) [1..] :: Vector Int > run $ fold (+) 0 (use xs) Array (Z) [55]

45 Reductions Folding (+) over a vector produces a sum - The result is a one-element array (scalar). Why? > let xs = fromlist (Z:.10) [1..] :: Vector Int > run $ fold (+) 0 (use xs) Array (Z) [55] Fold has an interesting type: fold :: (Shape sh, Elt a) => (Exp a - > Exp a - > Exp a) - > Exp a - > Acc (Array (sh:.int) a) - > Acc (Array sh a)

46 Reductions Folding (+) over a vector produces a sum - The result is a one-element array (scalar). Why? > let xs = fromlist (Z:.10) [1..] :: Vector Int > run $ fold (+) 0 (use xs) Array (Z) [55] Fold has an interesting type: fold :: (Shape sh, Elt a) => (Exp a - > Exp a - > Exp a) - > Exp a - > Acc (Array (sh:.int) a) - > Acc (Array sh a) input array

47 Reductions Folding (+) over a vector produces a sum - The result is a one-element array (scalar). Why? > let xs = fromlist (Z:.10) [1..] :: Vector Int > run $ fold (+) 0 (use xs) Array (Z) [55] Fold has an interesting type: fold :: (Shape sh, Elt a) => (Exp a - > Exp a - > Exp a) - > Exp a - > Acc (Array (sh:.int) a) - > Acc (Array sh a) input array outer dimension removed

48 Reductions Fold occurs over the outer dimension of the array > let mat = fromlist (Z:.3:.5) [1..] :: Array DIM2 Int > run $ fold (+) 0 (use mat) Array (Z :. 3) [15,40,65]

49 Reductions Fold occurs over the outer dimension of the array > let mat = fromlist (Z:.3:.5) [1..] :: Array DIM2 Int > run $ fold (+) 0 (use mat) Array (Z :. 3) [15,40,65]

50 Matrix-Vector Multiplication The trick is that we can use this to do all of our dot-products in parallel

51 Matrix-Vector Multiplication The trick is that we can use this to do all of our dot-products in parallel

52 Matrix-Vector Multiplication The trick is that we can use this to do all of our dot-products in parallel

53 Matrix-Vector Multiplication The trick is that we can use this to do all of our dot-products in parallel

54 Replicate Multidimensional replicate introduces new magic replicate :: (Slice slix, Elt e) => Exp slix - > Acc (Array (SliceShape slix) e) - > Acc (Array (FullShape slix) e) Type hackery - All indicates the entirety of the existing dimension - Number of new rows

55 Replicate Multidimensional replicate introduces new magic replicate :: (Slice slix, Elt e) => Exp slix - > Acc (Array (SliceShape slix) e) - > Acc (Array (FullShape slix) e) Type hackery - All indicates the entirety of the existing dimension - Number of new rows

56 Replicate Multidimensional replicate introduces new magic replicate :: (Slice slix, Elt e) => Exp slix - > Acc (Array (SliceShape slix) e) - > Acc (Array (FullShape slix) e) Type hackery > let vec = fromlist (Z:.5) [1..] :: Vector Float > run $ replicate (constant (Z :. (3::Int) :. All)) (use vec) Array (Z :. 3 :. 5) [1.0,2.0,3.0,4.0,5.0,1.0,2.0,3.0,4.0,...]

57 Replicate Multidimensional replicate introduces new magic replicate :: (Slice slix, Elt e) => Exp slix - > Acc (Array (SliceShape slix) e) - > Acc (Array (FullShape slix) e) Type hackery > let vec = fromlist (Z:.5) [1..] :: Vector Float > run $ replicate (constant (Z :. (3::Int) :. All)) (use vec) Array (Z :. 3 :. 5) [1.0,2.0,3.0,4.0,5.0,1.0,2.0,3.0,4.0,...]

58 Replicate Multidimensional replicate introduces new magic replicate :: (Slice slix, Elt e) => Exp slix - > Acc (Array (SliceShape slix) e) - > Acc (Array (FullShape slix) e) Type hackery > let vec = fromlist (Z:.5) [1..] :: Vector Float > run $ replicate (constant (Z :. (3::Int) :. All)) (use vec) Array (Z :. 3 :. 5) [1.0,2.0,3.0,4.0,5.0,1.0,2.0,3.0,4.0,...] - All indicates the entirety of the existing dimension

59 Replicate Multidimensional replicate introduces new magic replicate :: (Slice slix, Elt e) => Exp slix - > Acc (Array (SliceShape slix) e) - > Acc (Array (FullShape slix) e) Type hackery > let vec = fromlist (Z:.5) [1..] :: Vector Float > run $ replicate (constant (Z :. (3::Int) :. All)) (use vec) Array (Z :. 3 :. 5) [1.0,2.0,3.0,4.0,5.0,1.0,2.0,3.0,4.0,...] - All indicates the entirety of the existing dimension - Number of new rows

60 Matrix-Vector Multiplication Looks very much like vector dot product mvm :: Acc (Array DIM2 e) - > Acc (Vector e) - > Acc (Vector e) mvm mat vec = let Z :. rows :. _ = unlift (shape mat) :: Z :. Exp Int :. Exp Int vec' = A.replicate (lift (Z :. rows :. All)) vec in fold (+) 0 (A.zipWith (*) vec' mat)

61 N-Body Simulation Calculate all interactions between a vector of particles calcaccels :: Exp R - > Acc (Vector Body) - > Acc (Vector Accel) calcaccels epsilon bodies = let n = A.size bodies rows cols = A.replicate (lift $ Z :. n :. All) bodies = A.replicate (lift $ Z :. All :. n) bodies in A.fold (.+.) (vec 0) ( A.zipWith (accel epsilon) rows cols )

N-Body Simulation Calculate all interactions between a vector of particles calcaccels :: Exp R - > Acc (Vector Body) - > Acc (Vector Accel) calcaccels epsilon bodies = let n = A.

62 N-Body Simulation Calculate all interactions between a vector of particles calcaccels :: Exp R - > Acc (Vector Body) - > Acc (Vector Accel) calcaccels epsilon bodies = let n = A.size bodies rows cols = A.replicate (lift $ Z :. n :. All) bodies = A.replicate (lift $ Z :. All :. n) bodies in A.fold (.+.) (vec 0) ( A.zipWith (accel epsilon) rows cols ) Turn nested computations into segmented or higherdimensional operations

63 Accelerate Recall that Accelerate computations take place on arrays Arrays in Accelerate computation Arrays out

64 Accelerate Recall that Accelerate computations take place on arrays Arrays in Accelerate computation Arrays out Accelerate evaluates the expression passed to run to generate a series of CUDA kernels

65 Accelerate Recall that Accelerate computations take place on arrays Arrays in Accelerate computation Arrays out Accelerate evaluates the expression passed to run to generate a series of CUDA kernels - Each piece of code CUDA code must be compiled and loaded

66 Accelerate Recall that Accelerate computations take place on arrays Arrays in Accelerate computation Arrays out Accelerate evaluates the expression passed to run to generate a series of CUDA kernels - Each piece of code CUDA code must be compiled and loaded - The goal is to make kernels that can be reused

67 Accelerate Recall that Accelerate computations take place on arrays Arrays in Accelerate computation Arrays out Accelerate evaluates the expression passed to run to generate a series of CUDA kernels - Each piece of code CUDA code must be compiled and loaded - The goal is to make kernels that can be reused - If we don t, the overhead of compilation can ruin performance

68 Embedded Scalars Consider drop, which yields all but the first n elements of a vector drop :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop n arr = let n' = the (unit n) in backpermute (ilift1 (subtract n') (shape arr)) (ilift1 (+ n')) arr Lift an expression into a singleton array Extract the element of a singleton array

69 Embedded Scalars Consider drop, which yields all but the first n elements of a vector drop :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop n arr = let n' = the (unit n) in backpermute (ilift1 (subtract n') (shape arr)) (ilift1 (+ n')) arr Lift an expression into a singleton array Extract the element of a singleton array

70 Embedded Scalars Consider drop, which yields all but the first n elements of a vector drop :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop n arr = let n' = the (unit n) in backpermute (ilift1 (subtract n') (shape arr)) (ilift1 (+ n')) arr Lift an expression into a singleton array unit :: Exp e - > Acc (Scalar e)

71 Embedded Scalars Consider drop, which yields all but the first n elements of a vector drop :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop n arr = let n' = the (unit n) in backpermute (ilift1 (subtract n') (shape arr)) (ilift1 (+ n')) arr Lift an expression into a singleton array unit :: Exp e - > Acc (Scalar e) Extract the element of a singleton array the :: Acc (Scalar e) - > Exp e

72 Embedded Scalars Check the expression Accelerate sees when it evaluates run > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in let a1 = unit 4 in backpermute (let x0 = Z in x0 :. (indexhead (shape a0)) - (a1!x0)) (\x0 - > let x1 = Z in x1 :. (indexhead x0) + (a1!x1)) a0 - Corresponds to the array we created for n - Critically, this is outside the call to backpermute

73 Embedded Scalars Check the expression Accelerate sees when it evaluates run > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in let a1 = unit 4 in backpermute (let x0 = Z in x0 :. (indexhead (shape a0)) - (a1!x0)) (\x0 - > let x1 = Z in x1 :. (indexhead x0) + (a1!x1)) a0 - Corresponds to the array we created for n - Critically, this is outside the call to backpermute

74 Embedded Scalars Check the expression Accelerate sees when it evaluates run > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in let a1 = unit 4 in backpermute (let x0 = Z in x0 :. (indexhead (shape a0)) - (a1!x0)) (\x0 - > let x1 = Z in x1 :. (indexhead x0) + (a1!x1)) a0 - Corresponds to the array we created for n

75 Embedded Scalars Check the expression Accelerate sees when it evaluates run > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in let a1 = unit 4 in backpermute (let x0 = Z in x0 :. (indexhead (shape a0)) - (a1!x0)) (\x0 - > let x1 = Z in x1 :. (indexhead x0) + (a1!x1)) a0 - Corresponds to the array we created for n - Critically, this is outside the call to backpermute

76 Embedded Scalars drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr

77 Embedded Scalars drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0

78 Embedded Scalars drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0

79 Embedded Scalars Check the expression Accelerate sees when it evaluates run drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0

80 Embedded Scalars Check the expression Accelerate sees when it evaluates run drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0

81 Embedded Scalars Check the expression Accelerate sees when it evaluates run drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0

82 Embedded Scalars Check the expression Accelerate sees when it evaluates run drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0

83 Embedded Scalars Check the expression Accelerate sees when it evaluates run drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0

84 Embedded Scalars Check the expression Accelerate sees when it evaluates run drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0

85 Embedded Scalars Check the expression Accelerate sees when it evaluates run drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0

86 Embedded Scalars Check the expression Accelerate sees when it evaluates run drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0 - This will defeat Accelerate s caching of compiled kernels

87 Embedded Scalars Check the expression Accelerate sees when it evaluates run drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0 - This will defeat Accelerate s caching of compiled kernels Make sure any arguments that change are passed as Arrays

88 Embedded Scalars Inspect the expression directly, as done here Alternatively: Accelerate has some debugging options - Run the program with the - ddump- cc command line switch - Make sure you have installed Accelerate with - fdebug

89 Executing Computations Recall: to actually execute a computation we use run run :: Arrays a => Acc a - > a This entails quite a bit of work setting up the computation to run on the GPU - And sometimes the computation never changes

90 Executing Computations Alternative: execute an array program of one argument run1 :: (Arrays a, Arrays b) => (Acc a - > Acc b) - > a - > b What is the difference? - This version can be partially applied with the (Acc a - > Acc b) argument - Returns a new function (a - > b)

91 Executing Computations Alternative: execute an array program of one argument run1 :: (Arrays a, Arrays b) => (Acc a - > Acc b) - > a - > b What is the difference? - This version can be partially applied with the (Acc a - > Acc b) argument - Returns a new function (a - > b) - Key new thing: behind the scenes everything other than final execution is already done. AST is annotated with object code required for execution.

92 Executing Computations canny :: Acc (Array DIM2 RGBA) - > Acc (Array DIM2 Float) canny =...

93 Executing Computations canny :: Acc (Array DIM2 RGBA) - > Acc (Array DIM2 Float) canny =... No magic: edges :: Array DIM2 RGBA - > Array DIM2 Float edges img = run ( canny (use img) )

94 Executing Computations canny :: Acc (Array DIM2 RGBA) - > Acc (Array DIM2 Float) canny =... No magic: edges :: Array DIM2 RGBA - > Array DIM2 Float edges img = run ( canny (use img) ) Magic: edges :: Array DIM2 RGBA - > Array DIM2 Float edges img = run1 canny img

95 Executing Computations canny :: Acc (Array DIM2 RGBA) - > Acc (Array DIM2 Float) canny =... Criterion benchmarks benchmarking canny/run collecting 100 samples, 1 iterations each, in estimated s mean: ms, lb ms, ub ms, ci std dev: ms, lb ms, ub ms, ci variance introduced by outliers: % variance is severely inflated by outliers benchmarking canny/run1 mean: ms, lb ms, ub ms, ci std dev: ms, lb ms, ub ms, ci variance introduced by outliers: % variance is severely inflated by outliers

96 Executing Computations canny :: Acc (Array DIM2 RGBA) - > Acc (Array DIM2 Float) canny =... Criterion benchmarks benchmarking canny/run collecting 100 samples, 1 iterations each, in estimated s mean: ms, lb ms, ub ms, ci std dev: ms, lb ms, ub ms, ci variance introduced by outliers: % variance is severely inflated by outliers benchmarking canny/run1 mean: ms, lb ms, ub ms, ci std dev: ms, lb ms, ub ms, ci variance introduced by outliers: % variance is severely inflated by outliers

97 Floyd-Warshall Algorithm Find the shortest paths in a weighted graph shortestpath :: Graph - > Vertex - > Vertex - > Vertex - > Weight shortestpath g i j 0 = weight g i j shortestpath g i j k = min (shortestpath g i j (k- 1)) (shortestpath g i k (k- i) + shortestpath g k j (k- 1))

98 Floyd-Warshall Algorithm Find the shortest paths in a weighted graph shortestpath :: Graph - > Vertex - > Vertex - > Vertex - > Weight shortestpath g i j 0 = weight g i j shortestpath g i j k = min (shortestpath g i j (k- 1)) (shortestpath g i k (k- i) + shortestpath g k j (k- 1)) - k=0: path between vertices i and j is the direct case only

99 Floyd-Warshall Algorithm Find the shortest paths in a weighted graph shortestpath :: Graph - > Vertex - > Vertex - > Vertex - > Weight shortestpath g i j 0 = weight g i j shortestpath g i j k = min (shortestpath g i j (k- 1)) (shortestpath g i k (k- i) + shortestpath g k j (k- 1)) - k=0: path between vertices i and j is the direct case only - Otherwise the shortest path from i to j passes through k, or it does not

100 Floyd-Warshall Algorithm Find the shortest paths in a weighted graph shortestpath :: Graph - > Vertex - > Vertex - > Vertex - > Weight shortestpath g i j 0 = weight g i j shortestpath g i j k = min (shortestpath g i j (k- 1)) (shortestpath g i k (k- i) + shortestpath g k j (k- 1)) - k=0: path between vertices i and j is the direct case only - Otherwise the shortest path from i to j passes through k, or it does not - Exponential in the number of calls: instead traverse bottom-up to make the results at (k- 1) available

101 Floyd-Warshall Algorithm Instead of a sparse graph, we ll use a dense adjacency matrix - 2D array where the element is the distance between the two vertices type Weight = Int32 type Graph = Array DIM2 Weight step :: Acc (Scalar Int) - > Acc Graph - > Acc Graph step k g = generate (shape g) sp where k' = the k sp :: Exp DIM2 - > Exp Weight sp ix = let (Z :. i :. j) = unlift ix in min (g! (index2 i j)) (g! (index2 i k') + g! (index2 k' j)) - Must the iteration number k as a scalar array - sp is the core of the algorithm

102 Floyd-Warshall Algorithm Instead of a sparse graph, we ll use a dense adjacency matrix - 2D array where the element is the distance between the two vertices type Weight = Int32 type Graph = Array DIM2 Weight step :: Acc (Scalar Int) - > Acc Graph - > Acc Graph step k g = generate (shape g) sp where k' = the k sp :: Exp DIM2 - > Exp Weight sp ix = let (Z :. i :. j) = unlift ix in min (g! (index2 i j)) (g! (index2 i k') + g! (index2 k' j)) - Must the iteration number k as a scalar array - sp is the core of the algorithm

103 Floyd-Warshall Algorithm shortestpathsacc :: Int - > Acc Graph - > Acc Graph shortestpathsacc n g0 = foldl1 (.) steps g0 where steps :: [ Acc Graph - > Acc Graph ] steps = [ step (unit (constant k)) k <- [0.. n- 1] ]

104 Floyd-Warshall Algorithm shortestpathsacc :: Int - > Acc Graph - > Acc Graph shortestpathsacc n g0 = foldl1 (.) steps g0 where steps :: [ Acc Graph - > Acc Graph ] steps = [ step (unit (constant k)) k <- [0.. n- 1] ] - Construct a list of steps, applying k in sequence 0.. n- 1

105 Floyd-Warshall Algorithm shortestpathsacc :: Int - > Acc Graph - > Acc Graph shortestpathsacc n g0 = foldl1 (.) steps g0 where steps :: [ Acc Graph - > Acc Graph ] steps = [ step (unit (constant k)) k <- [0.. n- 1] ] - Construct a list of steps, applying k in sequence 0.. n- 1 - Compose the sequence together

106 Floyd-Warshall Algorithm shortestpathsacc :: Int - > Acc Graph - > Acc Graph shortestpathsacc n g0 = foldl1 (.) steps g0 where steps :: [ Acc Graph - > Acc Graph ] steps = [ step (unit (constant k)) k <- [0.. n- 1] ] - Construct a list of steps, applying k in sequence 0.. n- 1 - Compose the sequence together Let s try a 1000 x 1000 matrix

107 Floyd-Warshall Algorithm shortestpathsacc :: Int - > Acc Graph - > Acc Graph shortestpathsacc n g0 = foldl1 (.) steps g0 where steps :: [ Acc Graph - > Acc Graph ] steps = [ step (unit (constant k)) k <- [0.. n- 1] ] - Construct a list of steps, applying k in sequence 0.. n- 1 - Compose the sequence together Let s try a 1000 x 1000 matrix $./fwaccel RTS - s fwaccel: *** Internal error in package accelerate *** *** Please submit a bug report at (unhandled): CUDA Exception: out of memory 6,627,846,360 bytes allocated in the heap 2,830,827,472 bytes copied during GC 733,091,960 bytes maximum residency (26 sample(s)) 31,575,272 bytes maximum slop 812 MB total memory in use (54 MB lost due to fragmentation)... Total time 7.16s ( 11.42s elapsed)

108 Floyd-Warshall Algorithm shortestpathsacc :: Int - > Acc Graph - > Acc Graph shortestpathsacc n g0 = foldl1 (.) steps g0 where steps :: [ Acc Graph - > Acc Graph ] steps = [ step (unit (constant k)) k <- [0.. n- 1] ] - Construct a list of steps, applying k in sequence 0.. n- 1 - Compose the sequence together Let s try a 1000 x 1000 matrix $./fwaccel RTS - s fwaccel: *** Internal error in package accelerate *** *** Please submit a bug report at (unhandled): CUDA Exception: out of memory 6,627,846,360 bytes allocated in the heap 2,830,827,472 bytes copied during GC 733,091,960 bytes maximum residency (26 sample(s)) 31,575,272 bytes maximum slop 812 MB total memory in use (54 MB lost due to fragmentation)... Total time 7.16s ( 11.42s elapsed)

109 Pipelining To sequence computations we have a special operator (>- >) :: (Arrays a, Arrays b, Arrays c) => (Acc a - > Acc b) - > (Acc b - > Acc c) - > Acc a - > Acc c - Operationally, the two computations will not share any subcomputations with each other or the environment - Intermediate arrays from the first can be GC d when the second begins

110 Pipelining To sequence computations we have a special operator (>- >) :: (Arrays a, Arrays b, Arrays c) => (Acc a - > Acc b) - > (Acc b - > Acc c) - > Acc a - > Acc c - Operationally, the two computations will not share any subcomputations with each other or the environment - Intermediate arrays from the first can be GC d when the second begins $./fwaccel RTS - s Array (Z) [ ] 6,211,776,544 bytes allocated in the heap 914,044,096 bytes copied during GC 24,953,768 bytes maximum residency (211 sample(s)) 2,023,496 bytes maximum slop 63 MB total memory in use (0 MB lost due to fragmentation)... Total time 3.86s ( 3.92s elapsed)

Pipelining To sequence computations we have a special operator (>- >) :: (Arrays a, Arrays b, Arrays c) => (Acc a - > Acc b) - > (Acc b - > Acc c) - > Acc a - > Acc c - Operationally, the two

111 Pipelining To sequence computations we have a special operator (>- >) :: (Arrays a, Arrays b, Arrays c) => (Acc a - > Acc b) - > (Acc b - > Acc c) - > Acc a - > Acc c - Operationally, the two computations will not share any subcomputations with each other or the environment - Intermediate arrays from the first can be GC d when the second begins $./fwaccel RTS - s Array (Z) [ ] 6,211,776,544 bytes allocated in the heap 914,044,096 bytes copied during GC 24,953,768 bytes maximum residency (211 sample(s)) 2,023,496 bytes maximum slop 63 MB total memory in use (0 MB lost due to fragmentation)... Total time 3.86s ( 3.92s elapsed) (>- >) can help control memory use & startup time

112 Questions?

113 Extra slides

114 Stencils A stencil is a map with access to the neighbourhood around each element - Useful in many scientific & image processing algorithms laplace :: Stencil3x3 Int - > Exp Int laplace ((_,t,_),(l,c,r),(_,b,_)) = t + b + l + r - 4*c t l c r b

115 Stencils A stencil is a map with access to the neighbourhood around each element - Useful in many scientific & image processing algorithms laplace :: Stencil3x3 Int - > Exp Int laplace ((_,t,_),(l,c,r),(_,b,_)) = t + b + l + r - 4*c t l c r b - Boundary conditions specify how to handle out-of-bounds neighbours > let mat = fromlist (Z:.3:.5) [1..] :: Array DIM2 Int > run $ stencil laplace (Constant 0) (use mat) Array (Z :. 3 :. 5) [4,3,2,1,- 6,- 5,0,0,0,- 11,- 26,- 17,- 18,- 19,- 36]

116 Stencils A stencil is a map with access to the neighbourhood around each element - Useful in many scientific & image processing algorithms

117 Debugging options -dverbose -ddump-sharing -ddump-simpl-stats, -ddump-simpl-iterations -ddump-cc, -ddebug-cc -ddump-exec -ddump-gc -fflush-cache

118

GPU Programming I. With thanks to Manuel Chakravarty for some borrowed slides

GPU Programming I. With thanks to Manuel Chakravarty for some borrowed slides GPU Programming I With thanks to Manuel Chakravarty for some borrowed slides GPUs change the game Gaming drives development Image from h?p://www.geforce.com/hardware/desktop- gpus/geforce- gtx- Etan- black/performance