GPGPU Programming in Haskell with Accelerate
|
|
- Anna Walton
- 5 years ago
- Views:
Transcription
1 GPGPU Programming in Haskell with Accelerate Trevor L. McDonell University of New South
2 Preliminaries Get it from Hackage: - Debugging support enables some extra options to see what is going on cabal install accelerate - fdebug cabal install accelerate- cuda - fdebug Need to import both the base library as well as a specific backend - Import qualified to avoid name clashes with the Prelude import Prelude as P import Data.Array.Accelerate as A import Data.Array.Accelerate.CUDA - - or import Data.Array.Accelerate.Interpreter
3 Accelerate Accelerate is a Domain-Specific Language for GPU programming Copy result back to Haskell Haskell/Accelerate program Transform Accelerate program into CUDA program CUDA code Compile with NVIDIA s compiler & load onto the GPU
4 Accelerate Accelerate computations take place on arrays - Parallelism is introduced in the form of collective operations over arrays Arrays in Accelerate computation Arrays out Arrays have two type parameters - The shape of the array, or dimensionality data Array sh e - The element type of the array: Int, Float, etc.
5 Accelerate Accelerate is split into two worlds: Acc and Exp - Acc represents collective operations over instances of Arrays - Exp is a scalar computation on things of type Elt Collective operations in Acc comprise many scalar operations in Exp, executed in parallel over Arrays - Scalar operations can not contain collective operations This stratification excludes nested data parallelism
6 Accelerate To execute an Accelerate computation (on the GPU): run :: Arrays a => Acc a - > a - run comes from whichever backend we have chosen (CUDA)
7 Accelerate To execute an Accelerate computation (on the GPU): run :: Arrays a => Acc a - > a - run comes from whichever backend we have chosen (CUDA) To get arrays into Acc land use :: Arrays a => a - > Acc a - This may involve copying data to the GPU
8 Accelerate To execute an Accelerate computation (on the GPU): run :: Arrays a => Acc a - > a - run comes from whichever backend we have chosen (CUDA) To get arrays into Acc land use :: Arrays a => a - > Acc a - This may involve copying data to the GPU Using Accelerate focus on everything in between: using combinators of type Acc to build an AST that will be turned into CUDA code and executed by run
9 Arrays data Array sh e Create an array from a list: fromlist :: (Shape sh, Elt e) => sh - > [e] - > Array sh e - Generates a multidimensional array by consuming elements from the list and adding them to the array in row-major order Example: ghci> fromlist (Z:.10) [1..10] - Defaulting does not apply, because Shape is not a standard class
10 Arrays data Array sh e Create an array from a list: fromlist :: (Shape sh, Elt e) => sh - > [e] - > Array sh e - Generates a multidimensional array by consuming elements from the list and adding them to the array in row-major order Example: ghci> fromlist (Z:.10) [1..10] - Defaulting does not apply, because Shape is not a standard class <interactive>:3:1: No instance for (Shape (Z :. head0)) arising from a use of `fromlist' The type variable `head0' is ambiguous Possible fix: add a type signature that fixes these type Note: there is a potential instance available: instance Shape sh => Shape (sh :. Int) - - Defined in `Data.Array.Accelerate.Array.Sugar' Possible fix: add an instance declaration for (Shape (Z : In the expression: fromlist (Z :. 10) [1.. 10] In an equation for `it': it = fromlist (Z :. 10) [ <interactive>:3:14: No instance for (Num head0) arising from the literal `10' The type variable `head0' is ambiguous Possible fix: add a type signature that fixes these type Note: there are several potential instances: instance Num Double - - Defined in `GHC.Float' instance Num Float - - Defined in `GHC.Float' instance Integral a => Num (GHC.Real.Ratio a) - - Defined in `GHC.Real'...plus 12 others In the second argument of `(:.)', namely `10' In the first argument of `fromlist', namely `(Z :. 10)'
11 Arrays data Array sh e Create an array from a list: fromlist :: (Shape sh, Elt e) => sh - > [e] - > Array sh e - Generates a multidimensional array by consuming elements from the list and adding them to the array in row-major order Example: ghci> fromlist (Z:.10) [1..10] - Defaulting does not apply, because Shape is not a standard class <interactive>:3:1: No instance for (Shape (Z :. head0)) arising from a use of `fromlist' The type variable `head0' is ambiguous Possible fix: add a type signature that fixes these type Note: there is a potential instance available: instance Shape sh => Shape (sh :. Int) - - Defined in `Data.Array.Accelerate.Array.Sugar' Possible fix: add an instance declaration for (Shape (Z : In the expression: fromlist (Z :. 10) [1.. 10] In an equation for `it': it = fromlist (Z :. 10) [ <interactive>:3:14: No instance for (Num head0) arising from the literal `10' The type variable `head0' is ambiguous Possible fix: add a type signature that fixes these type Note: there are several potential instances: instance Num Double - - Defined in `GHC.Float' instance Num Float - - Defined in `GHC.Float' instance Integral a => Num (GHC.Real.Ratio a) - - Defined in `GHC.Real'...plus 12 others In the second argument of `(:.)', namely `10' In the first argument of `fromlist', namely `(Z :. 10)'
12 Arrays data Array sh e Create an array from a list: fromlist :: (Shape sh, Elt e) => sh - > [e] - > Array sh e - Generates a multidimensional array by consuming elements from the list and adding them to the array in row-major order Example: ghci> fromlist (Z:.10) [1..10] - Defaulting does not apply, because Shape is not a standard class <interactive>:3:1: No instance for (Shape (Z :. head0)) arising from a use of `fromlist' The type variable `head0' is ambiguous Possible fix: add a type signature that fixes these type Note: there is a potential instance available: instance Shape sh => Shape (sh :. Int) - - Defined in `Data.Array.Accelerate.Array.Sugar' Possible fix: add an instance declaration for (Shape (Z : In the expression: fromlist (Z :. 10) [1.. 10] In an equation for `it': it = fromlist (Z :. 10) [ Number 1 tip: Add type signatures <interactive>:3:14: No instance for (Num head0) arising from the literal `10' The type variable `head0' is ambiguous Possible fix: add a type signature that fixes these type Note: there are several potential instances: instance Num Double - - Defined in `GHC.Float' instance Num Float - - Defined in `GHC.Float' instance Integral a => Num (GHC.Real.Ratio a) - - Defined in `GHC.Real'...plus 12 others In the second argument of `(:.)', namely `10' In the first argument of `fromlist', namely `(Z :. 10)'
13 Arrays data Array sh e Create an array from a list: > fromlist (Z:.10) [1..10] :: Vector Float Array (Z :. 10) [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0]
14 Arrays data Array sh e Create an array from a list: > fromlist (Z:.10) [1..10] :: Vector Float Array (Z :. 10) [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0] Multidimensional arrays are similar: - Elements are filled along the right-most dimension first > fromlist (Z:.3:.5) [1..] :: Array DIM2 Int Array (Z :. 3 :. 5) [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
15 Accelerate by example Password recovery
16 MD5 Algorithm Aim: - Implement one round of MD5: unsalted, single 512-bit block - Apply to an array of words - Compare hashes to some unknown hash - i.e. standard dictionary attack
17 MD5 Algorithm Algorithm operates on a 4-word state, A, B, C, and D There are 4 x 16 rounds: F, G, H, and I - M i is a word from the input message - K i is a constant - <<< s is left rotate, by some constant ri Each round operates on the 512-bit message block, modifying the state
18 MD5 Algorithm in Accelerate Accelerate is a meta programming language - Use regular Haskell to generate the expression for each step of the round - Produces an unrolled loop type ABCD = (Exp Word32, Exp Word32,... ) md5round :: Acc (Vector Word32) - > ABCD md5round msg = P.foldl round (a0,b0,c0,d0) [0..64] where round :: ABDC - > Int - > ABCD round (a,b,c,d) i =...
19 MD5 Algorithm in Accelerate The constants k i and r i can be embedded directly - The simple list lookup would be death in standard Haskell - Generating the expression need not be performant, only executing it k :: Int - > Exp Word32 k i = constant (ks P.!! i) where ks = [ 0xd76aa478, 0xe8c7b756, 0x242070db, 0xc1bdceee,...
20 MD5 Algorithm in Accelerate The message M is stored as an array, so we need array indexing - Be wary, arbitrary array indexing can kill performance... (!) :: (Shape ix, Elt e) => Acc (Array ix e) - > Exp ix - > Exp e
21 MD5 Algorithm in Accelerate The message M is stored as an array, so we need array indexing - Be wary, arbitrary array indexing can kill performance... (!) :: (Shape ix, Elt e) => Acc (Array ix e) - > Exp ix - > Exp e Get the right word of the message for the given round m :: Int - > Exp Word32 m i i < 16 = msg! index1 (constant i) i < 32 = msg! index1 (constant ((5*i + 1) `rem` 16))...
22 MD5 Algorithm in Accelerate Finally, the non-linear functions F, G, H, and I round :: ABDC - > Int - > ABCD round (a,b,c,d) i i < 16 = shfl (f b c d)... where shfl x = (d, b + ((a + x + k i + m i) `rotatel` r i), b, c) f x y z = (x.&. y).. ((complement x).&. z)...
23 MD5 Algorithm in Accelerate MD5 applied to a single 16-word vector: no parallelism here Lift this operation to an array of n words - Process many words in parallel to compare against the unknown - Need to use generate, the most general form of array construction. Equivalently, the most easily misused generate :: (Shape sh, Elt e) => Exp sh - > (Exp sh - > Elt e) - > Acc (Array sh e)
24 Problem Solving Accelerate Hints for when things don t work as you expect
25 MD5 Algorithm in Accelerate As always, data layout is important - Accelerate arrays are stored in row-major order - For a CPU, this means the input word would be read on a cache line c o r r e c t h o r s e b a t t e r y s t a p l e
26 MD5 Algorithm in Accelerate However in a parallel context many threads must work together - generate uses one thread per element - For best performance CUDA threads need to index adjacent memory - This only works if the dictionary is stored column major instead c o r r e c t h o r s e b a t t e r y s t a p l e
27 MD5 Algorithm in Accelerate However in a parallel context many threads must work together - generate uses one thread per element - For best performance CUDA threads need to index adjacent memory - This only works if the dictionary is stored column major instead c o r r e c t h o r s e b a t t e r y s t a p l e
28 MD5 Algorithm in Accelerate However in a parallel context many threads must work together - generate uses one thread per element - For best performance CUDA threads need to index adjacent memory - This only works if the dictionary is stored column major instead c o r r e c t h o r s e b a t t e r y s t a p l e
29 MD5 Algorithm in Accelerate However in a parallel context many threads must work together - generate uses one thread per element - For best performance CUDA threads need to index adjacent memory - This only works if the dictionary is stored column major instead c h b s o o a t r r t a r s t p e e e l c r e t y
30 MD5 Algorithm in Accelerate However in a parallel context many threads must work together - generate uses one thread per element - For best performance CUDA threads need to index adjacent memory - This only works if the dictionary is stored column major instead c h b s o o a t r r t a r s t p e e e l c r e t y Pay attention to data layout if you use indexing operators
31 Nested Data-Parallelism Accelerate is a language for flat data-parallel computations - The allowed array element types do not include Array: simple types only - Lots of types to statically exclude nested parallelism: Acc vs. Exp
32 Nested Data-Parallelism Accelerate is a language for flat data-parallel computations - The allowed array element types do not include Array: simple types only - Lots of types to statically exclude nested parallelism: Acc vs. Exp - But, it doesn t always succeed, and the error is uninformative *** Exception: *** Internal error in package accelerate *** *** Please submit a bug report at (convertsharingexp): inconsistent shared 'Exp' tree with stable name 54; env = [56]
33 Nested Data-Parallelism Accelerate is a language for flat data-parallel computations - The allowed array element types do not include Array: simple types only - Lots of types to statically exclude nested parallelism: Acc vs. Exp - But, it doesn t always succeed, and the error is uninformative *** Exception: *** Internal error in package accelerate *** *** Please submit a bug report at (convertsharingexp): inconsistent shared 'Exp' tree with stable name 54; env = [56]
34 Nested Data-Parallelism Accelerate is a language for flat data-parallel computations - The allowed array element types do not include Array: simple types only - Lots of types to statically exclude nested parallelism: Acc vs. Exp - But, it doesn t always succeed, and the error is uninformative *** Exception: *** Internal error in package accelerate *** *** Please submit a bug report at (convertsharingexp): inconsistent shared 'Exp' tree with stable name 54; env = [56] *** Exception: Cyclic definition of a value of type 'Exp' (sa = 45)
35 Nested Data-Parallelism Matrix-vector multiplication - Define vector dot product dotp :: Acc (Vector e) - > Acc (Vector e) - > Acc (Scalar e) dotp u v = fold (+) 0 ( zipwith (*) u v )
36 Nested Data-Parallelism Matrix-vector multiplication - Want to apply dotp to every row of the matrix - Extract a row from the matrix takerow :: Exp Int - > Acc (Array DIM2 e) - > Acc (Vector e) takerow n mat = let Z :. _ :. cols = unlift (shape mat) :: Z:. Exp Int :. Exp Int in backpermute (index1 cols) (\ix - > index2 n (unindex1 ix)) mat
37 Nested Data-Parallelism Matrix-vector multiplication - Want to apply dotp to every row of the matrix - Extract a row from the matrix takerow :: Exp Int - > Acc (Array DIM2 e) - > Acc (Vector e) takerow n mat = let Z :. _ :. cols = unlift (shape mat) :: Z:. Exp Int :. Exp Int in backpermute (index1 cols) (\ix - > index2 n (unindex1 ix)) mat - At each element in the output array, where in the input do I read from?
38 Nested Data-Parallelism Matrix-vector multiplication - Want to apply dotp to every row of the matrix - Extract a row from the matrix takerow :: Exp Int - > Acc (Array DIM2 e) - > Acc (Vector e) takerow n mat = let Z :. _ :. cols = unlift (shape mat) :: Z:. Exp Int :. Exp Int in backpermute (index1 cols) (\ix - > index2 n (unindex1 ix)) mat - At each element in the output array, where in the input do I read from? - [un]index1 converts an Int into a (Z:.Int) 1D index (and vice versa)
39 Nested Data-Parallelism Matrix-vector multiplication - Apply dot product to each row of the matrix mvm :: Acc (Array DIM2 e) - > Acc (Vector e) - > Acc (Vector e) mvm mat vec = let Z :. rows :. _ = unlift (shape mat) :: Z :. Exp Int :. Exp Int in generate (index1 rows) (\ix - > the (vec `dotp` takerow (unindex1 ix) mat)) Extract the element from a singleton array
40 Nested Data-Parallelism Matrix-vector multiplication - Apply dot product to each row of the matrix mvm :: Acc (Array DIM2 e) - > Acc (Vector e) - > Acc (Vector e) mvm mat vec = let Z :. rows :. _ = unlift (shape mat) :: Z :. Exp Int :. Exp Int in generate (index1 rows) (\ix - > the (vec `dotp` takerow (unindex1 ix) mat)) indexing an array that Extract the element from a singleton array
41 Nested Data-Parallelism Matrix-vector multiplication - Apply dot product to each row of the matrix mvm :: Acc (Array DIM2 e) - > Acc (Vector e) - > Acc (Vector e) mvm mat vec = let Z :. rows :. _ = unlift (shape mat) :: Z :. Exp Int :. Exp Int in generate (index1 rows) (\ix - > the (vec `dotp` takerow (unindex1 ix) mat)) indexing an array that Extract the element from a singleton array the :: Acc (Scalar e) - > Exp e
42 Nested Data-Parallelism Matrix-vector multiplication - Apply dot product to each row of the matrix mvm :: Acc (Array DIM2 e) - > Acc (Vector e) - > Acc (Vector e) mvm mat vec = let Z :. rows :. _ = unlift (shape mat) :: Z :. Exp Int :. Exp Int in generate (index1 rows) (\ix - > the (vec `dotp` takerow (unindex1 ix) mat)) indexing an array that Extract the element from a singleton array depends on the index given by generate the :: Acc (Scalar e) - > Exp e
43 Nested Data-Parallelism The problem is attempting to execute many separate dot products in parallel dotp :: Acc (Vector e) - > Acc (Vector e) - > Acc (Scalar e) dotp u v = fold (+) 0 ( zipwith (*) u v ) We need a way to execute this step as a single collective operation
44 Reductions Folding (+) over a vector produces a sum - The result is a one-element array (scalar). Why? > let xs = fromlist (Z:.10) [1..] :: Vector Int > run $ fold (+) 0 (use xs) Array (Z) [55]
45 Reductions Folding (+) over a vector produces a sum - The result is a one-element array (scalar). Why? > let xs = fromlist (Z:.10) [1..] :: Vector Int > run $ fold (+) 0 (use xs) Array (Z) [55] Fold has an interesting type: fold :: (Shape sh, Elt a) => (Exp a - > Exp a - > Exp a) - > Exp a - > Acc (Array (sh:.int) a) - > Acc (Array sh a)
46 Reductions Folding (+) over a vector produces a sum - The result is a one-element array (scalar). Why? > let xs = fromlist (Z:.10) [1..] :: Vector Int > run $ fold (+) 0 (use xs) Array (Z) [55] Fold has an interesting type: fold :: (Shape sh, Elt a) => (Exp a - > Exp a - > Exp a) - > Exp a - > Acc (Array (sh:.int) a) - > Acc (Array sh a) input array
47 Reductions Folding (+) over a vector produces a sum - The result is a one-element array (scalar). Why? > let xs = fromlist (Z:.10) [1..] :: Vector Int > run $ fold (+) 0 (use xs) Array (Z) [55] Fold has an interesting type: fold :: (Shape sh, Elt a) => (Exp a - > Exp a - > Exp a) - > Exp a - > Acc (Array (sh:.int) a) - > Acc (Array sh a) input array outer dimension removed
48 Reductions Fold occurs over the outer dimension of the array > let mat = fromlist (Z:.3:.5) [1..] :: Array DIM2 Int > run $ fold (+) 0 (use mat) Array (Z :. 3) [15,40,65]
49 Reductions Fold occurs over the outer dimension of the array > let mat = fromlist (Z:.3:.5) [1..] :: Array DIM2 Int > run $ fold (+) 0 (use mat) Array (Z :. 3) [15,40,65]
50 Matrix-Vector Multiplication The trick is that we can use this to do all of our dot-products in parallel
51 Matrix-Vector Multiplication The trick is that we can use this to do all of our dot-products in parallel
52 Matrix-Vector Multiplication The trick is that we can use this to do all of our dot-products in parallel
53 Matrix-Vector Multiplication The trick is that we can use this to do all of our dot-products in parallel
54 Replicate Multidimensional replicate introduces new magic replicate :: (Slice slix, Elt e) => Exp slix - > Acc (Array (SliceShape slix) e) - > Acc (Array (FullShape slix) e) Type hackery - All indicates the entirety of the existing dimension - Number of new rows
55 Replicate Multidimensional replicate introduces new magic replicate :: (Slice slix, Elt e) => Exp slix - > Acc (Array (SliceShape slix) e) - > Acc (Array (FullShape slix) e) Type hackery - All indicates the entirety of the existing dimension - Number of new rows
56 Replicate Multidimensional replicate introduces new magic replicate :: (Slice slix, Elt e) => Exp slix - > Acc (Array (SliceShape slix) e) - > Acc (Array (FullShape slix) e) Type hackery > let vec = fromlist (Z:.5) [1..] :: Vector Float > run $ replicate (constant (Z :. (3::Int) :. All)) (use vec) Array (Z :. 3 :. 5) [1.0,2.0,3.0,4.0,5.0,1.0,2.0,3.0,4.0,...]
57 Replicate Multidimensional replicate introduces new magic replicate :: (Slice slix, Elt e) => Exp slix - > Acc (Array (SliceShape slix) e) - > Acc (Array (FullShape slix) e) Type hackery > let vec = fromlist (Z:.5) [1..] :: Vector Float > run $ replicate (constant (Z :. (3::Int) :. All)) (use vec) Array (Z :. 3 :. 5) [1.0,2.0,3.0,4.0,5.0,1.0,2.0,3.0,4.0,...]
58 Replicate Multidimensional replicate introduces new magic replicate :: (Slice slix, Elt e) => Exp slix - > Acc (Array (SliceShape slix) e) - > Acc (Array (FullShape slix) e) Type hackery > let vec = fromlist (Z:.5) [1..] :: Vector Float > run $ replicate (constant (Z :. (3::Int) :. All)) (use vec) Array (Z :. 3 :. 5) [1.0,2.0,3.0,4.0,5.0,1.0,2.0,3.0,4.0,...] - All indicates the entirety of the existing dimension
59 Replicate Multidimensional replicate introduces new magic replicate :: (Slice slix, Elt e) => Exp slix - > Acc (Array (SliceShape slix) e) - > Acc (Array (FullShape slix) e) Type hackery > let vec = fromlist (Z:.5) [1..] :: Vector Float > run $ replicate (constant (Z :. (3::Int) :. All)) (use vec) Array (Z :. 3 :. 5) [1.0,2.0,3.0,4.0,5.0,1.0,2.0,3.0,4.0,...] - All indicates the entirety of the existing dimension - Number of new rows
60 Matrix-Vector Multiplication Looks very much like vector dot product mvm :: Acc (Array DIM2 e) - > Acc (Vector e) - > Acc (Vector e) mvm mat vec = let Z :. rows :. _ = unlift (shape mat) :: Z :. Exp Int :. Exp Int vec' = A.replicate (lift (Z :. rows :. All)) vec in fold (+) 0 (A.zipWith (*) vec' mat)
61 N-Body Simulation Calculate all interactions between a vector of particles calcaccels :: Exp R - > Acc (Vector Body) - > Acc (Vector Accel) calcaccels epsilon bodies = let n = A.size bodies rows cols = A.replicate (lift $ Z :. n :. All) bodies = A.replicate (lift $ Z :. All :. n) bodies in A.fold (.+.) (vec 0) ( A.zipWith (accel epsilon) rows cols )
62 N-Body Simulation Calculate all interactions between a vector of particles calcaccels :: Exp R - > Acc (Vector Body) - > Acc (Vector Accel) calcaccels epsilon bodies = let n = A.size bodies rows cols = A.replicate (lift $ Z :. n :. All) bodies = A.replicate (lift $ Z :. All :. n) bodies in A.fold (.+.) (vec 0) ( A.zipWith (accel epsilon) rows cols ) Turn nested computations into segmented or higherdimensional operations
63 Accelerate Recall that Accelerate computations take place on arrays Arrays in Accelerate computation Arrays out
64 Accelerate Recall that Accelerate computations take place on arrays Arrays in Accelerate computation Arrays out Accelerate evaluates the expression passed to run to generate a series of CUDA kernels
65 Accelerate Recall that Accelerate computations take place on arrays Arrays in Accelerate computation Arrays out Accelerate evaluates the expression passed to run to generate a series of CUDA kernels - Each piece of code CUDA code must be compiled and loaded
66 Accelerate Recall that Accelerate computations take place on arrays Arrays in Accelerate computation Arrays out Accelerate evaluates the expression passed to run to generate a series of CUDA kernels - Each piece of code CUDA code must be compiled and loaded - The goal is to make kernels that can be reused
67 Accelerate Recall that Accelerate computations take place on arrays Arrays in Accelerate computation Arrays out Accelerate evaluates the expression passed to run to generate a series of CUDA kernels - Each piece of code CUDA code must be compiled and loaded - The goal is to make kernels that can be reused - If we don t, the overhead of compilation can ruin performance
68 Embedded Scalars Consider drop, which yields all but the first n elements of a vector drop :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop n arr = let n' = the (unit n) in backpermute (ilift1 (subtract n') (shape arr)) (ilift1 (+ n')) arr Lift an expression into a singleton array Extract the element of a singleton array
69 Embedded Scalars Consider drop, which yields all but the first n elements of a vector drop :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop n arr = let n' = the (unit n) in backpermute (ilift1 (subtract n') (shape arr)) (ilift1 (+ n')) arr Lift an expression into a singleton array Extract the element of a singleton array
70 Embedded Scalars Consider drop, which yields all but the first n elements of a vector drop :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop n arr = let n' = the (unit n) in backpermute (ilift1 (subtract n') (shape arr)) (ilift1 (+ n')) arr Lift an expression into a singleton array unit :: Exp e - > Acc (Scalar e)
71 Embedded Scalars Consider drop, which yields all but the first n elements of a vector drop :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop n arr = let n' = the (unit n) in backpermute (ilift1 (subtract n') (shape arr)) (ilift1 (+ n')) arr Lift an expression into a singleton array unit :: Exp e - > Acc (Scalar e) Extract the element of a singleton array the :: Acc (Scalar e) - > Exp e
72 Embedded Scalars Check the expression Accelerate sees when it evaluates run > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in let a1 = unit 4 in backpermute (let x0 = Z in x0 :. (indexhead (shape a0)) - (a1!x0)) (\x0 - > let x1 = Z in x1 :. (indexhead x0) + (a1!x1)) a0 - Corresponds to the array we created for n - Critically, this is outside the call to backpermute
73 Embedded Scalars Check the expression Accelerate sees when it evaluates run > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in let a1 = unit 4 in backpermute (let x0 = Z in x0 :. (indexhead (shape a0)) - (a1!x0)) (\x0 - > let x1 = Z in x1 :. (indexhead x0) + (a1!x1)) a0 - Corresponds to the array we created for n - Critically, this is outside the call to backpermute
74 Embedded Scalars Check the expression Accelerate sees when it evaluates run > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in let a1 = unit 4 in backpermute (let x0 = Z in x0 :. (indexhead (shape a0)) - (a1!x0)) (\x0 - > let x1 = Z in x1 :. (indexhead x0) + (a1!x1)) a0 - Corresponds to the array we created for n
75 Embedded Scalars Check the expression Accelerate sees when it evaluates run > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in let a1 = unit 4 in backpermute (let x0 = Z in x0 :. (indexhead (shape a0)) - (a1!x0)) (\x0 - > let x1 = Z in x1 :. (indexhead x0) + (a1!x1)) a0 - Corresponds to the array we created for n - Critically, this is outside the call to backpermute
76 Embedded Scalars drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr
77 Embedded Scalars drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0
78 Embedded Scalars drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0
79 Embedded Scalars Check the expression Accelerate sees when it evaluates run drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0
80 Embedded Scalars Check the expression Accelerate sees when it evaluates run drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0
81 Embedded Scalars Check the expression Accelerate sees when it evaluates run drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0
82 Embedded Scalars Check the expression Accelerate sees when it evaluates run drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0
83 Embedded Scalars Check the expression Accelerate sees when it evaluates run drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0
84 Embedded Scalars Check the expression Accelerate sees when it evaluates run drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0
85 Embedded Scalars Check the expression Accelerate sees when it evaluates run drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0
86 Embedded Scalars Check the expression Accelerate sees when it evaluates run drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0 - This will defeat Accelerate s caching of compiled kernels
87 Embedded Scalars Check the expression Accelerate sees when it evaluates run drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0 - This will defeat Accelerate s caching of compiled kernels Make sure any arguments that change are passed as Arrays
88 Embedded Scalars Inspect the expression directly, as done here Alternatively: Accelerate has some debugging options - Run the program with the - ddump- cc command line switch - Make sure you have installed Accelerate with - fdebug
89 Executing Computations Recall: to actually execute a computation we use run run :: Arrays a => Acc a - > a This entails quite a bit of work setting up the computation to run on the GPU - And sometimes the computation never changes
90 Executing Computations Alternative: execute an array program of one argument run1 :: (Arrays a, Arrays b) => (Acc a - > Acc b) - > a - > b What is the difference? - This version can be partially applied with the (Acc a - > Acc b) argument - Returns a new function (a - > b)
91 Executing Computations Alternative: execute an array program of one argument run1 :: (Arrays a, Arrays b) => (Acc a - > Acc b) - > a - > b What is the difference? - This version can be partially applied with the (Acc a - > Acc b) argument - Returns a new function (a - > b) - Key new thing: behind the scenes everything other than final execution is already done. AST is annotated with object code required for execution.
92 Executing Computations canny :: Acc (Array DIM2 RGBA) - > Acc (Array DIM2 Float) canny =...
93 Executing Computations canny :: Acc (Array DIM2 RGBA) - > Acc (Array DIM2 Float) canny =... No magic: edges :: Array DIM2 RGBA - > Array DIM2 Float edges img = run ( canny (use img) )
94 Executing Computations canny :: Acc (Array DIM2 RGBA) - > Acc (Array DIM2 Float) canny =... No magic: edges :: Array DIM2 RGBA - > Array DIM2 Float edges img = run ( canny (use img) ) Magic: edges :: Array DIM2 RGBA - > Array DIM2 Float edges img = run1 canny img
95 Executing Computations canny :: Acc (Array DIM2 RGBA) - > Acc (Array DIM2 Float) canny =... Criterion benchmarks benchmarking canny/run collecting 100 samples, 1 iterations each, in estimated s mean: ms, lb ms, ub ms, ci std dev: ms, lb ms, ub ms, ci variance introduced by outliers: % variance is severely inflated by outliers benchmarking canny/run1 mean: ms, lb ms, ub ms, ci std dev: ms, lb ms, ub ms, ci variance introduced by outliers: % variance is severely inflated by outliers
96 Executing Computations canny :: Acc (Array DIM2 RGBA) - > Acc (Array DIM2 Float) canny =... Criterion benchmarks benchmarking canny/run collecting 100 samples, 1 iterations each, in estimated s mean: ms, lb ms, ub ms, ci std dev: ms, lb ms, ub ms, ci variance introduced by outliers: % variance is severely inflated by outliers benchmarking canny/run1 mean: ms, lb ms, ub ms, ci std dev: ms, lb ms, ub ms, ci variance introduced by outliers: % variance is severely inflated by outliers
97 Floyd-Warshall Algorithm Find the shortest paths in a weighted graph shortestpath :: Graph - > Vertex - > Vertex - > Vertex - > Weight shortestpath g i j 0 = weight g i j shortestpath g i j k = min (shortestpath g i j (k- 1)) (shortestpath g i k (k- i) + shortestpath g k j (k- 1))
98 Floyd-Warshall Algorithm Find the shortest paths in a weighted graph shortestpath :: Graph - > Vertex - > Vertex - > Vertex - > Weight shortestpath g i j 0 = weight g i j shortestpath g i j k = min (shortestpath g i j (k- 1)) (shortestpath g i k (k- i) + shortestpath g k j (k- 1)) - k=0: path between vertices i and j is the direct case only
99 Floyd-Warshall Algorithm Find the shortest paths in a weighted graph shortestpath :: Graph - > Vertex - > Vertex - > Vertex - > Weight shortestpath g i j 0 = weight g i j shortestpath g i j k = min (shortestpath g i j (k- 1)) (shortestpath g i k (k- i) + shortestpath g k j (k- 1)) - k=0: path between vertices i and j is the direct case only - Otherwise the shortest path from i to j passes through k, or it does not
100 Floyd-Warshall Algorithm Find the shortest paths in a weighted graph shortestpath :: Graph - > Vertex - > Vertex - > Vertex - > Weight shortestpath g i j 0 = weight g i j shortestpath g i j k = min (shortestpath g i j (k- 1)) (shortestpath g i k (k- i) + shortestpath g k j (k- 1)) - k=0: path between vertices i and j is the direct case only - Otherwise the shortest path from i to j passes through k, or it does not - Exponential in the number of calls: instead traverse bottom-up to make the results at (k- 1) available
101 Floyd-Warshall Algorithm Instead of a sparse graph, we ll use a dense adjacency matrix - 2D array where the element is the distance between the two vertices type Weight = Int32 type Graph = Array DIM2 Weight step :: Acc (Scalar Int) - > Acc Graph - > Acc Graph step k g = generate (shape g) sp where k' = the k sp :: Exp DIM2 - > Exp Weight sp ix = let (Z :. i :. j) = unlift ix in min (g! (index2 i j)) (g! (index2 i k') + g! (index2 k' j)) - Must the iteration number k as a scalar array - sp is the core of the algorithm
102 Floyd-Warshall Algorithm Instead of a sparse graph, we ll use a dense adjacency matrix - 2D array where the element is the distance between the two vertices type Weight = Int32 type Graph = Array DIM2 Weight step :: Acc (Scalar Int) - > Acc Graph - > Acc Graph step k g = generate (shape g) sp where k' = the k sp :: Exp DIM2 - > Exp Weight sp ix = let (Z :. i :. j) = unlift ix in min (g! (index2 i j)) (g! (index2 i k') + g! (index2 k' j)) - Must the iteration number k as a scalar array - sp is the core of the algorithm
103 Floyd-Warshall Algorithm shortestpathsacc :: Int - > Acc Graph - > Acc Graph shortestpathsacc n g0 = foldl1 (.) steps g0 where steps :: [ Acc Graph - > Acc Graph ] steps = [ step (unit (constant k)) k <- [0.. n- 1] ]
104 Floyd-Warshall Algorithm shortestpathsacc :: Int - > Acc Graph - > Acc Graph shortestpathsacc n g0 = foldl1 (.) steps g0 where steps :: [ Acc Graph - > Acc Graph ] steps = [ step (unit (constant k)) k <- [0.. n- 1] ] - Construct a list of steps, applying k in sequence 0.. n- 1
105 Floyd-Warshall Algorithm shortestpathsacc :: Int - > Acc Graph - > Acc Graph shortestpathsacc n g0 = foldl1 (.) steps g0 where steps :: [ Acc Graph - > Acc Graph ] steps = [ step (unit (constant k)) k <- [0.. n- 1] ] - Construct a list of steps, applying k in sequence 0.. n- 1 - Compose the sequence together
106 Floyd-Warshall Algorithm shortestpathsacc :: Int - > Acc Graph - > Acc Graph shortestpathsacc n g0 = foldl1 (.) steps g0 where steps :: [ Acc Graph - > Acc Graph ] steps = [ step (unit (constant k)) k <- [0.. n- 1] ] - Construct a list of steps, applying k in sequence 0.. n- 1 - Compose the sequence together Let s try a 1000 x 1000 matrix
107 Floyd-Warshall Algorithm shortestpathsacc :: Int - > Acc Graph - > Acc Graph shortestpathsacc n g0 = foldl1 (.) steps g0 where steps :: [ Acc Graph - > Acc Graph ] steps = [ step (unit (constant k)) k <- [0.. n- 1] ] - Construct a list of steps, applying k in sequence 0.. n- 1 - Compose the sequence together Let s try a 1000 x 1000 matrix $./fwaccel RTS - s fwaccel: *** Internal error in package accelerate *** *** Please submit a bug report at (unhandled): CUDA Exception: out of memory 6,627,846,360 bytes allocated in the heap 2,830,827,472 bytes copied during GC 733,091,960 bytes maximum residency (26 sample(s)) 31,575,272 bytes maximum slop 812 MB total memory in use (54 MB lost due to fragmentation)... Total time 7.16s ( 11.42s elapsed)
108 Floyd-Warshall Algorithm shortestpathsacc :: Int - > Acc Graph - > Acc Graph shortestpathsacc n g0 = foldl1 (.) steps g0 where steps :: [ Acc Graph - > Acc Graph ] steps = [ step (unit (constant k)) k <- [0.. n- 1] ] - Construct a list of steps, applying k in sequence 0.. n- 1 - Compose the sequence together Let s try a 1000 x 1000 matrix $./fwaccel RTS - s fwaccel: *** Internal error in package accelerate *** *** Please submit a bug report at (unhandled): CUDA Exception: out of memory 6,627,846,360 bytes allocated in the heap 2,830,827,472 bytes copied during GC 733,091,960 bytes maximum residency (26 sample(s)) 31,575,272 bytes maximum slop 812 MB total memory in use (54 MB lost due to fragmentation)... Total time 7.16s ( 11.42s elapsed)
109 Pipelining To sequence computations we have a special operator (>- >) :: (Arrays a, Arrays b, Arrays c) => (Acc a - > Acc b) - > (Acc b - > Acc c) - > Acc a - > Acc c - Operationally, the two computations will not share any subcomputations with each other or the environment - Intermediate arrays from the first can be GC d when the second begins
110 Pipelining To sequence computations we have a special operator (>- >) :: (Arrays a, Arrays b, Arrays c) => (Acc a - > Acc b) - > (Acc b - > Acc c) - > Acc a - > Acc c - Operationally, the two computations will not share any subcomputations with each other or the environment - Intermediate arrays from the first can be GC d when the second begins $./fwaccel RTS - s Array (Z) [ ] 6,211,776,544 bytes allocated in the heap 914,044,096 bytes copied during GC 24,953,768 bytes maximum residency (211 sample(s)) 2,023,496 bytes maximum slop 63 MB total memory in use (0 MB lost due to fragmentation)... Total time 3.86s ( 3.92s elapsed)
111 Pipelining To sequence computations we have a special operator (>- >) :: (Arrays a, Arrays b, Arrays c) => (Acc a - > Acc b) - > (Acc b - > Acc c) - > Acc a - > Acc c - Operationally, the two computations will not share any subcomputations with each other or the environment - Intermediate arrays from the first can be GC d when the second begins $./fwaccel RTS - s Array (Z) [ ] 6,211,776,544 bytes allocated in the heap 914,044,096 bytes copied during GC 24,953,768 bytes maximum residency (211 sample(s)) 2,023,496 bytes maximum slop 63 MB total memory in use (0 MB lost due to fragmentation)... Total time 3.86s ( 3.92s elapsed) (>- >) can help control memory use & startup time
112 Questions?
113 Extra slides
114 Stencils A stencil is a map with access to the neighbourhood around each element - Useful in many scientific & image processing algorithms laplace :: Stencil3x3 Int - > Exp Int laplace ((_,t,_),(l,c,r),(_,b,_)) = t + b + l + r - 4*c t l c r b
115 Stencils A stencil is a map with access to the neighbourhood around each element - Useful in many scientific & image processing algorithms laplace :: Stencil3x3 Int - > Exp Int laplace ((_,t,_),(l,c,r),(_,b,_)) = t + b + l + r - 4*c t l c r b - Boundary conditions specify how to handle out-of-bounds neighbours > let mat = fromlist (Z:.3:.5) [1..] :: Array DIM2 Int > run $ stencil laplace (Constant 0) (use mat) Array (Z :. 3 :. 5) [4,3,2,1,- 6,- 5,0,0,0,- 11,- 26,- 17,- 18,- 19,- 36]
116 Stencils A stencil is a map with access to the neighbourhood around each element - Useful in many scientific & image processing algorithms
117 Debugging options -dverbose -ddump-sharing -ddump-simpl-stats, -ddump-simpl-iterations -ddump-cc, -ddebug-cc -ddump-exec -ddump-gc -fflush-cache
118
GPU Programming I. With thanks to Manuel Chakravarty for some borrowed slides
GPU Programming I With thanks to Manuel Chakravarty for some borrowed slides GPUs change the game Gaming drives development Image from h?p://www.geforce.com/hardware/desktop- gpus/geforce- gtx- Etan- black/performance
More informationDATA PARALLEL PROGRAMMING IN HASKELL
DATA PARALLEL PROGRAMMING IN HASKELL An Overview Manuel M T Chakravarty University of New South Wales INCLUDES JOINT WORK WITH Gabriele Keller Sean Lee Roman Leshchinskiy Ben Lippmeier Trevor McDonell
More informationRepa 3 tutorial for curious haskells
Repa 3 tutorial for curious haskells Daniel Eddeland d.skalle@gmail.com Oscar Söderlund soscar@student.chalmers.se 1 What is Repa? Repa is a library for Haskell which lets programmers perform operations
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming
More informationProfiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015
Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit
More informationParallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU
Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware
More informationPull and Push arrays, Effects and Array Fusion
Pull and Push arrays, Effects and Array Fusion Josef Svenningsson Chalmers University of Technology Edinburgh University 2014-11-18 Parallel Functional Array Representations Efficiency This talk will center
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationIdentification and Elimination of the Overhead of Accelerate with a Super-resolution Application
Regular Paper Identification and Elimination of the Overhead of Accelerate with a Super-resolution Application Izumi Asakura 1,a) Hidehiko Masuhara 1 Takuya Matsumoto 2 Kiminori Matsuzaki 3 Received: April
More informationOptimising Purely Functional GPU Programs
Optimising Purely Functional GPU Programs Trevor L. McDonell School of Computer Science and Engineering University of New South Wales July 29, 2014 iii This thesis is submitted in partial fulfilment of
More informationHow to Optimize Geometric Multigrid Methods on GPUs
How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient
More informationVector Addition on the Device: main()
Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space
More informationCompiling APL to Accelerate Through a Typed IL
Compiling APL to Accelerate Through a Typed IL Michael Budde University of Copenhagen skx295@alumni.ku.dk 6. November 2014 Abstract APL is a functional array programming language from the 1960 s. While
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationGPU Memory Model. Adapted from:
GPU Memory Model Adapted from: Aaron Lefohn University of California, Davis With updates from slides by Suresh Venkatasubramanian, University of Pennsylvania Updates performed by Gary J. Katz, University
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Synchronization 3 Automatic Parallelization and OpenMP 4 GPGPU 5 Q& A 2 Multithreaded
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationAn Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel
An Introduction to OpenACC Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel Chapter 1 Introduction OpenACC is a software accelerator that uses the host and the device. It uses compiler
More informationAdvanced Database Systems
Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed
More informationText. Parallelism Gabriele Keller University of New South Wales
Text Parallelism Gabriele Keller University of New South Wales Programming Languages Mentoring Workshop 2015 undergrad Karslruhe & Berlin, Germany PhD Berlin & Tsukuba, Japan University of Technology Sydney
More informationOpenACC programming for GPGPUs: Rotor wake simulation
DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationAutotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT
Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic
More informationFast Multipole and Related Algorithms
Fast Multipole and Related Algorithms Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani Joint work with Nail A. Gumerov Efficiency by exploiting symmetry and A general
More informationECE 408 / CS 483 Final Exam, Fall 2014
ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across
More informationAffine Loop Optimization using Modulo Unrolling in CHAPEL
Affine Loop Optimization using Modulo Unrolling in CHAPEL Aroon Sharma, Joshua Koehler, Rajeev Barua LTS POC: Michael Ferguson 2 Overall Goal Improve the runtime of certain types of parallel computers
More informationTowards a Performance- Portable FFT Library for Heterogeneous Computing
Towards a Performance- Portable FFT Library for Heterogeneous Computing Carlo C. del Mundo*, Wu- chun Feng* *Dept. of ECE, Dept. of CS Virginia Tech Slides Updated: 5/19/2014 Forecast (Problem) AMD Radeon
More informationIntroduction to CUDA (1 of n*)
Administrivia Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Paper presentation due Wednesday, 02/23 Topics first come, first serve Assignment 4 handed today
More informationINTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC
INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC DR. CHRISTOPH ANGERER, NVIDIA *) THANKS TO JEFF LARKIN, NVIDIA, FOR THE SLIDES 3 APPROACHES TO GPU PROGRAMMING Applications Libraries Compiler Directives
More informationData parallel algorithms, algorithmic building blocks, precision vs. accuracy
Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel
More informationCS 433 Homework 4. Assigned on 10/17/2017 Due in class on 11/7/ Please write your name and NetID clearly on the first page.
CS 433 Homework 4 Assigned on 10/17/2017 Due in class on 11/7/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationShaders. Slide credit to Prof. Zwicker
Shaders Slide credit to Prof. Zwicker 2 Today Shader programming 3 Complete model Blinn model with several light sources i diffuse specular ambient How is this implemented on the graphics processor (GPU)?
More informationUnboxing Sum Types. Johan Tibell - FP-Syd
Unboxing Sum Types Johan Tibell - FP-Syd 2017-05-24 Work in (slow) progress Ömer Sinan Ağacan Ryan Newton José Manuel Calderón Trilla I NULL Reason 1: Job security Program received signal SIGSEGV, Segmentation
More informationL I S Z T U S E R M A N U A L
M O N T S E M E D I N A, N I E L S J O U B E R T L I S Z T U S E R M A N U A L 1 7 M AY 2 0 1 0 liszt user manual 3 This tutorial is intended for programmers who are familiar with mesh based, PDE s solvers
More informationSupporting Data Parallelism in Matcloud: Final Report
Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by
More informationData Parallel Execution Model
CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling
More informationCS 360 Programming Languages Interpreters
CS 360 Programming Languages Interpreters Implementing PLs Most of the course is learning fundamental concepts for using and understanding PLs. Syntax vs. semantics vs. idioms. Powerful constructs like
More informationUnrolling parallel loops
Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:
More informationAccelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University
Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear
More informationGPU Programming EE Final Examination
Name Solution GPU Programming EE 4702-1 Final Examination Friday, 11 December 2015 15:00 17:00 CST Alias Methane? Problem 1 Problem 2 Problem 3 Problem 4 Problem 5 Problem 6 Exam Total (20 pts) (15 pts)
More informationWorking with Metal Overview
Graphics and Games #WWDC14 Working with Metal Overview Session 603 Jeremy Sandmel GPU Software 2014 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission
More informationParallelization Principles. Sathish Vadhiyar
Parallelization Principles Sathish Vadhiyar Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs
More informationT H E L I S Z T L A N G U A G E S P E C I F I C AT I O N
Z A C H D E V I T O, N I E L S J O U B E R T T H E L I S Z T L A N G U A G E S P E C I F I C AT I O N 1 7 M AY 2 0 1 0 the liszt language specification 3 This document describes the expected behavior
More informationAdvanced Computer Architecture Lab 3 Scalability of the Gauss-Seidel Algorithm
Advanced Computer Architecture Lab 3 Scalability of the Gauss-Seidel Algorithm Andreas Sandberg 1 Introduction The purpose of this lab is to: apply what you have learned so
More informationPorting a parallel rotor wake simulation to GPGPU accelerators using OpenACC
DLR.de Chart 1 Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC Melven Röhrig-Zöllner DLR, Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU)
More informationParallelism with Haskell
Parallelism with Haskell T-106.6200 Special course High-Performance Embedded Computing (HPEC) Autumn 2013 (I-II), Lecture 5 Vesa Hirvisalo & Kevin Hammond (Univ. of St. Andrews) 2013-10-08 V. Hirvisalo
More informationMULTIPLE OPERAND ADDITION. Multioperand Addition
MULTIPLE OPERAND ADDITION Chapter 3 Multioperand Addition Add up a bunch of numbers Used in several algorithms Multiplication, recurrences, transforms, and filters Signed (two s comp) and unsigned Don
More informationCUDA Performance Considerations (2 of 2)
Administrivia CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Friday 03/04, 11:59pm Assignment 4 due Presentation date change due via email Not bonus
More informationCS553 Lecture Dynamic Optimizations 2
Dynamic Optimizations Last time Predication and speculation Today Dynamic compilation CS553 Lecture Dynamic Optimizations 2 Motivation Limitations of static analysis Programs can have values and invariants
More informationCUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.
Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication
More informationPGI Accelerator Programming Model for Fortran & C
PGI Accelerator Programming Model for Fortran & C The Portland Group Published: v1.3 November 2010 Contents 1. Introduction... 5 1.1 Scope... 5 1.2 Glossary... 5 1.3 Execution Model... 7 1.4 Memory Model...
More informationAccelerating CFD with Graphics Hardware
Accelerating CFD with Graphics Hardware Graham Pullan (Whittle Laboratory, Cambridge University) 16 March 2009 Today Motivation CPUs and GPUs Programming NVIDIA GPUs with CUDA Application to turbomachinery
More informationMATLAB GUIDE UMD PHYS375 FALL 2010
MATLAB GUIDE UMD PHYS375 FALL 200 DIRECTORIES Find the current directory you are in: >> pwd C:\Documents and Settings\ian\My Documents\MATLAB [Note that Matlab assigned this string of characters to a variable
More informationScheduling Image Processing Pipelines
Lecture 14: Scheduling Image Processing Pipelines Visual Computing Systems Simple image processing kernel int WIDTH = 1024; int HEIGHT = 1024; float input[width * HEIGHT]; float output[width * HEIGHT];
More informationBlocks, Grids, and Shared Memory
Blocks, Grids, and Shared Memory GPU Course, Fall 2012 Last week: ax+b Homework Threads, Blocks, Grids CUDA threads are organized into blocks Threads operate in SIMD(ish) manner -- each executing same
More informationAn Introduction to OpenAcc
An Introduction to OpenAcc ECS 158 Final Project Robert Gonzales Matthew Martin Nile Mittow Ryan Rasmuss Spring 2016 1 Introduction: What is OpenAcc? OpenAcc stands for Open Accelerators. Developed by
More informationScheduling Image Processing Pipelines
Lecture 15: Scheduling Image Processing Pipelines Visual Computing Systems Simple image processing kernel int WIDTH = 1024; int HEIGHT = 1024; float input[width * HEIGHT]; float output[width * HEIGHT];
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationGPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:
COMP528 Multi-Core Programming GPU programming,ii www.csc.liv.ac.uk/~alexei/comp528 Alexei Lisitsa Dept of computer science University of Liverpool a.lisitsa@.liverpool.ac.uk Different ways: GPU programming
More informationCS61C Machine Structures. Lecture 4 C Pointers and Arrays. 1/25/2006 John Wawrzynek. www-inst.eecs.berkeley.edu/~cs61c/
CS61C Machine Structures Lecture 4 C Pointers and Arrays 1/25/2006 John Wawrzynek (www.cs.berkeley.edu/~johnw) www-inst.eecs.berkeley.edu/~cs61c/ CS 61C L04 C Pointers (1) Common C Error There is a difference
More informationModule Memory and Data Locality
GPU Teaching Kit Accelerated Computing Module 4.5 - Memory and Data Locality Handling Arbitrary Matrix Sizes in Tiled Algorithms Objective To learn to handle arbitrary matrix sizes in tiled matrix multiplication
More informationGeorgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009
Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009 Introduction CUDA is a tool to turn your graphics card into a small computing cluster. It s not always
More informationVLC : Language Reference Manual
VLC : Language Reference Manual Table Of Contents 1. Introduction 2. Types and Declarations 2a. Primitives 2b. Non-primitives - Strings - Arrays 3. Lexical conventions 3a. Whitespace 3b. Comments 3c. Identifiers
More informationLecture 8: Summary of Haskell course + Type Level Programming
Lecture 8: Summary of Haskell course + Type Level Programming Søren Haagerup Department of Mathematics and Computer Science University of Southern Denmark, Odense October 31, 2017 Principles from Haskell
More informationCSC Advanced Scientific Computing, Fall Numpy
CSC 223 - Advanced Scientific Computing, Fall 2017 Numpy Numpy Numpy (Numerical Python) provides an interface, called an array, to operate on dense data buffers. Numpy arrays are at the core of most Python
More informationBlis: Better Language for Image Stuff Project Proposal Programming Languages and Translators, Spring 2017
Blis: Better Language for Image Stuff Project Proposal Programming Languages and Translators, Spring 2017 Abbott, Connor (cwa2112) Pan, Wendy (wp2213) Qinami, Klint (kq2129) Vaccaro, Jason (jhv2111) [System
More informationIndex. object lifetimes, and ownership, use after change by an alias errors, use after drop errors, BTreeMap, 309
A Arithmetic operation floating-point arithmetic, 11 12 integer numbers, 9 11 Arrays, 97 copying, 59 60 creation, 48 elements, 48 empty arrays and vectors, 57 58 executable program, 49 expressions, 48
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationCSC 8301 Design & Analysis of Algorithms: Warshall s, Floyd s, and Prim s algorithms
CSC 8301 Design & Analysis of Algorithms: Warshall s, Floyd s, and Prim s algorithms Professor Henry Carter Fall 2016 Recap Space-time tradeoffs allow for faster algorithms at the cost of space complexity
More informationX10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management
X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large
More informationAccelerating image registration on GPUs
Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining
More informationComputational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors?
Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors? The CVDLab Team Francesco Furiani Tue, April 3, 2014 ROMA TRE UNIVERSITÀ DEGLI STUDI
More informationCUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012
CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix
More informationEngine Support System. asyrani.com
Engine Support System asyrani.com A game engine is a complex piece of software consisting of many interacting subsystems. When the engine first starts up, each subsystem must be configured and initialized
More informationCS 179: GPU Computing. Lecture 2: The Basics
CS 179: GPU Computing Lecture 2: The Basics Recap Can use GPU to solve highly parallelizable problems Performance benefits vs. CPU Straightforward extension to C language Disclaimer Goal for Week 1: Fast-paced
More informationCostless Software Abstractions For Parallel Hardware System
Costless Software Abstractions For Parallel Hardware System Joel Falcou LRI - INRIA - MetaScale SAS Maison de la Simulation - 04/03/2014 Context In Scientific Computing... there is Scientific Applications
More informationPFAC Library: GPU-Based String Matching Algorithm
PFAC Library: GPU-Based String Matching Algorithm Cheng-Hung Lin Lung-Sheng Chien Chen-Hsiung Liu Shih-Chieh Chang Wing-Kai Hon National Taiwan Normal University, Taipei, Taiwan National Tsing-Hua University,
More informationProgramming for Electrical and Computer Engineers. Pointers and Arrays
Programming for Electrical and Computer Engineers Pointers and Arrays Dr. D. J. Jackson Lecture 12-1 Introduction C allows us to perform arithmetic addition and subtraction on pointers to array elements.
More information1/25/12. Administrative
Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationProposal for parallel sort in base R (and Python/Julia)
Proposal for parallel sort in base R (and Python/Julia) Directions in Statistical Computing 2 July 2016, Stanford Matt Dowle Initial timings https://github.com/rdatatable/data.table/wiki/installation See
More informationA Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function
A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao
More informationUNIVERSITY OF CALIFORNIA, SANTA CRUZ BOARD OF STUDIES IN COMPUTER ENGINEERING
UNIVERSITY OF CALIFORNIA, SANTA CRUZ BOARD OF STUDIES IN COMPUTER ENGINEERING CMPE13/L: INTRODUCTION TO PROGRAMMING IN C SPRING 2012 Lab 3 Matrix Math Introduction Reading In this lab you will write a
More informationCenter for Computational Science
Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz,
More informationProcessors, Performance, and Profiling
Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode
More informationProblems and Solutions to the January 1994 Core Examination
Problems and Solutions to the January 1994 Core Examination Programming Languages PL 1. Consider the following program written in an Algol-like language: begin integer i,j; integer array A[0..10]; Procedure
More informationCopyright 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin Introduction to the Design & Analysis of Algorithms, 2 nd ed., Ch.
Iterative Improvement Algorithm design technique for solving optimization problems Start with a feasible solution Repeat the following step until no improvement can be found: change the current feasible
More informationCSE351 Spring 2018, Final Exam June 6, 2018
CSE351 Spring 2018, Final Exam June 6, 2018 Please do not turn the page until 2:30. Last Name: First Name: Student ID Number: Name of person to your left: Name of person to your right: Signature indicating:
More informationCS 1313 Spring 2000 Lecture Outline
1. What is a Computer? 2. Components of a Computer System Overview of Computing, Part 1 (a) Hardware Components i. Central Processing Unit ii. Main Memory iii. The Bus iv. Loading Data from Main Memory
More informationCPSC 3740 Programming Languages University of Lethbridge. Data Types
Data Types A data type defines a collection of data values and a set of predefined operations on those values Some languages allow user to define additional types Useful for error detection through type
More information5.12 EXERCISES Exercises 263
5.12 Exercises 263 5.12 EXERCISES 5.1. If it s defined, the OPENMP macro is a decimal int. Write a program that prints its value. What is the significance of the value? 5.2. Download omp trap 1.c from
More informationCSE 554 Lecture 6: Fairing and Simplification
CSE 554 Lecture 6: Fairing and Simplification Fall 2012 CSE554 Fairing and simplification Slide 1 Review Iso-contours in grayscale images and volumes Piece-wise linear representations Polylines (2D) and
More informationCS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines
CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture VLIW, Vector, and Multithreaded Machines Assigned 3/24/2019 Problem Set #4 Due 4/5/2019 http://inst.eecs.berkeley.edu/~cs152/sp19
More informationKampala August, Agner Fog
Advanced microprocessor optimization Kampala August, 2007 Agner Fog www.agner.org Agenda Intel and AMD microprocessors Out Of Order execution Branch prediction Platform, 32 or 64 bits Choice of compiler
More informationOpenACC Fundamentals. Steve Abbott November 15, 2017
OpenACC Fundamentals Steve Abbott , November 15, 2017 AGENDA Data Regions Deep Copy 2 while ( err > tol && iter < iter_max ) { err=0.0; JACOBI ITERATION #pragma acc parallel loop reduction(max:err)
More informationLecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC
Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview
More informationWYSE Academic Challenge Computer Science Test (State) 2015 Solution Set
WYSE Academic Challenge Computer Science Test (State) 2015 Solution Set 1. Correct Answer: E The following shows the contents of the stack (a lifo last in first out structure) after each operation. Instruction
More informationCS377P Programming for Performance GPU Programming - II
CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline
More informationECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017
ECE 550D Fundamentals of Computer Systems and Engineering Fall 2017 The Operating System (OS) Prof. John Board Duke University Slides are derived from work by Profs. Tyler Bletsch and Andrew Hilton (Duke)
More information