GPGPU Programming in Haskell with Accelerate

Size: px
Start display at page:

Download "GPGPU Programming in Haskell with Accelerate"

Transcription

1 GPGPU Programming in Haskell with Accelerate Trevor L. McDonell University of New South

2 Preliminaries Get it from Hackage: - Debugging support enables some extra options to see what is going on cabal install accelerate - fdebug cabal install accelerate- cuda - fdebug Need to import both the base library as well as a specific backend - Import qualified to avoid name clashes with the Prelude import Prelude as P import Data.Array.Accelerate as A import Data.Array.Accelerate.CUDA - - or import Data.Array.Accelerate.Interpreter

3 Accelerate Accelerate is a Domain-Specific Language for GPU programming Copy result back to Haskell Haskell/Accelerate program Transform Accelerate program into CUDA program CUDA code Compile with NVIDIA s compiler & load onto the GPU

4 Accelerate Accelerate computations take place on arrays - Parallelism is introduced in the form of collective operations over arrays Arrays in Accelerate computation Arrays out Arrays have two type parameters - The shape of the array, or dimensionality data Array sh e - The element type of the array: Int, Float, etc.

5 Accelerate Accelerate is split into two worlds: Acc and Exp - Acc represents collective operations over instances of Arrays - Exp is a scalar computation on things of type Elt Collective operations in Acc comprise many scalar operations in Exp, executed in parallel over Arrays - Scalar operations can not contain collective operations This stratification excludes nested data parallelism

6 Accelerate To execute an Accelerate computation (on the GPU): run :: Arrays a => Acc a - > a - run comes from whichever backend we have chosen (CUDA)

7 Accelerate To execute an Accelerate computation (on the GPU): run :: Arrays a => Acc a - > a - run comes from whichever backend we have chosen (CUDA) To get arrays into Acc land use :: Arrays a => a - > Acc a - This may involve copying data to the GPU

8 Accelerate To execute an Accelerate computation (on the GPU): run :: Arrays a => Acc a - > a - run comes from whichever backend we have chosen (CUDA) To get arrays into Acc land use :: Arrays a => a - > Acc a - This may involve copying data to the GPU Using Accelerate focus on everything in between: using combinators of type Acc to build an AST that will be turned into CUDA code and executed by run

9 Arrays data Array sh e Create an array from a list: fromlist :: (Shape sh, Elt e) => sh - > [e] - > Array sh e - Generates a multidimensional array by consuming elements from the list and adding them to the array in row-major order Example: ghci> fromlist (Z:.10) [1..10] - Defaulting does not apply, because Shape is not a standard class

10 Arrays data Array sh e Create an array from a list: fromlist :: (Shape sh, Elt e) => sh - > [e] - > Array sh e - Generates a multidimensional array by consuming elements from the list and adding them to the array in row-major order Example: ghci> fromlist (Z:.10) [1..10] - Defaulting does not apply, because Shape is not a standard class <interactive>:3:1: No instance for (Shape (Z :. head0)) arising from a use of `fromlist' The type variable `head0' is ambiguous Possible fix: add a type signature that fixes these type Note: there is a potential instance available: instance Shape sh => Shape (sh :. Int) - - Defined in `Data.Array.Accelerate.Array.Sugar' Possible fix: add an instance declaration for (Shape (Z : In the expression: fromlist (Z :. 10) [1.. 10] In an equation for `it': it = fromlist (Z :. 10) [ <interactive>:3:14: No instance for (Num head0) arising from the literal `10' The type variable `head0' is ambiguous Possible fix: add a type signature that fixes these type Note: there are several potential instances: instance Num Double - - Defined in `GHC.Float' instance Num Float - - Defined in `GHC.Float' instance Integral a => Num (GHC.Real.Ratio a) - - Defined in `GHC.Real'...plus 12 others In the second argument of `(:.)', namely `10' In the first argument of `fromlist', namely `(Z :. 10)'

11 Arrays data Array sh e Create an array from a list: fromlist :: (Shape sh, Elt e) => sh - > [e] - > Array sh e - Generates a multidimensional array by consuming elements from the list and adding them to the array in row-major order Example: ghci> fromlist (Z:.10) [1..10] - Defaulting does not apply, because Shape is not a standard class <interactive>:3:1: No instance for (Shape (Z :. head0)) arising from a use of `fromlist' The type variable `head0' is ambiguous Possible fix: add a type signature that fixes these type Note: there is a potential instance available: instance Shape sh => Shape (sh :. Int) - - Defined in `Data.Array.Accelerate.Array.Sugar' Possible fix: add an instance declaration for (Shape (Z : In the expression: fromlist (Z :. 10) [1.. 10] In an equation for `it': it = fromlist (Z :. 10) [ <interactive>:3:14: No instance for (Num head0) arising from the literal `10' The type variable `head0' is ambiguous Possible fix: add a type signature that fixes these type Note: there are several potential instances: instance Num Double - - Defined in `GHC.Float' instance Num Float - - Defined in `GHC.Float' instance Integral a => Num (GHC.Real.Ratio a) - - Defined in `GHC.Real'...plus 12 others In the second argument of `(:.)', namely `10' In the first argument of `fromlist', namely `(Z :. 10)'

12 Arrays data Array sh e Create an array from a list: fromlist :: (Shape sh, Elt e) => sh - > [e] - > Array sh e - Generates a multidimensional array by consuming elements from the list and adding them to the array in row-major order Example: ghci> fromlist (Z:.10) [1..10] - Defaulting does not apply, because Shape is not a standard class <interactive>:3:1: No instance for (Shape (Z :. head0)) arising from a use of `fromlist' The type variable `head0' is ambiguous Possible fix: add a type signature that fixes these type Note: there is a potential instance available: instance Shape sh => Shape (sh :. Int) - - Defined in `Data.Array.Accelerate.Array.Sugar' Possible fix: add an instance declaration for (Shape (Z : In the expression: fromlist (Z :. 10) [1.. 10] In an equation for `it': it = fromlist (Z :. 10) [ Number 1 tip: Add type signatures <interactive>:3:14: No instance for (Num head0) arising from the literal `10' The type variable `head0' is ambiguous Possible fix: add a type signature that fixes these type Note: there are several potential instances: instance Num Double - - Defined in `GHC.Float' instance Num Float - - Defined in `GHC.Float' instance Integral a => Num (GHC.Real.Ratio a) - - Defined in `GHC.Real'...plus 12 others In the second argument of `(:.)', namely `10' In the first argument of `fromlist', namely `(Z :. 10)'

13 Arrays data Array sh e Create an array from a list: > fromlist (Z:.10) [1..10] :: Vector Float Array (Z :. 10) [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0]

14 Arrays data Array sh e Create an array from a list: > fromlist (Z:.10) [1..10] :: Vector Float Array (Z :. 10) [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0] Multidimensional arrays are similar: - Elements are filled along the right-most dimension first > fromlist (Z:.3:.5) [1..] :: Array DIM2 Int Array (Z :. 3 :. 5) [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]

15 Accelerate by example Password recovery

16 MD5 Algorithm Aim: - Implement one round of MD5: unsalted, single 512-bit block - Apply to an array of words - Compare hashes to some unknown hash - i.e. standard dictionary attack

17 MD5 Algorithm Algorithm operates on a 4-word state, A, B, C, and D There are 4 x 16 rounds: F, G, H, and I - M i is a word from the input message - K i is a constant - <<< s is left rotate, by some constant ri Each round operates on the 512-bit message block, modifying the state

18 MD5 Algorithm in Accelerate Accelerate is a meta programming language - Use regular Haskell to generate the expression for each step of the round - Produces an unrolled loop type ABCD = (Exp Word32, Exp Word32,... ) md5round :: Acc (Vector Word32) - > ABCD md5round msg = P.foldl round (a0,b0,c0,d0) [0..64] where round :: ABDC - > Int - > ABCD round (a,b,c,d) i =...

19 MD5 Algorithm in Accelerate The constants k i and r i can be embedded directly - The simple list lookup would be death in standard Haskell - Generating the expression need not be performant, only executing it k :: Int - > Exp Word32 k i = constant (ks P.!! i) where ks = [ 0xd76aa478, 0xe8c7b756, 0x242070db, 0xc1bdceee,...

20 MD5 Algorithm in Accelerate The message M is stored as an array, so we need array indexing - Be wary, arbitrary array indexing can kill performance... (!) :: (Shape ix, Elt e) => Acc (Array ix e) - > Exp ix - > Exp e

21 MD5 Algorithm in Accelerate The message M is stored as an array, so we need array indexing - Be wary, arbitrary array indexing can kill performance... (!) :: (Shape ix, Elt e) => Acc (Array ix e) - > Exp ix - > Exp e Get the right word of the message for the given round m :: Int - > Exp Word32 m i i < 16 = msg! index1 (constant i) i < 32 = msg! index1 (constant ((5*i + 1) `rem` 16))...

22 MD5 Algorithm in Accelerate Finally, the non-linear functions F, G, H, and I round :: ABDC - > Int - > ABCD round (a,b,c,d) i i < 16 = shfl (f b c d)... where shfl x = (d, b + ((a + x + k i + m i) `rotatel` r i), b, c) f x y z = (x.&. y).. ((complement x).&. z)...

23 MD5 Algorithm in Accelerate MD5 applied to a single 16-word vector: no parallelism here Lift this operation to an array of n words - Process many words in parallel to compare against the unknown - Need to use generate, the most general form of array construction. Equivalently, the most easily misused generate :: (Shape sh, Elt e) => Exp sh - > (Exp sh - > Elt e) - > Acc (Array sh e)

24 Problem Solving Accelerate Hints for when things don t work as you expect

25 MD5 Algorithm in Accelerate As always, data layout is important - Accelerate arrays are stored in row-major order - For a CPU, this means the input word would be read on a cache line c o r r e c t h o r s e b a t t e r y s t a p l e

26 MD5 Algorithm in Accelerate However in a parallel context many threads must work together - generate uses one thread per element - For best performance CUDA threads need to index adjacent memory - This only works if the dictionary is stored column major instead c o r r e c t h o r s e b a t t e r y s t a p l e

27 MD5 Algorithm in Accelerate However in a parallel context many threads must work together - generate uses one thread per element - For best performance CUDA threads need to index adjacent memory - This only works if the dictionary is stored column major instead c o r r e c t h o r s e b a t t e r y s t a p l e

28 MD5 Algorithm in Accelerate However in a parallel context many threads must work together - generate uses one thread per element - For best performance CUDA threads need to index adjacent memory - This only works if the dictionary is stored column major instead c o r r e c t h o r s e b a t t e r y s t a p l e

29 MD5 Algorithm in Accelerate However in a parallel context many threads must work together - generate uses one thread per element - For best performance CUDA threads need to index adjacent memory - This only works if the dictionary is stored column major instead c h b s o o a t r r t a r s t p e e e l c r e t y

30 MD5 Algorithm in Accelerate However in a parallel context many threads must work together - generate uses one thread per element - For best performance CUDA threads need to index adjacent memory - This only works if the dictionary is stored column major instead c h b s o o a t r r t a r s t p e e e l c r e t y Pay attention to data layout if you use indexing operators

31 Nested Data-Parallelism Accelerate is a language for flat data-parallel computations - The allowed array element types do not include Array: simple types only - Lots of types to statically exclude nested parallelism: Acc vs. Exp

32 Nested Data-Parallelism Accelerate is a language for flat data-parallel computations - The allowed array element types do not include Array: simple types only - Lots of types to statically exclude nested parallelism: Acc vs. Exp - But, it doesn t always succeed, and the error is uninformative *** Exception: *** Internal error in package accelerate *** *** Please submit a bug report at (convertsharingexp): inconsistent shared 'Exp' tree with stable name 54; env = [56]

33 Nested Data-Parallelism Accelerate is a language for flat data-parallel computations - The allowed array element types do not include Array: simple types only - Lots of types to statically exclude nested parallelism: Acc vs. Exp - But, it doesn t always succeed, and the error is uninformative *** Exception: *** Internal error in package accelerate *** *** Please submit a bug report at (convertsharingexp): inconsistent shared 'Exp' tree with stable name 54; env = [56]

34 Nested Data-Parallelism Accelerate is a language for flat data-parallel computations - The allowed array element types do not include Array: simple types only - Lots of types to statically exclude nested parallelism: Acc vs. Exp - But, it doesn t always succeed, and the error is uninformative *** Exception: *** Internal error in package accelerate *** *** Please submit a bug report at (convertsharingexp): inconsistent shared 'Exp' tree with stable name 54; env = [56] *** Exception: Cyclic definition of a value of type 'Exp' (sa = 45)

35 Nested Data-Parallelism Matrix-vector multiplication - Define vector dot product dotp :: Acc (Vector e) - > Acc (Vector e) - > Acc (Scalar e) dotp u v = fold (+) 0 ( zipwith (*) u v )

36 Nested Data-Parallelism Matrix-vector multiplication - Want to apply dotp to every row of the matrix - Extract a row from the matrix takerow :: Exp Int - > Acc (Array DIM2 e) - > Acc (Vector e) takerow n mat = let Z :. _ :. cols = unlift (shape mat) :: Z:. Exp Int :. Exp Int in backpermute (index1 cols) (\ix - > index2 n (unindex1 ix)) mat

37 Nested Data-Parallelism Matrix-vector multiplication - Want to apply dotp to every row of the matrix - Extract a row from the matrix takerow :: Exp Int - > Acc (Array DIM2 e) - > Acc (Vector e) takerow n mat = let Z :. _ :. cols = unlift (shape mat) :: Z:. Exp Int :. Exp Int in backpermute (index1 cols) (\ix - > index2 n (unindex1 ix)) mat - At each element in the output array, where in the input do I read from?

38 Nested Data-Parallelism Matrix-vector multiplication - Want to apply dotp to every row of the matrix - Extract a row from the matrix takerow :: Exp Int - > Acc (Array DIM2 e) - > Acc (Vector e) takerow n mat = let Z :. _ :. cols = unlift (shape mat) :: Z:. Exp Int :. Exp Int in backpermute (index1 cols) (\ix - > index2 n (unindex1 ix)) mat - At each element in the output array, where in the input do I read from? - [un]index1 converts an Int into a (Z:.Int) 1D index (and vice versa)

39 Nested Data-Parallelism Matrix-vector multiplication - Apply dot product to each row of the matrix mvm :: Acc (Array DIM2 e) - > Acc (Vector e) - > Acc (Vector e) mvm mat vec = let Z :. rows :. _ = unlift (shape mat) :: Z :. Exp Int :. Exp Int in generate (index1 rows) (\ix - > the (vec `dotp` takerow (unindex1 ix) mat)) Extract the element from a singleton array

40 Nested Data-Parallelism Matrix-vector multiplication - Apply dot product to each row of the matrix mvm :: Acc (Array DIM2 e) - > Acc (Vector e) - > Acc (Vector e) mvm mat vec = let Z :. rows :. _ = unlift (shape mat) :: Z :. Exp Int :. Exp Int in generate (index1 rows) (\ix - > the (vec `dotp` takerow (unindex1 ix) mat)) indexing an array that Extract the element from a singleton array

41 Nested Data-Parallelism Matrix-vector multiplication - Apply dot product to each row of the matrix mvm :: Acc (Array DIM2 e) - > Acc (Vector e) - > Acc (Vector e) mvm mat vec = let Z :. rows :. _ = unlift (shape mat) :: Z :. Exp Int :. Exp Int in generate (index1 rows) (\ix - > the (vec `dotp` takerow (unindex1 ix) mat)) indexing an array that Extract the element from a singleton array the :: Acc (Scalar e) - > Exp e

42 Nested Data-Parallelism Matrix-vector multiplication - Apply dot product to each row of the matrix mvm :: Acc (Array DIM2 e) - > Acc (Vector e) - > Acc (Vector e) mvm mat vec = let Z :. rows :. _ = unlift (shape mat) :: Z :. Exp Int :. Exp Int in generate (index1 rows) (\ix - > the (vec `dotp` takerow (unindex1 ix) mat)) indexing an array that Extract the element from a singleton array depends on the index given by generate the :: Acc (Scalar e) - > Exp e

43 Nested Data-Parallelism The problem is attempting to execute many separate dot products in parallel dotp :: Acc (Vector e) - > Acc (Vector e) - > Acc (Scalar e) dotp u v = fold (+) 0 ( zipwith (*) u v ) We need a way to execute this step as a single collective operation

44 Reductions Folding (+) over a vector produces a sum - The result is a one-element array (scalar). Why? > let xs = fromlist (Z:.10) [1..] :: Vector Int > run $ fold (+) 0 (use xs) Array (Z) [55]

45 Reductions Folding (+) over a vector produces a sum - The result is a one-element array (scalar). Why? > let xs = fromlist (Z:.10) [1..] :: Vector Int > run $ fold (+) 0 (use xs) Array (Z) [55] Fold has an interesting type: fold :: (Shape sh, Elt a) => (Exp a - > Exp a - > Exp a) - > Exp a - > Acc (Array (sh:.int) a) - > Acc (Array sh a)

46 Reductions Folding (+) over a vector produces a sum - The result is a one-element array (scalar). Why? > let xs = fromlist (Z:.10) [1..] :: Vector Int > run $ fold (+) 0 (use xs) Array (Z) [55] Fold has an interesting type: fold :: (Shape sh, Elt a) => (Exp a - > Exp a - > Exp a) - > Exp a - > Acc (Array (sh:.int) a) - > Acc (Array sh a) input array

47 Reductions Folding (+) over a vector produces a sum - The result is a one-element array (scalar). Why? > let xs = fromlist (Z:.10) [1..] :: Vector Int > run $ fold (+) 0 (use xs) Array (Z) [55] Fold has an interesting type: fold :: (Shape sh, Elt a) => (Exp a - > Exp a - > Exp a) - > Exp a - > Acc (Array (sh:.int) a) - > Acc (Array sh a) input array outer dimension removed

48 Reductions Fold occurs over the outer dimension of the array > let mat = fromlist (Z:.3:.5) [1..] :: Array DIM2 Int > run $ fold (+) 0 (use mat) Array (Z :. 3) [15,40,65]

49 Reductions Fold occurs over the outer dimension of the array > let mat = fromlist (Z:.3:.5) [1..] :: Array DIM2 Int > run $ fold (+) 0 (use mat) Array (Z :. 3) [15,40,65]

50 Matrix-Vector Multiplication The trick is that we can use this to do all of our dot-products in parallel

51 Matrix-Vector Multiplication The trick is that we can use this to do all of our dot-products in parallel

52 Matrix-Vector Multiplication The trick is that we can use this to do all of our dot-products in parallel

53 Matrix-Vector Multiplication The trick is that we can use this to do all of our dot-products in parallel

54 Replicate Multidimensional replicate introduces new magic replicate :: (Slice slix, Elt e) => Exp slix - > Acc (Array (SliceShape slix) e) - > Acc (Array (FullShape slix) e) Type hackery - All indicates the entirety of the existing dimension - Number of new rows

55 Replicate Multidimensional replicate introduces new magic replicate :: (Slice slix, Elt e) => Exp slix - > Acc (Array (SliceShape slix) e) - > Acc (Array (FullShape slix) e) Type hackery - All indicates the entirety of the existing dimension - Number of new rows

56 Replicate Multidimensional replicate introduces new magic replicate :: (Slice slix, Elt e) => Exp slix - > Acc (Array (SliceShape slix) e) - > Acc (Array (FullShape slix) e) Type hackery > let vec = fromlist (Z:.5) [1..] :: Vector Float > run $ replicate (constant (Z :. (3::Int) :. All)) (use vec) Array (Z :. 3 :. 5) [1.0,2.0,3.0,4.0,5.0,1.0,2.0,3.0,4.0,...]

57 Replicate Multidimensional replicate introduces new magic replicate :: (Slice slix, Elt e) => Exp slix - > Acc (Array (SliceShape slix) e) - > Acc (Array (FullShape slix) e) Type hackery > let vec = fromlist (Z:.5) [1..] :: Vector Float > run $ replicate (constant (Z :. (3::Int) :. All)) (use vec) Array (Z :. 3 :. 5) [1.0,2.0,3.0,4.0,5.0,1.0,2.0,3.0,4.0,...]

58 Replicate Multidimensional replicate introduces new magic replicate :: (Slice slix, Elt e) => Exp slix - > Acc (Array (SliceShape slix) e) - > Acc (Array (FullShape slix) e) Type hackery > let vec = fromlist (Z:.5) [1..] :: Vector Float > run $ replicate (constant (Z :. (3::Int) :. All)) (use vec) Array (Z :. 3 :. 5) [1.0,2.0,3.0,4.0,5.0,1.0,2.0,3.0,4.0,...] - All indicates the entirety of the existing dimension

59 Replicate Multidimensional replicate introduces new magic replicate :: (Slice slix, Elt e) => Exp slix - > Acc (Array (SliceShape slix) e) - > Acc (Array (FullShape slix) e) Type hackery > let vec = fromlist (Z:.5) [1..] :: Vector Float > run $ replicate (constant (Z :. (3::Int) :. All)) (use vec) Array (Z :. 3 :. 5) [1.0,2.0,3.0,4.0,5.0,1.0,2.0,3.0,4.0,...] - All indicates the entirety of the existing dimension - Number of new rows

60 Matrix-Vector Multiplication Looks very much like vector dot product mvm :: Acc (Array DIM2 e) - > Acc (Vector e) - > Acc (Vector e) mvm mat vec = let Z :. rows :. _ = unlift (shape mat) :: Z :. Exp Int :. Exp Int vec' = A.replicate (lift (Z :. rows :. All)) vec in fold (+) 0 (A.zipWith (*) vec' mat)

61 N-Body Simulation Calculate all interactions between a vector of particles calcaccels :: Exp R - > Acc (Vector Body) - > Acc (Vector Accel) calcaccels epsilon bodies = let n = A.size bodies rows cols = A.replicate (lift $ Z :. n :. All) bodies = A.replicate (lift $ Z :. All :. n) bodies in A.fold (.+.) (vec 0) ( A.zipWith (accel epsilon) rows cols )

62 N-Body Simulation Calculate all interactions between a vector of particles calcaccels :: Exp R - > Acc (Vector Body) - > Acc (Vector Accel) calcaccels epsilon bodies = let n = A.size bodies rows cols = A.replicate (lift $ Z :. n :. All) bodies = A.replicate (lift $ Z :. All :. n) bodies in A.fold (.+.) (vec 0) ( A.zipWith (accel epsilon) rows cols ) Turn nested computations into segmented or higherdimensional operations

63 Accelerate Recall that Accelerate computations take place on arrays Arrays in Accelerate computation Arrays out

64 Accelerate Recall that Accelerate computations take place on arrays Arrays in Accelerate computation Arrays out Accelerate evaluates the expression passed to run to generate a series of CUDA kernels

65 Accelerate Recall that Accelerate computations take place on arrays Arrays in Accelerate computation Arrays out Accelerate evaluates the expression passed to run to generate a series of CUDA kernels - Each piece of code CUDA code must be compiled and loaded

66 Accelerate Recall that Accelerate computations take place on arrays Arrays in Accelerate computation Arrays out Accelerate evaluates the expression passed to run to generate a series of CUDA kernels - Each piece of code CUDA code must be compiled and loaded - The goal is to make kernels that can be reused

67 Accelerate Recall that Accelerate computations take place on arrays Arrays in Accelerate computation Arrays out Accelerate evaluates the expression passed to run to generate a series of CUDA kernels - Each piece of code CUDA code must be compiled and loaded - The goal is to make kernels that can be reused - If we don t, the overhead of compilation can ruin performance

68 Embedded Scalars Consider drop, which yields all but the first n elements of a vector drop :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop n arr = let n' = the (unit n) in backpermute (ilift1 (subtract n') (shape arr)) (ilift1 (+ n')) arr Lift an expression into a singleton array Extract the element of a singleton array

69 Embedded Scalars Consider drop, which yields all but the first n elements of a vector drop :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop n arr = let n' = the (unit n) in backpermute (ilift1 (subtract n') (shape arr)) (ilift1 (+ n')) arr Lift an expression into a singleton array Extract the element of a singleton array

70 Embedded Scalars Consider drop, which yields all but the first n elements of a vector drop :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop n arr = let n' = the (unit n) in backpermute (ilift1 (subtract n') (shape arr)) (ilift1 (+ n')) arr Lift an expression into a singleton array unit :: Exp e - > Acc (Scalar e)

71 Embedded Scalars Consider drop, which yields all but the first n elements of a vector drop :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop n arr = let n' = the (unit n) in backpermute (ilift1 (subtract n') (shape arr)) (ilift1 (+ n')) arr Lift an expression into a singleton array unit :: Exp e - > Acc (Scalar e) Extract the element of a singleton array the :: Acc (Scalar e) - > Exp e

72 Embedded Scalars Check the expression Accelerate sees when it evaluates run > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in let a1 = unit 4 in backpermute (let x0 = Z in x0 :. (indexhead (shape a0)) - (a1!x0)) (\x0 - > let x1 = Z in x1 :. (indexhead x0) + (a1!x1)) a0 - Corresponds to the array we created for n - Critically, this is outside the call to backpermute

73 Embedded Scalars Check the expression Accelerate sees when it evaluates run > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in let a1 = unit 4 in backpermute (let x0 = Z in x0 :. (indexhead (shape a0)) - (a1!x0)) (\x0 - > let x1 = Z in x1 :. (indexhead x0) + (a1!x1)) a0 - Corresponds to the array we created for n - Critically, this is outside the call to backpermute

74 Embedded Scalars Check the expression Accelerate sees when it evaluates run > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in let a1 = unit 4 in backpermute (let x0 = Z in x0 :. (indexhead (shape a0)) - (a1!x0)) (\x0 - > let x1 = Z in x1 :. (indexhead x0) + (a1!x1)) a0 - Corresponds to the array we created for n

75 Embedded Scalars Check the expression Accelerate sees when it evaluates run > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in let a1 = unit 4 in backpermute (let x0 = Z in x0 :. (indexhead (shape a0)) - (a1!x0)) (\x0 - > let x1 = Z in x1 :. (indexhead x0) + (a1!x1)) a0 - Corresponds to the array we created for n - Critically, this is outside the call to backpermute

76 Embedded Scalars drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr

77 Embedded Scalars drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0

78 Embedded Scalars drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0

79 Embedded Scalars Check the expression Accelerate sees when it evaluates run drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0

80 Embedded Scalars Check the expression Accelerate sees when it evaluates run drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0

81 Embedded Scalars Check the expression Accelerate sees when it evaluates run drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0

82 Embedded Scalars Check the expression Accelerate sees when it evaluates run drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0

83 Embedded Scalars Check the expression Accelerate sees when it evaluates run drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0

84 Embedded Scalars Check the expression Accelerate sees when it evaluates run drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0

85 Embedded Scalars Check the expression Accelerate sees when it evaluates run drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0

86 Embedded Scalars Check the expression Accelerate sees when it evaluates run drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0 - This will defeat Accelerate s caching of compiled kernels

87 Embedded Scalars Check the expression Accelerate sees when it evaluates run drop' :: Elt e => Exp Int - > Acc (Vector e) - > Acc (Vector e) drop' n arr = backpermute (ilift1 (subtract n) (shape arr)) (ilift1 (+ n)) arr > let vec = fromlist (Z:.10) [1..] :: Vector Int > drop' 4 (use vec) let a0 = use (Array (Z :. 10) [0,1,2,3,4,5,6,7,8,9]) in backpermute (Z : (indexhead (shape a0))) (\x0 - > Z :. 4 + (indexhead x0)) a0 - This will defeat Accelerate s caching of compiled kernels Make sure any arguments that change are passed as Arrays

88 Embedded Scalars Inspect the expression directly, as done here Alternatively: Accelerate has some debugging options - Run the program with the - ddump- cc command line switch - Make sure you have installed Accelerate with - fdebug

89 Executing Computations Recall: to actually execute a computation we use run run :: Arrays a => Acc a - > a This entails quite a bit of work setting up the computation to run on the GPU - And sometimes the computation never changes

90 Executing Computations Alternative: execute an array program of one argument run1 :: (Arrays a, Arrays b) => (Acc a - > Acc b) - > a - > b What is the difference? - This version can be partially applied with the (Acc a - > Acc b) argument - Returns a new function (a - > b)

91 Executing Computations Alternative: execute an array program of one argument run1 :: (Arrays a, Arrays b) => (Acc a - > Acc b) - > a - > b What is the difference? - This version can be partially applied with the (Acc a - > Acc b) argument - Returns a new function (a - > b) - Key new thing: behind the scenes everything other than final execution is already done. AST is annotated with object code required for execution.

92 Executing Computations canny :: Acc (Array DIM2 RGBA) - > Acc (Array DIM2 Float) canny =...

93 Executing Computations canny :: Acc (Array DIM2 RGBA) - > Acc (Array DIM2 Float) canny =... No magic: edges :: Array DIM2 RGBA - > Array DIM2 Float edges img = run ( canny (use img) )

94 Executing Computations canny :: Acc (Array DIM2 RGBA) - > Acc (Array DIM2 Float) canny =... No magic: edges :: Array DIM2 RGBA - > Array DIM2 Float edges img = run ( canny (use img) ) Magic: edges :: Array DIM2 RGBA - > Array DIM2 Float edges img = run1 canny img

95 Executing Computations canny :: Acc (Array DIM2 RGBA) - > Acc (Array DIM2 Float) canny =... Criterion benchmarks benchmarking canny/run collecting 100 samples, 1 iterations each, in estimated s mean: ms, lb ms, ub ms, ci std dev: ms, lb ms, ub ms, ci variance introduced by outliers: % variance is severely inflated by outliers benchmarking canny/run1 mean: ms, lb ms, ub ms, ci std dev: ms, lb ms, ub ms, ci variance introduced by outliers: % variance is severely inflated by outliers

96 Executing Computations canny :: Acc (Array DIM2 RGBA) - > Acc (Array DIM2 Float) canny =... Criterion benchmarks benchmarking canny/run collecting 100 samples, 1 iterations each, in estimated s mean: ms, lb ms, ub ms, ci std dev: ms, lb ms, ub ms, ci variance introduced by outliers: % variance is severely inflated by outliers benchmarking canny/run1 mean: ms, lb ms, ub ms, ci std dev: ms, lb ms, ub ms, ci variance introduced by outliers: % variance is severely inflated by outliers

97 Floyd-Warshall Algorithm Find the shortest paths in a weighted graph shortestpath :: Graph - > Vertex - > Vertex - > Vertex - > Weight shortestpath g i j 0 = weight g i j shortestpath g i j k = min (shortestpath g i j (k- 1)) (shortestpath g i k (k- i) + shortestpath g k j (k- 1))

98 Floyd-Warshall Algorithm Find the shortest paths in a weighted graph shortestpath :: Graph - > Vertex - > Vertex - > Vertex - > Weight shortestpath g i j 0 = weight g i j shortestpath g i j k = min (shortestpath g i j (k- 1)) (shortestpath g i k (k- i) + shortestpath g k j (k- 1)) - k=0: path between vertices i and j is the direct case only

99 Floyd-Warshall Algorithm Find the shortest paths in a weighted graph shortestpath :: Graph - > Vertex - > Vertex - > Vertex - > Weight shortestpath g i j 0 = weight g i j shortestpath g i j k = min (shortestpath g i j (k- 1)) (shortestpath g i k (k- i) + shortestpath g k j (k- 1)) - k=0: path between vertices i and j is the direct case only - Otherwise the shortest path from i to j passes through k, or it does not

100 Floyd-Warshall Algorithm Find the shortest paths in a weighted graph shortestpath :: Graph - > Vertex - > Vertex - > Vertex - > Weight shortestpath g i j 0 = weight g i j shortestpath g i j k = min (shortestpath g i j (k- 1)) (shortestpath g i k (k- i) + shortestpath g k j (k- 1)) - k=0: path between vertices i and j is the direct case only - Otherwise the shortest path from i to j passes through k, or it does not - Exponential in the number of calls: instead traverse bottom-up to make the results at (k- 1) available

101 Floyd-Warshall Algorithm Instead of a sparse graph, we ll use a dense adjacency matrix - 2D array where the element is the distance between the two vertices type Weight = Int32 type Graph = Array DIM2 Weight step :: Acc (Scalar Int) - > Acc Graph - > Acc Graph step k g = generate (shape g) sp where k' = the k sp :: Exp DIM2 - > Exp Weight sp ix = let (Z :. i :. j) = unlift ix in min (g! (index2 i j)) (g! (index2 i k') + g! (index2 k' j)) - Must the iteration number k as a scalar array - sp is the core of the algorithm

102 Floyd-Warshall Algorithm Instead of a sparse graph, we ll use a dense adjacency matrix - 2D array where the element is the distance between the two vertices type Weight = Int32 type Graph = Array DIM2 Weight step :: Acc (Scalar Int) - > Acc Graph - > Acc Graph step k g = generate (shape g) sp where k' = the k sp :: Exp DIM2 - > Exp Weight sp ix = let (Z :. i :. j) = unlift ix in min (g! (index2 i j)) (g! (index2 i k') + g! (index2 k' j)) - Must the iteration number k as a scalar array - sp is the core of the algorithm

103 Floyd-Warshall Algorithm shortestpathsacc :: Int - > Acc Graph - > Acc Graph shortestpathsacc n g0 = foldl1 (.) steps g0 where steps :: [ Acc Graph - > Acc Graph ] steps = [ step (unit (constant k)) k <- [0.. n- 1] ]

104 Floyd-Warshall Algorithm shortestpathsacc :: Int - > Acc Graph - > Acc Graph shortestpathsacc n g0 = foldl1 (.) steps g0 where steps :: [ Acc Graph - > Acc Graph ] steps = [ step (unit (constant k)) k <- [0.. n- 1] ] - Construct a list of steps, applying k in sequence 0.. n- 1

105 Floyd-Warshall Algorithm shortestpathsacc :: Int - > Acc Graph - > Acc Graph shortestpathsacc n g0 = foldl1 (.) steps g0 where steps :: [ Acc Graph - > Acc Graph ] steps = [ step (unit (constant k)) k <- [0.. n- 1] ] - Construct a list of steps, applying k in sequence 0.. n- 1 - Compose the sequence together

106 Floyd-Warshall Algorithm shortestpathsacc :: Int - > Acc Graph - > Acc Graph shortestpathsacc n g0 = foldl1 (.) steps g0 where steps :: [ Acc Graph - > Acc Graph ] steps = [ step (unit (constant k)) k <- [0.. n- 1] ] - Construct a list of steps, applying k in sequence 0.. n- 1 - Compose the sequence together Let s try a 1000 x 1000 matrix

107 Floyd-Warshall Algorithm shortestpathsacc :: Int - > Acc Graph - > Acc Graph shortestpathsacc n g0 = foldl1 (.) steps g0 where steps :: [ Acc Graph - > Acc Graph ] steps = [ step (unit (constant k)) k <- [0.. n- 1] ] - Construct a list of steps, applying k in sequence 0.. n- 1 - Compose the sequence together Let s try a 1000 x 1000 matrix $./fwaccel RTS - s fwaccel: *** Internal error in package accelerate *** *** Please submit a bug report at (unhandled): CUDA Exception: out of memory 6,627,846,360 bytes allocated in the heap 2,830,827,472 bytes copied during GC 733,091,960 bytes maximum residency (26 sample(s)) 31,575,272 bytes maximum slop 812 MB total memory in use (54 MB lost due to fragmentation)... Total time 7.16s ( 11.42s elapsed)

108 Floyd-Warshall Algorithm shortestpathsacc :: Int - > Acc Graph - > Acc Graph shortestpathsacc n g0 = foldl1 (.) steps g0 where steps :: [ Acc Graph - > Acc Graph ] steps = [ step (unit (constant k)) k <- [0.. n- 1] ] - Construct a list of steps, applying k in sequence 0.. n- 1 - Compose the sequence together Let s try a 1000 x 1000 matrix $./fwaccel RTS - s fwaccel: *** Internal error in package accelerate *** *** Please submit a bug report at (unhandled): CUDA Exception: out of memory 6,627,846,360 bytes allocated in the heap 2,830,827,472 bytes copied during GC 733,091,960 bytes maximum residency (26 sample(s)) 31,575,272 bytes maximum slop 812 MB total memory in use (54 MB lost due to fragmentation)... Total time 7.16s ( 11.42s elapsed)

109 Pipelining To sequence computations we have a special operator (>- >) :: (Arrays a, Arrays b, Arrays c) => (Acc a - > Acc b) - > (Acc b - > Acc c) - > Acc a - > Acc c - Operationally, the two computations will not share any subcomputations with each other or the environment - Intermediate arrays from the first can be GC d when the second begins

110 Pipelining To sequence computations we have a special operator (>- >) :: (Arrays a, Arrays b, Arrays c) => (Acc a - > Acc b) - > (Acc b - > Acc c) - > Acc a - > Acc c - Operationally, the two computations will not share any subcomputations with each other or the environment - Intermediate arrays from the first can be GC d when the second begins $./fwaccel RTS - s Array (Z) [ ] 6,211,776,544 bytes allocated in the heap 914,044,096 bytes copied during GC 24,953,768 bytes maximum residency (211 sample(s)) 2,023,496 bytes maximum slop 63 MB total memory in use (0 MB lost due to fragmentation)... Total time 3.86s ( 3.92s elapsed)

111 Pipelining To sequence computations we have a special operator (>- >) :: (Arrays a, Arrays b, Arrays c) => (Acc a - > Acc b) - > (Acc b - > Acc c) - > Acc a - > Acc c - Operationally, the two computations will not share any subcomputations with each other or the environment - Intermediate arrays from the first can be GC d when the second begins $./fwaccel RTS - s Array (Z) [ ] 6,211,776,544 bytes allocated in the heap 914,044,096 bytes copied during GC 24,953,768 bytes maximum residency (211 sample(s)) 2,023,496 bytes maximum slop 63 MB total memory in use (0 MB lost due to fragmentation)... Total time 3.86s ( 3.92s elapsed) (>- >) can help control memory use & startup time

112 Questions?

113 Extra slides

114 Stencils A stencil is a map with access to the neighbourhood around each element - Useful in many scientific & image processing algorithms laplace :: Stencil3x3 Int - > Exp Int laplace ((_,t,_),(l,c,r),(_,b,_)) = t + b + l + r - 4*c t l c r b

115 Stencils A stencil is a map with access to the neighbourhood around each element - Useful in many scientific & image processing algorithms laplace :: Stencil3x3 Int - > Exp Int laplace ((_,t,_),(l,c,r),(_,b,_)) = t + b + l + r - 4*c t l c r b - Boundary conditions specify how to handle out-of-bounds neighbours > let mat = fromlist (Z:.3:.5) [1..] :: Array DIM2 Int > run $ stencil laplace (Constant 0) (use mat) Array (Z :. 3 :. 5) [4,3,2,1,- 6,- 5,0,0,0,- 11,- 26,- 17,- 18,- 19,- 36]

116 Stencils A stencil is a map with access to the neighbourhood around each element - Useful in many scientific & image processing algorithms

117 Debugging options -dverbose -ddump-sharing -ddump-simpl-stats, -ddump-simpl-iterations -ddump-cc, -ddebug-cc -ddump-exec -ddump-gc -fflush-cache

118

GPU Programming I. With thanks to Manuel Chakravarty for some borrowed slides

GPU Programming I. With thanks to Manuel Chakravarty for some borrowed slides GPU Programming I With thanks to Manuel Chakravarty for some borrowed slides GPUs change the game Gaming drives development Image from h?p://www.geforce.com/hardware/desktop- gpus/geforce- gtx- Etan- black/performance

More information

DATA PARALLEL PROGRAMMING IN HASKELL

DATA PARALLEL PROGRAMMING IN HASKELL DATA PARALLEL PROGRAMMING IN HASKELL An Overview Manuel M T Chakravarty University of New South Wales INCLUDES JOINT WORK WITH Gabriele Keller Sean Lee Roman Leshchinskiy Ben Lippmeier Trevor McDonell

More information

Repa 3 tutorial for curious haskells

Repa 3 tutorial for curious haskells Repa 3 tutorial for curious haskells Daniel Eddeland d.skalle@gmail.com Oscar Söderlund soscar@student.chalmers.se 1 What is Repa? Repa is a library for Haskell which lets programmers perform operations

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming

More information

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit

More information

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware

More information

Pull and Push arrays, Effects and Array Fusion

Pull and Push arrays, Effects and Array Fusion Pull and Push arrays, Effects and Array Fusion Josef Svenningsson Chalmers University of Technology Edinburgh University 2014-11-18 Parallel Functional Array Representations Efficiency This talk will center

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

Identification and Elimination of the Overhead of Accelerate with a Super-resolution Application

Identification and Elimination of the Overhead of Accelerate with a Super-resolution Application Regular Paper Identification and Elimination of the Overhead of Accelerate with a Super-resolution Application Izumi Asakura 1,a) Hidehiko Masuhara 1 Takuya Matsumoto 2 Kiminori Matsuzaki 3 Received: April

More information

Optimising Purely Functional GPU Programs

Optimising Purely Functional GPU Programs Optimising Purely Functional GPU Programs Trevor L. McDonell School of Computer Science and Engineering University of New South Wales July 29, 2014 iii This thesis is submitted in partial fulfilment of

More information

How to Optimize Geometric Multigrid Methods on GPUs

How to Optimize Geometric Multigrid Methods on GPUs How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient

More information

Vector Addition on the Device: main()

Vector Addition on the Device: main() Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space

More information

Compiling APL to Accelerate Through a Typed IL

Compiling APL to Accelerate Through a Typed IL Compiling APL to Accelerate Through a Typed IL Michael Budde University of Copenhagen skx295@alumni.ku.dk 6. November 2014 Abstract APL is a functional array programming language from the 1960 s. While

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

GPU Memory Model. Adapted from:

GPU Memory Model. Adapted from: GPU Memory Model Adapted from: Aaron Lefohn University of California, Davis With updates from slides by Suresh Venkatasubramanian, University of Pennsylvania Updates performed by Gary J. Katz, University

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Synchronization 3 Automatic Parallelization and OpenMP 4 GPGPU 5 Q& A 2 Multithreaded

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel An Introduction to OpenACC Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel Chapter 1 Introduction OpenACC is a software accelerator that uses the host and the device. It uses compiler

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Text. Parallelism Gabriele Keller University of New South Wales

Text. Parallelism Gabriele Keller University of New South Wales Text Parallelism Gabriele Keller University of New South Wales Programming Languages Mentoring Workshop 2015 undergrad Karslruhe & Berlin, Germany PhD Berlin & Tsukuba, Japan University of Technology Sydney

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic

More information

Fast Multipole and Related Algorithms

Fast Multipole and Related Algorithms Fast Multipole and Related Algorithms Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani Joint work with Nail A. Gumerov Efficiency by exploiting symmetry and A general

More information

ECE 408 / CS 483 Final Exam, Fall 2014

ECE 408 / CS 483 Final Exam, Fall 2014 ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across

More information

Affine Loop Optimization using Modulo Unrolling in CHAPEL

Affine Loop Optimization using Modulo Unrolling in CHAPEL Affine Loop Optimization using Modulo Unrolling in CHAPEL Aroon Sharma, Joshua Koehler, Rajeev Barua LTS POC: Michael Ferguson 2 Overall Goal Improve the runtime of certain types of parallel computers

More information

Towards a Performance- Portable FFT Library for Heterogeneous Computing

Towards a Performance- Portable FFT Library for Heterogeneous Computing Towards a Performance- Portable FFT Library for Heterogeneous Computing Carlo C. del Mundo*, Wu- chun Feng* *Dept. of ECE, Dept. of CS Virginia Tech Slides Updated: 5/19/2014 Forecast (Problem) AMD Radeon

More information

Introduction to CUDA (1 of n*)

Introduction to CUDA (1 of n*) Administrivia Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Paper presentation due Wednesday, 02/23 Topics first come, first serve Assignment 4 handed today

More information

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC DR. CHRISTOPH ANGERER, NVIDIA *) THANKS TO JEFF LARKIN, NVIDIA, FOR THE SLIDES 3 APPROACHES TO GPU PROGRAMMING Applications Libraries Compiler Directives

More information

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel

More information

CS 433 Homework 4. Assigned on 10/17/2017 Due in class on 11/7/ Please write your name and NetID clearly on the first page.

CS 433 Homework 4. Assigned on 10/17/2017 Due in class on 11/7/ Please write your name and NetID clearly on the first page. CS 433 Homework 4 Assigned on 10/17/2017 Due in class on 11/7/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Shaders. Slide credit to Prof. Zwicker

Shaders. Slide credit to Prof. Zwicker Shaders Slide credit to Prof. Zwicker 2 Today Shader programming 3 Complete model Blinn model with several light sources i diffuse specular ambient How is this implemented on the graphics processor (GPU)?

More information

Unboxing Sum Types. Johan Tibell - FP-Syd

Unboxing Sum Types. Johan Tibell - FP-Syd Unboxing Sum Types Johan Tibell - FP-Syd 2017-05-24 Work in (slow) progress Ömer Sinan Ağacan Ryan Newton José Manuel Calderón Trilla I NULL Reason 1: Job security Program received signal SIGSEGV, Segmentation

More information

L I S Z T U S E R M A N U A L

L I S Z T U S E R M A N U A L M O N T S E M E D I N A, N I E L S J O U B E R T L I S Z T U S E R M A N U A L 1 7 M AY 2 0 1 0 liszt user manual 3 This tutorial is intended for programmers who are familiar with mesh based, PDE s solvers

More information

Supporting Data Parallelism in Matcloud: Final Report

Supporting Data Parallelism in Matcloud: Final Report Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by

More information

Data Parallel Execution Model

Data Parallel Execution Model CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling

More information

CS 360 Programming Languages Interpreters

CS 360 Programming Languages Interpreters CS 360 Programming Languages Interpreters Implementing PLs Most of the course is learning fundamental concepts for using and understanding PLs. Syntax vs. semantics vs. idioms. Powerful constructs like

More information

Unrolling parallel loops

Unrolling parallel loops Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:

More information

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear

More information

GPU Programming EE Final Examination

GPU Programming EE Final Examination Name Solution GPU Programming EE 4702-1 Final Examination Friday, 11 December 2015 15:00 17:00 CST Alias Methane? Problem 1 Problem 2 Problem 3 Problem 4 Problem 5 Problem 6 Exam Total (20 pts) (15 pts)

More information

Working with Metal Overview

Working with Metal Overview Graphics and Games #WWDC14 Working with Metal Overview Session 603 Jeremy Sandmel GPU Software 2014 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission

More information

Parallelization Principles. Sathish Vadhiyar

Parallelization Principles. Sathish Vadhiyar Parallelization Principles Sathish Vadhiyar Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs

More information

T H E L I S Z T L A N G U A G E S P E C I F I C AT I O N

T H E L I S Z T L A N G U A G E S P E C I F I C AT I O N Z A C H D E V I T O, N I E L S J O U B E R T T H E L I S Z T L A N G U A G E S P E C I F I C AT I O N 1 7 M AY 2 0 1 0 the liszt language specification 3 This document describes the expected behavior

More information

Advanced Computer Architecture Lab 3 Scalability of the Gauss-Seidel Algorithm

Advanced Computer Architecture Lab 3 Scalability of the Gauss-Seidel Algorithm Advanced Computer Architecture Lab 3 Scalability of the Gauss-Seidel Algorithm Andreas Sandberg 1 Introduction The purpose of this lab is to: apply what you have learned so

More information

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC DLR.de Chart 1 Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC Melven Röhrig-Zöllner DLR, Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU)

More information

Parallelism with Haskell

Parallelism with Haskell Parallelism with Haskell T-106.6200 Special course High-Performance Embedded Computing (HPEC) Autumn 2013 (I-II), Lecture 5 Vesa Hirvisalo & Kevin Hammond (Univ. of St. Andrews) 2013-10-08 V. Hirvisalo

More information

MULTIPLE OPERAND ADDITION. Multioperand Addition

MULTIPLE OPERAND ADDITION. Multioperand Addition MULTIPLE OPERAND ADDITION Chapter 3 Multioperand Addition Add up a bunch of numbers Used in several algorithms Multiplication, recurrences, transforms, and filters Signed (two s comp) and unsigned Don

More information

CUDA Performance Considerations (2 of 2)

CUDA Performance Considerations (2 of 2) Administrivia CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Friday 03/04, 11:59pm Assignment 4 due Presentation date change due via email Not bonus

More information

CS553 Lecture Dynamic Optimizations 2

CS553 Lecture Dynamic Optimizations 2 Dynamic Optimizations Last time Predication and speculation Today Dynamic compilation CS553 Lecture Dynamic Optimizations 2 Motivation Limitations of static analysis Programs can have values and invariants

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

PGI Accelerator Programming Model for Fortran & C

PGI Accelerator Programming Model for Fortran & C PGI Accelerator Programming Model for Fortran & C The Portland Group Published: v1.3 November 2010 Contents 1. Introduction... 5 1.1 Scope... 5 1.2 Glossary... 5 1.3 Execution Model... 7 1.4 Memory Model...

More information

Accelerating CFD with Graphics Hardware

Accelerating CFD with Graphics Hardware Accelerating CFD with Graphics Hardware Graham Pullan (Whittle Laboratory, Cambridge University) 16 March 2009 Today Motivation CPUs and GPUs Programming NVIDIA GPUs with CUDA Application to turbomachinery

More information

MATLAB GUIDE UMD PHYS375 FALL 2010

MATLAB GUIDE UMD PHYS375 FALL 2010 MATLAB GUIDE UMD PHYS375 FALL 200 DIRECTORIES Find the current directory you are in: >> pwd C:\Documents and Settings\ian\My Documents\MATLAB [Note that Matlab assigned this string of characters to a variable

More information

Scheduling Image Processing Pipelines

Scheduling Image Processing Pipelines Lecture 14: Scheduling Image Processing Pipelines Visual Computing Systems Simple image processing kernel int WIDTH = 1024; int HEIGHT = 1024; float input[width * HEIGHT]; float output[width * HEIGHT];

More information

Blocks, Grids, and Shared Memory

Blocks, Grids, and Shared Memory Blocks, Grids, and Shared Memory GPU Course, Fall 2012 Last week: ax+b Homework Threads, Blocks, Grids CUDA threads are organized into blocks Threads operate in SIMD(ish) manner -- each executing same

More information

An Introduction to OpenAcc

An Introduction to OpenAcc An Introduction to OpenAcc ECS 158 Final Project Robert Gonzales Matthew Martin Nile Mittow Ryan Rasmuss Spring 2016 1 Introduction: What is OpenAcc? OpenAcc stands for Open Accelerators. Developed by

More information

Scheduling Image Processing Pipelines

Scheduling Image Processing Pipelines Lecture 15: Scheduling Image Processing Pipelines Visual Computing Systems Simple image processing kernel int WIDTH = 1024; int HEIGHT = 1024; float input[width * HEIGHT]; float output[width * HEIGHT];

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways: COMP528 Multi-Core Programming GPU programming,ii www.csc.liv.ac.uk/~alexei/comp528 Alexei Lisitsa Dept of computer science University of Liverpool a.lisitsa@.liverpool.ac.uk Different ways: GPU programming

More information

CS61C Machine Structures. Lecture 4 C Pointers and Arrays. 1/25/2006 John Wawrzynek. www-inst.eecs.berkeley.edu/~cs61c/

CS61C Machine Structures. Lecture 4 C Pointers and Arrays. 1/25/2006 John Wawrzynek. www-inst.eecs.berkeley.edu/~cs61c/ CS61C Machine Structures Lecture 4 C Pointers and Arrays 1/25/2006 John Wawrzynek (www.cs.berkeley.edu/~johnw) www-inst.eecs.berkeley.edu/~cs61c/ CS 61C L04 C Pointers (1) Common C Error There is a difference

More information

Module Memory and Data Locality

Module Memory and Data Locality GPU Teaching Kit Accelerated Computing Module 4.5 - Memory and Data Locality Handling Arbitrary Matrix Sizes in Tiled Algorithms Objective To learn to handle arbitrary matrix sizes in tiled matrix multiplication

More information

Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009

Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009 Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009 Introduction CUDA is a tool to turn your graphics card into a small computing cluster. It s not always

More information

VLC : Language Reference Manual

VLC : Language Reference Manual VLC : Language Reference Manual Table Of Contents 1. Introduction 2. Types and Declarations 2a. Primitives 2b. Non-primitives - Strings - Arrays 3. Lexical conventions 3a. Whitespace 3b. Comments 3c. Identifiers

More information

Lecture 8: Summary of Haskell course + Type Level Programming

Lecture 8: Summary of Haskell course + Type Level Programming Lecture 8: Summary of Haskell course + Type Level Programming Søren Haagerup Department of Mathematics and Computer Science University of Southern Denmark, Odense October 31, 2017 Principles from Haskell

More information

CSC Advanced Scientific Computing, Fall Numpy

CSC Advanced Scientific Computing, Fall Numpy CSC 223 - Advanced Scientific Computing, Fall 2017 Numpy Numpy Numpy (Numerical Python) provides an interface, called an array, to operate on dense data buffers. Numpy arrays are at the core of most Python

More information

Blis: Better Language for Image Stuff Project Proposal Programming Languages and Translators, Spring 2017

Blis: Better Language for Image Stuff Project Proposal Programming Languages and Translators, Spring 2017 Blis: Better Language for Image Stuff Project Proposal Programming Languages and Translators, Spring 2017 Abbott, Connor (cwa2112) Pan, Wendy (wp2213) Qinami, Klint (kq2129) Vaccaro, Jason (jhv2111) [System

More information

Index. object lifetimes, and ownership, use after change by an alias errors, use after drop errors, BTreeMap, 309

Index. object lifetimes, and ownership, use after change by an alias errors, use after drop errors, BTreeMap, 309 A Arithmetic operation floating-point arithmetic, 11 12 integer numbers, 9 11 Arrays, 97 copying, 59 60 creation, 48 elements, 48 empty arrays and vectors, 57 58 executable program, 49 expressions, 48

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

CSC 8301 Design & Analysis of Algorithms: Warshall s, Floyd s, and Prim s algorithms

CSC 8301 Design & Analysis of Algorithms: Warshall s, Floyd s, and Prim s algorithms CSC 8301 Design & Analysis of Algorithms: Warshall s, Floyd s, and Prim s algorithms Professor Henry Carter Fall 2016 Recap Space-time tradeoffs allow for faster algorithms at the cost of space complexity

More information

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large

More information

Accelerating image registration on GPUs

Accelerating image registration on GPUs Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining

More information

Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors?

Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors? Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors? The CVDLab Team Francesco Furiani Tue, April 3, 2014 ROMA TRE UNIVERSITÀ DEGLI STUDI

More information

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix

More information

Engine Support System. asyrani.com

Engine Support System. asyrani.com Engine Support System asyrani.com A game engine is a complex piece of software consisting of many interacting subsystems. When the engine first starts up, each subsystem must be configured and initialized

More information

CS 179: GPU Computing. Lecture 2: The Basics

CS 179: GPU Computing. Lecture 2: The Basics CS 179: GPU Computing Lecture 2: The Basics Recap Can use GPU to solve highly parallelizable problems Performance benefits vs. CPU Straightforward extension to C language Disclaimer Goal for Week 1: Fast-paced

More information

Costless Software Abstractions For Parallel Hardware System

Costless Software Abstractions For Parallel Hardware System Costless Software Abstractions For Parallel Hardware System Joel Falcou LRI - INRIA - MetaScale SAS Maison de la Simulation - 04/03/2014 Context In Scientific Computing... there is Scientific Applications

More information

PFAC Library: GPU-Based String Matching Algorithm

PFAC Library: GPU-Based String Matching Algorithm PFAC Library: GPU-Based String Matching Algorithm Cheng-Hung Lin Lung-Sheng Chien Chen-Hsiung Liu Shih-Chieh Chang Wing-Kai Hon National Taiwan Normal University, Taipei, Taiwan National Tsing-Hua University,

More information

Programming for Electrical and Computer Engineers. Pointers and Arrays

Programming for Electrical and Computer Engineers. Pointers and Arrays Programming for Electrical and Computer Engineers Pointers and Arrays Dr. D. J. Jackson Lecture 12-1 Introduction C allows us to perform arithmetic addition and subtraction on pointers to array elements.

More information

1/25/12. Administrative

1/25/12. Administrative Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Proposal for parallel sort in base R (and Python/Julia)

Proposal for parallel sort in base R (and Python/Julia) Proposal for parallel sort in base R (and Python/Julia) Directions in Statistical Computing 2 July 2016, Stanford Matt Dowle Initial timings https://github.com/rdatatable/data.table/wiki/installation See

More information

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao

More information

UNIVERSITY OF CALIFORNIA, SANTA CRUZ BOARD OF STUDIES IN COMPUTER ENGINEERING

UNIVERSITY OF CALIFORNIA, SANTA CRUZ BOARD OF STUDIES IN COMPUTER ENGINEERING UNIVERSITY OF CALIFORNIA, SANTA CRUZ BOARD OF STUDIES IN COMPUTER ENGINEERING CMPE13/L: INTRODUCTION TO PROGRAMMING IN C SPRING 2012 Lab 3 Matrix Math Introduction Reading In this lab you will write a

More information

Center for Computational Science

Center for Computational Science Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz,

More information

Processors, Performance, and Profiling

Processors, Performance, and Profiling Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode

More information

Problems and Solutions to the January 1994 Core Examination

Problems and Solutions to the January 1994 Core Examination Problems and Solutions to the January 1994 Core Examination Programming Languages PL 1. Consider the following program written in an Algol-like language: begin integer i,j; integer array A[0..10]; Procedure

More information

Copyright 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin Introduction to the Design & Analysis of Algorithms, 2 nd ed., Ch.

Copyright 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin Introduction to the Design & Analysis of Algorithms, 2 nd ed., Ch. Iterative Improvement Algorithm design technique for solving optimization problems Start with a feasible solution Repeat the following step until no improvement can be found: change the current feasible

More information

CSE351 Spring 2018, Final Exam June 6, 2018

CSE351 Spring 2018, Final Exam June 6, 2018 CSE351 Spring 2018, Final Exam June 6, 2018 Please do not turn the page until 2:30. Last Name: First Name: Student ID Number: Name of person to your left: Name of person to your right: Signature indicating:

More information

CS 1313 Spring 2000 Lecture Outline

CS 1313 Spring 2000 Lecture Outline 1. What is a Computer? 2. Components of a Computer System Overview of Computing, Part 1 (a) Hardware Components i. Central Processing Unit ii. Main Memory iii. The Bus iv. Loading Data from Main Memory

More information

CPSC 3740 Programming Languages University of Lethbridge. Data Types

CPSC 3740 Programming Languages University of Lethbridge. Data Types Data Types A data type defines a collection of data values and a set of predefined operations on those values Some languages allow user to define additional types Useful for error detection through type

More information

5.12 EXERCISES Exercises 263

5.12 EXERCISES Exercises 263 5.12 Exercises 263 5.12 EXERCISES 5.1. If it s defined, the OPENMP macro is a decimal int. Write a program that prints its value. What is the significance of the value? 5.2. Download omp trap 1.c from

More information

CSE 554 Lecture 6: Fairing and Simplification

CSE 554 Lecture 6: Fairing and Simplification CSE 554 Lecture 6: Fairing and Simplification Fall 2012 CSE554 Fairing and simplification Slide 1 Review Iso-contours in grayscale images and volumes Piece-wise linear representations Polylines (2D) and

More information

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture VLIW, Vector, and Multithreaded Machines Assigned 3/24/2019 Problem Set #4 Due 4/5/2019 http://inst.eecs.berkeley.edu/~cs152/sp19

More information

Kampala August, Agner Fog

Kampala August, Agner Fog Advanced microprocessor optimization Kampala August, 2007 Agner Fog www.agner.org Agenda Intel and AMD microprocessors Out Of Order execution Branch prediction Platform, 32 or 64 bits Choice of compiler

More information

OpenACC Fundamentals. Steve Abbott November 15, 2017

OpenACC Fundamentals. Steve Abbott November 15, 2017 OpenACC Fundamentals Steve Abbott , November 15, 2017 AGENDA Data Regions Deep Copy 2 while ( err > tol && iter < iter_max ) { err=0.0; JACOBI ITERATION #pragma acc parallel loop reduction(max:err)

More information

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview

More information

WYSE Academic Challenge Computer Science Test (State) 2015 Solution Set

WYSE Academic Challenge Computer Science Test (State) 2015 Solution Set WYSE Academic Challenge Computer Science Test (State) 2015 Solution Set 1. Correct Answer: E The following shows the contents of the stack (a lifo last in first out structure) after each operation. Instruction

More information

CS377P Programming for Performance GPU Programming - II

CS377P Programming for Performance GPU Programming - II CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline

More information

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017 ECE 550D Fundamentals of Computer Systems and Engineering Fall 2017 The Operating System (OS) Prof. John Board Duke University Slides are derived from work by Profs. Tyler Bletsch and Andrew Hilton (Duke)

More information