NumbaPro CUDA Python. Square matrix multiplication

Size: px

Start display at page:

Download "NumbaPro CUDA Python. Square matrix multiplication"

Avice McLaughlin
6 years ago
Views:

1 NumbaPro Enables parallel programming in Python Support various entry points: Low-level (CUDA-C like) programming language High-level array oriented interface CUDA library bindings Also support multicore CPU And more hardware architectures in the future.

2 NumbaPro CUDA Python Square matrix multiplication

3 NumbaPro CUDA Python Determine thread Identity

4 NumbaPro CUDA Python Map threads to matrix coordinate

5 NumbaPro CUDA Python Thread inside matrix?

6 NumbaPro CUDA Python Compute one element. Launch NxN threads for NxN matrix

7 Launch CUDA Kernel Launch (100 x 32)^2 = 3200^2 threads for 3200 x 3200 matrix

8 Equivalent CUDA-C

9 Higher-level Entry Points So far, the API is quite low-level. We will go through some higher-level entry points in the lessons.

10 Lesson 1 SAXPY with Vectorize

11 @vectorize Creates elementwise operation from a scalar function Produces a NumPy universal function (ufunc). numpy.add is a ufunc Eliminate most of CUDA specific info griddim, blockdim are computed for you

12 The Scalar Function Core All arguments are scalar Returns a scalar value as the output

13 Writing a SAXPY function SAXPY computes a X + Y where X and Y are vectors of equal length.

14 @vectorize List of function type signatures

15 @vectorize Code generation target: cpu, parallel, gpu

16 @vectorize A scalar function Args: a, x, y are float32 Returns a float32

17 Calling a vectorize function Use as regular NumPy ufunc Applies to regular NumPy arrays Auto host->device and device->host transfer Auto calculate griddim and blockdim

18 SAXPY in CUDA Python

19 Memory transfer Explicit memory transfer is optional. Host->Device: device_array = cuda.to_device(host_array) Device Allocation: device_array = cuda.device_array_like(device_or_host_array) Note: behaves like numpy.empty_like Device->Host: host_array = device_array.copy_to_host()

20 Controlling Memory Transfer host -> device device -> host

21 Controlling Memory Transfer device -> host

22 Controlling Memory Transfer

23 Why manual transfer? As an optimization Control device memory usage Allow reusing of memory

24 Lesson 2 cufft convolution

25 FFT Convolution Image filter using FFT convolution with cufft. convolved = IFFT(FFT(image) * FFT (response))

26 cufft API The cufft object (`cufft` in the code) has: Forward FFT cufft.fft(in_array, out_array) cufft.fft_inplace(inout_array) Inverse FFT cufft.ifft(in_array, out_array) cufft.ifft_inplace(inout_array)

27 Doing a Inplace Convolution Forward FFT of image and response arrays Elementwise image and response arrays in frequency domain Inverse FFT the product

28 Doing a Inplace Convolution Elementwise image and response arrays in frequency domain Inverse FFT the product

29 Doing a Inplace Convolution Inverse FFT the product

30 Doing a Inplace Convolution

31 Lesson 3 JIT Linking

32 CUDA JIT Linking Use CUDA-C code inside NumbaPro Compile CUDA-C code into relocatable device code NumbaPro use CUDA JIT Linker to combine its generated code with a precompiled library

33 Use of JIT Linking Connect to missing features NumbaPro is still young Connect to CUDA-C only features Reusing existing CUDA-C code

34 NumbaPro Python code

35 NumbaPro Python code Declare external device function in Python

36 NumbaPro Python code Precompiled object file

37 NumbaPro Python code Add library dependencies to the CUDA kernel

38 NumbaPro Python code Use external function

39 CUDA-C code

40 CUDA-C code NumbaPro expects return value to be passed as the first argument

41 CUDA-C code Actual arguments follows

42 CUDA-C code Return value indicates status. Return 0 for success. Other return codes are possible to indicate builtin errors.

43 How to compile nvcc -arch=sm_20 -dc yourcode.cu Support only CC 2.0 or above -dc flag triggers relocatable device code

44 Example

45 Q & A

46 Thank You NumbaPro is Part of Anaconda Accelerate. Visit continuum.io

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways: COMP528 Multi-Core Programming GPU programming,ii www.csc.liv.ac.uk/~alexei/comp528 Alexei Lisitsa Dept of computer science University of Liverpool a.lisitsa@.liverpool.ac.uk Different ways: GPU programming