PyCUDA and PyUblas: Hybrid HPC in Python made easy

Size: px

Start display at page:

Download "PyCUDA and PyUblas: Hybrid HPC in Python made easy"

Crystal Barrett
5 years ago
Views:

1 PyCUDA and PyUblas: Hybrid HPC in Python made easy Applied Mathematics, Brown University March 5, 2009

2 Thanks Jan Hesthaven (Brown) Tim Warburton (Rice) Lucas Wilcox (UT Austin) Akil Narayan (Brown) PyCUDA contributors Nvidia Corporation

3 Outline 1 PyCUDA: What and Why? 2 PyCUDA: Overview 3 PyUblas 4 Conclusions

4 Scripting for GPUs Outline 1 PyCUDA: What and Why? Scripting for GPUs (Very) brief GPU PyCUDA: Overview 3 PyUblas 4 Conclusions

5 Scripting for GPUs PyCUDA: Why do Scripting for GPUs? GPUs are everything that Python is not. Highly parallel Very architecture-sensitive Built for maximum FP/memory throughput complement each other

6 Scripting for GPUs PyCUDA: Why do Scripting for GPUs? GPUs are everything that Python is not. Highly parallel Very architecture-sensitive Built for maximum FP/memory throughput complement each other CPU: largely restricted to control tasks ( 1000/sec) Python fast enough

7 Scripting for GPUs PyCUDA: Why do Scripting for GPUs? GPUs are everything that Python is not. Highly parallel Very architecture-sensitive Built for maximum FP/memory throughput complement each other CPU: largely restricted to control tasks ( 1000/sec) Python fast enough Realize a promise: Use Python... from first prototype to full-scale production code.

8 Scripting for GPUs Python: Speed Usual answer to the Speed Question : Hybrid ( mixed ) Code.

9 Scripting for GPUs Python: Speed Usual answer to the Speed Question : Hybrid ( mixed ) Code. Plays to the strengths of each language.

10 Scripting for GPUs Python: Speed Usual answer to the Speed Question : Hybrid ( mixed ) Code. Plays to the strengths of each language. But: Introduces (some) complexity.

11 Scripting for GPUs Python: Speed Usual answer to the Speed Question : Hybrid ( mixed ) Code. Plays to the strengths of each language. But: Introduces (some) complexity. Observation: GPU code is already hybrid.

12 Scripting for GPUs Python: Speed Usual answer to the Speed Question : Hybrid ( mixed ) Code. Plays to the strengths of each language. But: Introduces (some) complexity. Observation: GPU code is already hybrid. Consequence: No added complexity through hybrid code.

13 (Very) brief GPU 101 Outline 1 PyCUDA: What and Why? Scripting for GPUs (Very) brief GPU PyCUDA: Overview 3 PyUblas 4 Conclusions

14 (Very) brief GPU 101 GPUs: What? Design target for CPUs: Make a single thread very fast Hide latency through large caches Predict, speculate

15 (Very) brief GPU 101 GPUs: What? Design target for CPUs: Make a single thread very fast Hide latency through large caches Predict, speculate Design target for GPUs: Throughput matters single threads do not Hide latency through massive parallelism Let programmer deal with raw storage hierarchy

16 (Very) brief GPU 101 What is CUDA? CUDA is Nvidia s proprietary compute abstraction.

17 (Very) brief GPU 101 What is CUDA? CUDA is Nvidia s proprietary compute abstraction. Main merit: A well-balanced model of GPU computing. Abstract enough to not be hardware-specific. Concrete enough to expose most hardware features.

18 (Very) brief GPU 101 What is CUDA? CUDA is Nvidia s proprietary compute abstraction. Main merit: A well-balanced model of GPU computing. Abstract enough to not be hardware-specific. Concrete enough to expose most hardware features. (Very) close semantic relative of OpenCL.

19 (Very) brief GPU 101 What is CUDA? CUDA is Nvidia s proprietary compute abstraction. Main merit: A well-balanced model of GPU computing. Abstract enough to not be hardware-specific. Concrete enough to expose most hardware features. (Very) close semantic relative of OpenCL. Python + CUDA = PyCUDA.

20 (Very) brief GPU 101 GPUs: Execution Model Computational Grid Block (0, 1) Block (0, 0) Block (1, 1) Block (1, 0) Block (2, 1) Block (2, 0) Multi-tiered Parallelism Block Grid Block (1, 0) (0, 3) (1, 3) (2, 3) (3, 3) (0, 2) (1, 2) (2, 2) (3, 2) (0, 1) (0, 0) (1, 1) (1, 0) (2, 1) (2, 0) (3, 1) (3, 0) Image Credit: Johan S. Seland, Sintef

21 (Very) brief GPU 101 GPUs: Execution Model Computational Grid Block (0, 1) Block (0, 0) Block (1, 0) (0, 3) (0, 2) (0, 1) (0, 0) Block (1, 1) Block (1, 0) (1, 3) (1, 2) (1, 1) (1, 0) Block (2, 1) Block (2, 0) (2, 3) (2, 2) (2, 1) (2, 0) (3, 3) (3, 2) (3, 1) (3, 0) Multi-tiered Parallelism Block Grid Only threads within a block can communicate Image Credit: Johan S. Seland, Sintef

22 (Very) brief GPU 101 GPUs: Execution Model Computational Grid Block (0, 1) Block (0, 0) Block (1, 0) (0, 3) (0, 2) (0, 1) (0, 0) Block (1, 1) Block (1, 0) (1, 3) (1, 2) (1, 1) (1, 0) Block (2, 1) Block (2, 0) (2, 3) (2, 2) (2, 1) (2, 0) (3, 3) (3, 2) (3, 1) (3, 0) Multi-tiered Parallelism Block Grid Only threads within a block can communicate Each Block is assigned to physical execution unit. Image Credit: Johan S. Seland, Sintef

23 (Very) brief GPU 101 GPUs: Execution Model Computational Grid Block (0, 1) Block (0, 0) Block (1, 0) (0, 3) (0, 2) (0, 1) (0, 0) Block (1, 1) Block (1, 0) (1, 3) (1, 2) (1, 1) (1, 0) Block (2, 1) Block (2, 0) (2, 3) (2, 2) (2, 1) (2, 0) (3, 3) (3, 2) (3, 1) (3, 0) Multi-tiered Parallelism Block Grid Only threads within a block can communicate Each Block is assigned to physical execution unit. Grids and Blocks replace outer loops in an algorithm. Image Credit: Johan S. Seland, Sintef

24 Whetting your Appetite Outline 1 PyCUDA: What and Why? 2 PyCUDA: Overview Whetting your Appetite Working with PyCUDA 3 PyUblas 4 Conclusions

25 Whetting your Appetite Whetting your appetite 1 import pycuda.driver as cuda 2 import pycuda.autoinit 3 import numpy 4 5 a = numpy.random.randn(4,4).astype(numpy.float32) 6 a gpu = cuda.mem alloc(a.nbytes) 7 cuda.memcpy htod(a gpu, a) [This is examples/demo.py in the PyCUDA distribution.]

26 Whetting your Appetite Whetting your appetite 9 mod = cuda.sourcemodule( 10 global void doublify ( float a) 11 { 12 int idx = threadidx.x + threadidx.y 4; 13 a[ idx ] = 2; 14 } 15 ) func = mod.get function( doublify ) 18 func(a gpu, block=(4,4,1)) a doubled = numpy.empty like(a) 21 cuda.memcpy dtoh(a doubled, a gpu) 22 print a doubled 23 print a

27 Whetting your Appetite Whetting your appetite 9 mod = cuda.sourcemodule( 10 global void doublify ( float a) 11 { 12 int idx = threadidx.x + threadidx.y 4; 13 a[ idx ] = 2; 14 } 15 ) func = mod.get function( doublify ) 18 func(a gpu, block=(4,4,1)) a doubled = numpy.empty like(a) 21 cuda.memcpy dtoh(a doubled, a gpu) 22 print a doubled 23 print a Compute kernel

28 Whetting your Appetite Whetting your appetite, Part II Did somebody say Abstraction is good?

29 Whetting your Appetite Whetting your appetite, Part II 1 import numpy 2 import pycuda.autoinit 3 import pycuda.gpuarray as gpuarray 4 5 a gpu = gpuarray.to gpu( 6 numpy.random.randn(4,4).astype(numpy.float32)) 7 a doubled = (2 a gpu).get() 8 print a doubled 9 print a gpu

30 Working with PyCUDA Outline 1 PyCUDA: What and Why? 2 PyCUDA: Overview Whetting your Appetite Working with PyCUDA 3 PyUblas 4 Conclusions

31 Working with PyCUDA PyCUDA Philosophy Provide complete access

32 Working with PyCUDA PyCUDA Philosophy Provide complete access Automatically manage resources

33 Working with PyCUDA PyCUDA Philosophy Provide complete access Automatically manage resources Provide abstractions

34 Working with PyCUDA PyCUDA Philosophy Provide complete access Automatically manage resources Provide abstractions Allow interactive use

35 Working with PyCUDA PyCUDA Philosophy Provide complete access Automatically manage resources Provide abstractions Allow interactive use Check for and report errors automatically

36 Working with PyCUDA PyCUDA Philosophy Provide complete access Automatically manage resources Provide abstractions Allow interactive use Check for and report errors automatically Integrate tightly with numpy

37 Working with PyCUDA PyCUDA: Completeness PyCUDA exposes all of CUDA.

38 Working with PyCUDA PyCUDA: Completeness PyCUDA exposes all of CUDA. For example: Arrays and Textures Pagelocked host memory Memory transfers (asynchronous, structured) Streams and Events Device queries (GL Interop)

39 Working with PyCUDA PyCUDA: Completeness PyCUDA supports every OS that CUDA supports.

40 Working with PyCUDA PyCUDA: Completeness PyCUDA supports every OS that CUDA supports. Linux Windows OS X

41 Working with PyCUDA PyCUDA: Documentation

42 Working with PyCUDA PyCUDA: Workflow Edit Run

43 Working with PyCUDA PyCUDA: Workflow Edit Run SourceModule("...")

44 Working with PyCUDA PyCUDA: Workflow Edit Run SourceModule("...") PyCUDA

45 Working with PyCUDA PyCUDA: Workflow Edit Cache? Run SourceModule("...") PyCUDA

46 Working with PyCUDA PyCUDA: Workflow Edit Cache? Run nvcc SourceModule("...") PyCUDA

47 Working with PyCUDA PyCUDA: Workflow Edit Cache? Run nvcc.cubin SourceModule("...") PyCUDA

48 Working with PyCUDA PyCUDA: Workflow Edit Cache! Run nvcc.cubin SourceModule("...") PyCUDA

49 Working with PyCUDA PyCUDA: Workflow Edit Cache! Run nvcc.cubin SourceModule("...") Upload to GPU PyCUDA

50 Working with PyCUDA PyCUDA: Workflow Edit Cache! Run nvcc.cubin SourceModule("...") Run on GPU Upload to GPU PyCUDA

51 Working with PyCUDA Automatic Cleanup Reachable objects (memory, streams,... ) are never destroyed.

52 Working with PyCUDA Automatic Cleanup Reachable objects (memory, streams,... ) are never destroyed. Once unreachable, released at an unspecified future time.

53 Working with PyCUDA Automatic Cleanup Reachable objects (memory, streams,... ) are never destroyed. Once unreachable, released at an unspecified future time. Scarce resources (memory) can be explicitly freed. (obj.free())

54 Working with PyCUDA Automatic Cleanup Reachable objects (memory, streams,... ) are never destroyed. Once unreachable, released at an unspecified future time. Scarce resources (memory) can be explicitly freed. (obj.free()) Correctly deals with multiple contexts and dependencies.

55 Working with PyCUDA gpuarray: Simple Linear Algebra pycuda.gpuarray: Meant to look and feel just like numpy.

56 Working with PyCUDA gpuarray: Simple Linear Algebra pycuda.gpuarray: Meant to look and feel just like numpy. gpuarray.to gpu(numpy array) numpy array = gpuarray.get()

57 Working with PyCUDA gpuarray: Simple Linear Algebra pycuda.gpuarray: Meant to look and feel just like numpy. gpuarray.to gpu(numpy array) numpy array = gpuarray.get() No: indexing, slicing, etc. (yet)

58 Working with PyCUDA gpuarray: Simple Linear Algebra pycuda.gpuarray: Meant to look and feel just like numpy. gpuarray.to gpu(numpy array) numpy array = gpuarray.get() No: indexing, slicing, etc. (yet) Yes: +, -,, /, fill, sin, exp, rand, take,...

59 Working with PyCUDA gpuarray: Simple Linear Algebra pycuda.gpuarray: Meant to look and feel just like numpy. gpuarray.to gpu(numpy array) numpy array = gpuarray.get() No: indexing, slicing, etc. (yet) Yes: +, -,, /, fill, sin, exp, rand, take,... Mixed types (int32 + float32 = float64)

60 Working with PyCUDA gpuarray: Simple Linear Algebra pycuda.gpuarray: Meant to look and feel just like numpy. gpuarray.to gpu(numpy array) numpy array = gpuarray.get() No: indexing, slicing, etc. (yet) Yes: +, -,, /, fill, sin, exp, rand, take,... Mixed types (int32 + float32 = float64) print gpuarray for debugging.

Working with PyCUDA gpuarray: Simple Linear Algebra pycuda.gpuarray: Meant to look and feel just like numpy. gpuarray.to gpu(numpy array) numpy array = gpuarray.get() No: indexing, slicing, etc.

61 Working with PyCUDA gpuarray: Simple Linear Algebra pycuda.gpuarray: Meant to look and feel just like numpy. gpuarray.to gpu(numpy array) numpy array = gpuarray.get() No: indexing, slicing, etc. (yet) Yes: +, -,, /, fill, sin, exp, rand, take,... Mixed types (int32 + float32 = float64) print gpuarray for debugging. Memory behind gpuarray available as.gpudata attribute. Use as kernel arguments, textures, etc.

62 Working with PyCUDA gpuarray: Elementwise expressions Avoiding extra store-fetch cycles for elementwise math: from pycuda.curandom import rand as curand a gpu = curand((50,)) b gpu = curand((50,)) from pycuda.elementwise import ElementwiseKernel lin comb = ElementwiseKernel( float a, float x, float b, float y, float z, z[ i ] = a x[i] + b y[i] ) c gpu = gpuarray.empty like(a gpu) lin comb(5, a gpu, 6, b gpu, c gpu) assert la. norm((c gpu (5 a gpu+6 b gpu)).get()) < 1e 5

Working with PyCUDA PyCUDA: Vital Information http://mathema.tician.

63 Working with PyCUDA PyCUDA: Vital Information software/pycuda X Consortium License (no warranty, free for all use) Requires: numpy, Boost C++, Python Support via mailing list.

64 Working with PyCUDA PyCUDA: Metaprogramming In PyCUDA, GPU code does not need to be a compile-time constant.

65 Working with PyCUDA PyCUDA: Metaprogramming In PyCUDA, GPU code does not need to be a compile-time constant. (unlike native C interface)

66 Working with PyCUDA Shameless Plug How would you use this stuff to write a Discontinuous Galerkin solver? How well does that work? MS134: Large-Scale Computing with High-Order Mth. Tomorrow (Friday), 10:00 10:25, Concerto A

67 A Toolset for building Hybrid Codes Outline 1 PyCUDA: What and Why? 2 PyCUDA: Overview 3 PyUblas A Toolset for building Hybrid Codes 4 Conclusions

68 A Toolset for building Hybrid Codes Making C++ and Python talk to each other Making C++ and Python a workable combination: Python C++

69 A Toolset for building Hybrid Codes Making C++ and Python talk to each other Making C++ and Python a workable combination: Python Boost.Python C++

70 A Toolset for building Hybrid Codes Making C++ and Python talk to each other Making C++ and Python a workable combination: Python Numpy Boost.Python C++

71 A Toolset for building Hybrid Codes Making C++ and Python talk to each other Making C++ and Python a workable combination: Python Numpy Boost.Python C++ Boost.Ublas

72 A Toolset for building Hybrid Codes Making C++ and Python talk to each other Making C++ and Python a workable combination: Python Numpy Boost.Python PyUblas C++ Boost.Ublas

73 A Toolset for building Hybrid Codes PyUblas: Features Exploits pluggable storage backends in Ublas: numpy vector<t>, numpy matrix<t> Normal citizens of the Ublas universe (take part in ET etc.) Type-safe, zero-copy language transition OK anywhere: argument, return value, data member Strided storage: numpy strided vector<t> Not OK: non-contiguous storage

74 A Toolset for building Hybrid Codes Example C++: #include <pyublas/numpy.hpp> numpy vector<double> triple(numpy vector<double> x) { return 3 x; } BOOST PYTHON MODULE(sample ext) { boost :: python:: def( triple, triple ); } Python: import sample ext import pyublas vec = numpy.ones((5,), dtype=float) print sample ext. triple (vec)

75 Conclusions GPUs and Python: good combination +

76 Conclusions GPUs and Python: good combination Been considering playing with GPUs? PyCUDA makes that easier. +

77 Conclusions GPUs and Python: good combination Been considering playing with GPUs? PyCUDA makes that easier. Allows full access to low-level bits, if needed. +

78 Conclusions GPUs and Python: good combination Been considering playing with GPUs? PyCUDA makes that easier. Allows full access to low-level bits, if needed. Provides many nice abstractions to make life easier. +

Allows full access to low-level bits, if needed.

79 Conclusions GPUs and Python: good combination Been considering playing with GPUs? PyCUDA makes that easier. Allows full access to low-level bits, if needed. Provides many nice abstractions to make life easier. Enables just-in-time GPU code generation. See tomorrow s talk +

80 Questions?? Thank you for your attention! MS134: Tomorrow (Friday), 10:00 10:25, Concerto A image credits

81 Image Credits C870 GPU: Nvidia Corp. Snail: flickr.com/hadi fooladi GT200 die shot: Nvidia Corp. Old Books: flickr.com/ppdigital Adding Machine: flickr.com/thomashawk Floppy disk: flickr.com/ethanhein

GPU Metaprogramming using PyCUDA: Methods & Applications

GPU Metaprogramming using PyCUDA: Methods & Applications Division of Applied Mathematics Brown University Nvidia GTC October 2, 2009 Thanks Tim Warburton (Rice) Jan Hesthaven (Brown) Nicolas Pinto (MIT)