Scripting CUDA (using python, R and MATLAB)

Size: px

Start display at page:

Download "Scripting CUDA (using python, R and MATLAB)"

Clarissa Lee
5 years ago
Views:

1 Scripting CUDA (using python, R and MATLAB) Ferdinand Jamitzky jamitzky@lrz.de

2 Why parallel programming? End of the free lunch Moore's law means no longer faster processors, only more of them. But beware! 2 x 3 GHz < 6 GHz (cache consistency, multi-threading, etc)

saturates at 3 to 4 GHz multi-core processors vs

3 The future is parallel Moore's law is still valid Number of transistors doubles every 2 years Clock speed saturates at 3 to 4 GHz multi-core processors vs many-core processors grid/cloud computing clusters GPGPUs (intel 2005)

4 Supercomputer scaling

5 Supercomputer: SMP SMP Machine: Example: gvs1 shared memory typically 10s of cores threaded programs bus interconnect 128 GB RAM 16 cores in R: GB RAM cores library(multicore) and inlined code Example: uv2/3

6 Supercomputer: MPI Cluster of machines: Example: linux MPP cluster distributed memory typically 100s of cores message passing interface infiniband interconnect 2752 GB RAM 2752 cores in R: 340,000 GB RAM 155,656 Intel cores library(rmpi) and inlined code Example: supermuc

7 Supercomputer: GPGPU Graphics Card: shared memory typically 1000s of cores CUDA or opencl on chip interconnect in R: library(gputools) and inlined code Example: Tesla K20X 6 GB RAM 2688 Threads Example: Titan ORNL GB RAM 18,688 GPU Cards 50,233,344 Threads

8 The future is massively parallel Connection Machine CM-1 (1983) 12-D Hypercube bit cores (AND, OR, NOT) Rmax: 20 GFLOP/s

9 The future is massively parallel JUGENE Blue Gene/P (2007) 3-D Torus or Tree bit cores (PowerPC 450) Rmax: 222 TFLOP/s now: 1 PFLOP/s cores

10 Levels of Parallelism Node Level (e.g. SuperMUC has approx nodes) each node has 2 sockets Socket Level each socket contains 8 cores Core Level each core has 16 vector registers Vector Level (e.g. lxgp1 GPGPU has 480 vector registers) Pipeline Level (how many simultaneous pipelines) hyperthreading Instruction Level (instructions per cycle) out of order execution, branch prediction

11 Problems: Access Times Getting data from: CPU register 1ns Getting some food from: fridge 10s L2 cache 10ns microwave 100s ~ 2min memory 80 ns pizza service 800s ~ 15min network(ib) 200 ns city mall GPU(PCIe) ns mum sends cake s~1 week harddisk ns grown in own garden 2000s ~ 0.5h 5Ms ~ 2months

12 Amdahl's law Computing time for N processors T(N) = T(1)/N + Tserial + Tcomm * N Acceleration factor: T(1)/T(N) = N / (1 + Tserial/T(1)*N + Tcomm/T(1)*N^2) small N: T(1)/T(N) ~ N large N: T(1)/T(N) ~ 1/N saturation point!

13 Amdahl's law III > plot(n,type="l") > lines(n/(1+0.01*n),col="red") > lines(n/(1+0.01*n+0.001*n**2),col="green") > Tserial=0.01 > Tcomm=0.001

14 How are High-Performance Codes constructed? Traditional Construction of High-Performance Codes: C/C++/Fortran Libraries Alternative Construction of High-Performance Codes: Scripting for brains GPUs for inner loops Play to the strengths of each programming environment.

15 Hierarchical architecture of hardware vs software accelerators (gpus, xeon phi) in-core vectorisation (avx) multicore nodes (qpi, pci bus) strongly coupled nodes (infiniband, 10GE) weakly coupled clusters (cloud) Cuda, intrinsics vectorisation pragmas openmp MPI workflow middleware

16 Why Scripting? Do you: want to reuse CUDA code easily (e.g. as a library)? want to dynamically determine whether CUDA is available? want to use multi-threading (painlessly)? want to use MPI (painlessly)? want to use loose coupling (grid computing)? want dynamic exception handling and fallbacks? want dynamic compilation of CUDA code? If you answered "yes" to one of these questions, you should consider a scripting language

17 Parallel Tools in python, R and MATLAB SMP multicore parallelism R domc, dosmp, pnmath, BLAS no max cores MMP massive parallel processing GPGPU CUDA opencl dosnow, dompi, doredis rgpu, gputools multiprocessing futures parallel python, mpi4py pycuda, pyopencl parfor, spmd max 8 cores jobs, pmode gpuarray python MATLAB

18 Scripting CUDA CUDA Compiler PGI Fortran python NumbraPro R Interpreter pycuda rgpu MATLAB

19 MATLAB GPU

20 MATLAB GPU # load matlab module and start command line version module load cuda module load matlab/r2011a matlab -nodesktop

21 MATLAB gpuarray Copy data to GPGPU and return a handle on the object All operations on the handle are performed on the GPGPU x=rand(100); gx=gpuarray(x); how to compute the GFlop/s tic; M=gpuArray(rand(np*1000)); gather(sum(sum(m*m))); 2*np^3/toc

22 pycuda Gives you the following advantages: 1. Combining Two Strong Tools 2. Scripting CUDA 3. Run-Time Code Generation special thanks to a.klöckner

23 LRZ log in to lxgp1 $ module load python $ module load cuda $ module load boost $ python Python (r261:67515, Apr , 17:25:25) [GCC (SUSE Linux)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>>

24 Simple Example from numpy import * import pycuda.autoinit import pycuda.gpuarray as gpu a_gpu = gpu.to_gpu(random.randn(4,4).astype (ﬂoat32)) a_doubled = (2 a_gpu).get() print a_doubled print a_gpu

25 gpuarray class pycuda.gpuarray: Meant to look and feel just like numpy. gpuarray.to gpu(numpy array) numpy array = gpuarray.get() +, -,, /, ﬁll, sin, exp, rand, basic indexing, norm, inner product Mixed types (int32 + ﬂoat32 = ﬂoat64) print gpuarray for debugging. Allows access to raw bits Use as kernel arguments, textures, etc.

26 gpuarray: Elementwise expressions Avoiding extra store-fetch cycles for elementwise math: from pycuda.curandom import rand as curand a_gpu = curand((50,)) b_gpu = curand((50,)) from pycuda.elementwise import ElementwiseKernel lin_comb = ElementwiseKernel( ﬂoat a, ﬂoat x, ﬂoat b, ﬂoat y, ﬂoat z, z[ i ] = a x[i ] + b y[i] ) c_gpu = gpuarray.empty_like (a_gpu) lin_comb(5, a_gpu, 6, b_gpu, c_gpu) assert la.norm((c_gpu (5 a_gpu+6 b_gpu)).get()) < 1e 5

27 gpuarray: Reduction made easy Example: A scalar product calculation from pycuda.reduction import ReductionKernel dot = ReductionKernel(dtype_out=numpy.ﬂoat32, neutral= 0, reduce_expr= a+b, map_expr= x[i] y[i], arguments= const ﬂoat x, const ﬂoat y ) from pycuda.curandom import rand as curand x = curand(( ), dtype=numpy.ﬂoat32) y = curand(( ), dtype=numpy.ﬂoat32) x_dot_y = dot(x,y).get() x_dot_y_cpu = numpy.dot(x.get(), y.get ())

28 CUDA Kernels in pycuda import pycuda.autoinit import pycuda.driver as drv import numpy from pycuda.compiler import SourceModule mod = SourceModule(""" global void multiply_them(float *dest, float *a, float *b) { const int i = threadidx.x; dest[i] = a[i] * b[i]; }""") multiply_them = mod.get_function("multiply_them") a = numpy.random.randn(400).astype(numpy.float32) b = numpy.random.randn(400).astype(numpy.float32) dest = numpy.zeros_like(a) multiply_them( drv.out(dest), drv.in(a), drv.in(b), block=(400,1,1) print dest-a*b

29 Completeness PyCUDA exposes all of CUDA. For example: Arrays and Textures Pagelocked host memory Memory transfers (asynchronous, structured) Streams and Events Device queries GL Interop And furthermore: Allow interactive use Integrate tightly with numpy

30 pycuda showcase Agent-based Models Computational Visual Neuroscience Discontinuous Galerkin Finite Element PDE Solvers Estimating the Entropy of Natural Scenes Facial Image Database Search Filtered Backprojection for Radar Imaging LINGO Chemical Similarities Recurrence Diagrams Sailfish: Lattice Boltzmann Fluid Dynamics Selective Embedded Just In Time Specialization Simulation of spiking neural networks

31 NumbraPro Generate CUDA Kernels using a Just-in-time compiler from numbapro import float32[:], float32[:])') def sum(a, b, result): i = cuda.grid(1) # equals to threadidx.x + blockidx.x * blockdim.x result[i] = a[i] + b[i] # Invoke like: result_array) sum[grid_dim, block_dim](big_input_1, big_input_2,

32 The Language R

33 R in a nutshell module load cuda/2.3 module load R/serial/2.13 > x=1:10 > y=x**2 > str(y) > print(x) > times2 = function(x) 2*x graphics! > plot(x,y) = and <- are interchangable

34 rgpu a set of functions for loading data toa gpu and manipulating the data there: exportgpu(x) evalgpu(x+y) lsgpu() rmgpu("x") sumgpu(x), meangpu(x), gemmgpu(a,b) cos, sin,.., +, -, *, /, **, %*%

35 Example load the correct R module $ module load R/serial/2.13 start R $R R version ( ) Copyright (C) 2011 The R Foundation for Statistical Computing ISBN load rgpu library > library(rgpu) > help(package="rgpu") > rgpudetails()

36 Data on the GPGPU one million random uniform numbers > x=runif( ) send data to gpu > exportgpu(x) do some calculations > evalgpu(sumgpu(sin(x)+cos(x)+tan(x)+exp(x))) do some timing comparisons (GPU vs CPU): > system.time(evalgpu(sumgpu(sin(x)+cos(x)+tan(x)+exp(x)))) > system.time(sum(sin(x)+cos(x)+tan(x)+exp(x)))

37 real world examples: gputools gputools is a package of precompiled CUDA functions for statistics, linear algebra and machine learning choosegpu getgpuid() gpucor, gpuaucestimate gpudist, gpudistclust, gpuhclust, gpufastica gpuglm, gpulm gpugranger, gpumi gpumatmult, gpuqr, gpusvd, gpusolve gpulsfit gpusvmpredict, gpusvmtrain gputtest

38 Example: Matrix Inversion np < x <- matrix(runif(np**2), np,np) system.time(gpusolve(x)) system.time(solve(x))

39 Example: Hierarchical Clustering numvectors <- 5 dimension <- 10 Vectors <- matrix(runif(numvectors*dimension), numvectors, dimension) distmat <- gpudist(vectors, "euclidean") myclust <- gpuhclust(distmat, "single") plot(myclust) for other examples try: example(hclust)

40 Fortran 90 Example program myprog! simulate harmonic oscillator integer, parameter :: np=1000, nstep=1000 real :: x(np), v(np), dx(np), dv(np), dt=0.01 integer :: i,j forall(i=1:np) x(i)=i forall(i=1:np) v(i)=i do j=1,nstep dx=v*dt; dv=-x*dt x=x+dx; v=v+dv end do print*, " total energy: ",sum(x**2+v**2) end program

41 PGI Compiler log in to lxgp1 $ module load fortran/pgi/11.8 $ pgf90 -o myprog.exe myprog.f90 $ time./myprog.exe exercise for you: compute MFlop/s (Floating Point Operations: 4 * np * nstep) optimize (hint: -Minfo, -fast, -O3)

42 Fortran 90 Example program myprog! simulate harmonic oscillator integer, parameter :: np=1000, nstep=1000 real :: x(np), v(np), dx(np), dv(np), dt=0.01 integer :: i,j forall(i=1:np) x(i)=i forall(i=1:np) v(i)=i do j=1,nstep!$acc region dx=v*dt; dv=-x*dt x=x+dx; v=v+dv!$acc end region end do print*, " total energy: ",sum(x**2+v**2) end program

43 PGI Compiler accelerator module load fortran/pgi pgf90 -ta=nvidia -o myprog.exe myprog.f90 time./myprog.exe exercise for you: compute MFlop/s (Floating Point Operations: 4 * np * nstep) optimize (hint: change acc region)

44 Use R as scripting language R can dynamically load shared objects: dyn.load("lib.so") these functions can then be called via.c("fname", args).fortran("fname", args)

45 R subroutine subroutine mysub_cuda(x,v,nstep)! simulate harmonic oscillator integer, parameter :: np= real*8 :: x(np), v(np), dx(np), dv(np), dt=0.001 integer :: i,j, nstep forall(i=1:np) x(i)=real(i)/np forall(i=1:np) v(i)=real(i)/np do j=1,nstep dx=v*dt; dv=-x*dt x=x+dx; v=v+dv end do return end subroutine

46 Compile two versions don't forget to load the modules! module unload ccomp fortran module load ccomp/pgi/11.8 module load fortran/pgi/11.8 module load R/serial/2.13 pgf90 -shared -fpic -o mysub_host.so mysub_host.f90 pgf90 -ta=nvidia -shared -fpic -o mysub_cuda.so mysub_cuda.f90

47 Load and run Load dynamic libraries > dyn.load("mysub_host.so"), dyn.load("mysub_cuda.so"); np= Benchmark > system.time(str(.fortran("mysub_host",x=numeric(np),v=numeric(np),nstep=as.integer(1000)))) total energy: total energy: List of 3 $ x : num [1: ] -3.01e e e e e $ v : num [1: ] 1.38e e e e e $ nstep: int 1000 user system elapsed > system.time(str(.fortran("mysub_cuda",x=numeric(np),v=numeric(np),nstep=as.integer(1000)))) total energy: total energy: List of 3 $ x : num [1: ] -3.01e e e e e $ v : num [1: ] 1.38e e e e e $ nstep: int 1000 user system elapsed Acceleration Factor: > 26.9/0.83 [1]

48 Matrix Multipl. in FORTRAN subroutine mmult(a,b,c,np) integer np real*8 a(np,np), b(np,np), c(np,np) integer i,j, k do k=1, np forall(i=1:np, j=1:np) a(i,j) = a (i,j) + b(i,k)*c(k,j) end do return end subroutine

49 Call FORTRAN from R # compile f90 to shared object library system("pgf90 -shared -fpic -o mmult.so mmult. f90"); # dynamically load library dyn.load("mmult.so") # define multiplication function mmult.f <- function(a,b,c).fortran("mmult", a=a,b=b,c=c, np=as.integer(dim(a)[1]))

50 Call FORTRAN binary np=100 system.time( mmult.f( a = matrix(numeric(np*np),np,np), b = matrix(numeric(np*np)+1.,np,np), c = matrix(numeric(np*np)+1.,np,np) ) ) Exercise: make a plot system-time vs matrix-dimension

51 PGI accelerator directives subroutine mmult(a,b,c,np) integer np real*8 a(np,np), b(np,np), c(np,np) integer i,j, k do k=1, np!$acc region forall(i=1:np, j=1:np) a(i,j) = a(i,j) + b(i,k)*c(k,j)!$acc end region end do return end subroutine

52 Call FORTRAN from R # compile f90 to shared object library system("pgf90 -ta=nvidia -shared -fpic -o mmult.so mmult.f90"); # dynamically load library dyn.load("mmult.so") # define multiplication function mmult.f <- function(a,b,c).fortran("mmult", a=a,b=b,c=c, np=as.integer(dim(a)[1]))

53 Compute MFlop/s print(paste(2.*2.*np**3/ /system.time( str(mmult.f(...)) )[[3]]," MFlop/s")) Exercise: Compare MFlop/s vs dimension for serial and accelerated code

54 Scripting Parallel Execution R implicit rgpu jit pnmath explicite MKL domc dompi hierarchical parallelisation: - accelerator: rgpu, pnmath, MKL - intra-node: jit, domc, MKL - intra-cluster: SNOW, MPI, pbdmpi - inter-cluster: Redis, SNOW dosnow doredis

55 foreach package # new R foreach # old R code library(foreach) alist=list() alist <foreach (i=1:n) %do% call(i) for(i in 1:N) alist[i] <-call(i) foreach is a function for is a language keyword

56 multithreading with R library(foreach) library(foreach) library(domc) registerdomc() foreach(i=1:n) %do% { mmult.f() } foreach(i=1:n) %dopar% { mmult.f() } # serial execution # thread execution

57 MPI with R library(foreach) library(foreach) library(dosnow) registerdosnow() foreach(i=1:n) %do% { mmult.f() } foreach(i=1:n) %dopar% { mmult.f() } # serial execution # MPI execution

58 dosnow # R > library(dosnow) > cl <- makesockcluster(4) > registerdosnow(cl) > system.time(foreach(i=1:10) %do% sum(runif( ))) user system elapsed > system.time(foreach(i=1:10) %dopar% sum(runif( ))) user system elapsed

59 domc # R > library(domc) > registerdomc(cores=4) > system.time(foreach(i=1:10) %do% sum(runif( ))) user system elapsed > system.time(foreach(i=1:10) %dopar% sum(runif( ))) user system elapsed

60 nosql databases Redis is an open source, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets. Clients are available for C, C++, C#, Objective-C, Clojure, Common Lisp, Erlang, Go, Haskell, Io, Lua, Perl, Python, PHP, R ruby, scala, smalltalk, tcl

61 doredis / workers start redis worker: > echo "require('doredis');redisworker('jobs')" R The workers can be distributed over the internet > startredisworkers(100)

62 doredis # R > library(doredis) > registerdoredis("jobs") > system.time(foreach(i=1:10) %do% sum(runif( ))) user system elapsed > system.time(foreach(i=1:10) %dopar% sum(runif( ))) user system elapsed

63 MPI-CUDA with R Using dosnow and dyn.load with pgifortran: library(dosnow) cl=makecluster(c("gvs1","gvs2"),type="sock") registerdosnow(cl) foreach(i=1:2) %dopar% setwd("~/kurse/r_cuda") foreach(i=1:2) %dopar% dyn.load("mysub_cuda.so") system.time( foreach(i=1:4) %dopar% str(.fortran("mysub_cuda",x=numeric(np),v=numeric (np), nstep=as.integer(1000))))

64 Big Memory Logical Setup of Node Logical Setup of Node Logical Setup of Node without shared memory with shared memory with file-backed memory R R MEM MEM R R R R Logical Setup of Node with network attached filebacked memory R R MEM MEM Disk Network Disk Network Disk Network MEM

65 library(bigmemory) shared memory regions for several processes in SMP file backed arrays for several node over network file systems library(bigmemory) x <- as.big.matrix(matrix(runif( ), 1000, 1000))) sum(x[1,1:1000])

PyCUDA and PyUblas: Hybrid HPC in Python made easy

PyCUDA and PyUblas: Hybrid HPC in Python made easy Applied Mathematics, Brown University March 5, 2009 Thanks Jan Hesthaven (Brown) Tim Warburton (Rice) Lucas Wilcox (UT Austin) Akil Narayan (Brown) PyCUDA