DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods for parallel computers GP

OVERVIEW GP GP

NEED FOR MORE COMPUTATIONAL POWER we want to do more and more complex computations they require more powerful CPUs we cannot increase the power of CPU by increasing the clock frequency anymore we may search for more efficient architectures...... or for parallel computing using multicore CPUs GP

IS MULTICORE ENOUGH? However adding more cores is not so simple because of shared memory architecture it is very difficult to build parallel system with shared memory and more then 100 cores for more then 100 cores we usually have to switch to distributed systems the latest mainframe z10 supports up to 64 cores September 26, 2006 - Intel: 80 cores by 2011 http://techfreep.com/intel-80-cores-by-2011.htm GP

DISADVANTAGES OF CPU CPU is designed to process general code the main parts of CPU design are pipeline and cache pipeline allows more efficient processing of instructions it needs to predict conditions in code - speculative execution in average there is 1 condition instruction per 6 instructions cache allows to hide big latency of common RAM both require complicated algorithms majority of transistors is spent on cache and speculative execution and not for main computing CPU is not well designed for numerical computing. GP

ADVANTAGES OF GPU GPU - graphics processing unit GPU is designed to run simultaneously up to 240 threads - virtually up to 30 000 threads 1 threads must be independent - it is not known in what order the are going to be scheduled intensive computing and only few conditions is assumed there is no speculative execution there is no cache GPU is optimised for sequential memory access - 112 GB/s GP 1 nvidia Tesla

ADVANTAGES OF GPU GP FIGURE: Source nvidia Programming Guide

COMPARISON CPU VS. GPU For approx. 1000 EUR one can buy nvidia Intel Core i7-975 TESLA S1060 Extreme Quad-Core Transistors 1 400 millions 731 millions Clock 1.3 GHz 3.3 GHz Threads Num. 240 8 Peak. Perf. 936 GFlops 50 GFlops Bandwidth 102 GB/s 25.6 GB/s RAM 4 GB 48 GB nvidia predicts 570 times faster GPUs until 2015 GP

GPU = graphics processing unit accelerators of algorithms in 3D graphics and visualisation originally aimed for computer games psychological disadvantage of GPU even today typical run transformation of thousands of triangles applying textures projection to frame buffer no data dependency GP

GP assume having a rectangle and apply gray scale texture with 800x600 pixels project it one-to-one to framebuffer/screen with resolution 800x600 pixels we can apply two textures and mix them T (i, j) = α 1 T 1 (i, j) + α 2 T 2 (i, j), for all pixels (i, j) it equals weighted sum of two matrices from R 800,600 result of which is stored in framebuffer/screen GPGPU = General-purpose computing on graphics processing units (2003) GP

ESSENCE OF GPGPU at the beginnings we had to use OpenGL for GPGPU problems were reformulated in terms of textures and operations on pixels game developers needed more flexible hardware pixel shaders simple programmable processors for operations with pixels support for single precision arithmetic limited number of instructions GP

CUDA = Compute Unified Device Architecture - nvidia 15 February 2007 significantly simplifies GPGPU programming completely avoids use of OpenGL and texture-like formulations of problems based on simple extension of C-language support only for nvidia graphic cards (or TESLA cards) It is very easy to write code for CUDA but one must have good knowledge of hardware to get efficient code. GP

CUDA ARCHITECTURE I. CUDA device = device for simultaneous processing of thousands of independent threads CUDA thread is lightweight structure - easy and efficient to create communication between processing units is the main difficulty in parallel computing. we cannot hope to be able to synchronise 240 resp. 30 000 threads efficiently CUDA architecture introduces small groups of threads with shared memory which can be synchronised GP

CUDA ARCHITECTURE II. 10-Series architecture (GeForce 2xx, TESLA) consists of 30 multiprocessors each has 8 thread processors GP FIGURE: Source nvidia Programming Guide From the hardware architecture thread hierarchy follows:

THREAD HIERARCHY threads are grouped into blocks one block is processed on one multiprocessor threads in the same block share very fast memory with low latency 16kB threads in the same block can be synchronised there can be up to 512 threads in one block multiprocessor must switch between them blocks of threads are grouped into grids GP

EXECUTION MODEL GP FIGURE: Source nvidia: Getting Started with CUDA

MEMORY LAYOUT GP FIGURE: Source nvidia: Getting Started with CUDA

MEMORY HIERARCHY GP FIGURE: Source nvidia: Getting Started with CUDA

COALESCED ACCES majority of GPU global memory acces consists of texture acces GPU is strongly optimised for sequential global memory acces one should avoid random acces to global memory coalesced memory acces can significantly reduce (up to 16x) number of memory transactions GP

COALESCED ACCES GP FIGURE: Source nvidia: nvidia CUDA programming guide

PROGRAMMING IN CUDA I. programming for CUDA consists of writing of kernels = code processed by one thread kernels do not support recursion they support branching - it can reduce efficiency The following code in C int main() { float A[ N ], B[ N ], C[ N ];... for( int i = 0; i <= N-1, i ++ ) C[ i ] = A[ i ] + B[ i ]; } GP

PROGRAMMING IN CUDA II. can be replaced by global void vecadd( float* A, float* B, float* C ) { int i = threadidx.x; C[ i ] = A[ i ] + B[ i ]; } int main() { // allocate A, B, C on the CUDA device... vecadd<<< 0,N-1 >>>( A, B, C ); } GP

ALLOCATING MEMORY ON THE CUDA DEVICE // Allocate input vectors h_a and h_b in host memory float* h_a = malloc(size); float* h_b = malloc(size); // Allocate vectors in device memory float* d_a; cudamalloc((void**)&d_a, size); float* d_b; cudamalloc((void**)&d_b, size); float* d_c; cudamalloc((void**)&d_c, size); // Copy vectors from host memory to device memory cudamemcpy(d_a, h_a, size, cudamemcpyhosttodevice); cudamemcpy(d_b, h_b, size, cudamemcpyhosttodevice); // Invoke kernel VecAdd<<< 0, N-1 >>>(d_a, d_b, d_c); // Copy result from device memory to host memory // h_c contains the result in host memory cudamemcpy(h_c, d_c, size, cudamemcpydevicetohost); // Free device memory cudafree(d_a); cudafree(d_b); cudafree(d_c); GP Compile with nvcc

PDES IN CUDA I. Consider the following parabolic PDE u t (x, t) + F (x, u, u, 2 u, t) = 0 on (0, T ] Ω, where Ω is domain in R 2. u(x, 0) = u ini (x), on Ω, u(x, t) = g(x), on Ω, GP

PDES IN CUDA II. Assume that Ω [0, 1] [0, 1] and define a numerical grid ω h = {(ih, jh) i = 1 N 1, j = 1 N 1}, ω h = {(ih, jh) i = 0 N, j = 0 N}, ω h = ω h \ ω h, for N N + and h := 1/N. GP

PDES IN CUDA III. After discretisation in space (using e.g. the finite difference method) we obtain the following system of ODEs d dt u ij (t) + F ij (u h, u h, 2 u h, t) = 0 on (0, T ] ω h, u ij (0) = u ini (ih, jh), on ω h, u ij (t) = g(ih, jh), on ω. GP

PDES IN CUDA IV. This system of ODEs can be also written as with initial values d dt u ij (t) = f(u h, t) ij, for i, j = 0, N, u ij (0) = u ini (ih, jh), for i, j = 0, N. We solve it by the following Runge-Kutta-Merson method with adaptivity in time: GP

PDES IN CUDA V. 1. Set τ := τ 0 for arbitrary τ 0 > 0. 2. Compute the grid functions k 1 ij, k2 ij, k3 ij, k4 ij, k5 ij as: ( kij 1 := τf t, u h) ij ( ) kij 2 := τf t + τ/3, u h + k 1 /3 ij ( ) kij 3 := τf t + τ/2, u h + k 1 /6 + k 2 /2 ij ( ) kij 4 := τf t + τ/2, u h + k 1 /8 + 3k 3 /8 ij ( kij 5 := τf t + τ, u h + k 1 /2 3k 3 /2 + 2k 4). ij for i = 0, N 1 and j = 0,, N 2. 3. Evaluate the error of the approximation with the current time step τ as 1 e := max 1 i=0,,n1 3 5 k1 ij 9 10 k3 ij + 4 5 k3 ij 1 10 k5 ij. j=0,,n2 4. If this error is smaller then given tolerance ) ɛ update u h as u h ij := uh ij (k + ij 1 + 4k4 ij + k5 ij /6 for i = 0, N 1, j = 0,, N 2 and set t := t + τ. 5. Independently { on the previous condition update τ as: τ := min τ 4 ( } ɛ 5 5, T t. e)1 6. Repeat whole process with the new τ i.e. go to the step 1. GP

PDES IN CUDA VI. Evaluation of each k 1,, k 6 as well as e and arguments of f is implemented in separate kernels. GP

APPLICATION TO MEDICAL IMAGE SEGMENTATION BY MODIFIED ALLEN-CAHN EQUATION ξu t = ξ (g (I ) u) + g (I ) u(x, 0) = u ini(x) on Ω, u(x, t) = g(x)on Ω, where «1 f0 + ξf u on(0, T ] Ω, ξ I = G σ I g(s) = 1/(1 + λs) is the Perrona-Malik function f 0(u) = u(1 u)(u 1/2) F = F (x) is a forcing term GP V. Žabka, 2008

MRI SEGMENTATION GP FIGURE: Segmentation of MRI data by the Allen-Cahn equation

SPEEDUP OF THE METHOD IN LINES IN CUDA Comparison of CPU time vs GPU time on Intel Core 2 Duo E6550-2 cores at 2.33 GHz, 4 MB L2 cache 12.8 GB/s GPU time (nvidia GeForce 8800 GT - 112 cores at 1.62 GHz, 512 MB RAM 60.8 GB/s Resolution CPU (s) GPU (s) Speedup 256 256 16.2 1.056 15.34 512 512 341 11.92 28.61 1024 1024 6054 183.52 32.99 GP

METHOD IN CUDA we implemented GMRES method for solving linear system Ax = b - J. Vacata, 2008 by Google, in March 2009 we were the only one having GMRES for sparse matrices in CUDA implementing GMRES in CUDA is straightforward we need format for storing sparse matrices fulfilling coalesced memory acces when computing matrix-vector product GP

CSR FORMAT FOR SPARSE MATRICES 5 4 3 2 2 1 2 8 1 7 5 6 9 3 7 values[] columns[] row pointers Figure 4: CSR format 5 2 1 7 3 8 6 4 2 9 2 5 3 1 7 0 2 2 5 1 3 5 0 1 6 3 6 7 4 7 0 2 4 5 7 8 10 13 15 16 GP

PCSR FORMAT FOR SPARSE MATRICES 5 4 3 2 2 1 2 8 1 7 5 6 9 3 7 Figure 5: Parallel CSR format values[] columns[] non zero els[] block pointers[] 5 1 3 8 2 7 0 2 1 3 5 6 4 2 2 1 5 0 1 4 2 2 1 2 1 2 3 2 0 8 9 5 7 6 3 7 GP

We tested CUDA GMRES solver on the following matrices - helm2d03, language and cage14. GP

CUDA GMRES SPEEDUP Results obtained in the single-precision arithmetic on Intel Core 2 Duo E6550-2 cores at 2.33 GHz, 4 MB L2 cache 12.8 GB/s nvidia GeForce 8800 GT - 112 cores at 1.62 GHz, 512 MB RAM 60.8 GB/s Matrix Non-zero els. CPU (s) GPU (s) Speedup helm2d03 2,741,935 40.5 4 10.1 language 1,216,334 66.5 10.6 6.27 cage14 27,130,349 96.5 4.4 21.9 GP

FUTURE OF GP GPU is much better designed for numerical computations However it is still understood as a computer games device even with CUDA, the code development takes a lot of time libraries only by nvidia weak support of double precision limited memory 4GB almost no experience with GPU clusters GPU is still quickly developing therefore it is changing a lot possible fusion with CPU it would avoid necesity of CPU GPU data transfer but common RAM is not sequentialy optimised!!! GP

FUTURE OF CUDA? nvidia is now leader in GPGPU thanks to CUDA CUDA does not support GPU by AMD new standard OpenCL CUDA still does not have good support for computation on more cards GP

THANK YOU To start with CUDA visit http://www.nvidia.com/object/cuda_home.html# or just type "CUDA" into Google. GP