DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Similar documents
Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Introduction to CUDA Programming

Lecture 11: GPU programming

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Josef Pelikán, Jan Horáček CGG MFF UK Praha

High Performance Computing and GPU Programming

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Parallel Accelerators

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

Parallel Accelerators

Scientific discovery, analysis and prediction made possible through high performance computing.

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Using a GPU in InSAR processing to improve performance

General Purpose GPU Computing in Partial Wave Analysis

University of Bielefeld

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Tesla Architecture, CUDA and Optimization Strategies

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Accelerating image registration on GPUs

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Accelerating CFD with Graphics Hardware

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC


Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Basics of CADA Programming - CUDA 4.0 and newer

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS427 Multicore Architecture and Parallel Computing

Programmable Graphics Hardware (GPU) A Primer

Automated Finite Element Computations in the FEniCS Framework using GPUs

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

GPU Fundamentals Jeff Larkin November 14, 2016

Portland State University ECE 588/688. Graphics Processors

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

High-Performance Computing Using GPUs

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

COSC 6374 Parallel Computations Introduction to CUDA

Current Trends in Computer Graphics Hardware

ECE 574 Cluster Computing Lecture 15

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CME 213 S PRING Eric Darve

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to GPU hardware and to CUDA

COSC 6339 Accelerators in Big Data

GPU Programming Using NVIDIA CUDA

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Vector Addition on the Device: main()

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

EEM528 GPU COMPUTING

HPCSE II. GPU programming and CUDA

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Lecture 5. Performance Programming with CUDA

GPU Computing: Introduction to CUDA. Dr Paul Richmond

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Center for Computational Science

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Paralization on GPU using CUDA An Introduction

CUDA Programming Model

Multi-Processors and GPU

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Introduction to CUDA

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

International Journal of Computer Science and Network (IJCSN) Volume 1, Issue 4, August ISSN

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

GPU for HPC. October 2010

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

CSE 160 Lecture 24. Graphical Processing Units

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

Real-time Graphics 9. GPGPU

Introduction to CUDA

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

Numerical Algorithms on Multi-GPU Architectures

ECE 574 Cluster Computing Lecture 17

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

Introduction to CUDA (1 of n*)

Real-time Graphics 9. GPGPU

B. Tech. Project Second Stage Report on

Parallel Numerical Algorithms

GPU Programming with CUDA. Pedro Velho

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Large scale Imaging on Current Many- Core Platforms

high performance medical reconstruction using stream programming paradigms

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

How to Optimize Geometric Multigrid Methods on GPUs

Lecture 1: Introduction and Computational Thinking

Transcription:

USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods for parallel computers GP

OVERVIEW GP GP

NEED FOR MORE COMPUTATIONAL POWER we want to do more and more complex computations they require more powerful CPUs we cannot increase the power of CPU by increasing the clock frequency anymore we may search for more efficient architectures...... or for parallel computing using multicore CPUs GP

IS MULTICORE ENOUGH? However adding more cores is not so simple because of shared memory architecture it is very difficult to build parallel system with shared memory and more then 100 cores for more then 100 cores we usually have to switch to distributed systems the latest mainframe z10 supports up to 64 cores September 26, 2006 - Intel: 80 cores by 2011 http://techfreep.com/intel-80-cores-by-2011.htm GP

DISADVANTAGES OF CPU CPU is designed to process general code the main parts of CPU design are pipeline and cache pipeline allows more efficient processing of instructions it needs to predict conditions in code - speculative execution in average there is 1 condition instruction per 6 instructions cache allows to hide big latency of common RAM both require complicated algorithms majority of transistors is spent on cache and speculative execution and not for main computing CPU is not well designed for numerical computing. GP

ADVANTAGES OF GPU GPU - graphics processing unit GPU is designed to run simultaneously up to 240 threads - virtually up to 30 000 threads 1 threads must be independent - it is not known in what order the are going to be scheduled intensive computing and only few conditions is assumed there is no speculative execution there is no cache GPU is optimised for sequential memory access - 112 GB/s GP 1 nvidia Tesla

ADVANTAGES OF GPU GP FIGURE: Source nvidia Programming Guide

COMPARISON CPU VS. GPU For approx. 1000 EUR one can buy nvidia Intel Core i7-975 TESLA S1060 Extreme Quad-Core Transistors 1 400 millions 731 millions Clock 1.3 GHz 3.3 GHz Threads Num. 240 8 Peak. Perf. 936 GFlops 50 GFlops Bandwidth 102 GB/s 25.6 GB/s RAM 4 GB 48 GB nvidia predicts 570 times faster GPUs until 2015 GP

GPU = graphics processing unit accelerators of algorithms in 3D graphics and visualisation originally aimed for computer games psychological disadvantage of GPU even today typical run transformation of thousands of triangles applying textures projection to frame buffer no data dependency GP

GP assume having a rectangle and apply gray scale texture with 800x600 pixels project it one-to-one to framebuffer/screen with resolution 800x600 pixels we can apply two textures and mix them T (i, j) = α 1 T 1 (i, j) + α 2 T 2 (i, j), for all pixels (i, j) it equals weighted sum of two matrices from R 800,600 result of which is stored in framebuffer/screen GPGPU = General-purpose computing on graphics processing units (2003) GP

ESSENCE OF GPGPU at the beginnings we had to use OpenGL for GPGPU problems were reformulated in terms of textures and operations on pixels game developers needed more flexible hardware pixel shaders simple programmable processors for operations with pixels support for single precision arithmetic limited number of instructions GP

CUDA = Compute Unified Device Architecture - nvidia 15 February 2007 significantly simplifies GPGPU programming completely avoids use of OpenGL and texture-like formulations of problems based on simple extension of C-language support only for nvidia graphic cards (or TESLA cards) It is very easy to write code for CUDA but one must have good knowledge of hardware to get efficient code. GP

CUDA ARCHITECTURE I. CUDA device = device for simultaneous processing of thousands of independent threads CUDA thread is lightweight structure - easy and efficient to create communication between processing units is the main difficulty in parallel computing. we cannot hope to be able to synchronise 240 resp. 30 000 threads efficiently CUDA architecture introduces small groups of threads with shared memory which can be synchronised GP

CUDA ARCHITECTURE II. 10-Series architecture (GeForce 2xx, TESLA) consists of 30 multiprocessors each has 8 thread processors GP FIGURE: Source nvidia Programming Guide From the hardware architecture thread hierarchy follows:

THREAD HIERARCHY threads are grouped into blocks one block is processed on one multiprocessor threads in the same block share very fast memory with low latency 16kB threads in the same block can be synchronised there can be up to 512 threads in one block multiprocessor must switch between them blocks of threads are grouped into grids GP

EXECUTION MODEL GP FIGURE: Source nvidia: Getting Started with CUDA

MEMORY LAYOUT GP FIGURE: Source nvidia: Getting Started with CUDA

MEMORY HIERARCHY GP FIGURE: Source nvidia: Getting Started with CUDA

COALESCED ACCES majority of GPU global memory acces consists of texture acces GPU is strongly optimised for sequential global memory acces one should avoid random acces to global memory coalesced memory acces can significantly reduce (up to 16x) number of memory transactions GP

COALESCED ACCES GP FIGURE: Source nvidia: nvidia CUDA programming guide

PROGRAMMING IN CUDA I. programming for CUDA consists of writing of kernels = code processed by one thread kernels do not support recursion they support branching - it can reduce efficiency The following code in C int main() { float A[ N ], B[ N ], C[ N ];... for( int i = 0; i <= N-1, i ++ ) C[ i ] = A[ i ] + B[ i ]; } GP

PROGRAMMING IN CUDA II. can be replaced by global void vecadd( float* A, float* B, float* C ) { int i = threadidx.x; C[ i ] = A[ i ] + B[ i ]; } int main() { // allocate A, B, C on the CUDA device... vecadd<<< 0,N-1 >>>( A, B, C ); } GP

ALLOCATING MEMORY ON THE CUDA DEVICE // Allocate input vectors h_a and h_b in host memory float* h_a = malloc(size); float* h_b = malloc(size); // Allocate vectors in device memory float* d_a; cudamalloc((void**)&d_a, size); float* d_b; cudamalloc((void**)&d_b, size); float* d_c; cudamalloc((void**)&d_c, size); // Copy vectors from host memory to device memory cudamemcpy(d_a, h_a, size, cudamemcpyhosttodevice); cudamemcpy(d_b, h_b, size, cudamemcpyhosttodevice); // Invoke kernel VecAdd<<< 0, N-1 >>>(d_a, d_b, d_c); // Copy result from device memory to host memory // h_c contains the result in host memory cudamemcpy(h_c, d_c, size, cudamemcpydevicetohost); // Free device memory cudafree(d_a); cudafree(d_b); cudafree(d_c); GP Compile with nvcc

PDES IN CUDA I. Consider the following parabolic PDE u t (x, t) + F (x, u, u, 2 u, t) = 0 on (0, T ] Ω, where Ω is domain in R 2. u(x, 0) = u ini (x), on Ω, u(x, t) = g(x), on Ω, GP

PDES IN CUDA II. Assume that Ω [0, 1] [0, 1] and define a numerical grid ω h = {(ih, jh) i = 1 N 1, j = 1 N 1}, ω h = {(ih, jh) i = 0 N, j = 0 N}, ω h = ω h \ ω h, for N N + and h := 1/N. GP

PDES IN CUDA III. After discretisation in space (using e.g. the finite difference method) we obtain the following system of ODEs d dt u ij (t) + F ij (u h, u h, 2 u h, t) = 0 on (0, T ] ω h, u ij (0) = u ini (ih, jh), on ω h, u ij (t) = g(ih, jh), on ω. GP

PDES IN CUDA IV. This system of ODEs can be also written as with initial values d dt u ij (t) = f(u h, t) ij, for i, j = 0, N, u ij (0) = u ini (ih, jh), for i, j = 0, N. We solve it by the following Runge-Kutta-Merson method with adaptivity in time: GP

PDES IN CUDA V. 1. Set τ := τ 0 for arbitrary τ 0 > 0. 2. Compute the grid functions k 1 ij, k2 ij, k3 ij, k4 ij, k5 ij as: ( kij 1 := τf t, u h) ij ( ) kij 2 := τf t + τ/3, u h + k 1 /3 ij ( ) kij 3 := τf t + τ/2, u h + k 1 /6 + k 2 /2 ij ( ) kij 4 := τf t + τ/2, u h + k 1 /8 + 3k 3 /8 ij ( kij 5 := τf t + τ, u h + k 1 /2 3k 3 /2 + 2k 4). ij for i = 0, N 1 and j = 0,, N 2. 3. Evaluate the error of the approximation with the current time step τ as 1 e := max 1 i=0,,n1 3 5 k1 ij 9 10 k3 ij + 4 5 k3 ij 1 10 k5 ij. j=0,,n2 4. If this error is smaller then given tolerance ) ɛ update u h as u h ij := uh ij (k + ij 1 + 4k4 ij + k5 ij /6 for i = 0, N 1, j = 0,, N 2 and set t := t + τ. 5. Independently { on the previous condition update τ as: τ := min τ 4 ( } ɛ 5 5, T t. e)1 6. Repeat whole process with the new τ i.e. go to the step 1. GP

PDES IN CUDA VI. Evaluation of each k 1,, k 6 as well as e and arguments of f is implemented in separate kernels. GP

APPLICATION TO MEDICAL IMAGE SEGMENTATION BY MODIFIED ALLEN-CAHN EQUATION ξu t = ξ (g (I ) u) + g (I ) u(x, 0) = u ini(x) on Ω, u(x, t) = g(x)on Ω, where «1 f0 + ξf u on(0, T ] Ω, ξ I = G σ I g(s) = 1/(1 + λs) is the Perrona-Malik function f 0(u) = u(1 u)(u 1/2) F = F (x) is a forcing term GP V. Žabka, 2008

MRI SEGMENTATION GP FIGURE: Segmentation of MRI data by the Allen-Cahn equation

SPEEDUP OF THE METHOD IN LINES IN CUDA Comparison of CPU time vs GPU time on Intel Core 2 Duo E6550-2 cores at 2.33 GHz, 4 MB L2 cache 12.8 GB/s GPU time (nvidia GeForce 8800 GT - 112 cores at 1.62 GHz, 512 MB RAM 60.8 GB/s Resolution CPU (s) GPU (s) Speedup 256 256 16.2 1.056 15.34 512 512 341 11.92 28.61 1024 1024 6054 183.52 32.99 GP

METHOD IN CUDA we implemented GMRES method for solving linear system Ax = b - J. Vacata, 2008 by Google, in March 2009 we were the only one having GMRES for sparse matrices in CUDA implementing GMRES in CUDA is straightforward we need format for storing sparse matrices fulfilling coalesced memory acces when computing matrix-vector product GP

CSR FORMAT FOR SPARSE MATRICES 5 4 3 2 2 1 2 8 1 7 5 6 9 3 7 values[] columns[] row pointers Figure 4: CSR format 5 2 1 7 3 8 6 4 2 9 2 5 3 1 7 0 2 2 5 1 3 5 0 1 6 3 6 7 4 7 0 2 4 5 7 8 10 13 15 16 GP

PCSR FORMAT FOR SPARSE MATRICES 5 4 3 2 2 1 2 8 1 7 5 6 9 3 7 Figure 5: Parallel CSR format values[] columns[] non zero els[] block pointers[] 5 1 3 8 2 7 0 2 1 3 5 6 4 2 2 1 5 0 1 4 2 2 1 2 1 2 3 2 0 8 9 5 7 6 3 7 GP

We tested CUDA GMRES solver on the following matrices - helm2d03, language and cage14. GP

CUDA GMRES SPEEDUP Results obtained in the single-precision arithmetic on Intel Core 2 Duo E6550-2 cores at 2.33 GHz, 4 MB L2 cache 12.8 GB/s nvidia GeForce 8800 GT - 112 cores at 1.62 GHz, 512 MB RAM 60.8 GB/s Matrix Non-zero els. CPU (s) GPU (s) Speedup helm2d03 2,741,935 40.5 4 10.1 language 1,216,334 66.5 10.6 6.27 cage14 27,130,349 96.5 4.4 21.9 GP

FUTURE OF GP GPU is much better designed for numerical computations However it is still understood as a computer games device even with CUDA, the code development takes a lot of time libraries only by nvidia weak support of double precision limited memory 4GB almost no experience with GPU clusters GPU is still quickly developing therefore it is changing a lot possible fusion with CPU it would avoid necesity of CPU GPU data transfer but common RAM is not sequentialy optimised!!! GP

FUTURE OF CUDA? nvidia is now leader in GPGPU thanks to CUDA CUDA does not support GPU by AMD new standard OpenCL CUDA still does not have good support for computation on more cards GP

THANK YOU To start with CUDA visit http://www.nvidia.com/object/cuda_home.html# or just type "CUDA" into Google. GP