G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Similar documents
CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Tesla GPU Computing A Revolution in High Performance Computing

High Performance Computing and GPU Programming

GPU programming. Dr. Bernhard Kainz

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

NVIDIA CUDA Compute Unified Device Architecture

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Tesla Architecture, CUDA and Optimization Strategies

Introduction to GPU Computing. 周国峰 Wuhan University 2017/10/13

Lecture 1: Introduction and Computational Thinking

Lab 1 Part 1: Introduction to CUDA

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

Introduction to GPGPU and GPU-architectures

HPC with Multicore and GPUs

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

High-Performance Computing Using GPUs

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Massively Parallel Architectures

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Introduction to Parallel Computing with CUDA. Oswald Haan

Programmable Graphics Hardware (GPU) A Primer

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Real-time Graphics 9. GPGPU

COSC 6374 Parallel Computations Introduction to CUDA

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

GPU Programming Using NVIDIA CUDA

Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009

ECE 574 Cluster Computing Lecture 15

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Programming in CUDA. Malik M Khan

Parallel Computing. Lecture 19: CUDA - I

CUDA Parallelism Model

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Introduction to CUDA

CUDA C Programming Mark Harris NVIDIA Corporation

Introduc)on to GPU Programming

Programming Parallel Computers

Real-time Graphics 9. GPGPU

ECE 574 Cluster Computing Lecture 17

Accelerating image registration on GPUs

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Introduction to CUDA Programming

Lecture 3: Introduction to CUDA

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

General-purpose computing on graphics processing units (GPGPU)

1/25/12. Administrative

Lecture 2: Introduction to CUDA C

Dense Linear Algebra. HPC - Algorithms and Applications

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

University of Bielefeld

Multi-Processors and GPU

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

B. Tech. Project Second Stage Report on

Using a GPU in InSAR processing to improve performance

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Basics of CADA Programming - CUDA 4.0 and newer

CME 213 S PRING Eric Darve

Practical Introduction to CUDA and GPU

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Technology for a better society. hetcomp.com

Reduction of a Symmetrical Matrix. to Tridiagonal Form on GPUs

COSC 6339 Accelerators in Big Data

GPU Programming Using CUDA

Mathematical computations with GPUs

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

CME 213 S PRING Eric Darve


Accelerating High Performance Computing.

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Data Parallel Execution Model

Scientific discovery, analysis and prediction made possible through high performance computing.

Module 2: Introduction to CUDA C. Objective

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Applications of Berkeley s Dwarfs on Nvidia GPUs

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

gpot: Intelligent Compiler for GPGPU using Combinatorial Optimization Techniques

Inter-Block GPU Communication via Fast Barrier Synchronization

Scientific Computations Using Graphics Processors

Programming Parallel Computers

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

CUDA. Matthew Joyner, Jeremy Williams

GPU Computing with CUDA. Part 2: CUDA Introduction

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

General Purpose GPU Computing in Partial Wave Analysis

Unrolling parallel loops

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Module 2: Introduction to CUDA C

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Lecture 11: GPU programming

CS 179 Lecture 4. GPU Compute Architecture

Module 3: CUDA Execution Model -I. Objective

Transcription:

Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty of Physics Department of Computational Physics In recent years the application of graphics processing units to general purpose computing becomes widely developed. GPGPU, which stands for General-purpose computing on Graphics Processing Units, makes its way into the fields of computations, traditionally associated with and handled on CPUs or clusters of CPUs. The breakthrough in the area of GPU computing was caused by the introduction of programmable stages and higher precision arithmetics on rendering pipelines, allowing to perform stream processing of non-graphics data. Let s understand, what makes GPUs effective in high-performance computing. GPUs are built for parallel processing of data, and are highly effective in data parallel tasks. High amount of computing units (GPUs have in the range of 128-800 ALUs, compared to 4 ALUs on a typical quad-core) allows computation power of GPU to exceed that of CPU by up to 10 times for high-end models, while high-end GPUs cost much less than CPUs (see Fig. 1). Memory bandwidth, essential to many applications, is 100+ GB/s, compared to 10-20 GB/s for CPUs. 36)$$% 32$% CDA-'% 32$% 35)% 304%!"#/%!"#$%!".$% 30$% #;$%3<=% >7-?)%@*7% #;)%3<=% <'-,?-A7B(% &'(% &*(% +,-% )$$#% &*(% 9'-%!78% 9':% &*(% )$$.% )$$/% )$$1% )$$0% )$$2% GT200 = GeForce GTX 280 G92 = GeForce 9800 GTX! G80!=!GeForce!8800!GTX G71!=!GeForce!7900!GTX! G70!=!GeForce!7800!GTX! NV40!=!GeForce!6800!Ultra! NV35!=!GeForce!FX!5950!Ultra! NV30!=!GeForce!FX!5800! Fig. 1. Comparison of computational power: floating-point operations per second for the GPU and CPU (by NVIDIA)

GPGPU: High-performance Computing Puzyrev 2 Such computational power can now be applied in different areas, including scientific computing, signal and image processing, and, of course, computer graphics itself (including non-traditional rendering algorithms, e.g. ray tracing). Scientific applications of high-performance GPGPU include molecular dynamics, astrophysics, geophysics, quantum chemistry, neural networks. Fig. 2 illustrates the effectiveness (speedup) of GPGPU in several fields. Not 2x or 3x : Speedups are 20x to 150x 146X 36X 18X 50X 100X Medical Imaging U of Utah Molecular Dynamics U of Illinois, Urbana Video Transcoding Elemental Tech Matlab Computing AccelerEyes Astrophysics RIKEN 149X 47X 20X 130X 30X Financial simulation Oxford Linear Algebra Universidad Jaime 3D Ultrasound Techniscan Quantum Chemistry U of Illinois, Urbana Gene Sequencing U of Maryland Fig. 2. Speedup by use of GPGPU in scientific applications (by NVIDIA) GPGPU computing can be performed on various hardware, including virtually all modern GPUs. NVIDIA desktop GPUs (9x or GTX series) and modern AMD GPUs support general-purpose computing. Both NVIDIA and AMD have their own dedicated high-performance GPGPU solutions, which are NVIDIA Tesla and AMD FireStream. Of course, GPU architecture is highly specific. Basically, GPU devotes much more transistors to data processing rather than data caching and flow control. Fig. 3 roughly illustrates the CPU and GPU architecture specifics. Hardware specifics affect the programming model. The basics of GPGPU programming model are: Small program (called kernel) works on many data elements Each data element is processed concurrently Communication is effective only inside one execution unit. Two slightly different models are used on GPUs: SIMD (Single instruction, multiple data) and SPMD (Single program, multiple data).

GPGPU: High-performance Computing Puzyrev 3! Control Cache ALU ALU ALU ALU DRAM DRAM CPU GPU! Fig. 3. Schematic comparison of CPU and GPU architecture The most popular instrument of general-purpose GPU computing is NVIDIA s GPGPU implementation, that is called NVIDIA CUDA. CUDA is positioned as general purpose parallel computing architecture and works with all modern NVIDIA GPUs. It uses C as high-level programming language, though other languages will be supported in the future, as seen on Fig. 4. Fig. 4. NVIDIA CUDA architecture One of the basics for CUDA is the concept of kernels. These are C functions, that are executed N times in parallel with specific <<<>>> syntax in main functions. Fig. 5 contains the basic example of CUDA code with kernel and main function.

GPGPU: High-performance Computing Puzyrev 4 global void vecadd(float* A, float* B, float* C) int i = threadidx.x; C[i] = A[i] + B[i]; int main() // Kernel invocation vecadd<<<1, N>>>(A, B, C); Fig. 5. Kernel: this code adds two vectors A and B of size N The next basic concept of CUDA is thread hierarchy. One kernel is executed in one thread, and threads are combined in thread blocks. Next figure shows an example of matrix addition using thread blocks. global void matadd(float A[N][N], float B[N][N], float C[N][N]) int i = threadidx.x; int j = threadidx.y; C[i][j] = A[i][j] + B[i][j]; int main() // Kernel invocation dim3 dimblock(n, N); matadd<<<1, dimblock>>>(a, B, C); Fig. 6. Thread blocks: this code adds two matrices A and B of size N*N Threads within a block have shared memory and can synchronize. On current GPUs, a thread block may contain up to 512 threads. A kernel can be executed by multiple equally shaped thread blocks put in a grid. Thread blocks in a grid are required to execute independently. Fig. 7 shows an

GPGPU: High-performance Computing Puzyrev 5 example of matrix addition on a grid. Fig. 8 shows the full hierarchy of threads. global void matadd(float A[N][N], float B[N][N], float C[N][N]) int i = blockidx.x * blockdim.x + threadidx.x; int j = blockidx.y * blockdim.y + threadidx.y; if (i < N && j < N) C[i][j] = A[i][j] + B[i][j]; int main() // Kernel invocation dim3 dimblock(16, 16); dim3 dimgrid((n + dimblock.x 1) / dimblock.x, (N + dimblock.y 1) / dimblock.y); matadd<<<dimgrid, dimblock>>>(a, B, C); Fig. 7. Grid: this code adds two matrices A and B of size N*N Grid Block (2, 0) Block (2, 1) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Fig. 8. Full thread hierarchy

GPGPU: High-performance Computing Puzyrev 6 Memory hierarchy is another complicated part of CUDA. As shown on Fig. 9, multiple memory spaces are present, including additional specialized memory spaces: constant and texture memory. Different memory usage strategies are used for different applications. Memory usage is usually the bottleneck of GPGPU applications. Thread Per-thread local memory Thread Block Per-block shared memory Grid 0 Block (2, 0) Block (2, 1) Grid 1 Global memory Block (0, 2) Block (1, 2) Fig. 9. Memory hierarchy The host and device model controls the execution of CUDA program. CUDA threads are executed execute on a physically separate device that operates as a coprocessor to the host running the C program. Both the host and the device maintain their own DRAM, referred to as host memory and device memory, and CUDA runtime manages data transfer.

GPGPU: High-performance Computing Puzyrev 7 C Program Sequential Execution Serial code Host Parallel kernel Kernel0<<<>>>() Device Grid 0 Block (2, 0) Block (2, 1) Serial code Host Parallel kernel Kernel1<<<>>>() Device Grid 1 Block (0, 2) Block (1, 2) Serial code executes on the host while parallel code executes on the device. Fig. 10. Host and device: CUDA program execution

GPGPU: High-performance Computing Puzyrev 8 CUDA allows to implement various libraries of functions, that essential for scientific highperformance computing. There is an efficient implementation of FFT on CUDA, as well as an implementation of BLAS, called CUBLAS. Unfortunately, CUBLAS doesn t yet include all BLAS functions. CUDA-based plugins for MATLAB exists. LAPACK libraries for CUDA are under development. CUDA allows to perform efficient operations on dense matrices, sparse solvers are still under development. CUDA still has many limitations. Of course, some of them are fundamental and come from specific architecture and data-parallel programming model. But other limitations should definitely be fixed in the future. For example, double precision has no deviations from IEEE 754 standard, but single precision is not standard. Double precision is supported only by the last generation of NVIDIA GPUs. CUDA still lacks advanced profiler. Emulation on CPU is slow and often results, which are different from that on GPU. Another drawback is that CUDA works only on NVIDIA GPUs. Alternative implementation of GPGPU architecture is Stream Computing by AMD, which is based on Brook programming language. Brook works on both AMD and NVIDIA GPUs, Brook+ is AMD hardware optimized version. Brook is faster in some applications and has better support of double precision. There are two main future directions in GPGPU computing. The first focuses on creation of new hardware, which should be more suitable for general-purpose computing. The best example is Larrabee - Intel s upcoming discrete GPU. It will compete with both GPUs and high-performance computing. Larrabee has a hybrid architecture: 16-32 simple x86 cores with SIMD vector units and per-core L1/L2 cache, without fixed function hardware, except for texturing units. Programming model for Larrabee will be task-parallel on core level and data-parallel on vector units inside the core. One significant feature is that Larrabee cores can submit work to itself without the host. Larrabee will use Intel C/C++ compiler and benefit from all its features. Larrabee gathers much praise from Intel and huge criticism from GPU manufacturers. Another direction is to create a standardized API for programming on different architectures. OpenCL, which stands for Open Computing Language, is a framework for programming on CPUs, GPUs and other processors. It was proposed by Apple and developed by Khronos Group, and has full support from both AMD and NVIDIA. OpenCL is likely to be introduced with Mac OS X 10.6. For the conclusion, let s make a statement: GPGPU is a developed branch of computing. NVIDIA CUDA already allows us to utilize the power of GPUs in convenient way. Data parallel programming model is effective in many applications, but GPGPU could become more flexible, supporting more programming languages and programming concepts. A fusion between many-core CPUs and GPUs is a promising direction, and this direction is explored by Intel. Standardized API for GPGPU is highly anticipated, and upcoming OpenCL is the main candidate for this API.

GPGPU: High-performance Computing Puzyrev 9 NVIDIA CUDA Programming Guide 2.1 Various NVIDIA CUDA presentations REFERENCES