GPU CUDA Programming

Similar documents
Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

CUDA Basics. July 6, 2016

EEM528 GPU COMPUTING

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CS 179: GPU Computing. Lecture 2: The Basics

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Parallel Numerical Algorithms

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

COSC 6339 Accelerators in Big Data

University of Bielefeld

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Basic unified GPU architecture

CSE 160 Lecture 24. Graphical Processing Units

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

Introduc)on to GPU Programming

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CUDA C Programming Mark Harris NVIDIA Corporation

GPU Computing: Introduction to CUDA. Dr Paul Richmond

CS377P Programming for Performance GPU Programming - I

Tesla Architecture, CUDA and Optimization Strategies

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

HPCSE II. GPU programming and CUDA

HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GPU programming. Dr. Bernhard Kainz

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

High Performance Linear Algebra on Data Parallel Co-Processors I

CUDA Parallel Programming Model Michael Garland

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Introduction to GPGPU and GPU-architectures

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

Introduction to CUDA Programming

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

GPU computing Simulating spin models on GPU Lecture 1: GPU architecture and CUDA programming. GPU computing. GPU computing.

GPU Programming Using CUDA

Lecture 1: an introduction to CUDA

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

CUDA Architecture & Programming Model

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Massively Parallel Algorithms

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Module 2: Introduction to CUDA C. Objective

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to GPGPUs and to CUDA programming model

Lecture 3. Programming with GPUs

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

COSC 6374 Parallel Computations Introduction to CUDA

High Performance Computing and GPU Programming

Lecture 10!! Introduction to CUDA!

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

High-Performance Computing Using GPUs

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

ECE 574 Cluster Computing Lecture 17

CS179 GPU Programming Recitation 4: CUDA Particles

ECE 574 Cluster Computing Lecture 15

Introduction to GPGPUs

Module 2: Introduction to CUDA C

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Fundamental CUDA Optimization. NVIDIA Corporation

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

GPGPU in Film Production. Laurence Emms Pixar Animation Studios

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Basics of CADA Programming - CUDA 4.0 and newer

Hands-on CUDA Optimization. CUDA Workshop

Parallel Accelerators


COMP 322: Fundamentals of Parallel Programming

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Parallel Computing. Lecture 19: CUDA - I

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Introduction to CUDA (1 of n*)

CUDA Programming Model

Transcription:

GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications Parallel Programming on GPUs: Basics Conceptual Introduction GPU Architecture Review Parallel Programming on GPUs: Practice Real programming Conclusion 1

연구실소개 임베디드 SoC 연구실 (www.onchip.net) 주연구주제 Clock을사용하지않는비동기회로설계 고신뢰저전자파회로설계방법 병렬컴퓨팅및에너지효율분석 연구원 : Nguyen Van Toan & Minh Tung Dam ( 박사과정 ) 실제품개발 2

Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications Parallel Programming on GPUs: Basics Conceptual Introduction GPU Architecture Review Parallel Programming on GPUs: Practice Real programming Conclusion Introduction Multicore/Manycore and GPU 3

Alpha-Go? Introduction 구글의 DeepMind 가 이렇게 강력한 인공지능을개발할수있었던바탕은 풍부한계산자원에있다. 판후이와 대국을 한 분산 AlphaGo 의 경우 1920 개의 CPU 와 280 개의 GPU 가 사용됐다. Deep learning is HOT! Why now? Alpha-Go? Introduction 구글의 DeepMind 가 이렇게 강력한 인공지능을개발할수있었던바탕은 풍부한계산자원에있다. 판후이와 대국을 한 분산 AlphaGo 의 경우 1202 개의 CPU 와 176 개의 GPU 가 사용됐다. Deep learning is HOT! Why now? 4

Introduction: GPU? Manycore? Tesla P100 3584 Introduction: GPU? Manycore? GPU? Graphics only Processing Unit? 5

Introduction: GPU? Manycore? GPU? Graphics only Processing Unit? GPGPU General Purpose GPU http://www.nvidia.co.kr/object/tesla-case-studies-kr.html CPU vs. GPU Latency Processor + Throughput Processor C C Cache Mem + Heterogeneous Computing CPU architecture attempts to minimize latency within each thread with the help of CACHING GPU architecture hides latency with computation from other thread warps with the help of THREADING 6

Introduction Performance Gap If you find a GOOD APPLICATION for GPU Acceleration!!! Ref: Tesla GPU Computing Brochure Introduction GPU on Medical Imaging Applications One of the GOOD GPU APPLICATIONs Real-Time & Data Level Parallelism 7

Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications Parallel Programming on GPUs: Basics Conceptual Introduction GPU Architecture Review Parallel Programming on GPUs: Practice Real programming Conclusion Parallel Programming on a GPU We have one code part for(i=0; i < 100; i++) C[i] = A[i] + B[i]; Yes! It is simple! But what about on 10 cores on-chip processor? 8

Parallel Programming on a GPU But what about on 10 cores on-chip processor? for(i=0; i < 100; i++) C[i] = A[i] + B[i]; Core Index 0 1 2 3 4 5 6 7 8 9 Parallel Programming on a GPU But what about on 10 cores on-chip processor? for(i=0; i < 100; i++) C[i] = A[i] + B[i]; Core Index 0 1 2 3 4 5 6 7 8 9 For each core for(i=0; i < 10; i++) C[CoreIndex*10+i]? = A[CoreIndex*10+i]? + B[CoreIndex*10+i];? 9

Parallel Programming on a GPU But what about on 10 cores on-chip processor? for(i=0; i < 100; i++) C[i] = A[i] + B[i]; Core Index 0 1 2 3 4 5 6 7 8 9 For each core for(i=0; i < 10; i++) C[CoreIndex*10+i] = A[CoreIndex*10+i] + B[CoreIndex*10+i]; Parallel Programming on a GPU What about Threading? for(i=0; i < 100; i++) C[i] = A[i] + B[i]; Thread 0: C[0] = A[0] + B[0]; Thread 1: C[1] = A[1] + B[1]; Thread 98: C[98] = A[98] + B[98]; Thread 99: C[99] = A[99] + B[99]; 10

Parallel Programming on a GPU What about Threading? for(i=0; i < 100; i++) C[i] = A[i] + B[i]; Thread 0: C[0] = A[0] + B[0]; Thread 1: C[1] = A[1] + B[1]; Thread 98: C[98] = A[98] + B[98]; Thread 99: C[99] = A[99] + B[99]; Stream Processor (SP) Schedule 0 1 2 3 4 5 6 7 8 9 Parallel Programming on a GPU Stream Multiprocessor (SM) 1 Stream Multiprocessor (SM) 2 0 1 2 3 4 0 1 2 3 4 Shared Memory Shared Memory? Thread 0: C[0] = A[0] + B[0]; Thread 1: C[1] = A[1] + B[1]; Thread 98: C[98] = A[98] + B[98]; Thread 99: C[99] = A[99] + B[99]; 11

Parallel Programming on a GPU Stream Multiprocessor (SM) 1 Stream Multiprocessor (SM) 2 0 1 2 3 4 0 1 2 3 4 Shared Memory Shared Memory Block 0 Thread 0: C[0] = A[0] + B[0]; Thread 1: C[1] = A[1] + B[1]; Block 9 Thread 98: C[98] = A[98] + B[98]; Thread 99: C[99] = A[99] + B[99]; Parallel Programming on a GPU 1 Thread per block & 100 Blocks Or 5 Thread per block & 20 Blocks Or 10 Thread per block & 10 Blocks Or 20 Thread per block & 5 Blocks 12

Parallel Programming on a GPU We have one code part for(i=0; i < 100; i++) C[i] = A[i] + B[i]; Now, 10 blocks and 10 threads per a block! Parallel Programming on a GPU Now, 10 blocks and 10 threads per a block! BLOCK 0 BLOCK 1 BLOCK 2 BLOCK 3 BLOCK 4 BLOCK 5 BLOCK 6 BLOCK 7 BLOCK 8 BLOCK 9 Stream Multiprocessor (SM) 1 Stream Multiprocessor (SM) 2 0 1 2 3 4 0 1 2 3 4 Shared Memory Shared Memory 13

Parallel Programming on a GPU Now, 10 blocks and 10 threads per a block! BLOCK 0 Thread 0: C[0] = A[0] + B[0]; Thread 1: C[1] = A[1] + B[1]; 5 Threads First and Then 5 Threads! Stream Multiprocessor (SM) 1 Ok! Let s call the group of threads as WARP! 0 1 2 3 4 Shared Memory WARP 0: thread 0 thread 4 WARP 1: thread 5 thread 9 Parallel Programming on a GPU Now, 10 blocks and 10 threads per a block! à We have two indexes: block index & thread index BLOCK 0 Thread 0: C[0] = A[0] + B[0]; Thread 1: C[1] = A[1] + B[1]; Block index ß 0; // blockidx Thread index ß 1; // threadidx C[0*10+1] = A[0*10+1] + B[0*10+1]; global VectorAdd (A, B, C) { C[blockIdx*10+threadIdx] = A[blockIdx*10+threadIdx] + B[blockIdx*10+threadIdx]; } 14

Parallel Programming on a GPU Hm Where are A, B, C arrays in a system? Parallel Programming on a Manycore Hm Where are A, B, C arrays in a system? A system? 15

Parallel Programming on a Manycore Hm Where are A, B, C arrays in a system? A system? GPU Global Memory Parallel Programming on a Manycore Hm Where are A, B, C arrays in a system? A system? GPU Global Memory PCI-E CPU uprocessor Main memory 16

Parallel Programming on a Manycore Hm Where are A, B, C arrays in a system? A system! GPU Global Memory PCI-E A, B are Here! CPU uprocessor Main memory Parallel Programming on a Manycore Hm Where are A, B, C arrays in a system? A system! GPU Global Memory PCI-E 1. cudamalloc 2. cudamemcpy 3. Kernel launch 4. cudamemcpy CPU uprocessor Main memory 17

Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications Parallel Programming on GPUs: Basics Conceptual Introduction GPU Architecture Review Parallel Programming on GPUs: Practice Real programming Conclusion GPU Architecture The architecture of GPUs Single Instruction Multiple Thread (SIMT) Manycore Vector Processing Throughput-Oriented Accelerating Architecture First See a CPU Case: AMD 12 Core CPU C C Cache Mem Many on-chip Cache: L1/L2/L3 * HT: HyperTransport - low-latency point-to-point link - https://en.wikipedia.org/wiki/hypertransport 18

Cache in CPU & GPU Die shots of NVIDIA s GK110 GPU (left) and Intel s Nehalem Beckton 8 core CPU (right) with block diagrams for the GPU streaming multiprocessor and the CPU core. One more thing in GPU Special Memory in GPU Graphics memory: much higher bandwidth than standard CPU memory (QDR) Stacked Memory in Pascal HBM2 High Bandwidth Memory 19

Simple Comparison GPU vs CPU: Theoretical Peak capabilities For these particular example, GPU s theoretical advantage is ~5x for both compute and main memory bandwidth Performance very much depends on applications Depends on how well a target application is suited to/tuned for architecture Nvidia GPU: Fermi (2009) ref: http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf 20

Nvidia GPU: Kepler (2012) ref: http://www.nvidia.com/object/nvidia-kepler.html Nvidia GPU: Kepler (2012) ref: http://www.nvidia.com/object/nvidia-kepler.html 21

Nvidia GPU: Maxwell (2014) 128 CUDA cores/smm 32 CUDA cores ref: http://international.download.nvidia.com/geforce-com/international/pdfs/geforce_gtx_980_whitepaper_final.pdf Nvidia GPU: Maxwell (2014) 128 CUDA cores/smm 32 CUDA cores ref: http://international.download.nvidia.com/geforce-com/international/pdfs/geforce_gtx_980_whitepaper_final.pdf 22

Nvidia GPU: Pascal (2016) Looking at an individual SM, there are 64 CUDA cores, and each SM has a 256K register file, which is four times the size of the shared L2 cache size. In total, the GP100 has 14,336K of register file space. Compared to Maxwell, each core in Pascal has twice as many registers, 1.33 times more shared memory Nvidia GPU: Pascal (2016) Looking at an individual SM, there are 64 CUDA cores, and each SM has a 256K register file, which is four times the size of the shared L2 cache size. In total, the GP100 has 14,336K of register file space. Compared to Maxwell, each core in Pascal has twice as many registers, 1.33 times more shared memory 23

Nvidia GPU: Pascal (2016) Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications Parallel Programming on GPUs: Basics Conceptual Introduction GPU Architecture Review Parallel Programming on GPUs: Practice Real programming Conclusion 24

CUDA Kernels Parallel portion of application: execute as a kernel Entire GPU executes kernel Kernel launches create thousands of CUDA threads efficiently CUDA threads Lightweight Fast switching: Hardware Kernel Thread! C[i] = A[i] + B[i]; Kernel launches create hierarchical groups of threads Threads are grouped into Blocks, and Blocks into Grids CUDA C : C with a few keywords Kernel: function that executes on device (GPU) and can be called from host (CPU) Can only access GPU memory Functions must be declared with a qualifier global : GPU kernel function launched by CPU, must return void device : can be called from GPU functions Kernel Thread! C[i] = A[i] + B[i]; device add( ) { C[i] = A[i] + B[i]; 25

CUDA Kernels : Parallel Threads A kernel is a function executed on a GPU as an array of parallel threads All threads execute the same kernel code, but can take different paths Each thread has an ID Select input/output data Control decisions device add( ) { i = threadidx.x; C[i] = A[i] + B[i]; CUDA Thread Organization GPUs can handle thousands of concurrent threads CUDA programming model supports even more Allows a kernel launch to specify more threads than the GPU can execute concurrently 26

Blocks of threads Threads are grouped into blocks Grids of blocks Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads 27

Blocks execute on SM Streaming Processor Streaming Multiprocessor SMEM Thread Thread block Registers Per-Block Shared Memory Global Device Memory Global Device Memory Grids of blocks executes across GPU GPU Grid of Blocks Global Device Memory 28

Thread and Block ID and Dimensions Threads 3D IDs, unique within a block Thread Blocks 3D IDs, unique within a grid Build-in variables threadidx blockidx blockdim griddim Examples of Indexes and Indexing global void kernel( int *a ) { int idx = blockidx.x*blockdim.x + threadidx.x; a[idx] = 7; // output: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 } global void kernel( int *a ) { int idx = blockidx.x*blockdim.x + threadidx.x; a[idx] = blockidx.x; // output: 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 } global void kernel( int *a ) { int idx = blockidx.x*blockdim.x + threadidx.x; a[idx] = threadidx.x; // output: 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 } 4 29

Let's Start Again from C convert into CUDA int A[2][4]; for(i=0;i<2;i++) for(j=0;j<4;j++) A[i][j]++; int A[2][4]; kernelf<<<(2,1),(4,1)>>>(a); device kernelf(a){ i = blockidx.x; j = threadidx.x; A[i][j]++; } // define 2x4=8 threads // all threads run same kernel // each thread block has its id // each thread has its id // each thread has different i and j 참조 : http://www.es.ele.tue.nl/~heco/courses/embeddedcomputerarchitecture/ Thread Hierarchy Example: thread 3 of block 1 operates on element A[1][3] int A[2][4]; kernelf<<<(2,1),(4,1)>>>(a); // define 2x4=8 threads device kernelf(a){ // all threads run same kernel i = blockidx.x; j = threadidx.x; A[i][j]++; // each thread block has its id // each thread has its id // each thread has different i and j } 참조 : http://www.es.ele.tue.nl/~heco/courses/embeddedcomputerarchitecture/ 30

Memory Architecture Host CPU Chipset DRAM Device DRAM Local Global Constant Texture GPU Multiprocessor Multiprocessor Registers Multiprocessor Registers Shared Memory Registers Shared Memory Shared Memory Constant and Texture Caches https://en.wikipedia.org/wiki/pci_express Managing Device (GPU) Memory Host (CPU) manage device (GPU) memory cudamalloc(void** pointer, size_t num_bytes) cudamemset(void** pointer, int value, size_t count) cudafree(void* pointer) Ex.: allocate and initialize array of 1024 ints on device // allocate and initialize int x[1024] on device int n = 1024; int num_bytes = 1024*sizeof(int); int* d_x = 0; // holds device pointer cudamalloc((void**)&d_x, num_bytes); cudamemset(d_x, 0, num_bytes); cudafree(d_x); 31

Transferring Data cudamemcpy(void* dst, void* src, size_t num_bytes, enum cudamemcpykind direction); Returns to host thread after the copy completes Block CPU thread until all bytes have been copied Doesn t start copying until previous CUDA calls complete Direction controlled by num cudamemcpykind cudamemcpyhosttodevice cudamemcpydevicetohost cudamemcpydevicetodevice CUDA also provides non-blocking Allows program to overlap data transfer with concurrent computation on host and device 32

Example: SAXPY Kernel // [compute] for(i=0; i<n; i++) y[i] = a*x[i] + y[i]; // Each thread processes on element global void saxpy(int n, float a, float* x, float* y) { int i = threadidx.x + blockdim.x * blockidx.x; if( i<n ) y[i] = a*x[i] + y[i]; } int main() { // invoke parallel SAXPY kernel with 256 threads / block int nblocks = (n + 255)/256; saxpy<<<nblocks, 256>>>(n, 2.0, d_x, d_y); } Example: SAXPY Kernel // [compute] for(i=0; i<n; i++) y[i] = a*x[i] + y[i]; // Each thread processes on element global void saxpy(int n, float a, float* x, float* y) { int i = threadidx.x + blockdim.x * blockidx.x; if( i<n ) y[i] = a*x[i] + y[i]; } int main() { // invoke parallel SAXPY kernel with 256 threads / block int nblocks = (n + 255)/256 saxpy<<<nblocks, 256>>>(n, 2.0, d_x, d_y); } Device Code 33

Example: SAXPY Kernel // [compute] for(i=0; i<n; i++) y[i] = a*x[i] + y[i]; // Each thread processes on element global void saxpy(int n, float a, float* x, float* y) { int i = threadidx.x + blockdim.x * blockidx.x; if( i<n ) y[i] = a*x[i] + y[i]; } int main() { // invoke parallel SAXPY kernel with 256 threads / block int nblocks = (n + 255)/256 saxpy<<<nblocks, 256>>>(n, 2.0, d_x, d_y); } Host Code Example: SAXPY Kernel int main() { // allocate and initialize host (CPU) memory float* x = ; float* y = ; // allocate device (GPU) memory float *d_x, *d_y; cudamalloc((void**) &d_x, n*sizeof(float)); cudamalloc((void**) &d_y, n*sizeof(float)); // copy x and y from host memory to device memory cudamemcpy(d_x, x, n*sizeof(float), cudamemcpyhosttodevice); cudamemcpy(d_x, x, n*sizeof(float), cudamemcpyhosttodevice); // invoke parallel SAXPY kernel with 256 threads / block int nblocks = (n + 255)/256 saxpy<<<nblocks, 256>>>(n, 2.0, d_x, d_y); 34

Example: SAXPY Kernel int main() { // allocate and initialize host (CPU) memory float* x = ; float* y = ; // allocate device (GPU) memory float *d_x, *d_y; cudamalloc((void**) &d_x, n*sizeof(float)); cudamalloc((void**) &d_y, n*sizeof(float)); // copy x and y from host memory to device memory cudamemcpy(d_x, x, n*sizeof(float), cudamemcpyhosttodevice); cudamemcpy(d_x, x, n*sizeof(float), cudamemcpyhosttodevice); // invoke parallel SAXPY kernel with 256 threads / block int nblocks = (n + 255)/256 saxpy<<<nblocks, 256>>>(n, 2.0, d_x, d_y); Example: SAXPY Kernel // invoke parallel SAXPY kernel with 256 threads / block int nblocks = (n + 255)/256 saxpy<<<nblocks, 256>>>(n, 2.0, d_x, d_y); // copy y from device (GPU) memory to host (CPU) memory cudamemcpy(d_y, y, n*sizeof(float), cudamemcpydevicetohost); // do something with the result // free device (GPU) memory cudafree(d_x); cudafree(d_y); return 0; } 35

Example: SAXPY Kernel // invoke parallel SAXPY kernel with 256 threads / block int nblocks = (n + 255)/256 saxpy<<<nblocks, 256>>>(n, 2.0, d_x, d_y); // copy y from device (GPU) memory to host (CPU) memory cudamemcpy(d_y, y, n*sizeof(float), cudamemcpydevicetohost); // do something with the result // free device (GPU) memory cudafree(d_x); cudafree(d_y); } return 0; Example: Check the Differences void saxpy_serial(int n, float a, float* x, float* y) { for(int i = 0; i < n; i++) y[i] = a*x[i] + y[i]; } // invoke host SAXPY function saxpy_serial(n, 2.0, x, y); } global void saxpy(int n, float a, float* x, float* y) { int i = threadidx.x + blockdim.x * blockidx.x; if( i<n ) y[i] = a*x[i] + y[i]; } // invoke parallel SAXPY kernel with 256 threads / block int nblocks = (n + 255)/256 saxpy<<<nblocks, 256>>>(n, 2.0, d_x, d_y); } O(n) Standard C Code O(1) CUDA C Code 36

Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications Parallel Programming on GPUs: Basics Conceptual Introduction GPU Architecture Review Parallel Programming on GPUs: Practice Real programming Conclusion Conclusion GPU is still evolving for even much higher performance Stacked DRAM NVLINK Free Lunch? GPU continue to improve performance and power with more advanced GPU hardware/software features Mobile / portable medical system & Car system Tegra K1/X1 37

Conclusion GPU is still evolving for even much higher performance Single precision floating General Matrix Multiply (SGEMM) Stacked DRAM NVLINK Free Lunch? GPU continue to improve performance and power with more advanced GPU hardware/software features Mobile / portable medical system Tegra K1/X1 Thanks! 38