High Performance Linear Algebra on Data Parallel Co-Processors I

Similar documents
CUDA C Programming Mark Harris NVIDIA Corporation

ECE 574 Cluster Computing Lecture 15

CUDA Architecture & Programming Model

Module 2: Introduction to CUDA C

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Module 2: Introduction to CUDA C. Objective

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

ECE 574 Cluster Computing Lecture 17

Introduction to CUDA

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Real-time Graphics 9. GPGPU

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Josef Pelikán, Jan Horáček CGG MFF UK Praha

GPU Computing: Introduction to CUDA. Dr Paul Richmond

CUDA Parallel Programming Model Michael Garland

Tesla Architecture, CUDA and Optimization Strategies

Parallel Numerical Algorithms

Real-time Graphics 9. GPGPU

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

High-Performance Computing Using GPUs

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

GPU programming. Dr. Bernhard Kainz

ECE 408 / CS 483 Final Exam, Fall 2014

Parallel Computing. Lecture 19: CUDA - I

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

CS377P Programming for Performance GPU Programming - I

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Lecture 3: Introduction to CUDA

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

Programming with CUDA, WS09

Introduction to CUDA (1 of n*)

GPU CUDA Programming

Lecture 2: Introduction to CUDA C

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

CUDA C/C++ BASICS. NVIDIA Corporation

Introduction to GPU Programming

Mathematical computations with GPUs

CSE 160 Lecture 24. Graphical Processing Units

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Speed Up Your Codes Using GPU

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

University of Bielefeld

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming Using CUDA

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

EEM528 GPU COMPUTING

CUDA. Sathish Vadhiyar High Performance Computing

Practical Introduction to CUDA and GPU

CUDA C/C++ BASICS. NVIDIA Corporation

Introduction to CUDA C

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Introduction to CUDA C

CS 179: GPU Computing. Lecture 2: The Basics

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

Introduction to CUDA Programming

Cartoon parallel architectures; CPUs and GPUs

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

Scientific discovery, analysis and prediction made possible through high performance computing.

HPCSE II. GPU programming and CUDA

CS : Many-core Computing with CUDA

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 10!! Introduction to CUDA!

Introduction to Parallel Computing with CUDA. Oswald Haan

Vector Addition on the Device: main()

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

CUDA Programming. Aiichiro Nakano

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo

Introduction to GPU programming. Introduction to GPU programming p. 1/17

About 30 student clubs/associations 1500 students 5 years program

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Overview: Graphics Processing Units

An Introduction to GPU Computing and CUDA Architecture

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

GPU Computing with CUDA. Part 2: CUDA Introduction

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CS 193G. Lecture 1: Introduction to Massively Parallel Computing

CUDA Basics. July 6, 2016

Lecture 6b Introduction of CUDA programming

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

Basics of CADA Programming - CUDA 4.0 and newer

Today s Content. Lecture 7. Trends. Factors contributed to the growth of Beowulf class computers. Introduction. CUDA Programming CUDA (I)

General-purpose computing on graphics processing units (GPGPU)

Shared Memory and Synchronizations


Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

High Performance Computing and GPU Programming

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

AMS 148 Chapter 6: Histogram, Sort, and Sparse Matrices

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.

Transcription:

926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018 159265358979323846264338327950288419716939937549189801 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018 159265358979323846264338327950288419716939937549189801 926535897932384626433832795028841971693993754918980183 415926535897932384626433832795028841971693993754918980 926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 592653589793238462643383279502884197169399375491898018 159265358979323846264338327950288419716939937549189801 926535897932384626433832795028841971693993754918980183 159265358979323846264338327950288419716939937549189801 High Performance Linear Algebra on Data Parallel Co-Processors I

2 A bit of history GFLOPS 300 G71 G80 200 100 NV30 NV35 NV40 G70 G70-512 Intel Core2 Duo 3.0 GHz Jan. Jun. Apr. May Nov. Mar. Nov. 2003 2004 2005 2006 Original data Mark Harris (NVIDIA), 2007.

3 A bit of history GPU Video Memory CPU Vertex Processor Raster Unit Fragment Processor Render Target

4 A bit of history GPU Video Memory CPU Vertex Processor Raster Unit Fragment Processor Render Target

5 A bit of history GPU Video Memory CPU Vertex Processor Raster Unit Fragment Processor Render Target

6 A bit of history GPU Video Memory CPU Vertex Processor Raster Unit Fragment Processor Render Target

7 A bit of history GPU Video Memory CPU Vertex Processor Raster Unit Fragment Processor Render Target

8 A bit of history GPU Video Memory CPU Vertex Processor Raster Unit Fragment Processor Render Target

9 Modern graphics card Host Input Assembler Setup / Raster / ZCull Thread Scheduler SMEM SMEM SMEM SMEM SMEM SMEM SMEM SMEM L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 MEM MEM MEM MEM MEM MEM

10 Data parallel co-processors Host Input Assembler Thread Scheduler SMEM SMEM SMEM SMEM SMEM SMEM SMEM SMEM L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 MEM MEM MEM MEM MEM MEM

11 Array of multiprocessors Analogue of a multi-core chip with each multiprocessor corresponding to one core. But the cores are vectorized. Task parallel execution. Independent program counters. Between 2 and 16 multiprocessors on one chip.

12 Multiprocessor Data parallel processing unit.

13 Multiprocessor Data parallel processing unit.

14 Multiprocessor Data parallel processing unit. +

15 Multiprocessor Data parallel processing unit. +

16 Multiprocessor Data parallel processing unit. Thread block: set of threads that is logically processed together on a multiprocessor.

17 Multiprocessor Data parallel processing unit. Thread block: set of threads that is logically processed together on a multiprocessor. Thread warp: set of threads that is physically processed together on a multiprocessor. Width of the processor, typically 16 or 32.

18 Multiprocessor

19 Multiprocessor

20 Multiprocessor

21 Multiprocessor

22 Multiprocessor

23 Multiprocessor

24 Multiprocessor Data parallel processing unit. Thread block: set of threads that is logically processed together on a multiprocessor. Thread warp: set of threads that is physically processed together on a multiprocessor. Width of the processor, typically 16 or 32. Thread: processes one data element. Extremely lightweight.

25 Data parallel co-processors Host Input Assembler Thread Scheduler SMEM SMEM SMEM SMEM SMEM SMEM SMEM SMEM L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 MEM MEM MEM MEM MEM MEM

26 Data parallel co-processors Host Input Assembler Thread Scheduler SMEM SMEM SMEM SMEM SMEM SMEM SMEM SMEM L1 L1 L1 L1 L1 L1 L1 L1 L2 Global Memory

27 Memory vs. Execution Task parallel array of multiprocessors Global memory Host and device very high latency

28 Memory vs. Execution Task parallel array of multiprocessors Thread block per multiprocessor Global memory Local Memory (Shared Memory/L1) Host and device very high latency Thread block only low latency

29 Memory vs. Execution Task parallel array of multiprocessors Thread block per multiprocessor Global memory Local Memory (Shared Memory/L1) Host and device very high latency Thread block only low latency Warp of threads

30 Memory vs. Execution Task parallel array of multiprocessors Thread block per multiprocessor Global memory Local Memory (Shared Memory/L1) Host and device very high latency Thread block only low latency Warp of threads Thread Registers Thread only very low latency

31 Memory vs. Execution Task parallel array of multiprocessors Thread block per multiprocessor Global memory Local Memory (Shared Memory/L1) Host and device very high latency Thread block only low latency Warp of threads Thread Registers Thread only very low latency

32 Thread level parallelism Data

33 Thread level parallelism Data

34 Thread level parallelism Data

35 Thread level parallelism Data

36 Thread level parallelism Data

37 Thread level parallelism Data

38 Thread level parallelism Data

39 Thread level parallelism Data

40 Thread level parallelism Data

41 Thread level parallelism http://www.cs.berkeley.edu/~volkov/volkov10-gtc.pdf

42 Instruction Level Parallelism... c = c * b + a;

43 Instruction Level Parallelism... c = c * b + a; d = d * b + a;

44 Instruction Level Parallelism... c = c * b + a; d = d * b + a; e = e * b + a;...

45 Instruction Level Parallelism... c = c * b + a; d = d * b + a; e = e * b + a;... Commands are non-blocking only dependent operations block multiprocessor.

46 Instruction Level Parallelism http://www.cs.berkeley.edu/~volkov/volkov10-gtc.pdf

47 Memory latency Volkov, V., and J. W. Demmel. Benchmarking GPUs to Tune Dense Linear Algebra. 2008.

48 Memory latency Volkov, V., and J. W. Demmel. Benchmarking GPUs to Tune Dense Linear Algebra. 2008.

49 Ideal program flow 1. Load data form global to shared memory. 2. Process data in shared memory. 3. Write data from shared to global memory.

50 CUDA Compute Unified Device Architecture. Proprietary solution by NVIDIA. But very similar to OpenCL. Low-level software interface based on ANSI C. Explicit parallelization, memory handling,... Latest hardware has full C++ support.

51 CUDA execution model A program is executed as grid of thread blocks. Each thread block and thread has unique {1,2,3}D thread block and thread identifier.

52 CUDA execution model A program is executed as grid of thread blocks. Each thread block and thread has unique {1,2,3}D thread block and thread identifier.

53 CUDA execution model A program is executed as grid of thread blocks. Each thread block and thread has unique {1,2,3}D thread block and thread identifier.

54 CUDA execution model A program is executed as grid of thread blocks. Each thread block and thread has unique {1,2,3}D thread block and thread identifier.

55 CUDA program structure Kernel: function executed on the device that can be called from the host. Host program: Device and memory initialization and execution management. C/C++/Fortran program.

56 CUDA Host Extensions Kernel invocation mykernel<<< grid_dim, block_dim >>>(in, out) Memory management cudamalloc(), cudamemcpy(), cudafree(),... Device management... cudagetdevicecount(), cudasetdevice(),...

57 CUDA Device Extensions Memory declaration shared float smem[1024] Synchronisation (barrier) synchthreads() Atomic operations atomicadd(), atomicexch(), atomicxor(),... Thread identifier threadidx, blockidx, blockdim,...

58 CUDA Program Flow 1. Upload data to process into device memory. 2. Define execution environment. 3. Launch kernel. Read input data into shared memory. Process data. Write result from shared to global memory. 4. Read result back to host memory.

59 CUDA Example Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1]

60 CUDA Example Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap

61 CUDA Example a Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap b

62 CUDA Example a Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap x b

63 CUDA Example a Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap x b

64 CUDA Example a Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap x b

65 CUDA Example a Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap x b

66 CUDA Example a Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap b block block block block

67 CUDA Example global void conv( float* in, float* out) {

68 CUDA Example global void conv( float* in, float* out) { shared float smem[1024];

69 CUDA Example global void conv( float* in, float* out) { shared float smem[1024]; tid = threadidx.x + blockdim.x * blockidx.x; smem[threadidx.x] = in[tid]; syncthreads();

70 CUDA Example global void conv( float* in, float* out) { shared float smem[1024]; tid = threadidx.x + blockdim.x * blockidx.x; smem[threadidx.x] = in[tid]; syncthreads(); float res = smem[threadidx.x-1]; res += 3.0 * smem[threadidx.x]; res += smem[threadidx.x+1];

71 CUDA Example global void conv( float* in, float* out) { shared float smem[1024]; tid = threadidx.x + blockdim.x * blockidx.x; smem[threadidx.x] = in[tid]; syncthreads(); float res = smem[threadidx.x-1]; res += 3.0 * smem[threadidx.x]; res += smem[threadidx.x+1]; } out[tid] = res;

72 CUDA Example host void run() { // read input data... uint mem_size = signal_size * sizeof(float);

73 CUDA Example host void run() { // read input data... uint mem_size = signal_size * sizeof(float); float* d_in, d_out; cudamalloc( (void**) &d_in, mem_size); cudamalloc( (void**) &d_out, mem_size); cudamemcpy( d_in, signal, mem_size, cudamemcpyhosttodevice);

74 CUDA Example // assume signal_size > 512 dim3 threads, blocks; threads = 512; blocks = (signal_size / 512) + 1;

75 CUDA Example // assume signal_size > 512 dim3 threads, blocks; threads = 512; blocks = (signal_size / 512) + 1; // run kernel conv<<<blocks,threads>>>( d_in, d_out);

76 CUDA Example } // assume signal_size > 512 dim3 threads, blocks; threads = 512; blocks = (signal_size / 512) + 1; // run kernel conv<<<blocks,threads>>>( d_in, d_out); cudathreadsynchronize(); // copy result back to the host array result cudamemcpy( result, d_out, mem_size, cudamemcpydevicetohost);

77 Final Remarks Data parallel co-processor: task parallel array of data parallel multiprocessor. Efficient programs require to keep hardware architecture in mind. - Execution model and organization of threads. - Memory hierarchy and latencies.