Tesla Architecture, CUDA and Optimization Strategies

Similar documents
CUDA Architecture & Programming Model

CUDA Programming Model

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Lecture 2: CUDA Programming

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

NVIDIA CUDA Compute Unified Device Architecture

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Introduction to CUDA

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Introduction to CUDA Programming

GPU programming. Dr. Bernhard Kainz

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

GPU Computing with CUDA. Part 2: CUDA Introduction

Multi-Processors and GPU

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Practical Introduction to CUDA and GPU

Parallel Computing. Lecture 19: CUDA - I

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Lecture 3: Introduction to CUDA

Parallel Numerical Algorithms

Josef Pelikán, Jan Horáček CGG MFF UK Praha

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Scientific discovery, analysis and prediction made possible through high performance computing.

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

CS516 Programming Languages and Compilers II

ECE 574 Cluster Computing Lecture 15

CUDA Parallel Programming Model Michael Garland

NVIDIA GPU CODING & COMPUTING

Introduction to Parallel Computing with CUDA. Oswald Haan

Portland State University ECE 588/688. Graphics Processors

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

CS 314 Principles of Programming Languages

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Programming with CUDA, WS09

Programming in CUDA. Malik M Khan

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

CSE 160 Lecture 24. Graphical Processing Units

NVIDIA CUDA Compute Unified Device Architecture

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CUDA Programming. Aiichiro Nakano


Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Introduction to GPGPUs and to CUDA programming model

CUDA Performance Optimization. Patrick Legresley

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

COSC 6339 Accelerators in Big Data

MIC-GPU: High-Performance Computing for Medical Imaging on Programmable Graphics Hardware (GPUs)

GPU Programming Using CUDA

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CME 213 S PRING Eric Darve

Massively Parallel Architectures

Lecture 10!! Introduction to CUDA!

High-Performance Computing Using GPUs

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

CUDA C Programming Mark Harris NVIDIA Corporation

COSC 6374 Parallel Computations Introduction to CUDA

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Fundamental CUDA Optimization. NVIDIA Corporation

University of Bielefeld

CS 179 Lecture 4. GPU Compute Architecture

Introduction to GPGPU and GPU-architectures

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Mathematical computations with GPUs

Real-time Graphics 9. GPGPU

CUDA Performance Optimization Mark Harris NVIDIA Corporation

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Overview: Graphics Processing Units

CS516 Programming Languages and Compilers II

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

ECE 574 Cluster Computing Lecture 17

Fundamental CUDA Optimization. NVIDIA Corporation

HPCSE II. GPU programming and CUDA

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

ECE 408 / CS 483 Final Exam, Fall 2014

HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Lab 1 Part 1: Introduction to CUDA

Lecture 11: GPU programming

Real-time Graphics 9. GPGPU

High Performance Linear Algebra on Data Parallel Co-Processors I

Introduc)on to GPU Programming

Programmable Graphics Hardware (GPU) A Primer

Transcription:

Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1

Outline Tesla Architecture & CUDA CUDA Programming Optimization Strategies Summary Page 2

Outline Tesla Architecture & CUDA CUDA Programming Optimization Strategies Summary Page 3

Revolutionary NVIDIA Tesla Multi-threaded architecture with a 128-processor computing core C-language development environment for the GPU C870 GPU Computing Processor - One GPU (128 thread processors) - 1.5 GB dedicated memory - Fits in one full-length, dual slot with one open PCI Express x16 slot D870 Deskside GPU Computing System: desktop (2 x C870) S870 GPU Computing System: 1U rack-mount chassis (4 x C870) Page 4

GPU architecture Massively multithreaded parallel computing platform 8 Texture Processor Clusters (TPC) 1 TPC = 2 Streaming Multiprocessors (SMs) + texture 1 SM = 8 streaming processors (SPs) 128 Thread Processors total 1.35 GHz processor clock, 518 GFLOPS peak Parallel Data Cache accelerates processing Page 5

SM Multithreaded Multiprocessor - 8 SP Thread Processors - 32 GFLOPS, peak at 1.35GHz - IEEE 754 32-bit floating point - 32-bit integer - 2 SFU Special Function Units - Multithreaded Instruction Unit - 768 Threads, hardware multithreaded - 24 SIMD warps of 32 threads - Independent MIMD thread execution - Hardware thread scheduling - 16KB Shared Memory - Concurrent threads share data - Low latency load/store Page 6

SIMT Multithreaded Execution Warp: the set of 32 parallel threads that execute a SIMD instruction SM hardware implements zero-overhead warp and thread scheduling 768 concurrent threads = 24 warps x 32 threads Threads can execute independently Best efficiency and performance: threads of a warp execute together Single-Instruction Multiple-Thread across threads (not just SIMDacross data) gives easy single-threadscalar programming with SIMD efficiencywarp Page 7

NVIDIA CUDA CUDA (Compute Unified Device Architecture) enables efficient use of the massive parallelism of NVIDIA GPUs Direct execution of data-parallel programs Without the overhead of a graphics API Using CUDA on Tesla GPUs can provide large speedups on data-parallel computations straight out of the box! Heterogeneous mixed serial-parallel programming Scalable hierarchical thread execution model Accessible minimal but expressive changes to C Page 8

CUDA Programming Model: Grids, Blocks, and Threads Execute a sequence of kernels on GPU computing device A kernel executes as a Grid of thread blocks A thread block/ctas (Cooperative Thread Array) is an array of threads that can cooperate size: 1 to 512 concurrent threads shape: 1D, 2D, or 3D Threads within the same block synchronize and share data in Shared Memory Page 9

CUDA integrated CPU + GPU application C program Serial C code executes on CPU Parallel Kernel C code executes on GPU thread blocks Page 10

Memory Spaces Each thread can: Read/write per-thread 32-bit registers Read/write per-thread local memory Read/write per-block shared memory (on chip) Read/write per-grid global memory Read only per-grid constant memory Read only per-grid texture memory The host can: read/write Host Global memory, Constant memory, and texture memory (stored in DRAM) Page 11

Outline Tesla Architecture & CUDA CUDA Programming Optimization Strategies Summary Page 12

CUDA Programming A minimal set of extensions to the C language A runtime library Page 13

Learning by example Matrices addition void addmatrix( float *a, float *b, global void addmatrix( float *a, floatt *b, float *c, int N ) float *c, int N ) { { int i, j, index; int i = blockidx.x * blockdim.x + threadidx.x; for( i = 0; i < N; i++ ) { int j = blockidx.y * blockdim.y + threadidx.y; for( j = 0; j < N; j++ ) { int index = i + j * N; index = i + j *N; if( i < N && j < N ) c[index] = a[index] + c[index]; c[index] = a [index] + b[index]; } } } } void main() void main() { { addmatrix( a, b, c, N ); dim3 dimblk ( blocksize, blocksize ); } dim3 dimgrd ( N/dimBlk.x, N/dimBlk.y ); addmatrix<<<dimgrd, dimblk>>>( a, b, c, N); } Page 14

Language Extensions Function Type Qualifiers global : kernel callable from host device : function callable on device host : function callable on host (default) - device void trigger( ); Variable Type Qualifiers device : variable in device memory constant : variable in constant memory shared : variable in shared memory Page 15

Language Extensions Execution Configuration Definition of the grid and blocks executed on the device In global function : <<< Dg, Db, Ns, S >>> Example : global void Function( float* parameter); dim3 dimgrid( 100, 50 ); dim3 dimblock( 4, 8, 8 ); Function<<< DimGrid, DimBlock >>>(parameter); Build-in in Variables - dim3 griddim - dim3 blockidx - dim3 blockdim - dim3 threadidx Page 16

Compilation with NVCC NVCC C++ syntax rules PTX Page 17

Software Stack Device driver Application programming interface Mathematical libraries : CUFFT and CUBLAS Page 18

Runtime library Common component Device component Host component Page 19

Common Component Built-in Vector Types : - char1, uchar1, int3, long2, etc. - Structures accessed with x, y, z, w fields : uint4 para; int y = para.y; - dim3 based on uint3 Texture Type - texture references - texture<type, Dim, ReadMode> texref ; Mathematical Functions - sinf, powf, log, min, etc. Time Function - clock_t clock(); Page 20

Device Component Mathematical Functions - pow - sin, cos, tan - etc. Synchronization Functions - void syncthreads(); Texture Functions - Type text1dfetch ( texture<type, 1, cudareadmodeelement> texref, int x ); Atomic Functions - atomicadd(); Page 21

Host Component Device management Context management Memory management Code module management Execution control Texture reference management Interoperability with OpenGL and Direct3D Page 22

Host Component Memory management float data[ 256 ]; int size = sizeof( data ); float* devptr; cudamalloc( ( void** ) &devptr, size ); cudamemcpy( devptr, data, size, cudamemcpyhosttodevice ); Page 23

Outline Tesla Architecture & CUDA CUDA Programming Optimization Strategies Summary Page 24

Optimization Strategies Maximize parallel execution Optimize memory usage Optimize instruction usage Page 25

Maximize Parallel Execution More blocks per SM, more threads per block Limiting factors: # Registers per kernel (8192/SM ) Amount of shared memory per SM (16KB/SM) So 8 blocks/sm mostly #Threads per block = a multiple of the warp size,better 64 Best:192 or 256 threads per block (maximal 768 / SM) Page 26

Optimize Memory Usage Data Transfers Device to host with lowest bandwidth : - Minimize the data transfers between the host and the device - memory - Group transfers Shared memory is hundreds of times faster than global memory(400 ~ 600 cycles) - Minimize the data transfers between the global memory by using on-chip shared memory - Typical programming pattern is to stage data coming from global memory - Into shared memory - Best: might be avoid any data transfer by recomputing the data Page 27

Optimize Memory Usage Memory accessing Global memory Read 4-byte, 8-byte, 16-byte words in a single instruction Most efficient: memory accesses by threads in a half-warp can be coalesced into a single memory transaction(32 bytes, 64 bytes, 128 bytes) Coalescing Non-coalescing Page 28

Optimize Memory Usage Matrix transpose Page 29

Optimize Memory Usage Shared memory:16 banks No bank conflict between two halves of a warp Page 30

Optimize instruction usage Memory Instructions -->4 clock cycles. - Global memory latency (400~600 cycles) can be hidden by the thread scheduler Arithmetic Instructions : - Minimize the use of arithmetic instructions with low throughput - Trading precision for speed Control flow instruction Lead to Divergent Branching Key: - Condition: threadidx / WSIZE - then - no divergent in a warp Page 31

Summary Page 32

References 1. NVIDIA CUDA Programming Guide 2.0 2. NVIDIA CUDA Optimization Strategies 3. Wikipedia 4. http://www.gpgpu.org/asplos2008/ 5. NVIDIA Tesla: A Unified Graphics and Computing Architecture March-April 2008, IEEE MICRO Page 33

Thank you for your attention! Page 34