Lecture 1: Introduction and Computational Thinking

Similar documents
Lecture 2: Introduction to CUDA C

Module 2: Introduction to CUDA C

Module 2: Introduction to CUDA C. Objective

Parallel Computing. Lecture 19: CUDA - I

Lecture 3: Introduction to CUDA

CUDA C Programming Mark Harris NVIDIA Corporation

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA Parallelism Model

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Lecture 6: Input Compaction and Further Studies

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Threading Hardware in G80

Using The CUDA Programming Model

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Tesla Architecture, CUDA and Optimization Strategies

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

Multi-Processors and GPU

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Introduction to CUDA (1 of n*)

High Performance Computing and GPU Programming

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

CS 193G. Lecture 1: Introduction to Massively Parallel Computing

Programming in CUDA. Malik M Khan

Intro to GPU s for Parallel Computing

GPU Programming Using NVIDIA CUDA

Introduction to GPGPU and GPU-architectures

GPU programming. Dr. Bernhard Kainz

Mathematical computations with GPUs

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Programmable Graphics Hardware (GPU) A Primer


High Performance Linear Algebra on Data Parallel Co-Processors I

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

CUDA Parallel Programming Model Michael Garland

University of Bielefeld

CSE 160 Lecture 24. Graphical Processing Units

Accelerating Molecular Modeling Applications with Graphics Processors

High Performance Computing with Accelerators

Graphics Processor Acceleration and YOU

Master Informatics Eng.

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

CUDA (Compute Unified Device Architecture)

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

Introduction to CELL B.E. and GPU Programming. Agenda

High Performance Computing on GPUs using NVIDIA CUDA

Tesla GPU Computing A Revolution in High Performance Computing

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Introduction to CUDA Programming

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Introduc)on to GPU Programming

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Lab 1 Part 1: Introduction to CUDA

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C

GPU Programming Using CUDA

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Parallel Accelerators

Parallel Computing. Hwansoo Han (SKKU)

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

high performance medical reconstruction using stream programming paradigms

General-purpose computing on graphics processing units (GPGPU)

Lecture 11: GPU programming

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

COSC 6339 Accelerators in Big Data

Lecture 2: CUDA Programming

ECE 8823: GPU Architectures. Objectives

High-Performance Computing Using GPUs

Portland State University ECE 588/688. Graphics Processors

GPU CUDA Programming

Introduction to CUDA (1 of n*)

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

CS 668 Parallel Computing Spring 2011

GPU Cluster Computing. Advanced Computing Center for Research and Education

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Parallel Computing. November 20, W.Homberg

Parallel Accelerators

ECE 574 Cluster Computing Lecture 15

GPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Transcription:

PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1

Course Objective To master the most commonly used algorithm techniques and computational thinking skills needed for many-core GPU programming Especially the simple ones! In particular, to understand Many-core hardware limitations and constraints Desirable and undesirable computation patterns Commonly used algorithm techniques to convert undesirable computation patterns into desirable ones 2

Performance Advantage of GPUs An enlarging peak performance advantage: Calculation: 1 TFLOPS vs. 100 GFLOPS Memory Bandwidth: 100-150 GB/s vs. 32-64 GB/s Courtesy: John Owens GPU in every PC and workstation massive volume and potential impact 3

CPUs and GPUs have fundamentally different design philosophies. Control CPU ALU ALU ALU ALU GPU Cache DRAM DRAM January 5-7, 2011 Chile, 4

UIUC/NCSA AC Cluster 32 nodes 4-GPU (GTX280, Tesla) nodes GPUs donated by NVIDIA Host boxes funded by NSF Coulomb Summation: 1.78 TFLOPS/node 271x speedup vs. one Intel QX6700 CPU core w/ SSE UIUC/NCSA AC Cluster http://iacat.uiuc.edu/resources/cluster/ A partnership between NCSA and UIUC academic departments. 5

EcoG - One of the Most Energy Efficient Supercomputers in the World #3 of the Nov 2010 Green 500 list 128 nodes One Fermi GPU per node 934 MFLOPS/Watt 33.6 TFLOPS DP Linpack Built by Illinois students and NVIDIA researchers 6

GPU computing is catching on. Financial Analysis Scientific Simulation Engineering Simulation Data Intensive Analytics Medical Imaging Digital Audio Processing Digital Video Processing Computer Vision Biomedical Informatics Electronic Design Automation Statistical Modeling Ray Tracing Rendering Interactive Physics Numerical Methods 280 submissions to GPU Computing Gems 110 articles included in two volumes 7

A Common GPU Usage Pattern A desirable approach considered impractical Due to excessive computational requirement But demonstrated to achieve domain benefit Convolution filtering (e.g. bilateral Gaussian filters), De Novo gene assembly, etc. Use GPUs to accelerate the most time-consuming aspects of the approach Kernels in CUDA or OpenCL Refactor host code to better support kernels Rethink the domain problem 8

CUDA /OpenCL Execution Model Integrated host+device app C program Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C code Serial Code (host) Parallel Kernel (device) KernelA<<< nblk, ntid >>>(args);... Serial Code (host) Parallel Kernel (device) KernelB<<< nblk, ntid >>>(args);... 9

CUDA Devices and Threads A compute device Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads (work elements for OpenCL) in parallel Is typically a GPU but can also be another type of parallel processing device Data-parallel portions of an application are expressed as device kernels which run on many threads Differences between GPU and CPU threads GPU threads are extremely lightweight Very little creation overhead GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few 10

Arrays of Parallel Threads A CUDA kernel is executed by an array of threads All threads run the same code (SPMD) Each thread has an index that it uses to compute memory addresses and make control decisions threads 0 1 2 3 4 5 6 7 float a = input[threadidx]; float b = func(a); output[threadidx] = b; 11 11

Thread Blocks: Scalable Cooperation Divide monolithic thread array into multiple blocks Threads within a block cooperate via shared memory, atomic operations and barrier synchronization Threads in different blocks cannot cooperate threads Thread Block 0 Thread Block 1 Thread Block N - 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 float a = input[threadidx]; float b = func(a); output[threadidx] = b; float a = input[threadidx]; float b = func(a); output[threadidx] = b; float a = input[threadidx]; float b = func(a); output[threadidx] = b; 12

blockidx and threadidx Each thread uses indices to decide what data to work on blockidx: 1D or 2D threadidx: 1D, 2D, or 3D Host Kernel 1 Device Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on volumes Kernel 2 Grid 2 Block (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) T hrea d (0,0,0) T hrea d (0,1,0) T hrea d (1,0,0) T hrea d (1,1,0) T hrea d (2,0,0) T hrea d (2,1,0) T hrea d (3,0,0) T hrea d (3,1,0) Courtesy: NDVIA 13

Example: Vector Addition Kernel // Compute vector sum C = A+B Device Code // Each thread performs one pair-wise addition global void vecadd(float* A, float* B, float* C, int n) { int i = threadidx.x + blockdim.x * blockidx.x; if(i<n) C[i] = A[i] + B[i]; } int main() { // Run ceil(n/256) blocks of 256 threads each vecadd<<<ceil(n/256), 256>>>(d_A, d_b, d_c, n); } 14

Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition global void vecadd(float* A, float* B, float* C, int n) { int i = threadidx.x + blockdim.x * blockidx.x; if(i<n) C[i] = A[i] + B[i]; } Host Code int main() { // Run ceil(n/256) blocks of 256 threads each vecadd<<<ceil(n/256), 256>>>(d_A, d_b, d_c, N); } 15

Kernel execution in a nutshell host void example() { int B = 128, P = ceil(n/b); vecadd<<<p,b>>>(n,a,x,y); } global void saxpy(int n, float a, float *x, float *y) { int i = blockidx.x * blockdim.x + threadidx.x; } if( i<n ) y[i] = a * x[i] + y[i]; Blk 0 Kernel Blk p-1 Schedule onto multiprocessors M0 RAM GPU Mk RAM 16

Harvesting Performance Benefit of Many-core GPU Requires Massive parallelism in application algorithms Data parallelism Regular computation and data accesses Similar work for parallel threads Avoidance of conflicts in critical resources Off-chip DRAM (Global Memory) bandwidth Conflicting parallel updates to memory locations 17

Massive Parallelism - Regularity 18

Main Hurdles to Overcome Serialization due to conflicting use of critical resources Over subscription of Global Memory bandwidth Load imbalance among parallel threads 19

Computational Thinking Skills The ability to translate/formulate domain problems into computational models that can be solved efficiently by available computing resources Understanding the relationship between the domain problem and the computational models Understanding the strength and limitations of the computing devices Defining problems and models to enable efficient computational solutions 20

DATA ACCESS CONFLICTS 21

Conflicting Data Accesses Cause Serialization and Delays Massively parallel execution cannot afford serialization Contentions in accessing critical data causes serialization 22

A Simple Example A naïve inner product algorithm of two vectors of one million elements each All multiplications can be done in time unit (parallel) Additions to a single accumulator in one million time units (serial) * * * * + + + Time + * * 23

How much can conflicts hurt? Amdahl s Law If fraction X of a computation is serialized, the speedup can not be more than 1/(1-X) In the previous example, X = 50% Half the calculations are serialized No more than 2X speedup, no matter how many computing cores are used 24

GLOBAL MEMORY BANDWIDTH 25

Global Memory Bandwidth Ideal Reality 26

Global Memory Bandwidth Many-core processors have limited off-chip memory access bandwidth compared to peak compute throughput Fermi 1 TFLOPS SPFP peak throughput 0.5 TFLOPS DPFP peak throughput 144 GB/s peak off-chip memory access bandwidth 36 G SPFP operands per second 18 G DPFP operands per second To achieve peak throughput, a program must perform 1,000/36 = ~28 SPFP (14 DPFP) arithmetic operations for each operand value fetched from off-chip memory 27

LOAD BALANCE 28

Load Balance The total amount of time to complete a parallel job is limited by the thread that takes the longest to finish good bad 29

How bad can it be? Assume that a job takes 100 units of time for one person to finish If we break up the job into 10 parts of 10 units each and have fo10 people to do it in parallel, we can get a 10X speedup If we break up the job into 50, 10, 5, 5, 5, 5, 5, 5, 5, 5 units, the same 10 people will take 50 units to finish, with 9 of them idling for most of the time. We will get no more than 2X speedup. 30

How does imbalance come about? Non-uniform data distributions Highly concentrated spatial data areas Astronomy, medical imaging, computer vision, rendering, If each thread processes the input data of a given spatial volume unit, some will do a lot more work 31 than others

Eight Algorithmic Techniques (so far) http://courses.engr.illinois.edu/ece598/hk/ GPU Computing Gems, Vol. 1 and 2 32

You can do it. Computational thinking is not as hard as you may think it is. Most techniques have been explained, if at all, at the level of computer experts. The purpose of the course is to make them accessible to domain scientists and engineers. 33

ANY MORE QUESTIONS? 34