CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Similar documents
Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Portland State University ECE 588/688. Graphics Processors

TUNING CUDA APPLICATIONS FOR MAXWELL

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

TUNING CUDA APPLICATIONS FOR MAXWELL

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Practical Introduction to CUDA and GPU

Parallel Processing SIMD, Vector and GPU s cont.

CUDA Architecture & Programming Model

CUDA Programming Model

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Threading Hardware in G80

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Introduction to GPU programming with CUDA

Introduction to GPU computing

CS 179: GPU Computing

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Tesla Architecture, CUDA and Optimization Strategies

General Purpose GPU Computing in Partial Wave Analysis

Master Informatics Eng.

GPU Programming Using NVIDIA CUDA

B. Tech. Project Second Stage Report on

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

High Performance Computing on GPUs using NVIDIA CUDA

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

Current Trends in Computer Graphics Hardware

! Readings! ! Room-level, on-chip! vs.!

GPU for HPC. October 2010

Warps and Reduction Algorithms

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

CUDA. Matthew Joyner, Jeremy Williams

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Programming in CUDA. Malik M Khan

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Tesla GPU Computing A Revolution in High Performance Computing

Introduction to GPU hardware and to CUDA

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Introduction to CUDA (1 of n*)

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Tesla GPU Computing A Revolution in High Performance Computing

GPU. Ben de Waal Summer 2008

Introduction to Parallel Computing with CUDA. Oswald Haan

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

Technology for a better society. hetcomp.com

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Programmable Graphics Hardware (GPU) A Primer

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

ECE 574 Cluster Computing Lecture 15

Lecture 1: an introduction to CUDA

Lecture 2: CUDA Programming

Fundamental CUDA Optimization. NVIDIA Corporation

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Fundamental CUDA Optimization. NVIDIA Corporation

GPU Fundamentals Jeff Larkin November 14, 2016

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Lecture 1: Introduction and Computational Thinking

Introduction to GPU programming. Introduction to GPU programming p. 1/17

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS427 Multicore Architecture and Parallel Computing

CS 179 Lecture 4. GPU Compute Architecture

ECE 8823: GPU Architectures. Objectives

Parallel Computing: Parallel Architectures Jin, Hai

Optimization solutions for the segmented sum algorithmic function

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Introduction II. Overview

Accelerating CFD with Graphics Hardware

CME 213 S PRING Eric Darve

ME964 High Performance Computing for Engineering Applications


Mathematical computations with GPUs

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

GPU Cluster Computing. Advanced Computing Center for Research and Education

Analyzing CUDA Workloads Using a Detailed GPU Simulator

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Introduction to CUDA

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

Parallel Systems I The GPU architecture. Jan Lemeire

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

ECE 574 Cluster Computing Lecture 17

Numerical Simulation on the GPU

Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas

Using GPUs to compute the multilevel summation of electrostatic forces

NVIDIA Fermi Architecture

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Transcription:

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology

Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture CUDA Programming Model Advanced Features in CUDA 6.0 onwards Unified Memory Dynamic Parallelism Example K-Means Clustering Benefits of CUDA Restrictions of CUDA Conclusion References CUDA Programming Model 2

What is GPGPU? What s the need? GPGPU General Purpose Graphics Processing Unit Accelerates the computation path of applications Leveraged by Data Parallel Algorithms Fine grain SIMD parallelism Low-latency floating point computations Exciting Supercomputing Applications : Molecular dynamics simulation, Video and audio coding and manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products Various granularities of parallelism No hindrance in parallel implementation Careful and efficient data delivery CUDA Programming Model 3

Computation Complexity Support CUDA Programming Model 4

More Transistors! Different design philosophies GPU devotes more transistors for data processing in parallel fashion CUDA Programming Model 5

GPU as Co-processor CUDA Programming Model 6

CUDA-Capable GPU Architecture CUDA Programming Model 7

CUDA Programming Model Supports High Multithreading Easily scalable model Runs on GPU a co-processor Runs many threads in parallel Extremely light-weight threads less creation time Kernels Data parallel portions on an application Thread hierarchy Memory hierarchy Compute capability CUDA Programming Model 8

Threads, Blocks and Grids Logical partitioning with: Threads Thread IDs Blocks of threads Block IDs Block Dimensions Threads are arranged in 1D, 2D, 3D logical fashion Grid of blocks Grid Dimensions Blocks are arranged in 1D, 2D, 3D logical fashion Each is limited with physical resources available All threads follow SPMD (same code runs on all threads with different data) Threads in same block can share data but not if in different blocks Scheduled in warps (a bunch of 32 threads) CUDA Programming Model 9

Memory Model Hierarchy Registers Allocated per thread Each thread can Read/write. Local Memory Allocated per thread Each thread can Read/write. Shared Memory Allocated per block Each thread can Read/write. Global memory Common for all threads in a grid. Each thread can Read/write. Constant Memory Common for all threads in a grid. Each thread can Read/only. Texture Memory Read only memory and is cached on chip CUDA Programming Model 10

Streaming Multiprocessor Scalable array of Streaming Multiprocessors GPU contains a bunch of SMs Each SM is independent and bound by threads and blocks count Contains multiples stream processors Executes threads of one warpsize at time CUDA Programming Model 11

Automatic Scalability Hardware is free to assign blocks to any processor at any time A kernel scales across any number of parallel processors CUDA Programming Model 12

CUDA Compilation and PTX Any source file containing CUDA language extensions must be compiled with NVCC Code sent from CPU to GPU is in Parallel Thread Execution (PTX) Graphics Drivers convert PTX into executable binary C++ host code C/C++ compiler GPU device functions proprietary NVIDIA compilers/assemblers Embeds the compiled GPU functions as load images in the host object file CUDA Programming Model 13

Processing Flow on CUDA CUDA Programming Model 14

Advanced Features in CUDA 6.0 onwards Unified Memory Dynamic Parallelism Hyper-Q GPU-direct CUDA Programming Model 15

Unified Memory Earlier Separate memories for CPU and GPU Lot of communication overhead More complexity With Unified Memory Single memory between CPU and GPU Less communication overhead No need of deep copies of structured data Simpler programming CUDA Programming Model 16

Dynamic Parallelism Parallel work generates more parallel work Parent Kernel creates child Kernels and divides work further Better load balancing CUDA Programming Model 17

Example K- Means Clustering Classifies millions of data points among given number of classes Uses nearest mean distance from centroid Calculated for each point Performance comparison: Fermi architecture GPU (GeForce GTX 480) ~35 million data points 768 threads with 1D blocks Speedups of 5x CUDA Programming Model 18

Advantages of CUDA Coarse-grained thread blocks map naturally to separate processor cores Fine-grained threads map to multiple-thread contexts Easy to scale with increasing parallel resources in system Easy to transform serial programs into parallel CUDA programs Fast shared memory provides substantial performance improvements by being used as software-managed cache Supports graphical application through Texture memory hardware CUDA Programming Model 19

Restrictions of CUDA Blocks cannot communicate. Recursive function calls are not allowed in CUDA kernels due to limited perthread resource. Individual thread control is not supported. CUDA does not support the full C standard Unlike OpenCL, CUDA-enabled GPUs are only available from Nvidia SIMD execution model of CUDA becomes a significant limitation for any inherently divergent task (high divergence»» less performance) CUDA Programming Model 20

Conclusion CUDA provides an easy-to-program model for parallel applications. Can extend to any parallel systems specific to NVIDIA s GPU architecture. Other parallel programming libraries such as OpenMPI, OpenCL provide similar features for multicore CPUs. CUDA Programming Model 21

References NVidia CUDA Home - http://www.nvidia.com/object/cuda_home_new.html CUDA Programming Guide http://docs.nvidia.com/cuda/cuda-c-programmingguide/index.html#axzz3tqmxinfo Maxwell Architecture - http://devblogs.nvidia.com/parallelforall/maxwell-mostadvanced-cuda-gpu-ever-made/ Unified Memory - http://devblogs.nvidia.com/parallelforall/unified-memory-incuda-6/ http://sbel.wisc.edu/documents/tr-2014-09.pdf Dynamic Parallelism - http://developer.download.nvidia.com/assets/cuda/files/cudadownloads/techbri ef_dynamic_parallelism_in_cuda.pdf CUDA Programming Model 22