COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Similar documents
Portland State University ECE 588/688. Graphics Processors

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

NVIDIA Fermi Architecture

Practical Introduction to CUDA and GPU

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Introduction to Parallel Computing with CUDA. Oswald Haan

Tesla Architecture, CUDA and Optimization Strategies

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

CUDA Programming Model

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Threading Hardware in G80

Lecture 2: CUDA Programming

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Introduction to GPU programming with CUDA

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Computer Architecture

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Numerical Simulation on the GPU

Parallel Processing SIMD, Vector and GPU s cont.

Josef Pelikán, Jan Horáček CGG MFF UK Praha

GPU Fundamentals Jeff Larkin November 14, 2016

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Mattan Erez. The University of Texas at Austin

E6895 Advanced Big Data Analytics Lecture 8: GPU Examples and GPU on ios devices

Introduction to CUDA (1 of n*)

Multi-Processors and GPU

Programming in CUDA. Malik M Khan

Fundamental CUDA Optimization. NVIDIA Corporation

CS 179 Lecture 4. GPU Compute Architecture


Fundamental CUDA Optimization. NVIDIA Corporation

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Tuning CUDA Applications for Fermi. Version 1.2

COMP 605: Introduction to Parallel Computing Quiz 4: Module 4 Quiz: Comparing CUDA and MPI Matrix-Matrix Multiplication

GPU programming. Dr. Bernhard Kainz

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Data Parallel Execution Model

ME964 High Performance Computing for Engineering Applications

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Master Informatics Eng.

Warps and Reduction Algorithms

CS516 Programming Languages and Compilers II

CUDA Basics. July 6, 2016

CUDA (Compute Unified Device Architecture)

Using GPUs to compute the multilevel summation of electrostatic forces

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

ECE 574 Cluster Computing Lecture 15

From Application to Technology OpenCL Application Processors Chung-Ho Chen

Parallel Computing: Parallel Architectures Jin, Hai

Lecture 3: Introduction to CUDA

Mathematical computations with GPUs

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Introduction to GPU hardware and to CUDA

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

Parallel Computing. Lecture 19: CUDA - I

Real-time Graphics 9. GPGPU

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

Optimization solutions for the segmented sum algorithmic function

Computer Architecture

Core/Many-Core Architectures and Programming. Prof. Huiyang Zhou

CS427 Multicore Architecture and Parallel Computing

Parallel Numerical Algorithms

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

GPU ARCHITECTURE Chris Schultz, June 2017

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Lecture 1: an introduction to CUDA

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

Analyzing CUDA Workloads Using a Detailed GPU Simulator

B. Tech. Project Second Stage Report on

Introduction to CUDA

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

Spring Prof. Hyesoon Kim

Cartoon parallel architectures; CPUs and GPUs

CUDA Architecture & Programming Model

Real-time Graphics 9. GPGPU

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

CUDA GPGPU Workshop 2012

high performance medical reconstruction using stream programming paradigms

Tesla GPU Computing A Revolution in High Performance Computing

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Scientific discovery, analysis and prediction made possible through high performance computing.

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

GPUs have enormous power that is enormously difficult to use

High Performance Computing and GPU Programming

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

GRAPHICS PROCESSING UNITS

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Advanced CUDA Optimization 1. Introduction

WHY PARALLEL PROCESSING? (CE-401)

Transcription:

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University (SDSU) Posted: 04/10/17 Last Update: 04/10/17

COMP 605: Topic Posted: 04/10/17 Last Update: 04/10/17 2/16 Mary Thomas Table of Contents 1 GPU Features GPU Memory Model

COMP 605: Topic Posted: 04/10/17 Last Update: 04/10/17 3/16 Mary Thomas GPU Architecture

COMP 605: Topic Posted: 04/10/17 Last Update: 04/10/17 4/16 Mary Thomas GPU Computing: Simplified Hardware View

COMP 605: Topic Posted: 04/10/17 Last Update: 04/10/17 5/16 Mary Thomas CPU = Central Processing Units single or miltiple processing units (cores) standalone or integrated into clusters designed to run processes, supports threads GPU = Graphical Processing Units Usually attached to a host CPU Developed for games (think Sony, PS3), and visualization (OpenGl, think Pixar) Designed to run lightweight threads, may have multiple PE s Accessible via specialized libraries, compiler directives (OpenACC), and extensions to languages (C, C++ and Fortran). CUDA ( Compute Unified Device Architecture) a parallel computing platform and programming model created by NVIDIA. extension of C programming language

COMP 605: Topic Posted: 04/10/17 Last Update: 04/10/17 6/16 Mary Thomas GPU Architecture GPU is a highly threaded coprocessor to the host CPU and associated memory Kernels are sections of the application that are run on the GPU by a thread. A thread block is a batch of threads that can cooperate with each other by: Sharing data through shared memory Synchronizing their execution Each block is organized as 3D array of threads: (blockdim.x, blockdim.y, and blockdim.z) Threads within a thread block must: execute the same kernel share data, so they must be issued to the same processor A grid is a collection of blocks: A Grid is organized as a 2D array of blocks: (griddim.x and griddim.y) Source: http://hothardware.com/articles/nvidia-gf100-architecture-and-feature-preview

COMP 605: Topic Posted: 04/10/17 Last Update: 04/10/17 7/16 Mary Thomas GPU: Kernel Kernels are not full applications: they are the parallel sections or critcal blocks They are executed by a grid of unordered thread blocks: up to 512 threads per block. Thread blocks start at the same instruction address, execute in parallel Blocks can have different endpoints (divergence) but these are limited Communicate through shared memory and synchronization barriers. Must be assigned to the same processor

COMP 605: Topic Posted: 04/10/17 Last Update: 04/10/17 8/16 Mary Thomas GPU: Threads A thread of execution is the smallest sequence of programmed instructions that can be managed independently by an operating system scheduler. A thread is a light-weight process. In most cases, a thread is contained inside a process. Multithreading generally occurs by time-division multiplexing (as in multitasking) Multiprocessor (including multi-core system): threads or tasks run at the same time - each processor or core runs a particular thread or task. Source: http://www.compsci.hunter.cuny.edu/~sweiss/course_materials/csci360/lecture_notes/gpus.pdf

COMP 605: Topic Posted: 04/10/17 Last Update: 04/10/17 9/16 Mary Thomas NVIDIA Thread Calculations Thread ID is unique within a block Bock ID is unique Can make unique ID for each thread per kernel using Thread and Block IDs.

COMP 605: Topic Posted: 04/10/17 Last Update: 04/10/17 10/16 Mary Thomas GPU: Grid A collection of blocks that can (but are not required to) execute in parallel. Theres no synchronization at all between the blocks. Number of [concurrent] grids on a GPU: 1 for GPU Cores < 2.0 16 for 2.0 <= CC <= 3.0 32 for CC = 3.5... Need to use right API in order to avoid serialization. There is an API for querying the GPU system.

COMP 605: Topic Posted: 04/10/17 Last Update: 04/10/17 11/16 Mary Thomas The GPU Memory Model Red is fast on-chip, orange is DRAM Register file & local memory are private for each thread Shared memory is used for communication between threads (appx same latency as regs) DRAM, Readonly: Constant memory (64KB) used for random accesses (such as instructions) Texture memory (large) and has two dimensional locality Global Memory: visible to an entire grid, can be arbitrarily written to and read from by the GPU or the CPU. Source: NVIDIA

COMP 605: Topic Posted: 04/10/17 Last Update: 04/10/17 12/16 Mary Thomas NVIDIA Hardware: GEForce 8800 Streaming Multiprocessors (SMs, also called nodes) 8 Stream Processors (SPs) (or cores): primary thread processor has 1000 s of registers that can be partitioned among threads of execution Multiple caches: shared memory for fast data interchange between threads, constant cache for fast broadcast of reads from constant memory, texture cache to aggregate bandwidth from texture memory, L1 cache: reduce latency to memory warp schedulers: switch contexts between threads and instructions to warps; Execution cores: Integer and floating point ops Special Function Units (SFUs) Source: http://www.compsci.hunter.cuny. edu/~sweiss/course_materials/csci360/ lecture_notes/gpus.pdf

COMP 605: Topic Posted: 04/10/17 Last Update: 04/10/17 13/16 Mary Thomas NVIDIA GPU GF100 High-Level Block Diagram (2010) CPU is the host GPU is the device The GF100 has 4 Graphics Processing Clusters (GPCs): laid out in (2x2) arrangement (also called Raster Engine ). Each GPC has 4 Streaming Multiprocessors, (SMs): NVIDIAs term for multiprocessor (also called Polymorph Engines ). Arranged in 1x4 layout. Total number of SMs = 4 4 = 16 Each SM has a block of Stream Processors (SPs) or Cores also called execution units. Arranged in 8x4 layout. Total number of SPs on each SM = 8 4 = 32 Total number of cores on the GPU #Cores = #SPs/SM #SMs/GPC #GPCs = 32 4 4 = 512 Source: http://hothardware.com/articles/nvidia-gf100-architecture-and-feature-preview

COMP 605: Topic Posted: 04/10/17 Last Update: 04/10/17 14/16 Mary Thomas NVIDIA GF100 SM Block Diagram each SM block in each GPC is comprised of 32 cores 48/16KB of shared memory (3 x that of GT200), 16/48KB of L1 (there is no L1 cache on GT200), Source: http://hothardware.com/articles/nvidia-gf100-architecture-and-feature-preview

COMP 605: Topic Posted: 04/10/17 Last Update: 04/10/17 15/16 Mary Thomas NVIDIA GT200 SM Arch (2008) highly threaded single-issue processor with SIMD/SIMT (single instruction multiple thread) 8 functional units Each SM can execute up to 8 thread blocks concurrently and a total of 1024 threads concurrently warp: a group of threads managed by SM thread scheduler Single Instruction, Multiple Thread (SIMT) programming model Source: http://www.realworldtech.com/gt200

COMP 605: Topic Posted: 04/10/17 Last Update: 04/10/17 16/16 Mary Thomas GPU Global Scheduler (work distribution unit) Manages coarse grained parallelism at thread block level At kernel startup, information for grid sent from CPU (host) to GPU (device) Scheduler reads information and issues thread blocks to streaming multiprocessors (SM) Issues thread blocks in a round-robin fashion to SMs Uniformly distribute threads to SMs Key distribution factors: kernel demand for threads per block shared memory per block registers per thread thread and block state requirements current availability resources in SM http://www.realworldtech.com/gt200/6/