COMP 322: Fundamentals of Parallel Programming

Similar documents
COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

CS 179: GPU Computing. Lecture 2: The Basics

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Introduction to GPGPUs and to CUDA programming model

Parallel Numerical Algorithms

Lessons learned from a simple application

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

Computation to Core Mapping Lessons learned from a simple application

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Parallel Computing. Lecture 19: CUDA - I

Lecture 3: Introduction to CUDA

Matrix Multiplication in CUDA. A case study

Lecture 2: CUDA Programming

Introduction to CUDA (1 of n*)

ECE 574 Cluster Computing Lecture 15

GPU CUDA Programming

COSC 6339 Accelerators in Big Data

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

University of Bielefeld

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Fundamental CUDA Optimization. NVIDIA Corporation

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Fundamental CUDA Optimization. NVIDIA Corporation

COSC 6374 Parallel Computations Introduction to CUDA

COMP 322: Fundamentals of Parallel Programming. Lecture 38: Comparison of Programming Models

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to GPU programming. Introduction to GPU programming p. 1/17

High Performance Linear Algebra on Data Parallel Co-Processors I

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Programming with CUDA, WS09

Introduction to CUDA

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Introduction II. Overview

Parallel Computing: Parallel Architectures Jin, Hai

CUDA Basics. July 6, 2016

Tesla Architecture, CUDA and Optimization Strategies

CDA3101 Recitation Section 13

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

High Performance Computing and GPU Programming

Online Course Evaluation. What we will do in the last week?

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Cartoon parallel architectures; CPUs and GPUs

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

ECE 574 Cluster Computing Lecture 17

GPU programming. Dr. Bernhard Kainz

Introduction to Parallel Computing with CUDA. Oswald Haan

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

AMath 483/583, Lecture 24, May 20, Notes: Notes: What s a GPU? Notes: Some GPU application areas

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Multi-Processors and GPU

Scientific Computing on GPUs: GPU Architecture Overview

CS179 GPU Programming Recitation 4: CUDA Particles

Scientific discovery, analysis and prediction made possible through high performance computing.

Lect. 2: Types of Parallelism

Warps and Reduction Algorithms

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Introduc)on to GPU Programming

Introduction to GPU programming with CUDA

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

Massively Parallel Algorithms

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Lecture 25: Interrupt Handling and Multi-Data Processing. Spring 2018 Jason Tang

GPU Fundamentals Jeff Larkin November 14, 2016

WHY PARALLEL PROCESSING? (CE-401)

Parallel Computing Platforms

BlueGene/L (No. 4 in the Latest Top500 List)

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Introduction to CUDA Programming

Processor Architecture and Interconnect

COMP 322: Fundamentals of Parallel Programming Module 3: Locality and Distribution

Advanced Computer Architecture

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

Master Informatics Eng.

Numerical Simulation on the GPU

Parallel Processing SIMD, Vector and GPU s cont.

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Review: Parallelism - The Challenge. Agenda 7/14/2011

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

04. CUDA Data Transfer

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Transcription:

COMP 322: Fundamentals of Parallel Programming Lecture 38: General-Purpose GPU (GPGPU) Computing Guest Lecturer: Max Grossman Instructors: Vivek Sarkar, Mack Joyner Department of Computer Science, Rice University {jmg3, vsarkar, mjoyner}@rice.edu http://comp322.rice.edu/ COMP 322 Lecture 38 19 April 2017

Worksheet #37: Creating a Circuit for Parallel Prefix Sums Name: Netid: Assume that you have a full adder cell,, that can be used as a building block for circuits (no need to worry about carry s). Create a circuit that generates the prefix sums for 1, 6, by adding at most 5 more cells to the sketch shown below, while ensuring that the CPL is at most 3 cells long. Assume that you can duplicate any value (fan-out) to whatever degree you like without any penalty. 1 2 3 4 5 6 (Inputs) 1 3 6 10 15 21 (Outputs) 2 COMP 322, Spring 2017 (V. Sarkar, M. Joyner)

Demo Performance gap between GPUs and multicore CPUs continues to widen 3 COMP 322, Spring 2017 (V. Sarkar, M. Joyner)

So I Can Move Dots Around, So What? Google - Use GPUs internally to train deep learning models (e.g. for NLP) USA Departments of Energy & Defense - 3rd fastest supercomputer in the world based on GPUs, two of the next three supercomputers deployed by USA Department of Energy will be GPU based Mayo Clinic - Using GPUs to improve tumor identification Audi - Using GPUs for self-driving cars SpaceX - Uses GPUs internally for combustion modeling of Merlin methane-based rocket Facebook - Uses GPUs through their open source Caffe2 framework 4 COMP 322, Spring 2017 (V. Sarkar, M. Joyner)

Flynn s Taxonomy for Parallel Computers Single Instruction Single Data SISD MISD Multiple Data SIMD MIMD Multiple Instructions Single Instruction, Single Data stream (SISD) A sequential computer which exploits no parallelism in either the instruction or data streams. e.g., old single processor PC Single Instruction, Multiple Data streams (SIMD) A computer which exploits multiple data streams against a single instruction stream to perform operations which may be naturally parallelized. e.g. graphics processing unit Multiple Instruction, Single Data stream (MISD) Multiple instructions operate on a single data stream. Uncommon architecture which is generally used for fault tolerance. Heterogeneous systems operate on the same data stream and must agree on the result. e.g. the Space Shuttle flight control computer. Multiple Instruction, Multiple Data streams (MIMD) Multiple autonomous processors simultaneously executing different instructions on different data. e.g. a PC cluster memory space. 5 COMP 322, Spring 2017 (V. Sarkar, M. Joyner)

Multicore Processors are examples of MIMD systems Memory hierarchy for a single Intel Xeon Quad-core E5530 processor chip Core A Core B Core C Core D Regs Regs Regs Regs L1 d-cache L1 i-cache L1 d-cache L1 i-cache L1 d-cache L1 i-cache L1 d-cache L1 i-cache L2 unified cache L2 unified cache L3 unified cache Main memory 6 COMP 322, Spring 2017 (V. Sarkar, M. Joyner)

7 COMP 322, Spring 2017 (V. Sarkar, Source: M. Mattson Joyner) and Keutzer, UCB 8 SIMD computers Definition: A single instruction stream is applied to multiple data elements. One program text One instruction counter Distinct data streams per Processing Element (PE) Examples: Vector Procs, GPUs PE PE PE PE

CPU-Style Cores The CPU-Style core is designed to make individual threads speedy. Fetch/Decode Out-of-order control logic (Execute) Execution contexts Branch predictor logic Memory pre fetch unit Large data cache Execution context == memory and hardware associated to a specific stream of instructions (e.g. a thread) Multiple cores lead to MIMD computers 8 COMP 322, Spring 2017 (V. Sarkar, M. Joyner)

GPU Design Idea #1: more slow cores The first big idea that differentiates GPU and CPU core design: slim down the footprint of each core. Fetch/Decode (Execute) Execution contexts Idea #1: Remove the modules that help a single instruction execute fast. Slides and graphics based on presentations from Andreas Klöckner and Kayvon Fatahalian 9 COMP 322, Spring 2017 (V. Sarkar, M. Joyner)

GPU Design Idea #1: more slow cores See: Andreas Klöckner and Kayvon Fatahalian 10 COMP 322, Spring 2017 (V. Sarkar, M. Joyner)

GPU Design Idea #2: lock stepping In the GPU rendering context, the instruction streams are typically very similar. Design for a single instruction multiple data SIMD model: share the cost of the instruction stream across many s (i.e. single program counter for multiple cores ) Fetch/Decode Fetch/Decode (Execute) Execution contexts See: Andreas Klöckner and Kayvon Fatahalian shared memory SIMD model Ct Ct Ct Ct Ct Ct Ct Ct Shared Ctx Data 11 COMP 322, Spring 2017 (V. Sarkar, M. Joyner)

GPU Design Idea #2: branching? 12 COMP 322, Spring 2017 (V. Sarkar, M. Joyner) See: Andreas Klöckner and Kayvon Fatahalian Question: What happens when the instruction streams include branching? How can they execute in lock step?

GPU Design Idea #2: lock stepping w/ branching Time Non branching code; T T F T T F F F X X X X X X X X X X X X X X X X if(flag > 0){ /* branch */ x = exp(y); y = 2.3*x; } else{ x = sin(y); y = 2.1*x; } Non branching code; The cheap branching approach means that some s are idle as all s traverse all branches [ executing NOPs if necessary ] In the worst possible case we could see 1/8 of maximum performance. 13 COMP 322, Spring 2017 (V. Sarkar, M. Joyner)

GPU Design Idea #3: latency hiding Time work on registers; work on registers; work on registers; load registers from main memory; It takes O(1000) cycles to load data from off chip memory into the SM registers file These s are idled (stalled) after a load See: Andreas Klöckner and Kayvon Fatahalian 14 COMP 322, Spring 2017 (V. Sarkar, M. Joyner)

GPU Design Idea #3: latency hiding Idea #3: enable fast context switching so the s can efficiently alternate between different tasks. Fetch/Decode Fetch/Decode Ct Ct Ct Ct Ct Ct Ct Ct Shared Ctx Data 1 2 3 4 See: Andreas Klöckner and Kayvon Fatahalian 15 COMP 322, Spring 2017 (V. Sarkar, M. Joyner)

GPU Design Idea #3: context switching Ti me Body Level Five Ctx1: work on registers; Ctx1: work on registers; Ctx1: work on registers; Ctx1: load request, switch context; Ctx3: work on registers; Ctx3: work on registers; Ctx3: work on registers; Ctx3: load request, switch context; Ctx2: work on registers; Ctx2: work on registers; Ctx2: work on registers; Ctx2: load request, switch context; See: Andreas Klöckner and Kayvon Fatahalian Ctx1: load done so continue 16 COMP 322, Spring 2017 (V. Sarkar, M. Joyner)

Summary: CPUs and GPUs have fundamentally different design GPU = Graphics Processing Unit Single CPU core Multiple GPU processors Control Co Ca A A A A A A A A A A A A A A A A Streaming Multiprocessor Cache DRAM DRAM GPUs are provided to accelerate graphics, but they can also be used for non-graphics applications that exhibit large amounts of data parallelism and require large amounts of streaming throughput SIMD parallelism within an SM, and SPMD parallelism across SMs 17 COMP 322, Spring 2017 (V. Sarkar, M. Joyner)

Host vs. Device CPU (aka host) GPU (aka device) 18 COMP 322, Spring 2017 (V. Sarkar, M. Joyner)

Host vs. Device The GPU has its own independent memory space. The GPU brick is a separate compute sidecar. We refer to: the GPU as a DEVICE the CPU as the HOST An array that is in HOST-attached memory is not directly visible to the DEVICE, and vice versa. To load data onto the DEVICE from the HOST: We allocate memory on the DEVICE for the array We then copy data from the HOST array to the DEVICE array To retrieve results from the DEVICE they have to be copied from the DEVICE array to the HOST array. 19 COMP 322, Spring 2017 (V. Sarkar, M. Joyner)

Execution of a CUDA program Host Code (small number of threads) Explicit host-device communication Device Kernel (large number of threads)... Explicit host-device communication Host Code (small number of threads) Explicit host-device communication Device Kernel (large number of threads)... Explicit host-device communication Host Code (small number of threads) 20 COMP 322, Spring 2017 (V. Sarkar, M. Joyner)

Outline of a CUDA main program pseudo_cuda_code.cu: global void kernel(arguments) { } instructions for a single GPU thread;... main(){ Body Level Five set up GPU arrays; copy CPU data to GPU; kernel <<< # thread blocks, # threads per block >>> (arguments); copy GPU data to CPU; } 21 COMP 322, Spring 2017 (V. Sarkar, M. Joyner)

Thread Block Grid 0 CUDA Storage Classes + Thread Hierarchy Local Memory Shared Memory Local Memory: per-thread Private per thread Auto variables, register spill Shared Memory: per-block Shared by threads of the same bloc Inter-thread communication Global Memory: per-application Shared by all threads Inter-Grid communication... Grid 1... Global Memory Sequential Grids in Time 22 COMP 322, Spring 2017 (V. Sarkar, M. Joyner)

CUDA Host-Device Data Transfer cudaerror_t cudamemcpy(void* dst, const void* src, size_t count, enum cudamemcpykind kind) Copies count bytes from the memory area pointed to by src to the memory area pointed to by dst, where kind is one of (Device) Grid Block (0, 0) Shared Memory Regist Regist Block (1, 0) Shared Memory Regist Regist cudamemcpyhosttohost Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) cudamemcpyhosttodevice cudamemcpydevicetohost cudamemcpydevicetodevice Local Mem ory Local Mem ory Local Mem ory Local Mem ory The memory areas may not overlap Calling cudamemcpy() with dst and src pointers that do not match the direction of the copy results in an undefined behavior. Host Global Memory Constant Memory Texture Memory 23 COMP 322, Spring 2017 (V. Sarkar, M. Joyner)

Matrix multiplication kernel code in CUDA --- SPMD model with 2D index Body Level Five 24 COMP 322, Spring 2017 (V. Sarkar, M. Joyner)

Host Code in C for Matrix Multiplication 1. void MatrixMultiplication(float* M, float* N, float* P, int Width) { 2. int size = Width*Width*sizeof(float); // matrix size 3. float* Md, Nd, Pd; // pointers to device arrays 4. cudamalloc((void**)&md, size); // allocate Md on device 5. cudamemcpy(md, M, size, cudamemcpyhosttodevice); // copy M to Md 6. cudamalloc((void**)&nd, size); // allocate Nd on device 7. cudamemcpy(nd, M, size, cudamemcpyhosttodevice); // copy N to Nd 8. cudamalloc((void**)&pd, size); // allocate Pd on device 9. dim3 dimblock(width,width); dim3 dimgrid(1,1); 10. // launch kernel (equivalent to async at(gpu), forall, forall 11. MatrixMulKernel<<<dimGrid,dimBlock>>>(Md, Nd, Pd, Width); 12. cudamemcpy(p, Pd, size, cudamemcpydevicetohost); // copy Pd to P 13. // Free device matrices 14. cudafree(md); cudafree(nd); cudafree(pd); 15. } 25 COMP 322, Spring 2017 (V. Sarkar, M. Joyner)

Summary of key features in CUDA CUDA construct Related HJ/Java constructs Kernel invocation, <<<...>>> async at(gpu-place) 1D/2D grid with 1D/2D/3D blocks of threads Outer 1D/2D forall with inner 1D/2D/3D forall Intra-block barrier, syncthreads() HJ forall-next on implicit phaser for inner forall cudamemcpy() No direct equivalent in HJ/Java (can use System.arraycopy() if needed) Storage classes: local, shared, global No direct equivalent in HJ/Java (method-local variables are scalars) 26 COMP 322, Spring 2017 (V. Sarkar, M. Joyner)

Worksheet #35: Branching in SIMD code Name: Netid: Consider SIMD execution of the following pseudocode with 8 threads. Assume that each call to dowork(x) takes x units of time, and ignore all other costs. How long will this program take when executed on 8 GPU cores, taking into consideration the branching issues discussed in Slide 9? 1. int tx = threadidx.x; // ranges from 0 to 7 2. if (tx % 2 = 0) { 3. S1: dowork(1); // Computation S1 takes 1 unit of time 4. } 5. else { 6. S2: dowork(2); // Computation S2 takes 2 units of time 7. } 25 COMP 322, Spring 2017 (V. Sarkar, M. Joyner)

BACKUP SLIDES START HERE 28 COMP 322, Spring 2017 (V. Sarkar, M. Joyner)

HJ abstraction of a CUDA kernel invocation: async at + forall + forall async at(gpu) forall(blockidx) async at(gpu) forall(threadidx) 29 COMP 322, Spring 2017 (V. Sarkar, M. Joyner)