CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

Similar documents
Programmer's View of Execution Teminology Summary

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

Vector Processors and Graphics Processing Units (GPUs)

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs) John Wawrzynek. EECS, University of California at Berkeley

GPU Fundamentals Jeff Larkin November 14, 2016

GPUs have enormous power that is enormously difficult to use

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Processing SIMD, Vector and GPU s cont.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Portland State University ECE 588/688. Graphics Processors

Warps and Reduction Algorithms

CUDA Performance Optimization

Motivation: Who Cares About I/O? Historical Perspective. Hard Disk Drives. CS252 Graduate Computer Architecture Lecture 25

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Computer Architecture

Master Informatics Eng.

CS516 Programming Languages and Compilers II

Flynn s Classification (1966)

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Parallel Numerical Algorithms

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Numerical Simulation on the GPU

CS 152, Spring 2011 Section 10

Parallel Computing: Parallel Architectures Jin, Hai

Advanced CUDA Optimizations

ME964 High Performance Computing for Engineering Applications

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

Lecture 2: CUDA Programming

DATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2)

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Fundamental CUDA Optimization. NVIDIA Corporation

CS 179 Lecture 4. GPU Compute Architecture

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

Scientific Computing on GPUs: GPU Architecture Overview

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

Preparing seismic codes for GPUs and other

Occupancy-based compilation

Introduction to GPGPU and GPU-architectures

Introduction to GPU programming with CUDA

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec


CUDA Performance Optimization. Patrick Legresley

Fundamental CUDA Optimization. NVIDIA Corporation

GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017

GPU 101. Mike Bailey. Oregon State University

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

CUDA Programming Model

Optimizing Parallel Reduction in CUDA

Arquitetura e Organização de Computadores 2

ECE 574 Cluster Computing Lecture 15

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Handout 3. HSAIL and A SIMT GPU Simulator

Introduction to Runtime Systems

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Introduction to CUDA Programming

Tesla Architecture, CUDA and Optimization Strategies

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

ECE 574 Cluster Computing Lecture 17

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Real-Time Rendering Architectures

CS377P Programming for Performance GPU Programming - II

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

CUDA OPTIMIZATIONS ISC 2011 Tutorial

L17: CUDA, cont. 11/3/10. Final Project Purpose: October 28, Next Wednesday, November 3. Example Projects

GPUs and GPGPUs. Greg Blanton John T. Lubia

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

OpenACC programming for GPGPUs: Rotor wake simulation

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

From Shader Code to a Teraflop: How Shader Cores Work

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

GPUs and Emerging Architectures

Multi-Processors and GPU

GPU for HPC. October 2010

45-year CPU Evolution: 1 Law -2 Equations

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

Computer Architecture and Engineering. CS152 Quiz #5. April 23rd, Professor Krste Asanovic. Name: Answer Key

From Application to Technology OpenCL Application Processors Chung-Ho Chen

! Readings! ! Room-level, on-chip! vs.!

Exotic Methods in Parallel Computing [GPU Computing]

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Introduction to GPU hardware and to CUDA

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University

UNIT III DATA-LEVEL PARALLELISM IN VECTOR, SIMD, AND GPU ARCHITECTURES

Profiling & Tuning Applications. CUDA Course István Reguly

Introduction to OpenCL!

Programming in CUDA. Malik M Khan

Transcription:

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming Lecturer: Alan Christopher

Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance demo

A Quick Review: Classes of ILP: Parallelism Run multiple instructions from one stream in parallel (e.g. pipelining) TLP: Run multiple instruction streams simultaneously (e.g. openmp) DLP: GPUs are here Run the same operation on multiple data at the same time (e.g. SSE intrinsics)

GPUs Hardware specialized for graphics calculations Originally developed to facilitate the use of CAD programs Graphics calculations are extremely data parallel e.g. translate every vertex in a 3D model to the right Programmers found that that could rephrase some of their problems as graphics manipulations and run them on the GPU Incredibly burdensome for the programmer to use More usable these days opencl, CUDA

CPU vs. GPU Latency optimized A couple threads of execution Each thread executes quickly Serial code Lots of caching Throughput optimized Many, many threads of execution Each thread executes slowly Parallel code Lots of memory bandwidth

OpenCL and CUDA Extensions to C which allow for relatively easy GPU programming CUDA is NVIDIA proprietary NVIDIA cards only OpenCL is opensource Can be used with NVIDA or ATI cards Intended for general heterogeneous computing Means you can use it with stuff like FPGAs Also means it's relatively clunky Similar tools, but different jargon

Kernels Kernels define the computation for one array index The GPU runs the kernel on each index of a specified range Similar functionality to map, but you get to know the array index and the array value. Call the work at a given index a work-item, a cuda thread, or a µthread. The entire range is called an index-space or grid.

OpenCL vvadd /* C version. */ void vvadd(float *dst, float *a, float *b, unsigned n) { for(int i = 0; i < n; i++) } dst[i] = a[i] + b[i] /* opencl Kernel. */ kernel void vvadd( global float *dst, global float *a, global float *b, unsigned n) { unsigned tid = get_global_id(0); if (tid < n) } dst[tid] = a[tid] + b[tid];

Programmer's View of Execution Create enough work groups to cover input vector (opencl calls this ensemble of work groups an index space, can be 3-dimensional in opencl, 2 dimensional in CUDA) threadid 0 threadid 1 threadid 255 threadid 256 threadid 257 threadid 511 threadid m threadid m+1 threadid n+e Local work size (programmer can choose) Conditional (i<n) turns off unused threads in last block

Hardware Execution Model CPU Lane 0 Lane 1 Lane 0 Lane 1 Lane 0 Lane 1 CPU Memory Lane 15 Core 0 Lane 15 Core 1 GPU Lane 15 Core 15 GPU Memory GPU is built from multiple parallel cores, each core contains a multithreaded SIMD processor. CPU sends whole index-space over to GPU, which distributes work-groups among cores (each work-group executes on one core) Programmer unaware of number of cores Notice that the GPU and CPU have different memory spaces. That'll be important when we start considering which jobs are a good fit for GPUs, and which jobs are a poor fit.

Single Instruction, Multiple Thread GPUs use a SIMT model, where individual scalar instruction streams for each work item are grouped together for SIMD execution on hardware (Nvidia groups 32 CUDA threads into a warp. OpenCL refers to them as wavefronts.) Scalar instruction stream ld x mul a ld y add st y µt0 µt1 µt2 µt3 µt4 µt5 µt6 µt7 SIMD execution across wavefront

Teminology Summary Kernel: The function that is mapped across the input. Work-item: The basic unit of execution. Takes care of one index. Also called a microthread or cuda thread. Work-group/Block: A group of work-items. Each work-group is sent to one core in the GPU. Index-space/Grid: The range of indices over which the kernel is applied. Wavefront/Warp: A group of microthreads (work-items) scheduled to be SIMD executed with eachother.

Administrivia OH 10am-5pm in 611 Soda tomorrow Final on Friday

Conditionals in the SIMT Model Simple if-then-else are compiled into predicated execution, equivalent to vector masking More complex control flow compiled into branches How to execute a vector of branches? Scalar instruction stream tid=threadid If (tid >= n) skip Call func1 add st y skip: µt0 µt1 µt2 µt3 µt4 µt5 µt6 µt7 SIMD execution across warp

Branch Divergence Hardware tracks which µthreads take or don t take branch If all go the same way, then keep going in SIMD fashion If not, create mask vector indicating taken/not-taken Keep executing not-taken path under mask, push taken branch PC+mask onto a hardware stack and execute later When can execution of µthreads in warp reconverge?

Warps (wavefronts) are multithreaded on a single core One warp of 32 µthreads is a single thread in the hardware Multiple warp threads are interleaved in execution on a single core to hide latencies (memory and functional unit) A single thread block can contain multiple warps (up to 512 µt max in CUDA), all mapped to single core Can have multiple blocks executing on one core [Nvidia, 2010]

OpenCL Memory Model Global read and write by all work-items and work-groups Constant read-only by work-items; read and write by host Local used for data sharing; read/write by work-items in the same work group Private only accessible to one work-item

SIMT Illusion of many independent threads But for efficiency, programmer must try and keep µthreads aligned in a SIMD fashion Try to do unit-stride loads and store so memory coalescing kicks in Avoid branch divergence so most instruction slots execute useful work and are not masked off

VVADD /* C version. */ void vvadd(float *dst, float *a, float *b, unsigned n) { #pragma omp parallel for for(int i = 0; i < n; i++) dst[i] = a[i] + b[i] } /* opencl Kernel. */ kernel void vvadd( global float *dst, global float *a, global float *b, unsigned n) { unsigned tid = get_global_id(0); if (tid < n) dst[tid] = a[tid] + b[tid]; } A: CPU faster B: GPU faster

VVADD /* C version. */ void vvadd(float *dst, float *a, float *b, unsigned n) { #pragma omp parallel for for(int i = 0; i < n; i++) dst[i] = a[i] + b[i] } Only 1 flop per three memory accesses => memory bound calculation. A many core processor A device for turning a compute bound problem into a memory bound problem Kathy Yelick

VECTOR_COP /* C version. */ void vector_cop(float *dst, float *a, float *b, unsigned n) { #pragma omp parallel for for(int i = 0; i < n; i++) { dst[i] = 0; for (int j = 0; j < A_LARGE_NUMBER; j++) dst[i] += a[i]*2*b[i] a[i]*a[i] b[i]*b[i]; } } /* OpenCL kernel. */ kernel void vector_cop( global float *dst, global float *a, global float *b, unsigned n) { unsigned i = get_global_id(0); if (tid < n) { dst[i] = 0; for (int j = 0; j < A_LARGE_NUMBER; j++) dst[i] += a[i]*2*b[i] a[i]*a[i] b[i]*b[i]; } } A: CPU faster B: GPU faster

GP-GPU in the future High-end desktops have separate GPU chip, but trend towards integrating GPU on same die as CPU (already in laptops, tablets and smartphones) Advantage is shared memory with CPU, no need to transfer data Disadvantage is reduced memory bandwidth compared to dedicated smaller-capacity specialized memory system Graphics DRAM (GDDR) versus regular DRAM (DDR3) Will GP-GPU survive? Or will improvements in CPU DLP make GP-GPU redundant? On same die, CPU and GPU should have same memory bandwidth GPU might have more FLOPS as needed for graphics anyway

Acknowledgements These slides contain materials developed and copryright by Krste Asanovic (UCB) AMD codeproject.com

And in conclusion GPUs thrive when The calculation is data parallel The calculation is CPU-bound The calculation is large CPUs thrive when The calculation is largely serial The calculation is small The programmer is lazy

Bonus OpenCL source code for vvadd and vector_cop demos available at http://www-inst.eecs.berkeley.edu/~cs61c/sp13/lec/39/demo.tar.gz