Today s Content. Lecture 7. Trends. Factors contributed to the growth of Beowulf class computers. Introduction. CUDA Programming CUDA (I)

Similar documents
Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Introduction to CUDA (1 of n*)

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Parallel Numerical Algorithms

Introduction to CUDA Programming

ECE 574 Cluster Computing Lecture 15

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

EEM528 GPU COMPUTING

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS 179: GPU Computing. Lecture 2: The Basics

Introduction to CUDA (1 of n*)

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

GPU Programming Using CUDA

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Scientific discovery, analysis and prediction made possible through high performance computing.

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Lecture 2: Introduction to CUDA C

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Module 2: Introduction to CUDA C

University of Bielefeld

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Lecture 11: GPU programming

Massively Parallel Architectures

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Module 2: Introduction to CUDA C. Objective

High Performance Linear Algebra on Data Parallel Co-Processors I

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

COSC 6374 Parallel Computations Introduction to CUDA

GPU programming. Dr. Bernhard Kainz

High-Performance Computing Using GPUs

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

ECE 574 Cluster Computing Lecture 17

Introduction to CUDA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Introduction to GPGPUs and to CUDA programming model

CS427 Multicore Architecture and Parallel Computing

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

GPU Computing: A Quick Start

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Module 3: CUDA Execution Model -I. Objective

Tesla Architecture, CUDA and Optimization Strategies

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPGPU in Film Production. Laurence Emms Pixar Animation Studios

GPU CUDA Programming

Lecture 3: Introduction to CUDA

Threading Hardware in G80

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Lecture 1: an introduction to CUDA

Introduction to GPU hardware and to CUDA

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU computing Simulating spin models on GPU Lecture 1: GPU architecture and CUDA programming. GPU computing. GPU computing.

ECE 8823: GPU Architectures. Objectives

Parallel Computing. Lecture 19: CUDA - I

GPUs and GPGPUs. Greg Blanton John T. Lubia

CSE 160 Lecture 24. Graphical Processing Units

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Shaders. Slide credit to Prof. Zwicker

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CUDA Architecture & Programming Model

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

CUDA Basics. July 6, 2016

Programming with CUDA, WS09

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CS179 GPU Programming Recitation 4: CUDA Particles

Spring 2011 Prof. Hyesoon Kim

Introduction to Parallel Computing with CUDA. Oswald Haan

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

High Performance Computing and GPU Programming

Accelerating image registration on GPUs

Spring 2009 Prof. Hyesoon Kim

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Current Trends in Computer Graphics Hardware

General-purpose computing on graphics processing units (GPGPU)

Introduc)on to GPU Programming

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Lecture 10!! Introduction to CUDA!

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

GPGPU/CUDA/C Workshop 2012

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

CUDA C Programming Mark Harris NVIDIA Corporation

Transcription:

Today s Content Lecture 7 CUDA (I) Introduction Trends in HPC GPGPU CUDA Programming 1 Trends Trends in High-Performance Computing HPC is never a commodity until 199 In 1990 s Performances of PCs are getting faster and faster Proprietary computer chips offers lower and lower performance/price compared to standard PC chips 199, NASA Thomas Sterling, Donald Becker, (199) 1 DX (08) processors, channel bonded Ethernet. Factors contributed to the growth of Beowulf class computers. The prevalence of computers for office automation, home computing, games and entertainment now provide system designers with new types of cost-effective components. The COTS industry now provides fully assembled subsystems (microprocessors, motherboards, disks and network interface cards). Mass market competition has driven the prices down and reliability up for these subsystems. The availability of open source software, particularly the Linux operating system, GNU compilers and programming tools and MPI and PVM message passing libraries. Programs like the HPCC program have produced many years of experience working with parallel algorithms. The recognition that obtaining high performance, even from vendor provided, parallel platforms is hard work and requires researchers to adopt a do-it-yourself attitude. An increased reliance on computational science which demands high performance computing. http://www.beowulf.org/overview/history.html MPP: Massive Parallel Processing MPP vs. cluster: they are very similar. Multiple processors, each has its private memory, are interconnected through network. MPP tends to have more sophisticated interconnection than cluster. 1

Next 10 years development? Multi-core CPU is already commodity / standard. That s why OpenMP is becoming more and more important ;) Intel has 80-core prototype processors in their lab in 00 Cluster/MPP approach is showing its age with diminishing increase in performance Increasing in clock-speed for general purpose processors (GPP) is slowing down due to power consumption. Moore s law suggests number of transistors can be double every 18 months. These increases can be used better for special-purposed processing. http://news.cnet.com/intel-shows-off-80-core-processor/100-100_-18181.html7 http://hpc.pnl.gov/projects/hybrid-computing/ Power Density Source: Patrick Gelsinger, Intel Developer s Forum, Intel Corporation, 00. Next 10 years of development? Hybrid computing system Systems that employ more than one type of computing engine to perform application tasks. General purpose processors (e.g. x8) do not necessarily perform well on all tasks. Specialized processors can be built for some special-purpose tasks. GPU much better at D computer graphics FPGA (Field Programmable Gate Arrays) the processor s hardware can be reconfigured or programmed to invest its transistors on performing certain tasks. DSP (Digital Signal Processors) Introduction to GPGPU http://hpc.pnl.gov/projects/hybrid-computing/ 9 10 GPGPU General-Purpose computing using Graphical Processing Unit GPU Performance Growth GPGPU: Has been a hot topic in research community since early 000. High-performance (>10x of current generation General Purpose CPUs) Highly efficient Cost efficient Space efficient Green / Energy-efficient 11 David Kirk/NVIDIA and Wen-mei W. Hwu, 007-009 ECE 98AL Spring 010, University of Illinois, Urbana-Champaign 1

Why GPU is faster than CPU? The GPU is specialized for compute-intensive, highly data parallel computation (graphics rendering) So, more transistors can be devoted to data processing rather than data caching and flow control DRAM Control Cache CPU DRAM GPU David Kirk/NVIDIA and Wen-mei W. Hwu, 007 ECE 98AL, University of Illinois, Urbana-Champaign Program/ API Driver GPU Front End Vertex Processing transformations GPU Pipeline CPU GPU Vertex & Fragments are processed in parallel! Data Parallel, Stream Processing, Also note that data rarely travel backwards GPGPU exploits vertex & fragment processing Primitive Assembly Compiles Vertices into Points, Lines and/or Polygons; Link elements Set rasterizer Bus Turn fragments into pixels Rasterization & Interpolation Determine color & depth of pixels Depth checking, blending Texturing Fragment Processing Raster Operations Framebuffer Cg NVIDIA Shading Languages HLSL Microsoft High Level Shader Language DirectX 8 + XBox, XBox0 GLSL OpenGL shading language http://en.wikipedia.org/wiki/shading_language OpenGL, DirectX? Barrier into GPGPU Graphic pipeline? Map between computation versus drawing? Open a drawing canvas Array texture Compute kernel shader Compute drawing http://www.mathematik.unidortmund.de/~goeddeke/gpgpu/tutorial.html#feedback Efforts for GPGPU Programming Language Approach BrookGPU Around 00 @ Stanford University Ian Buck, now with NVIDIA Layering over DirectX 9, OpenGL, CTM, Cg Incorporated into the Stream SDK, AMD to replace it CTM Sh Conceived @ University of Waterloo, Canada General release around 00 Evolved and commercialized @ RapidMind, Inc. CUDA Released at 00 @ NVIDIA OpenCL 008 @ Apple WDC, now supported by Nvidia, AMD/ATI DirectCompute Microsoft DirectX 10 + on Windows Vista & Windows 7 CUDA Compute Unified Device Architecture http://www.gpgpu.org/cgi-bin/blosxom.cgi/high-level%0languages/index.html 18

Outline CUDA Architecture CUDA Architecture CUDA programming model CUDA-C 19 0 CUDA : a General-Purpose Parallel Computing Architecture CUDA is designed to support various languages or APIs CUDA Device and Threads Host: usually the general purpose CPU running the computer system Host memory: DRAM on the host Initiates the execution of kernels Device: coprocessor to the CPU/Host Device memory: memory on the device Runs many threads in parallel CPU (host) DRAM NVIDIA_CUDA_Programming_Guide_.0 1 Others (host) GPU (device) Programming Concepts CUDA Programming Model CUDA considers many-core processors Need lots and lots of threads to make it efficient! Kernel Kernels are executed by devices through threads. One kernel may be executed by many threads. Thread Executes a kernel threadidx Array 1 7 Kernel +1 Array 7 8

Single thread vs. multiple threads Single thread + loop int a[100]={0; for(int i=0;i<100;i++) { a[i]++; Single thread vs. multiple threads 100 threads, each thread has an Id starts from 0 a[id]++; Thread Block Grid Each thread has its unique ID within a block One thread block contains many threads arranged in 1D, D, or D Process data in linear array, in matrices, or in volumes Threads get their id by threadidx.x, threadidx.y, threadidx.z Holds up to 1 threads in current GPU The limit exists because threads in a block are expected to execute on the same processor core Threads in the same block can cooperate Threads within a block can cooperate via shared memory, atomic operations and barrier synchronization 7 Thread blocks are arranged into grid (1D or D) So CUDA programs have grid, each grid point has a thread block, and each thread block contains threads arranged in 1D, D, or D Dimensional Limits: Grid: x Thread block : 1 x 1 x (the total number must be < 1) NO cooperation between thread blocks This grid blocks threads alllows CUDA programs scale on different hardware. Programmers can use blockidx, blockdim, and threadidx to identify their global location/ranking/id to deduce its portion of work (similar to rank in MPI) blockidx.x, blockidx.y, blockdim.x, blockdim.y, blockdim.z threadidx.x, threadidx.y, threadidx.z 8 Thread blocks are executed in any order as scheduler see fits. Between thread blocks, it is impossible to cooperate beyond barrier synchronization issued by host. Programming Concepts Memory Hierarchy Local memory Shared memory Global memory (R/W) Constant memory (R) Texture memory (R) Global memories can be accessed by both device & host, but they are optimized for different purposes.. 9 NVIDIA_CUDA_Programming_Guide_.0 Need to choose appropriate memory types to use. But for now, just keep this in mind. 0

Refresh Kernel Thread Thread Block Grid Memory Local, shared, global / constant / texture CUDA-C 1 CUDA-C CUDA-C extends C/C ++ Declspecs (declaration specifications): Functions declaration: global, device, host Variable declaration: shared, local, constant Built-in variables threadidx, blockidx, blockdim Intrinsics (intrinsic function) synthreads, cudathreadsynchronize(), Runtime API Memory, execution management, etc. Launching kernels Examples 0. Memory Allocation & Initialization 1. Data Movement intrinsic function: functions handled by compilers instead of library functions. (http://en.wikipedia.org/wiki/intrinsic_function / ) Memory allocation & initialization Before we use GPU to do any computation, we need data on GPU first. Unless, data are generated on GPU. Global memory Main means of communicating R/W Data between host and device Grid Contents visible to all threads Block (0, 0) Block (1, 0) Long latency access (i.e. slow) Content persist between Shared Memory kernel launches void *malloc(size_t size); void *calloc(size_t nmemb, size_t size); void free(void *ptr); void *realloc(void *ptr, size_t size); Host Thread (0, 0) Global Memory Thread (1, 0) Thread (0, 0) Shared Memory Thread (1, 0) double host(const unsigned long size) { double t1=0; char *data data = malloc(size); memset(data, 0, size); t1 += timer.elapsedtime(); while(t1<1.0); free(data); return t1/count; Memory allocation & initialization void* malloc(size_t size) allocate dynamic memory void *memset(void *s, int c, size_t n); fill memory with a constant byte void free(void *ptr); free the memory space pointed by ptr

double host(const unsigned long size) { double t1=0; Memory allocation & initialization double device(const unsigned long size) { double t=0; I7 90.07GHz + GeFore GT0 Bandwidth on erasing data Xeon E10 1.GHz + GeForce 8800Ultra Q90.8GHz + GeForce GTX80 I7 90.07GHz + NVidia Tesla c00 char *data; data = malloc(size); memset(data, 0, size); t1 += timer.elapsedtime(); while(t1<1.0); free(data); return t1/count; char *data; cudamalloc((void**) &data, size); cudamemset(data, 0, size); cudathreadsynchronize(); t += timer.elapsedtime(); while(t < 1.0); cudafree(data); return t/count; 7 Host Device Host Device Host Device Host Device 11.0 1.1.7 0.7..00 11.0 10. Config #0 Config #1 Config # Config # 01.cu Memory copy between two memory blocks host double memop(void *dst, void *src, const unsigned long size, cudamemcpykind dir ) { double t = 0; memcpy(dst, src, size); t += timer.elapsedtime(); while(t < T_MIN); return t/count; cudamemcpy(dst, src, size, dir); cudathreadsynchronize(); enum cudamemcpykind { cudamemcpyhosttodevice, cudamemcpydevicetohost, cudamemcpydevicetodevice, cudamemcpyhosttohost 9 Example 1. Memory performance Copy Config #0 HostDevice:0.0091secs., bandwidth=.97gb/sec. DeviceHost:0.09secs., bandwidth=.01gb/sec. DeviceDevi:0.0secs., bandwidth=0.gb/sec. HostHost :0.009secs., bandwidth=1.78gb/sec. Config #1 HostDevice:0.0877secs., bandwidth=.8819gb/sec. DeviceHost:0.080891secs., bandwidth=.0908gb/sec. DeviceDevi:0.08secs., bandwidth=10.71gb/sec. HostHost :0.10secs., bandwidth=.881gb/sec. Config # HostDevice:0.08009secs., bandwidth=.111gb/sec. DeviceHost:0.080009secs., bandwidth=.18gb/sec. DeviceDevi:0.091secs., bandwidth=1.07gb/sec. HostHost :0.0719secs., bandwidth=10.0gb/sec. Config # HostDevice:0.08891secs., bandwidth=.0gb/sec. DeviceHost:0.08179secs., bandwidth=.081gb/sec. DeviceDevi:0.01108secs., bandwidth=8.08gb/sec. HostHost :0.0719secs., bandwidth=1.977gb/sec. CUDA-C APIs for memory management Readings cudaerror_t cudamalloc (void *devptr, size_t size) cudaerror_t cudamallochost (void **ptr, size_t size) Allocates size bytes of host memory that is page-locked/pinned and accessible to the device. cudaerror_t cudamemcpy (void *dst, const void *src, size_t count, enum cudamemcpykind kind) cudamemcpyhosttodevice, cudamemcpydevicetohost, cudamemcpydevicetodevice, cudamemcpyhosttohost cudaerror_t cudafree (void *devptr) cudaerror_t cudamemset (void *devptr, int value, size_t count) No programming assignment this week! Serial implementation of the term project due on May. 10, 011 NVIDIA CUDA Programing Guide (.0RC) Chap 1, (p.1 1) Most of Section. (p. 18 ) Chap (p. 8-88), optional Chap (p. 89 10) Quickly browse Appendix B, C (p.107 1) CUDA Reference manual.0: Section.8: memory management 1 http://10.118..:8000/doc/cuda/.0rc/cuda_c_programming_guide.pdf 7