Martin Kruliš, v

Similar documents
Martin Kruliš, v

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU

Heterogeneous Computing

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

INTRODUCTION TO OPENCL. Jason B. Smith, Hood College May

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

OpenCL Overview Benedict R. Gaster, AMD

Antonio R. Miele Marco D. Santambrogio

OpenCL Device Fission Benedict R. Gaster, AMD

Tesla Architecture, CUDA and Optimization Strategies

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Portland State University ECE 588/688. Graphics Processors

Exotic Methods in Parallel Computing [GPU Computing]

Josef Pelikán, Jan Horáček CGG MFF UK Praha

AMath 483/583, Lecture 24, May 20, Notes: Notes: What s a GPU? Notes: Some GPU application areas

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Computer Architecture

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome!

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

OpenCL. Dr. David Brayford, LRZ, PRACE PATC: Intel MIC & GPU Programming Workshop

Introduction to CUDA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

Lecture 11: GPU programming

Threading Hardware in G80

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

GPGPU IGAD 2014/2015. Lecture 1. Jacco Bikker

Scientific Computing on GPUs: GPU Architecture Overview

CUDA Performance Optimization. Patrick Legresley

Parallel Programming Concepts. GPU Computing with OpenCL

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Modern Processor Architectures. L25: Modern Compiler Design

Introduction to Parallel & Distributed Computing OpenCL: memory & threads

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

OpenCL on the GPU. San Jose, CA September 30, Neil Trevett and Cyril Zeller, NVIDIA

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Introduction à OpenCL

OpenCL The Open Standard for Heterogeneous Parallel Programming

NVIDIA Fermi Architecture

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

From Shader Code to a Teraflop: How Shader Cores Work

ECE 574 Cluster Computing Lecture 17

Real-Time Rendering Architectures

Fundamental CUDA Optimization. NVIDIA Corporation

From Application to Technology OpenCL Application Processors Chung-Ho Chen

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CUDA Programming Model

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Modern C++ Parallelism from CPU to GPU

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

Fundamental CUDA Optimization. NVIDIA Corporation

Introduction to GPU hardware and to CUDA

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

CS377P Programming for Performance GPU Programming - II

GPGPU Training. Personal Super Computing Competence Centre PSC 3. Jan G. Cornelis. Personal Super Computing Competence Center

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

Massively Parallel Architectures

Tuning CUDA Applications for Fermi. Version 1.2

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

high performance medical reconstruction using stream programming paradigms

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Handout 3. HSAIL and A SIMT GPU Simulator

TUNING CUDA APPLICATIONS FOR MAXWELL

Technology for a better society. hetcomp.com

Highly Scalable Multi-Objective Test Suite Minimisation Using Graphics Card

CS 677: Parallel Programming for Many-core Processors Lecture 11

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Numerical Simulation on the GPU

GRAPHICS PROCESSING UNITS

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

General Purpose Computing on Graphical Processing Units (GPGPU(

CS 179 Lecture 4. GPU Compute Architecture

OpenCL. Computation on HybriLIT Brief introduction and getting started

CS427 Multicore Architecture and Parallel Computing

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

CUDA (Compute Unified Device Architecture)

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Accelerating image registration on GPUs

TUNING CUDA APPLICATIONS FOR MAXWELL

Transcription:

Martin Kruliš 1

GPGPU History Current GPU Architecture OpenCL Framework Example Optimizing Previous Example Alternative Architectures 2

1996: 3Dfx Voodoo 1 First graphical (3D) accelerator for desktop PCs 1999: NVIDIA GeForce 256 First Transform&Lightning unit 2000: NVIDIA GeForce2, ATI Radeon 2001: GPU has programmable parts DirectX vertex and fragment shaders (v1.0) 2006: OpenGL, DirectX 10, Windows Vista Unified shader architecture in HW Geometry shader added 3

2007: NVIDIA CUDA First GPGPU solution, restricted to NVIDIA GPUs 2007: AMD Stream SDK (previously CTM) 2009: OpenCL, Direct Compute Universal GPGPU solution Mac OS (Snow Leopard) first to implement OpenCL 2010: OpenCL implementation from AMD and NVIDIA OpenCL revision 1.1 4

CPU GPU Few cores per chip General purpose cores Processing different threads Huge caches to reduce memory latency Locality of reference problem Many cores per chip Cores specialized for numeric computations SIMT thread processing Huge amount of threads and fast context switch Results in more complex memory transfers 5

NVIDIA Fermi State-of-the-art 16 SMP units 512 CUDA cores 786kB L2 cache Note that one CUDA core is equivalent of one 5D AMD Stream Processor. Therefore Radeon 5870 has 320 cores. 6

Streaming Multiprocessor 32 CUDA cores 64kB shared memory (or L1 cache) 1024 registers per core 16 load/store units 4 special function units 16 double precision ops per clock 1 instruction decoder All cores are running in SIMT mode 7

Single Instruction Multiple Threads Every core is executing the same instruction Each core has its own set of registers Instruction Decoder registers registers registers registers registers registers 8

CPU Expensive context-switch Large caches required GPU Fast context switch Another thread may run while current is stalled Small caches CPU GPU 9

Universal Framework for Parallel Computations Specification created by Khronos group Multiple implementations exist (AMD, NVIDIA, MAC, ) API for Different Parallel Architectures Multi-Core CPU, Many-Core GPU, IBM Cell cards, Handles device detection, data transfers, and code execution Extended Version of C99 for Programming Kernels Kernel is compiled at runtime for selected device Theoretically, we may chose best device for our application dynamically However, we have to consider optimizations 10

A parallel HW is mapped to following model... Device (CPU die or GPU card) Compute unit (CPU core or GPU SMP) Processing element (slot in SSE registers or GPU core) 11

Logical Layers Platform An implementation of OCL Context Groups devices of selected kind Buffers, programs, and other objects lives in context Device Command Queue Created for a device One device may have multiple command queues Intel Core i7 (4 cores with HT) ATI Radeon 5870 (320 cores) 12

std::vector<cl::platform> platforms; cl_int err = cl::platform::get(&platforms); if (err!= CL_SUCCESS) return 1; cl_context_properties cps[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[0]()), 0}; cl::context context(cl_device_type_gpu, cps, NULL, NULL, &err); std::vector<cl::device> devices = context.getinfo<cl_context_devices>(); cl::buffer buf(context, CL_MEM_READ_ONLY, sizeof(cl_float)*n); cl::program program(context, cl::program::sources(1, std::make_pair(source.c_str(), source.length())) ); err = program.build(devices); cl::kernel kernel(program, "function_name", &err); err = kernel.setarg(0, buf); List all platforms Context of all GPUs on the 1 st platform List all GPUs Create and compile the program from string source Mark function as a kernel GPU command queue cl::commandqueue commandqueue(context, devices[0], 0, &err); commandqueue.enqueuewritebuffer(buf, CL_TRUE, 0, sizeof(cl_float)*n, data); commandqueue.enqueuendrangekernel(kernel, cl::nullrange, cl::ndrange(n), cl::ndrange(grp), NULL, NULL); commandqueue.finish(); Send commands and wait for them to finish 13

A Kernel A function written in extended version of C99 Compiled at runtime for destination platform With high degree of optimization Kernel Execution Task Parallelism Multiple kernels are enlisted in command queue and executed concurrently Data Parallelism Multiple instances (threads) are created from single kernel, each operate on distinct data 14

Data Parallelism Each kernel instance has its own ID A 1-3 dimensional vector of numbers from 0 to N-1 ID identifies the portion of data to be processed Threads form groups Threads within one group can cooperate in some ways Groups have IDs as well T ID = G ID G SIZE + T LOCAL_ID 15

Types of Memory private memory that belongs to one thread local memory shared by workgroup global memory of the device constant read-only version of global memory 16

Threads and work groups mapping to hardware Workgroup is assigned non-preemptively to SMPs Threads are mapped to cores Multiple threads may be mapped to one core Work group T T T T T T T T T T T T T T T T core core core core Remaining workgroups wait until running workgroups terminate core core core core core core core core SMP SMP 17

Data Types Almost every standard C99 types (int, float, ) Some predefined types: size_t, ptrdiff_t Special type half 16bit equivalent of float Vector Data Types typen, where type is std. type and N 2,4,8,16 E.g. float4, int16 Items are accessed as structure members int4 vec = (int4) (1, 2, 3, 4); int x = vec.x; Special swizzling operations are allowed int4 vec2 = vec.xxyy; vec = vec.wzyx; 18

Functions Other functions (beside the kernel) can be defined in the program (kernel is just an entry point) It is possible to call std. functions like printf() However, it will work only on CPU There are many built-in functions Thread-related functions (e.g. get_global_id()) Mathematical and geometric functions Originally designed for graphics Some of them are translated into single instruction Functions for asynchronous memory transfers 19

Restrictions And Optimization Issues Branching problem (if-else) Workgroup runs in SIMT, thus all branches are followed Use conditional assignment rather than branches For-cycles The compiler attempts to unwrap them automatically While-cycles Same problem as branching Vector operations Translated into single instruction if possible (e.g. SSE) The compiler attempts to generate them automatically 20

Global Explicit barriers added to command queue Command queue operations event dependencies Within Workgroup Local barriers Memory fences Atomic operations (on integers) For both local and global memory Base and extended version Base common operations (add, sub, xchg, cmpxhg, ) Extended min, max, and, or, xor 21

kernel void mul_matrix ( global const float *m1, global const float *m2, global float *mres) { int n = get_global_size(0); int r = get_global_id(0); int c = get_global_id(1); float sum = 0; for (int i = 0; i < n; ++i) sum += m1[r*n + i] * m2[c*n + i]; mres[r*n + c] = sum; } The second matrix is already transposed 22

ms Multiplication of two rectangular matrices AMD Radeon 8570 (320 cores) vs. Core i7 (4 HT cores) Naïve N 3 algorithm, second matrix is transposed 35000 30000 29370 25000 1024x1024 2048x2048 20000 15000 10000 5000 0 3630 4090 2338 3400 591 319 392 CPU - serial CPU - TBB CPU - OpenCL GPU - OpenCL?! 23

Profiler Results How many times each thread read the global memory How many % of the time took memory reading ops ALU to reading ops ratio How many % of the time was fetch units stalled (waiting) 24

Coalesced load Threads running in SIMT mode have to cooperate Each thread loads different 4-byte word from aligned continuous block Details and rules for coalesced loads are different for each generation of GPUs HW performs such load as one memory transaction 0 B 64 B memory threads 25

Banking Local memory is divided into banks Bank manages 4-byte words, addresses are assigned modulo number of banks (16 or 32) banks threads Each bank is accessed by single thread banks threads Broadcast 26

Optimized Solution Workgroup computes block of 16x16 results In each step, appropriate blocks of 16x16 numbers are loaded into local memory and intermediate results are updated 16 = 16 27

kernel void mul_matrix_opt ( global const float *m1, global const float *m2, global float *mres, local float *tmp1, local float *tmp2) { int size = get_global_size(0); int lsize_x = get_local_size(0); int lsize_y = get_local_size(1); int block_size = lsize_x * lsize_y; int gid_x = get_global_id(0); int gid_y = get_global_id(1); int lid_x = get_local_id(0); int lid_y = get_local_id(1); int offset = lid_y*lsize_x + lid_x; float sum = 0; for (int i = 0; i < size; i += lsize_x) { tmp1[offset] = m1[gid_y*size + i + lid_x]; for (int j = 0; j < lsize_x / lsize_y; ++j) tmp2[offset + j*block_size] = m2[(gid_x + lsize_y*j)*size + i + lid_x]; barrier(clk_local_mem_fence); Copy corresponding blocks to local memory } for (int k = 0; k < lsize_x; ++k) sum += tmp1[lid_y*lsize_x + k] * tmp2[lid_x*lsize_x + k]; barrier(clk_local_mem_fence); } mres[gid_y*size + gid_x] = sum; Compute intermediate results from current blocks 28

ms Optimized version for GPU 45x faster to serial CPU, 3.6x to parallel CPU version 35000 30000 29370 25000 1024x1024 2048x2048 20000 15000 10000 5000 0 3630 4090 2338 3400 591 319 392 100 654 CPU - serial CPU - TBB CPU - OpenCL GPU - OpenCL GPU - opt. OCL 29

Profiler Results Fetch has reduced from 2048 to 128 (16x) ALU operations took 45% of total time Data loads took only 9% and only 5% of time were fetch units stalled However, there were many bank conflicts 30

Data Transfers to Computations Ratio 1024x1024 matrix (8 MB to GPU, 4 MB from GPU) naïve 372 20 computations optimized 80 20 data transfers 0 50 100 150 200 250 300 350 400 450 2048x2048 matrix (32 MB to GPU, 16 MB from GPU) naïve 3330 70 computations optimized 587 67 data transfers 0 500 1000 1500 2000 2500 3000 3500 4000 31

Two vectors of 16M floats Multiplication: z i = x i y i Equation: z i = x i y i xi + cos y i x i 600 500 505 400 300 multiplication equation 200 137 148 148 100 0 38 23 CPU - serial CPU - TBB GPU - OpenCL The computation took only 2.1 and 5.5 ms, respectively 32

Special Version of Knapsack Problem Set of (30) numbers, each is added into the sum as either positive or negative Trying to find given sum Data modified, so the solution does not exist There are only odd numbers in the set and we are looking for even one 7000 6000 5000 4000 3000 2000 1000 6193 1875 493 0 CPU - serial CPU - TBB GPU - OpenCL 33

Drivers Sometimes unstable, may cause OS crash Data Needs to be transferred from host memory to GPU and back Task Parallelism on GPU Currently only on NVIDIA Fermi Kernel Compilation Takes up to few seconds 34

OpenCL And OpenGL Relations OpenCL is a younger brother of OpenGL OpenCL has data types for representing images And special types for representing colors Many object conversions are defined CL buffer to GL buffer CL image object to GL texture CL buffer to GL renderbuffer OpenCL and OpenGL may share context To create OpenCL objects from OpenGL objects 35

NVIDIA CUDA The first GPGPU technology More simple API, designed for GPUs only No platform and device detection required Kernels are written directly into the main program Microsoft Direct Compute Part of DirectX 11 (from November 2009) Designed primarily for game developers API similar to vertex or fragment shaders 36

37