Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Similar documents
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Tesla Architecture, CUDA and Optimization Strategies

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

GPU programming. Dr. Bernhard Kainz

Lecture 11: GPU programming

Practical Introduction to CUDA and GPU

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Programming with CUDA, WS09

Lecture 10!! Introduction to CUDA!

Josef Pelikán, Jan Horáček CGG MFF UK Praha

CUDA Architecture & Programming Model

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Real-time Graphics 9. GPGPU

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Lecture 2: CUDA Programming

Introduction to CUDA Programming

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Real-time Graphics 9. GPGPU

CUDA C Programming Mark Harris NVIDIA Corporation

Lecture 3: Introduction to CUDA

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Introduction to CUDA

Module 2: Introduction to CUDA C. Objective

Module 2: Introduction to CUDA C

Parallel Computing. Lecture 19: CUDA - I

Scientific discovery, analysis and prediction made possible through high performance computing.

GPU Programming Using CUDA

HPCSE II. GPU programming and CUDA

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Introduction to CUDA (1 of n*)

GPU Programming Introduction

CS 314 Principles of Programming Languages

ECE 574 Cluster Computing Lecture 15

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Parallel Numerical Algorithms

Lecture 2: Introduction to CUDA C

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

COSC 6374 Parallel Computations Introduction to CUDA

High Performance Linear Algebra on Data Parallel Co-Processors I

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group

Introduction to Parallel Computing with CUDA. Oswald Haan

GPU Computing: Introduction to CUDA. Dr Paul Richmond

CUDA Programming Model

ECE 408 / CS 483 Final Exam, Fall 2014

Programming in CUDA. Malik M Khan

Massively Parallel Architectures

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

CS377P Programming for Performance GPU Programming - I

CS516 Programming Languages and Compilers II

ECE 574 Cluster Computing Lecture 17

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Dense Linear Algebra. HPC - Algorithms and Applications

Cartoon parallel architectures; CPUs and GPUs

Advanced CUDA Optimization 1. Introduction

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

Programmable Graphics Hardware (GPU) A Primer

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

An Introduction to OpenAcc

CUDA Performance Optimization. Patrick Legresley

CUDA C/C++ BASICS. NVIDIA Corporation

Overview: Graphics Processing Units

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Accelerating image registration on GPUs

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU Computing with CUDA. Part 2: CUDA Introduction

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

CUDA Parallel Programming Model Michael Garland

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

CUDA C/C++ BASICS. NVIDIA Corporation

Data Parallel Execution Model

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Introduction to CUDA 5.0

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Introduction to CUDA C

Writing and compiling a CUDA code

Introduction to CUDA

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

High Performance Computing and GPU Programming

GPGPU/CUDA/C Workshop 2012

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

Introduction to CUDA C

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Transcription:

Graph Partitioning Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Partition given graph G=(V,E) in k subgraphs of nearly equal size, such that the weight of the broken edges is minimum. General problem is NP-hard: not solvable in reasonable time! - Multilevel methods - Spectral methods 1

Multilevel Methods Collapsing vertives and edges to produce smaller graph Partition smaller graph. Apply recusrsively. METIS 2

Spectral Partitioning For each graph we can define the Laplacian: Diagonal matrix D containing the degree of each node Adjacency matrix A with A i,j =1 if there is en edge i j Laplacian L:= D A, with L1=01, L sym. pos. semidefinite Compute the second smallest eigenvalue and eigenvector = called Fiedler vector v Partititon indices (vertices) in negative/positive components of v. Gives partitioning in two subgraphs. Repeat. 3

GPU CUDA Programming (NVIDIA) Compute Unified Device Architecture: - C programming on GPUs - Requires no knowledge of graphics APIs or GPU programming - Access to naive instructions and memory - Easy to get started - Stable, free, documented, supported - Windows and Linux 1

GPU GPU is viewed as a compute device operating as a coprocessor to the main CPU (host) - For data parallel, cost intensive functions or functions that are executed many times, but on different data (loops) - A function compiled for the device is called a kernel - The kernel is executed on the device as many different threads - Host (CPU) and device (GPU) manage their own memory - Data can be copied between them 2

3rd Level of Parallelism MPI OpenMP 3

Computational Grid Grid of Thread Blocks Computational Grid Block (0, 1) Block (0, 0) Block (1,0) Block (1, 1) Block (1, 0) Block (2, 1) Block (2, 0) The computational grid consists of a grid of thread blocks Each thread executes the kernel The application specifies the grid and the block dimensions Grid layouts can be 1D, 2D, 3D Thread (0, 3) Thread (0, 2) Thread (0, 1) Thread (0, 0) Thread (1, 3) Thread (1, 2) Thread (1, 1) Thread (1, 0) Thread (2, 3) Thread (2, 2) Thread (2, 1) Thread (2, 0) Thread (3, 3) Thread (3, 2) Thread (3, 1) Thread (3, 0) Maximal sizes are determined by GPU memory and kernel complexity Each block has unique block ID Each thread has unique thread ID (within its block) 4

Block (1, 0) Example: Matrix Addition CPU Program void add_matrix ( float* a, float* b, float* c, int N ) { int index; for ( int i = 0; i < N; ++i ) for ( int j = 0; j < N; ++j ) { index = i + j*n; c[index] = a[index] + b[index]; } } int main() { add_matrix( a, b, c, N ); } CUDA Program global add_matrix ( float* a, float* b, float* c, int N ) { int i = blockidx.x * blockdim.x + threadidx.x; int j = blockidx.y * blockdim.y + threadidx.y; int index = i + j*n; if ( i < N && j < N ) c[index] = a[index] + b[index]; } int main() { dim3 dimblock( blocksize, blocksize ); dim3 dimgrid( N/dimBlock.x, N/dimBlock.y ); add_matrix<<<dimgrid, dimblock>>>( a, b, c, N ); } The nested for loops are replaced with an implicit grid 5

Memory Model CUDA exposes all the different types of memory on the GPU Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Local Local Local Memory Memory Memory Memory Global Memory Constant Memory Texture Memory 6

Memory Model II Overview Registers Per thread Read-Write Local memory Per thread Read-Write Shared memory Per block Read-Write For sharing data within a block Global memory Per grid Read-Write Not cached Constant memory Per grid Read-only Cached Texture memory Per grid Read-only Spatially cached Explicit GPU memory allocation and deallocation via cudamalloc() and cudafree() Access to GPU memory via pointers Copying between CPU and GPU memory: A slow operation: minimize this! 7

Multiple levels of parallelism: Thread Thread block - Thread block: Up to 512 threads per block Communicate via shared memory Threads guaranteed to be resident threadidx, blockidx syncthreads() - Grid of thread blocks: f<<<n, T>>>( a, b, c) Communication via global memory Grid of thread blocks 8

CUDA Programming Language - Minimal C extensions - A runtime library - A host (CPU) component to control and access GPU(s) - A device component - A common component: Built-in vector types, C standard library subset - Function type qualifiers - Specify where to call and execute a function device, global, host - Variable type qualifiers device, constant, shared - Kernel execution directive foo<<<griddim, BlockDim>>>( ) - Built-in variables for grid/block size and block/thread indices 9

CUDA Compiler Source files must be compiled with the CUDA compiler nvcc CUDA kernels are stored in files ending with.cu NVCC uses the host compiler to compile CPU code NVCC automatically handles #include s and linking Write kernels+cpu caller in.cu files Compile with nvcc Store signature of CPU called in header file #include header file in C(++) sources 10

GPU Runtime Component Only available on the GPU: - Less accurate, faster math functions sin(x) - syncthreads() : wait until all threads in the block have reached this point - Type conversion functions, with rounding mode - Type casting functions - Texture functions - Atomic functions: Guarantees that operation (like add) is performed on a variable without interference from other threads 11

CPU Runtime Component Only available on the CPU: - Device management: Get device properties, multi-gpu control, - Memory management: cudamalloc(), cudamemcpy(), cudafree(),.. - Texture management - Asynchronous concurrent execution - Built-in vector types: i.e. float1, float2, int3, ushort4, - Mathematical functions: standard math.h on CPU, 12

const int N = 1024; const int blocksize = 16; global COMPUTE KERNEL void add_matrix( float* a, float *b, float *c, int N ) { int i = blockidx.x * blockdim.x + threadidx.x; int j = blockidx.y * blockdim.y + threadidx.y; int index = i + j*n; if ( i < N && j < N ) c[index] = a[index] + b[index]; } int main() { float *a = new float[n*n]; float *b = new float[n*n]; float *c = new float[n*n]; for ( int i = 0; i < N*N; ++i ) { a[i] = 1.0f; b[i] = 3.5f; } float *ad, *bd, *cd; const int size = N*N*sizeof(float); cudamalloc( (void**)&ad, size ); cudamalloc( (void**)&bd, size ); cudamalloc( (void**)&cd, size ); cudamemcpy( ad, a, size, cudamemcpyhosttodevice ); cudamemcpy( bd, b, size, cudamemcpyhosttodevice ); dim3 dimblock( blocksize, blocksize ); dim3 dimgrid( N/dimBlock.x, N/dimBlock.y ); add_matrix<<<dimgrid, dimblock>>>( ad, bd, cd, N ); cudamemcpy( c, cd, size, cudamemcpydevicetohost ); cudafree( ad ); cudafree( bd ); cudafree( cd ); delete[] a; delete[] b; delete[] c; return EXIT_SUCCESS; } Store in source file (i.e. MatrixAdd.cu) Compile with nvcc MatrixAdd.cu Run CPU MEMORY ALLOCATION GPU MEMORY ALLOCATION COPY DATA TO GPU EXECUTE KERNEL COPY RESULTS CPU CLEAN UP, RETURN 13

. Execution Model Multiprocessor N Multiprocessor 3 Multiprocessor 2 Multiprocessor 1 SP 1.. SP M A GPU consists of N multiprocessors (MP) Each MP has M scalar processors (SP) Each MP processes batches of blocks A block is processed by only one MP Each block is split into SIMD groups of threads called warps A warp is executed physically in parallel A scheduler switches between warps A warp contains threads of consecutive, increasing thread ID The warp size is 32 threads today 14

Costs of Operations 4 clock cycles: Floating point: add, multiply, fused mult.-add Integer add, bitwise op., compare, min, max 16 clock cycles: reciprocal, square root, log, 32-bit int mult 32 clock cycles: sin(x), cos(x), exp(x) 36 clock cycles: Floating point division Particularly expensive: Integer division, modulo Double precision will perform at half the speed 15

Right kind of memory Constant memory: Quite small, 20K As fast as register access if all threads in a warp access the same location Texture memory: Spatially cached Optimized for 2D locality Neighbouring threads should read neighbouring addresses No need to think about coalescing Constraint: These memories can only be updated from CPU 16

Accessing global memory 4 cycles to issue on memory fetch but 400-600 cycles of latency! Likely to be a performance bottleneck Order of magnitude speedups possible: Coalesce memory access Use shared memory to reorder non-coalesced addressing 17

Coalescing (Assembling) For best performance, global memory access should be coalesced A memory access coordinated within a warp A contiguous, aligned region of global memory - 128 bytes each thread reads a float or int - 256 bytes each thread reads a float2 or int2-512 bytes each thread reads a float4 or int4 - float3s are not aligned Warp base address (WBA) must be a multiple of 16*sizeof(type) The k-th thread should access the element at WBA+k These restrictions apply to both reading and writing 18

Coalesced memory access Coalesced memory access: Thread k accesses WBA + k 19

Coalesced memory access Coalesced memory access: Thread k accesses WBA + k Not all threads have to participate 20

Coalesced memory access Non-coalesced memory access: Misaligned starting address 21

Coalesced memory access Non-coalesced memory access: Non-sequential access 22

Coalesced memory access Non-coalesced memory access: Wrong size of type 23

Software fro GPU CUDA OpenCL: Vendor-independent industrial standard DirectCompute: GPU computing from Microsoft 24

OpenCL C-based language for GPU kernels + device kernels Plus low-level device API (application programming interface) - Same flavor as CUDA - JIT (just in time) compilation of kernel programs - Portable - but inevitable optimization required for every platform Managed by Khronos group (non-profit organization) - All major vendors participate - This is the cross-vendor industry-standard 25

Work Items N-D Range OpenCL execution model: - define an N-d computation domain - execute kernel for each point in the domain Global dimensions: 1024 x 1024 Local dimensions: 128 x 128 work group execute together 1024 128 128 Individual work item 1 of 1048576 1024 Synchronization between work-items possible only within workgroups using barriers and memory fences Cannot synchronize outside of a workgroup Choose the dimensions That are the best for 26 your algorithm

Parallelizing For Loop Scalar: void scalar_mul(const float *a, const float *b, float *c, int n) { int i; for (i=0; i<n; i++) c[i] = a[i] * b[i]; } Data Parallel: Kernel executed n times, once for each work item kernel void dp_mul(global const float *a, global const float *b, global float *c) { int id = get_global_id(0); c[id] = a[id] * b[id]; } // execute over n work-items Get the index of the work item

OpenCL Objects Setup: - Devices (CPU, GPU, Cell,..) - Contexts (collection of devices) - Queues (submit work to the device) Memory: - Buffers (blocks of memory, kernels can acces however they like) - Images (2D or 3D formatted images) Execution: - Programs (collections of kernels) - Kernels Synchronisation: - Events 28

DirectCompute GPGPU under Windows - Microsoft API / Windows standard for all GPU vendors - General-purpose GPU computing under Windows - released with DirectX11 / Windows 7 - Supports all CUDA-enabled devices and ATI GPUs - Low-level API for device management and launching of kernels - Defines HLSL-based language for compute shaders (High Level Shading Language) 29

Application-level integration Matlab: Parallel Computing Toolbox, Jacket / AccelerEyes Mathematica LabView PetSc Trilinos OpenFoam, Ansys,.. 30

Application-specific Libraries CUsparse: NVIDIA library for sparse matrix vector operations CUBLAS: NVIDIA library for dense linear algebra CUFFT: NVIDIA library for Fast Fourier Transforms CUSP: NVIDIA library for sparse linear algebra Thrust: A template library for CUDA applications (sort, reduce,.) MAGMA: LAPACK for HPC on heterogeneous architectures OpenCurrent: PDE, Gauss-Seidel, multigrid 31

Application-specific Libraries ViennaCL: Scientific computing library, C++, OpenCL BLAS, iterative methods, preconditioners LAMA: Library for Accelarated math Applications BLAS, various sparse matrix storage formats FEAST: Finite Element Analysis and Solution Tools PARDISO: Fast direct solver for sparse problems Using BLAS functionality of GPUs 32