HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Similar documents
CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Mathematical computations with GPUs

Portland State University ECE 588/688. Graphics Processors

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

GRAPHICS PROCESSING UNITS

A Massively Parallel Implementation of QC-LDPC Decoder on GPU

Tesla GPU Computing A Revolution in High Performance Computing

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

TUNING CUDA APPLICATIONS FOR MAXWELL

Tesla Architecture, CUDA and Optimization Strategies

TUNING CUDA APPLICATIONS FOR MAXWELL

Threading Hardware in G80

Technical Report on IEIIT-CNR

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Optimization solutions for the segmented sum algorithmic function

Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen

NVIDIA Fermi Architecture

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA Architecture & Programming Model

GPU Fundamentals Jeff Larkin November 14, 2016

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Technology for a better society. hetcomp.com

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

Fundamental CUDA Optimization. NVIDIA Corporation

CS 179 Lecture 4. GPU Compute Architecture

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin


1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7.

Applications of Berkeley s Dwarfs on Nvidia GPUs

High Performance Computing with Accelerators

Overview. NVIDIA Quadro M GB Real Interactive Expression. NVIDIA Quadro M GB Part No. VCQM GB-PB.

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Tesla GPU Computing A Revolution in High Performance Computing

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

A GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Tuning CUDA Applications for Fermi. Version 1.2

high performance medical reconstruction using stream programming paradigms

Introduction to GPU hardware and to CUDA

Introduction to CUDA (1 of n*)

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Programming GPUs with CUDA. Prerequisites for this tutorial. Commercial models available for Kepler: GeForce vs. Tesla. I.

Fundamental CUDA Optimization. NVIDIA Corporation

GPU Architecture. Alan Gray EPCC The University of Edinburgh

CUDA. Matthew Joyner, Jeremy Williams

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

General Purpose GPU Computing in Partial Wave Analysis

CUDA Programming Model

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

B. Tech. Project Second Stage Report on

Antonio R. Miele Marco D. Santambrogio

GPU ARCHITECTURE Chris Schultz, June 2017

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Using GPUs to compute the multilevel summation of electrostatic forces

GPU Computing with NVIDIA s new Kepler Architecture

Overview. Web Copy. NVIDIA Quadro M4000 Extreme Performance in a Single-Slot Form Factor

GPUs and GPGPUs. Greg Blanton John T. Lubia

Practical Introduction to CUDA and GPU

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

GPU for HPC. October 2010

Parallel Programming for Graphics

Advanced CUDA Optimization 1. Introduction

University of Bielefeld

High Performance Computing on GPUs using NVIDIA CUDA

GPU A rchitectures Architectures Patrick Neill May

CS427 Multicore Architecture and Parallel Computing

NVIDIA Quadro M5000 Designed for Extreme Performance and Power Efficiency

Accelerating Biomedical Imaging using parallel Fuzzy C-Means Algorithm

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

Multi-Processors and GPU

CUDA OPTIMIZATIONS ISC 2011 Tutorial

A Parallel Decoding Algorithm of LDPC Codes using CUDA

Introduction to GPGPU and GPU-architectures

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

CME 213 S PRING Eric Darve

Lecture 1: Introduction and Computational Thinking

Inside Kepler. Manuel Ujaldon Nvidia CUDA Fellow. Computer Architecture Department University of Malaga (Spain)

Parallel Computing: Parallel Architectures Jin, Hai

GPU ARCHITECTURE Chris Schultz, June 2017

CSE 160 Lecture 24. Graphical Processing Units

implementation using GPU architecture is implemented only from the viewpoint of frame level parallel encoding [6]. However, it is obvious that the mot

Peter Messmer Developer Technology Group Stan Posey HPC Industry and Applications

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

Introduction to CUDA

Hardware/Software Co-Design

BER Guaranteed Optimization and Implementation of Parallel Turbo Decoding on GPU

Understanding Outstanding Memory Request Handling Resources in GPGPUs

Transcription:

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning

Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation of LDPC codes 2

NVIDIA GPU Card Families GeForce - gaming graphics processing products Nvidia is best known for Quadro - computer-aided design workstation graphics processing products Tesla General Purpose GPU for high-end image generation applications 3

GeForce GeForce 400 Series 11th generation of GeForce, introducing the Fermi architecture, with GFcodenamed chips First cards were GeForce GTX 470 and GTX 480, released April 2010, based on the Fermi architecture, codenamed GF100, 448 & 480 cores Michael Rauter (AIT) uses a GTX 460, released July 2010, based on GF104 architecture, with 336 cores, lower power, better performance GeForce 500 Series First card was GTX 580, Nov. 2010, 512 cores, GF110 architecture Madrid have a GeForce GTX 570, which has 480 cores 4

GeForce GeForce 600 Series Introducing the Kepler architecture, with GK-codenamed chips The series contains products with the older Fermi architecture First Kepler card was GeForce GTX 680, released March 2012, with a GK104 architecture and 1536 cores Madrid have a GTX 670, released May 2012, based on GK104 architecture, with 1344 cores 5

Quadro Essentially the same hardware at a premium price for professional markets Driver software and firmware to selectively enable features vital to segments of the workstation market, which prevent high-end customers from using the less expensive products A system used for gaming can shut down textures, shading, or rendering after only approximating a final output, but algorithms on a CAD-oriented card tend to complete all rendering operations, prioritising accuracy and rendering quality over speed Christoph Pacher has a Quadro FX 580, with 32 cores, G96 based on GeForce 9500 (9 series), and Ian Glendinning has a Quadro 2000M, with 192 cores, GF106GLM, like GeForce GTS 450 6

Tesla Based on high-end GPUs from the GeForce 8 series upwards, as well as the Quadro family NVIDIA's first dedicated General-Purpose GPU The main difference between Tesla and GeForce/Quadro was the lack of ability to output images to a display, but the latest C-class products include a Dual-Link DVI port The main difference between Fermi-based Tesla cards and the GeForce 500 series is the unlocked double precision performance, giving ½ of peak single-precision performance, compared with 1/8 of peak for GeForce cards Tesla cards also have ECC-protected memory and up to 6 GByte on-board memory 7

NVIDIA GPU Architectures G80 Introduced in the GeForce 8800 (8 series), Nov. 2006 First GPU to support C First GPU to replace separate vertex and pixel pipelines by a single unified processor First GPU to use a scalar thread processor, eliminating the need for programmers to manually manage vector registers Introduced the single-instruction multiple-thread (SIMT) execution model Introduced shared memory and barrier synchronization for inter-thread communication GT200 Introduced in the GeForce GTX 280, June 2008 More cores, threads, added hardware memory-access coalescing and double precision floating point support 8

NVIDIA GPU Architectures Fermi Up to 512 CUDA cores, each executing one floating point or integer instruction per clock Cores organized into streaming multiprocessors (SMs) with 32 cores Six 64-bit memory partitions for a 384-bit memory interface supporting up to 6 GB of GDDR5 DRAM memory A host interface connects the GPU to the CPU via PCI-Express The GigaThread global scheduler distributes thread blocks to SM thread schedulers Introduced shared memory and barrier synchronization for inter-thread communication GF100's architecture is built from a number of hardware blocks called Graphics Processing Clusters (GPCs), containing a raster engine and up to four SMs Unified address space enables full C++ support 9

Fermi GF100 Architecture 10

Fermi GF100 Streaming Multiprocessor 11

Fermi Memory Hierarchy 12

NVIDIA GPU Architectures Kepler Like Fermi, Kepler GPUs are composed of different configurations of Graphics Processing Clusters (GPCs), Streaming Multiprocessors (SMs) and memory controllers The GeForce GTX 680 GPU consists of four GPCs, eight nextgeneration Streaming Multiprocessors (SMX) and four memory controllers Key features of the architecture are The new SMX processor architecture An enhanced memory subsystem, offering additional caching capabilities, more bandwidth at each level of the hierarchy, and a redesigned and faster DRAM I/O implementation Hardware support to enable new programming model capabilities 13

Kepler Architecture (GeForce GTX 680) 14

Kepler Architecture (GeForce GTX 680) 15

Kepler Memory Hierarchy (GK110) 16

CUDA & OpenCL CUDA Is the hardware and software architecture that enables NVIDIA GPUs to execute programs written in C, C++, Fortran, OpenCL, DirectCompute and other languages (Compute Unified Device Architecture) A CUDA program calls parallel kernels A kernel executes in parallel across a set of parallel threads The programmer or compiler organizes threads in thread blocks and grids of thread blocks Each thread within a thread block executes an instance of the kernel and has a thread ID A thread block is a set of threads that can cooperate through barrier synchronization and shared memory, and has a block ID within its grid A grid is an array of thread blocks that execute the same kernel, read inputs from global memory, write outputs to global memory, and synchronize between dependent kernel calls 17

CUDA Thread Hierarchy and Memory Spaces 18

Hardware Execution CUDA's hierarchy of threads maps to a hierarchy of processors on the GPU A GPU executes one or more kernel grids A streaming multiprocessor (SM) executes one or more thread blocks CUDA cores and other execution units in the SM execute threads The SM executes threads in blocks of 32 threads called a warp While programmers can generally ignore warp execution for functional correctness, they can greatly improve performance by having threads in a warp execute the same code path and access memory in nearby addresses 19

Automatic Scalability Blocks of threads execute independently from each other, so that a GPU with more multiprocessors will automatically execute the program in less time than a GPU with fewer multiprocessors Only the runtime system needs to know the physical processor count 20

OpenCL The CUDA architecture is a close match to the OpenCL architecture A CUDA Streaming Multiprocessor corresponds to an OpenCL compute unit A multiprocessor executes a thread for each OpenCL work item and a thread block for each OpenCL work group A kernel is executed over an OpenCL NDRange by a grid of thread blocks Each of the thread blocks that execute a kernel is uniquely identified by its work group ID and each thread by its global ID or by a combination of its local ID and its work group ID No OpenCL support in CUDA 5, must rely on CUDA 4.2 for now NVIDIA is developing OpenACC with Cray, PGI and others 21

Parallel Implementation of LDPC Codes Implementation of Decoders for LDPC Block Codes and LDPC Convolutional Codes Based on GPUs, Yue Zhao and Francis C.M. Lau, July 2012 Up to 100 to 200 times speedup on NVIDIA GTX 460 with 336 cores http://arxiv.org/abs/1204.0334 High-Throughput GPU-Based LDPC Decoding, Yang-Lang Chang, Cheng- Chun Chang, Min-Yu Huang and Bormin Huang, Proc. Of SPIE Vol. 7810, 781008 (2010) Regular LDPC codes Achieved 271 times speedup on NVIDIA Tesla 1060 with 240 cores 22

Parallel Implementation of LDPC Codes A Massively Parallel Implementation of QC-LDPC Decoder on GPU, Guohui Wang, Michael Wu, Yang Sun, and Joseph R. Cavallaro, 2011 IEEE 9 th Symposium on Application Specific Processors (SASP) http://gpuscience.com/cs/a-massively-parallel-implementation-of-lowdensity-parity-check-decoder-on-gpu/ Quasi-Cyclic LDPC (QC-LDPC) LDPC decoder for IEEE 802.11n WiFi and 802.16e WiMAX LDPC codes as examples, irregular codes Achieve throughput of up to 100.3 Mbps on an NVIDIA GTX 470 with 448 cores 23

Details of IEEE 9 th SASP Paper Mapping LDPC Decoding Algorithm to GPU Kernels Decoding can be split into two stages: horizontal processing & APP update (a posteriori probability) One computational kernel for each stage, running on the GPU, and host code performs initialization and memory copy between host and device CUDA Kernel 1: Horizontal Processing Since the Check-node to Variable-node (CTV) messages are calculated independently, many parallel threads can be used to process them For an M x N parity-check matrix H, M threads are spawned, and each thread processes a row H consists of M sub x N sub sub-matrices, and is generated by the expansion of a Z x Z base matrix M sub thread blocks are used, each with Z threads E.g. in 802.11g: 12 thread blocks each with 81 threads, 972 total 24

Details of IEEE 9 th SASP Paper Mapping LDPC Decoding Algorithm to GPU Kernels CUDA Kernel 2: APP value update There are N values to be updated, and the APP update is independent among the variable nodes N sub thread blocks are used, each with Z threads Kernel 2 finally makes a hard decision for each bit, by quantizing the APP value into 1 and 0, to get the decoded bit Multi-codeword Parallel Decoding Since the number of threads and thread blocks are limited by the dimension of the H matrix, multi-codeword decoding is needed to further increase the parallelism of the workload A two-level multi-codeword scheme is used N cw codewords are first packed into one macro-codeword (MCW) Each MCW is decoded by a thread block and N mcw MCWs are decoded by a group of thread blocks 25

Conclusions and Future Work LDPC codes can be efficiently implemented on NVIDIA GPUs Next steps: Analyse the code from Madrid Compare it with published work Implement new code 26

AIT Austrian Institute of Technology your ingenious partner Ian Glendinning ian.glendinning.fl@ait.ac.at