CSC573: TSHA Introduction to Accelerators

Similar documents
CUDA. Matthew Joyner, Jeremy Williams

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

MapReduce Locality Sensitive Hashing GPUs. NLP ML Web Andrew Rosenberg

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

GPUs and Emerging Architectures

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Technology for a better society. hetcomp.com

Introduction to GPU hardware and to CUDA

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPU ARCHITECTURE Chris Schultz, June 2017

GPU A rchitectures Architectures Patrick Neill May

Tesla GPU Computing A Revolution in High Performance Computing

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

CUDA Programming Model

OpenACC Course. Office Hour #2 Q&A

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU programming. Dr. Bernhard Kainz

GPU ARCHITECTURE Chris Schultz, June 2017

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA Architecture & Programming Model

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

Introduction to Runtime Systems

Tesla Architecture, CUDA and Optimization Strategies

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Introduction to Parallel Computing with CUDA. Oswald Haan

VSC Users Day 2018 Start to GPU Ehsan Moravveji

CS377P Programming for Performance GPU Programming - II

GPU Fundamentals Jeff Larkin November 14, 2016

A Multi-Tiered Optimization Framework for Heterogeneous Computing

Tesla GPU Computing A Revolution in High Performance Computing

CME 213 S PRING Eric Darve

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

General Purpose GPU Computing in Partial Wave Analysis

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Multi-Processors and GPU

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

ECE 571 Advanced Microprocessor-Based Design Lecture 20

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

CS427 Multicore Architecture and Parallel Computing

n N c CIni.o ewsrg.au

Lecture 8: GPU Programming. CSE599G1: Spring 2017

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

GPU Programming Using NVIDIA CUDA

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Introduction to GPGPU and GPU-architectures

Fundamental CUDA Optimization. NVIDIA Corporation

Motivation Hardware Overview Programming model. GPU computing. Part 1: General introduction. Ch. Hoelbling. Wuppertal University

High Performance Computing on GPUs using NVIDIA CUDA

Lecture 1: Gentle Introduction to GPUs

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Introduction to CUDA Programming

ECE 574 Cluster Computing Lecture 15

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

CS377P Programming for Performance Multicore Performance Multithreading

Real-Time Rendering Architectures

Graphics Processor Acceleration and YOU

! Readings! ! Room-level, on-chip! vs.!

Antonio R. Miele Marco D. Santambrogio

The Era of Heterogeneous Computing

OpenACC programming for GPGPUs: Rotor wake simulation

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

Parallel Computing. November 20, W.Homberg

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Real-time Graphics 9. GPGPU

Advanced CUDA Optimization 1. Introduction

High Performance Computing and GPU Programming

Real-time Graphics 9. GPGPU

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

TUNING CUDA APPLICATIONS FOR MAXWELL

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Experts in Application Acceleration Synective Labs AB

GPU Computing with CUDA. Part 2: CUDA Introduction

GRAPHICS PROCESSING UNITS

Real-Time Support for GPU. GPU Management Heechul Yun

Lecture 11: GPU programming

ECE 574 Cluster Computing Lecture 16

Solving Dense Linear Systems on Graphics Processors

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Transcription:

CSC573: TSHA Introduction to Accelerators Sreepathi Pai September 5, 2017 URCS

Outline Introduction to Accelerators GPU Architectures GPU Programming Models

Outline Introduction to Accelerators GPU Architectures GPU Programming Models

Accelerators Single-core processors Multi-core processors What if these aren t enough? Accelerators, specifically GPUs what they are when you should use them

Timeline 1980s 1990s 2000s Geometry Engines Consumer GPUs Out-of-order Superscalars General-purpose GPUs Multicore CPUs Cell BE (Playstation 3) Lots of specialized accelerators in phones

The Graphics Processing Unit (1980s) SGI Geometry Engine Implemented the Geometry Pipeline Hardwired logic Embarrassingly Parallel O(Pixels) Large number of logic elements High memory bandwidth From Kaufman et al. (2009):

GPU 2.0 (circa 2004) Like CPUs, GPUs benefited from Moore s Law Evolved from fixed-function hardwired logic to flexible, programmable ALUs Around 2004, GPUs were programmable enough to do some non-graphics computations Severely limited by graphics programming model (shader programming) In 2006, GPUs became fully programmable GPGPU: General-Purpose GPU NVIDIA releases CUDA language to write non-graphics programs that will run on GPUs

FLOPS/s NVIDIA CUDA C Programming Guide

Memory Bandwidth NVIDIA CUDA C Programming Guide

GPGPU Today GPUs are widely deployed as accelerators Intel Paper 10x vs 100x Myth GPUs so successful that other accelerators are dead Sony/IBM Cell BE Clearspeed RSX Kepler K40 GPUs from NVIDIA have performance of 4TFlops (peak) CM-5, #1 system in 1993 was 60 Gflops (Linpack) ASCI White (#1 2001) was 4.9 Tflops (Linpack) Pictures of Titan and Tianhe 1A from the Top500 website.

Accelerator Programming Models CPUs have always depended on co-processors I/O co-processors to handle slow I/O Math co-processors to speed up computation H.264 co-processor to play video (Phones) DSPs to handle audio (Phones) Many have been transparent Drop in the co-processor and everything sped up Or used a function-based model Call a function and it is sped up (e.g. decode video ) The GPU is not a transparent accelerator for general purpose computations Only code using graphics API (e.g. OpenGL) is sped up transparently Code must be rewritten to target GPUs

Using a GPU You must retarget code for the GPU Rewrite, recompile, translate, etc.

Outline Introduction to Accelerators GPU Architectures GPU Programming Models

The Two (Three?) Kinds of GPUs Type 1: Discrete GPUs More computational power More memory bandwidth Separate memory NVIDIA

The Two (Three?) Kinds of GPUs #2 Type 2: Integrated GPUs Intel Share memory with processor Share bandwidth with processor Consume Less power Can participate in cache coherence

The NVIDIA Kepler NVIDIA Kepler GK110 Whitepaper

Using a Discrete GPU You must retarget code for the GPU Rewrite, recompile, translate, etc. Working set must fit in GPU RAM You must copy data to/from GPU RAM You : Programmer, Compiler, Runtime, OS, etc. Some recent hardware can do this for you (it s slow)

NVIDIA Kepler SMX (i.e. CPU core equivalent)

NVIDIA Kepler SMX Details 2-wide Inorder 4-wide SMT 2048 threads per core (64 warps) 15 cores Each thread runs the same code (hence SIMT) 65536 32-bit registers (256KBytes) A thread can use upto 255 of these Partitioned among threads (not shared!) 192 ALUs 64 Double-precision 32 Load/store 32 Special Functional Unit 64 KB L1/Shared Cache Shared cache is software-managed cache

CPU vs GPU Parameter CPU GPU Clockspeed > 1 GHz 700 MHz RAM GB to TB 12 GB (max) Memory B/W 60 GB/s > 300 GB/s Peak FP < 1 TFlop > 1 TFlop Concurrent Threads O(10) O(1000) [O(10000)] LLC cache size > 100MB (L3) [edram] O(10) [traditional] < 2MB (L2) Cache size per thread O(1 MB) O(10 bytes) Software-managed cache None 48KB/SMX Type OOO superscalar 2-way Inorder superscalar

Using a GPU You must retarget code for the GPU Rewrite, recompile, translate, etc. Working set must fit in GPU RAM You must copy data to/from GPU RAM You : Programmer, Compiler, Runtime, OS, etc. Some recent hardware can do this for you Data accesses should be streaming Or use scratchpad as user-managed cache Lots of parallelism preferred (throughput, not latency) SIMD-style parallelism best suited High arithmetic intensity (FLOPs/byte) preferred

Showcase GPU Applications Image Processing Graphics Rendering Matrix Multiply FFT See Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU by V.W.Lee et al. for more examples and a comparison of CPU and GPU.

Outline Introduction to Accelerators GPU Architectures GPU Programming Models

Hierarchy of GPU Programming Models Model GPU CPU Equivalent Vectorizing Compiler PGI CUDA Fortran gcc, icc, etc. Drop-in Libraries cublas ATLAS Directive-driven OpenACC, OpenMP-to-CUDA OpenMP High-level languages pycuda python Mid-level languages OpenCL, CUDA pthreads + C/C++ Low-level languages PTX, Shader - Bare-metal SASS Assembly/Machine code

Drop-in Libraries Drop-in replacements for popular CPU libraries, examples from NVIDIA: CUBLAS/NVBLAS for BLAS (e.g. ATLAS) CUFFT for FFTW MAGMA for LAPACK and BLAS These libraries may still expect you to manage data transfers manually Libraries may support multiple accelerators (GPU + CPU + Xeon Phi)

GPU Libraries NVIDIA Thrust Like C++ STL, but executes on the GPU Modern GPU At first glance: high-performance library routines for sorting, searching, reductions, etc. A deeper look: Specific hard problems tackled in a different style NVIDIA CUB Low-level primitives for use in CUDA kernels

Directive-Driven Programming OpenACC, new standard for offloading parallel work to an accelerator Currently supported only by PGI Accelerator compiler gcc 5.0 support is ongoing OpenMPC, a research compiler, can compile OpenMP code + extra directives to CUDA OpenMP 4.0 also supports offload to accelerators Not for GPUs yet int main(void) { double pi = 0.0f; long i; } #pragma acc parallel loop reduction(+:pi) for (i=0; i<n; i++) { double t= (double)((i+0.5)/n); pi +=4.0/(1.0+t*t); } printf("pi=%16.15f\n",pi/n); return 0;

Python-based Tools (pycuda) import pycuda.autoinit import pycuda.driver as drv import numpy from pycuda.compiler import SourceModule mod = SourceModule(""\" global void multiply_them(float *dest, float *a, float *b) { const int i = threadidx.x; dest[i] = a[i] * b[i]; } ""\") multiply_them = mod.get_function("multiply_them") a = numpy.random.randn(400).astype(numpy.float32) b = numpy.random.randn(400).astype(numpy.float32) dest = numpy.zeros_like(a) multiply_them( drv.out(dest), drv.in(a), drv.in(b), block=(400,1,1), grid=(1,1)) print dest-a*b

OpenCL C99-based dialect for programming heterogenous systems Originally based on CUDA nomenclature is different Supported by more than GPUs Xeon Phi, FPGAs, CPUs, etc. Source code is portable (somewhat) Performance may not be! Poorly supported by NVIDIA

CUDA Compute Unified Device Architecture First language to allow general-purpose programming for GPUs preceded by shader languages Promoted by NVIDIA for their GPUs Not supported by any other accelerator though commercial CUDA-to-x86/64 compilers exist We will focus on CUDA programs

CUDA Architecture From 10000 feet CUDA is like pthreads CUDA language C++ dialect Host code (CPU) and GPU code in same file Special language extensions for GPU code CUDA Runtime API Manages runtime GPU environment Allocation of memory, data transfers, synchronization with GPU, etc. Usually invoked by host code CUDA Device API Lower-level API that CUDA Runtime API is built upon

CUDA Limitations No standard library for GPU functions No parallel data structures No synchronization primitives (mutex, semaphores, queues, etc.) you can roll your own only atomic*() functions provided Toolchain not as mature as CPU toolchain Felt intensely in performance debugging It s only been a decade :)

Conclusions GPUs are very interesting parallel machines They re not going away Xeon Phi might pose a formidable challenge They re here and now Your laptop probably already contains one Your phone definitely has one