Technology for a better society. hetcomp.com

Similar documents
Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Advanced CUDA Optimization 1. Introduction

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Introduction to GPU hardware and to CUDA

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Practical Introduction to CUDA and GPU

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA

HIGH-PERFORMANCE COMPUTING

CME 213 S PRING Eric Darve

Trends in HPC (hardware complexity and software challenges)

GPU Architecture. Alan Gray EPCC The University of Edinburgh

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Applications of Berkeley s Dwarfs on Nvidia GPUs

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

A General Discussion on! Parallelism!

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU ARCHITECTURE Chris Schultz, June 2017

GPU programming. Dr. Bernhard Kainz

High Performance Computing on GPUs using NVIDIA CUDA

CS516 Programming Languages and Compilers II

VSC Users Day 2018 Start to GPU Ehsan Moravveji

GPU for HPC. October 2010

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Mathematical computations with GPUs

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

GPUs and Emerging Architectures

GPU ARCHITECTURE Chris Schultz, June 2017

CUDA Toolkit 4.0 Performance Report. June, 2011

Chapter 3 Parallel Software

CUDA Programming Model

GPU Programming. Ringberg Theorie Seminar 2010

Carlo Cavazzoni, HPC department, CINECA

Advanced Research Computing. ARC3 and GPUs. Mark Dixon

General Purpose GPU Computing in Partial Wave Analysis

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

OpenACC Course. Office Hour #2 Q&A

AMath 483/583, Lecture 24, May 20, Notes: Notes: What s a GPU? Notes: Some GPU application areas

Addressing Heterogeneity in Manycore Applications

Exotic Methods in Parallel Computing [GPU Computing]

Antonio R. Miele Marco D. Santambrogio

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

CUDA Accelerated Compute Libraries. M. Naumov

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

high performance medical reconstruction using stream programming paradigms

GPUs and GPGPUs. Greg Blanton John T. Lubia

GPU A rchitectures Architectures Patrick Neill May

! Readings! ! Room-level, on-chip! vs.!

The Art of Parallel Processing

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Introduction II. Overview

Parallel Programming Concepts. GPU Computing with OpenCL

Lecture 11: GPU programming

TUNING CUDA APPLICATIONS FOR MAXWELL

Real-Time Rendering Architectures

An Introduc+on to OpenACC Part II

High Performance Computing with Accelerators

Massively Parallel Architectures

CUDA. Matthew Joyner, Jeremy Williams

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

CSC573: TSHA Introduction to Accelerators

ECE 574 Cluster Computing Lecture 18

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Modern GPU Programming With CUDA and Thrust. Gilles Civario (ICHEC)

TUNING CUDA APPLICATIONS FOR MAXWELL

A Cross-Input Adaptive Framework for GPU Program Optimizations

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

State-of-the-art in Heterogeneous Computing

GPGPU on ARM. Tom Gall, Gil Pitney, 30 th Oct 2013

A MATLAB Interface to the GPU

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

CUDA 6.0 Performance Report. April 2014

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Threading Hardware in G80

The Era of Heterogeneous Computing

Solving Dense Linear Systems on Graphics Processors

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA GPGPU Workshop 2012

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Introduction to CUDA

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Fundamental CUDA Optimization. NVIDIA Corporation

Performance potential for simulating spin models on GPU

GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017

GPU 101. Mike Bailey. Oregon State University

Current Trends in Computer Graphics Hardware

HPC with Multicore and GPUs

From Brook to CUDA. GPU Technology Conference

Transcription:

Technology for a better society hetcomp.com 1

J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2

9:30 10:15 Introduction to GPU Computing 10:15 10:30 Break 10:30 11:00 CUDA Intermediate Example 11:00 11:30 Design, Test and Lifecycle hetcomp.com 3

GPUs are everywhere Highest performing chip in all classes of computers hetcomp.com 4

What is GPU Computing? GPU = Graphics Processing Unit = Video Card Delivers extreme floating point performance FLOPS / $ FLOPS / Watt FLOPS /Volume Massively parallel Fine grained parallelism TOP500 Supercomputer list: Num 2, 4 and 5 use GPUs hetcomp.com 5

Hardware characteristics Q2 2011 Workstation Hardware I7-2600K Sandy Br, Fermi GPU Geforce 580 Xeon X7560 Fermi GPU Quadro 6000 Number of cores 4 16 8 14 # of float arithmetic units 32 512 32 448 Clock frequency (GHz) 3.4 1.2 2.66 1.1 Single precision gigaflops 217 1581 144 985 Double:single performance 1:2 1:8 1:2 1:2 Gigaflops per watt 2.3 6.5 1.1 6.5 Gigaflops per $ 0.68 3.2 0.048 0.20 Memory bandwidth (GiB/s) 21.3 192.4 26 160 hetcomp.com 6

What happened 2000 2010? Increasing frequency hits several walls: Memory Expensive to build fast memory Remedy: Caches Instruction Level Parallelism Complex to identify Power Density Relative to frequency cubed hetcomp.com 7

How Parallelism Can Help 100% Single Core 100% 100% The power density of microprocessors is proportional the cube of the clock frequency Multi Core 85% 100% 170 % Frequency Power Performance 30% GPU 100 % ~10x hetcomp.com 8

GPU Programming Models GPU Hardware Parallel Complex Changing Proprietary Device driver act as operating system Memory management Task scheduling Just-in-time compilation Various abstractions hide details Remarkably successful! hetcomp.com 10

Graphics GPU Programming Models OpenGL DirectX WebGL Native APIs Custom shader programs Usage: Games, visualization, CAD++ Drives GPU design Actively maintained and developed 2000 2005 2011 hetcomp.com 11

Specialized Graphics GPU Programming Models OpenGL DirectX WebGL DirectCompute Various abstractions Automatic generation of shaders SIMD Programming Model Example: PeakStream, Brook, RapidMind Mostly died out CUDA OpenCL Compute kernels written in C SIMD/SPMD Programming Model Explicit memory management Expose low-level features Very high-performance WebCL 2000 2005 2011 hetcomp.com 12

General Specialized Graphics GPU Programming Models SPMD Programming Model Automatic memory management OpenGL Examples: C++ AMP, Java APARAPI Generate code for various backends DirectX Not yet production ready WebGL DirectCompute Matrix Algebra, FFTs, Various RNG, abstractions Image proc. Reductions, sorting Some memory management CUDA OpenCL WebCL Instrument existing code with pragmas Generate code for various backends Examples: HMPP, PGI Expensive Domain Specific Libraries Generic Libraries Compiler Pragmas Lang. Constructs 2000 2005 2011 hetcomp.com 13

OpenCL vs CUDA Two APIs for directly programming GPUs Expose the same programming model (SPMD) OpenCL has a public standard OpenCL Nvidia CUDA Owner Khronos Group Nvidia Target Platform GPUs, CPUs, cell phones Nvidia GPUs Programming Model SPMD SPMD Language C C/C++ (templates, virt. funcs) Low-Level HW Access Properitary extensions Full HW Fragmentation Much Some Tools Some Mature Vendor support Apple, AMD, Nvidia, Intel++ Nvidia hetcomp.com 14

Approaches to GPU Programming Approach Language Description Domain Graphics API OpenGL (WebGL) DirectX C(++), Fortran,.NET, Java, Python, Perl, Ruby, Javascript What GPUs were designed for Graphics (Games, Visualization, CAD etc.) Matlab/Mathematica Matlab/Mathematica Semi -automatic Scientific OpenMP like pragmas PGI Accelerator HMPP Cray GPU libraries Dedicated languages CUDA OpenCL C/C++ / Fortran C/C++ (call from anything) C(++) dialects (call from anything) Easy porting of legacy applications Easy to integrate into existing apps, if algorithm exists Expose GPU features Hand tuned algs. Manual mem. alloc. Scientific applications Scientific, encoding/decoding Scientific hetcomp.com 15

Some available libraries CUFFT CUBLAS CULA CUSPARSE CUSP CURAND NPP Nvidia Perf. Primitives CUDA Video Decoder/Encoder THRUST Fast Fourier Transform Dense Linear Algebra LAPACK interface Sparse Linear Algebra Linear Algebra Graph Computations Random Number Generation Image and Signal Processing H.264/MPEG-2 video coding STL like algorithms These libraries have various licenses hetcomp.com 16

GPU Clusters Each node has: 1-4 GPUs 1-4 multi-core CPUs MPI-style parallelism between nodes MPI-style parallelism between GPUs MPI or thread-parallelism between CPUs hetcomp.com 18

SPMD Programming Model Host code, runs on CPU Memory allocation Memory transfer Scheduling of tasks Dependencies Device code, runs on GPU Kernel functions Invoked over compute grids Compute grid can be much larger than #cores Written in C/C++-like languages (CUDA/OpenCL) Separate compiler hetcomp.com 19

Block (0,0) GPU compute grids Thread (0,0) Thread (1,0) Thread (2,0) Thread (3,0) Compute grid Thread (0,1) Thread (1,1) Thread (2,1) Thread (3,1) Block (0,0) Block (1,0) Block (2,0) Thread (0,2) Thread (1,2) Thread (2,2) Thread (3,2) Block (0,1) Block (1,1) Block (2,1) Execution invoked by CPU over a compute grid Compute grid subdivided into a set of blocks Blocks contains a set of threads, which can access block-level shared memory All threads in the compute grid run the same program But with individual data and individual code flow hetcomp.com 20

GPU Architecture hetcomp.com 21

System overview GPUs are on the PCIe bus GPUs have their own memory Some recent chips have embedded GPUs Multi-GPU system are common GPU RAM PCIe CPU RAM GPU RAM HDD USB hetcomp.com 22

GPU Architecture NVIDIA Fermi Multi Processor Execution Unit Core Scheduler Dispatch Register File L1 Cache hetcomp.com 23

Fermi Architecture Streaming Multiprocessor 32 cores per SM 64 KB shared memory and L1 cache Special Function Unit Double precision at half speed Concurrent kernel execution ECC Support hetcomp.com 24

Block (0,0) Recap Compute Grids Thread (0,0) Thread (1,0) Thread (2,0) Thread (3,0) Compute grid Thread (0,1) Thread (1,1) Thread (2,1) Thread (3,1) Block (0,0) Block (1,0) Block (2,0) Thread (0,2) Thread (1,2) Thread (2,2) Thread (3,2) Block (0,1) Block (1,1) Block (2,1) Execution invoked by CPU over a compute grid Compute grid subdivided into a set of blocks Blocks contains a set of threads, which can access block-level shared memory All threads in the compute grid run the same program But with individual data and individual code flow hetcomp.com 25

Challenges in CUDA/OpenCL Programming Hard to learn In our experience: 1/2 year to master if motivated Do you really care that much about performance after all? Hardware fragmentation Makes the build process more complex Driver/compiler version issues Low level of reusability of code Many different optimization strategies possible Memory access in particular hetcomp.com 26

Conclusion hetcomp.com 27

Conclusion GPU computing is here now (So is multi-core computing) Widely deployed on supercomputers (2 nd, 4 th and 5 th on TOP500) Easy to get started Libraries can be called from existing application Difficult to reach peak performance Requires intimate HW knowledge Easy to get some speedup Hard to reach optimum performance hetcomp.com 28

Overview of resources nvidia.com/cuda Programming guide Tutorials Forums khronos.org/opencl/ gpgpu.org Links to papers/libraries hetcomp.com 29

Questions? hetcomp.com 30