Tesla GPU Computing A Revolution in High Performance Computing

Similar documents
Tesla GPU Computing A Revolution in High Performance Computing

Advanced CUDA Optimization 1. Introduction

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

HIGH-PERFORMANCE COMPUTING

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Accelerating High Performance Computing.

HIGH-PERFORMANCE COMPUTING WITH NVIDIA TESLA GPUS. Timothy Lanfear, NVIDIA

Technology for a better society. hetcomp.com

Mathematical computations with GPUs

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

General Purpose GPU Computing in Partial Wave Analysis

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

HIGH-PERFORMANCE COMPUTING WITH NVIDIA TESLA GPUS. Chris Butler NVIDIA

Practical Introduction to CUDA and GPU

Advanced CUDA Programming. Dr. Timo Stich

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

GPU Fundamentals Jeff Larkin November 14, 2016

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Tesla Architecture, CUDA and Optimization Strategies

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Early 3D Graphics. NVIDIA Corporation Perspective study of a chalice Paolo Uccello, circa 1450

HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS

CUDA Programming Model

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

GPU Lund Observatory

Introduction to GPU Computing. 周国峰 Wuhan University 2017/10/13

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Threading Hardware in G80

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Nvidia Tesla The Personal Supercomputer

CME 213 S PRING Eric Darve

The Future of GPU Computing

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

CUDA C Programming Mark Harris NVIDIA Corporation

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Fundamental CUDA Optimization. NVIDIA Corporation

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Fundamental CUDA Optimization. NVIDIA Corporation

Portland State University ECE 588/688. Graphics Processors

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

Introduction to GPU hardware and to CUDA

When MPPDB Meets GPU:

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

High Performance Computing. David B. Kirk, Chief Scientist, t NVIDIA

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

Technical Report on IEIIT-CNR

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Introduc)on to GPU Programming

Hybrid Architectures Why Should I Bother?

Introduction to GPU computing

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Introduction to CUDA (1 of n*)

HPC with Multicore and GPUs

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

HIGH PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS

High Performance Computing with Accelerators

Master Informatics Eng.

CSE 160 Lecture 24. Graphical Processing Units

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

NVIDIA Fermi Architecture

Lecture 1: Introduction and Computational Thinking

OpenACC Course. Office Hour #2 Q&A

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

TUNING CUDA APPLICATIONS FOR MAXWELL

GPU ARCHITECTURE Chris Schultz, June 2017

GPU programming. Dr. Bernhard Kainz

high performance medical reconstruction using stream programming paradigms

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

TUNING CUDA APPLICATIONS FOR MAXWELL

Paralization on GPU using CUDA An Introduction

World s most advanced data center accelerator for PCIe-based servers

GPU Computing Master Clss. Development Tools

GPU Programming Using NVIDIA CUDA

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

CUDA Architecture & Programming Model

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Parallel Computer Architecture - Basics -

GPU for HPC. October 2010

GPU Computing with NVIDIA s new Kepler Architecture

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Accelerating Data Warehousing Applications Using General Purpose GPUs

High Performance Computing on GPUs using NVIDIA CUDA

Kepler Overview Mark Ebersole

Parallel Computing: Parallel Architectures Jin, Hai

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Massively Parallel Architectures

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline

GPU Programming for Mathematical and Scientific Computing

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Transcription:

Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA

Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment Next Generation Architecture Getting Started Resources

Tesla GPU Computing WHAT IS GPU COMPUTING?

What is GPU Computing? 4 cores CPU GPU Computing with CPU + GPU Heterogeneous Computing

Parallel Scaling 1200 1000 800 600 NVIDIA GPU x86 CPU Peak Performance Gflops/sec Tesla G80 Tesla T10 120 100 80 60 NVIDIA GPU x86 CPU Peak Memory Bandwidth GB/sec Tesla G80 Tesla T10 400 200 Harpertown 3 GHz Nehalem 3 GHz 40 20 Harpertown 3 GHz Nehalem 3 GHz 0 2003 2004 2005 2006 2007 2008 2009 0 2003 2004 2005 2006 2007 2008 2009

Low Latency or High Throughput? CPU Optimized for low-latency access to cached data sets Control logic for out-of-order and speculative execution GPU Optimized for data-parallel, throughput computation Architecture tolerant of memory latency More transistors dedicated to computation

Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory

Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on board/chip for performance

Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory

Tesla GPU Computing INTRODUCTION TO TESLA

Parallel Computing on All GPUs 100+ Million CUDA GPUs Deployed GeForce Entertainment Tesla TM High-Performance Computing Quadro Design & Creation

Many-Core High Performance Computing Each core has: Floating point & Integer unit Logic unit Move, compare unit Branch unit 10-series GPU has 240 cores 1.4 billion transistors 1 Teraflop of processing power Cores managed by thread manager Spawn and manage over 30,000 threads Zero-overhead thread switching

Tesla GPU Computing Products SuperMicro 1U GPU SuperServer Tesla S1070 1U System Tesla C1060 Computing Board Tesla Personal Supercomputer GPUs 2 Tesla GPUs 4 Tesla GPUs 1 Tesla GPU 4 Tesla GPUs Single Precision Performance Double Precision Performance 1.87 Teraflops 4.14 Teraflops 933 Gigaflops 3.7 Teraflops 156 Gigaflops 346 Gigaflops 78 Gigaflops 312 Gigaflops Memory 8 GB (4 GB / GPU) 16 GB (4 GB / GPU) 4 GB 16 GB (4 GB / GPU)

Reported Speedups 146X 36X 18X 50X 100X Medical Imaging U of Utah Molecular Dynamics U of Illinois Video Transcoding Elemental Tech Matlab Computing AccelerEyes Astrophysics RIKEN 149X 47X 20X 130X 30X Financial simulation Oxford Linear Algebra Universidad Jaime 3D Ultrasound Techniscan Quantum Chemistry U of Illinois Gene Sequencing U of Maryland

CUDA ARCHITECTURE

CUDA Parallel Computing Architecture Parallel computing architecture and programming model Includes a CUDA C compiler, support for OpenCL and DirectCompute Architected to natively support multiple computational interfaces (standard languages and APIs)

CUDA Parallel Computing Architecture CUDA defines: Programming model Memory model Execution model CUDA uses the GPU for general-purpose computing Facilitate heterogeneous computing: CPU + GPU CUDA is scalable Scale to run on 100s of cores/1000s of parallel threads

CUDA PROGRAMMING MODEL

CUDA Kernels Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads CUDA threads: Lightweight Fast switching 1000s execute simultaneously CPU Host Executes functions GPU Device Executes kernels

CUDA Kernels: Parallel Threads A kernel is a function executed on the GPU Array of threads, in parallel All threads execute the same code, can take different paths float x = input[threadid]; float y = func(x); output[threadid] = y; Each thread has an ID Select input/output data Control decisions

CUDA Kernels: Subdivide into Blocks

CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks

CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid

CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads

CUDA Kernels: Subdivide into Blocks GPU Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads

Communication Within a Block Threads may need to cooperate Memory accesses Share results Cooperate using shared memory Accessible by all threads within a block Restriction to within a block permits scalability Fast communication between N threads is not feasible when N large

Transparent Scalability G84 1 2 3 4 5 6 7 8 9 10 11 12 11 12 9 10 7 8 5 6 3 4 1 2

Transparent Scalability G80 1 2 3 4 5 6 7 8 9 10 11 12 9 10 11 12 1 2 3 4 5 6 7 8

Transparent Scalability GT200 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12... Idle Idle Idle

CUDA Programming Model - Summary A kernel executes as a grid of thread blocks Host Device Kernel 1 0 1 2 3 1D A block is a batch of threads Communicate through shared memory 0,0 0,1 0,2 0,3 Each block has a block ID Kernel 2 1,0 1,1 1,2 1,3 2D Each thread has a thread ID

CUDA MEMORY MODEL

Memory hierarchy Thread: Registers

Memory hierarchy Thread: Registers Thread: Local memory

Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory

Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory

Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory All blocks: Global memory

Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory All blocks: Global memory

CUDA PROGRAMMING ENVIRONMENT

CUDA APIs API allows the host to manage the devices Allocate memory & transfer data Launch kernels CUDA C Runtime API High level of abstraction - start here! CUDA C Driver API More control, more verbose OpenCL Similar to CUDA C Driver API

CUDA C and OpenCL Entry point for developers who want low-level API Entry point for developers who prefer high-level C Shared back-end compiler and optimization technology

Visual Studio Separate file types.c/.cpp for host code.cu for device/mixed code Compilation rules: cuda.rules Syntax highlighting Intellisense Integrated debugger and profiler: Nexus

NVIDIA Nexus IDE The industry s first IDE for massively parallel applications Accelerates co-processing (CPU + GPU) application development Complete Visual Studio-integrated development environment

NVIDIA Nexus IDE - Debugging

NVIDIA Nexus IDE - Profiling

Linux Separate file types.c/.cpp for host code.cu for device/mixed code Typically makefile driven cuda-gdb for debugging CUDA Visual Profiler

Fermi NEXT GENERATION ARCHITECTURE

Introducing the Fermi Architecture 3 billion transistors 512 cores DP performance 50% of SP ECC L1 and L2 Caches GDDR5 Memory Up to 1 Terabyte of GPU Memory Concurrent Kernels, C++

Fermi SM Architecture 32 CUDA cores per SM (512 total) Double precision 50% of single precision 8x over GT200 Dual Thread Scheduler 64 KB of RAM for shared memory and L1 cache (configurable)

CUDA Core Architecture New IEEE 754-2008 floating-point standard, surpassing even the most advanced CPUs Fused multiply-add (FMA) instruction for both single and double precision Newly designed integer ALU optimized for 64-bit and extended precision operations

Cached Memory Hierarchy First GPU architecture to support a true cache hierarchy in combination with on-chip shared memory L1 Cache per SM (per 32 cores) Improves bandwidth and reduces latency Unified L2 Cache (768 KB) Fast, coherent data sharing across all cores in the GPU Parallel DataCache Memory Hierarchy

Larger, Faster Memory Interface GDDR5 memory interface 2x speed of GDDR3 Up to 1 Terabyte of memory attached to GPU Operate on large data sets

ECC ECC protection for DRAM ECC supported for GDDR5 memory All major internal memories Register file, shared memory, L1 cache, L2 cache Detect 2-bit errors, correct 1-bit errors (per word)

GigaThread Hardware Thread Scheduler Hierarchically manages thousands of simultaneously active threads 10x faster application context switching Concurrent kernel execution

GigaThread Hardware Thread Scheduler Concurrent Kernel Execution + Faster Context Switch Serial Kernel Execution Parallel Kernel Execution

GigaThread Streaming Data Transfer Engine Dual DMA engines Simultaneous CPU GPU and GPU CPU data transfer Fully overlapped with CPU and GPU processing time Activity Snapshot:

Enhanced Software Support Full C++ Support Virtual functions Try/Catch hardware support System call support Support for pipes, semaphores, printf, etc Unified 64-bit memory addressing

Tesla GPU Computing Products: Fermi Tesla S2050 1U System Tesla S2070 1U System Tesla C2050 Computing Board Tesla C2070 Computing Board GPUs 4 Tesla GPUs 1 Tesla GPU Double Precision Performance 2.1 2.5 Teraflops 520 630 Gigaflops Memory 12 GB (3 GB / GPU) 24 GB (6 GB / GPU) 3 GB 6 GB

Getting Started RESOURCES

Getting Started CUDA Zone www.nvidia.com/cuda Introductory tutorials/webinars Forums Documentation Programming Guide Best Practices Guide Examples CUDA SDK

Libraries NVIDIA CUBLAS Dense linear algebra (subset of full BLAS suite) CUFFT 1D/2D/3D real and complex Third party NAG Numeric libraries e.g. RNGs CULAPACK/MAGMA Open Source Thrust STL/Boost style template library CUDPP Data parallel primitives (e.g. scan, sort and reduction) CUSP Sparse linear algebra and graph computation Many more...

Tesla GPU Computing Questions?