Tesla GPU Computing A Revolution in High Performance Computing

Size: px

Start display at page:

Download "Tesla GPU Computing A Revolution in High Performance Computing"

Avice Singleton
5 years ago
Views:

1 Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley)

2 Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla Product Line Review of CUDA Architecture Programming & Memory Models Programming Environment (incl. profiling and debugging tools) Next Generation Architecture Getting Started Resources

3 Tesla GPU Computing INTRODUCTION TO TESLA

4 Parallel Computing on All GPUs 100+ Million CUDA GPUs Deployed GeForce Entertainment Tesla TM High-Performance Computing Quadro Design & Creation

5 Tesla GPU Computing Products SuperMicro 1U GPU SuperServer Tesla S1070 1U System Tesla C1060 Computing Board Tesla Personal Supercomputer GPUs 2 Tesla GPUs 4 Tesla GPUs 1 Tesla GPU 4 Tesla GPUs Single Precision Performance Double Precision Performance 1.87 Teraflops 4.14 Teraflops 933 Gigaflops 3.7 Teraflops 156 Gigaflops 346 Gigaflops 78 Gigaflops 312 Gigaflops Memory 8 GB (4 GB / GPU) 16 GB (4 GB / GPU) 4 GB 16 GB (4 GB / GPU)

6 Tesla GPU Computing Products: Fermi Tesla S2050 1U System Tesla S2070 1U System Tesla C2050 Computing Board Tesla C2070 Computing Board GPUs 4 Tesla GPUs 1 Tesla GPU Double Precision Performance Teraflops Gigaflops Memory 12 GB (3 GB / GPU) 24 GB (6 GB / GPU) 3 GB 6 GB

7 CUDA REVIEW OF CUDA ARCHITECTURE

8 CUDA Parallel Computing Architecture Parallel computing architecture and programming model Includes a CUDA C compiler, support for OpenCL and DirectCompute Architected to natively support multiple computational interfaces (standard languages and APIs)

9 CUDA Parallel Computing Architecture CUDA defines: Programming model Memory model Execution model CUDA uses the GPU, but is for general-purpose computing Facilitate heterogeneous computing: CPU + GPU CUDA is scalable Scale to run on 100s of cores/1000s of parallel threads

10 CUDA PROGRAMMING ENVIRONMENT

11 CUDA APIs API allows the host to manage the devices Allocate memory & transfer data Launch kernels CUDA C Runtime API High level of abstraction - start here! CUDA C Driver API More control, more verbose OpenCL Similar to CUDA C Driver API

12 CUDA C and OpenCL Entry point for developers who want low-level API Entry point for developers who prefer high-level C Shared back-end compiler and optimization technology

13 Windows: Visual Studio Separate file types.c/.cpp for host code.cu for device/mixed code Compilation rules: cuda.rules Syntax highlighting Intellisense Forthcoming integrated debugger and profiler: Nexus

14 Linux Separate file types.c/.cpp for host code.cu for device/mixed code Typically makefile driven cuda-gdb for debugging CUDA Visual Profiler

15 Introduction to CUDA Profiling Tools November 2009

16 CUDA Toolchain Stack CUDA Tools Compiled Apps/SDK Samples/Math libs CUDA-C Runtime/Driver APIs CUDA driver NVIDIA GPU

17 CUDA Visual Profiler - Overview Performance analysis tool Fine tune CUDA applications Supported on Linux/Windows/Mac platforms (Included with CUDA Toolkit) Simple GUI Launch a CUDA application, select profiling data Collect profile data for all kernels and memory transfers Tools to help analyze profiling data

18 CUDA Visual Profiler Kernel Profiler Data

19 CUDA Visual Profiler Memory Transfers Memory transfer type Synchronous/Asynchronous Direction host to device etc. Size Stream ID

20 CUDA Visual Profiler Data Analysis Compare data from multiple sessions Selection of plots to visualize the counters and timelines Visual Profiler is a front-end for the low-level profiler Can be command line driven Data is stored as CSV

21 CUDA Debugger cuda-gdb Builds on GDB, adding extensions to support CUDA Support on Linux (32-bit or 64-bit) platforms Seamless debug of both host and device code Breakpoint on any symbol Single step a warp Access all variables local, global, shared or constant NEW (CUDA 3.0 beta): Memory boundary checks!

22 CUDA Debugger EMACS

23 CUDA Debugger DDD

24 Further Information CUDA Visual Profiler installed with CUDA Toolkit $CUDA_INSTALL_PATH/cudaprof/doc CUDA-gdb installed with CUDA Toolkit on Linux Documentation available online: Select Downloads, Documentation

25 NVIDIA Nexus IDE The industry s first IDE for massively parallel applications Accelerates co-processing (CPU + GPU) application development Complete Visual Studio-integrated development environment

26 Nexus Debugger Beta Nexus Debugger Beta supports debugging of CUDA C and HLSL source code transparently inside Visual Studio Source breakpoints: Break anywhere, and use hardware-evaluated conditionals Memory inspection: Directly view GPU memory using the Visual Studio Memory Window Data breakpoints: Break on writes to an arbitrary memory location Memory Checker: Find out-of-bounds memory accesses

Nexus Analyzer Beta The Nexus Analyzer Beta supports the trace and profiling of your

Trace: See activities and events across your CPU and GPU on a single, correlated

CUDA C, DX10, OpenGL and Cg API calls GPU <-> Host memory transfers GPU workload

27 Nexus Analyzer Beta The Nexus Analyzer Beta supports the trace and profiling of your GPU Computing application. Trace: See activities and events across your CPU and GPU on a single, correlated timeline. CUDA C, DX10, OpenGL and Cg API calls GPU <-> Host memory transfers GPU workload executions CPU core, thread and process events Custom user events - Mark custom events or time ranges using a C API Profile: Gather and analyze kernel-level performance information, including hardware performance counters

28 NVIDIA Nexus IDE - Debugging

29 NVIDIA Nexus IDE - Profiling

30 New features of CUDA 3.0

31 New features of CUDA 2.2/2.3 zero-copy: Map CPU memory into GPU address space 2D texturing from pitchlinear memory cuda-gdb (Linux): gdb-like debugging of kernels fp16 <-> fp32 conversion intrinsics: reduce mem bandwidth PTX just-in-time compilation Use SLI-paired GPUs for Compute

32 New features of CUDA 3.0 Next generation of CUDA API, for Tesla and Fermi CUDA Driver / Runtime Interoperability: allows applications using the CUDA C Driver API to use libraries implemented using the CUDA C Runtime. More capable cuda-gdb (memory bounds check, driver API) C++ Class Inheritance and Template Inheritance support OpenGL Texture interoperation: Share textures as cuarrays between OpenGL and CUDA CUDA Toolkit libraries are now versioned CUDA 3.0 beta: Become registered developer now!

33 Fermi NEXT GENERATION ARCHITECTURE

34 Introducing the Fermi Architecture 3 billion transistors 512 cores DP performance 50% of SP ECC L1 and L2 Caches GDDR5 Memory Up to 1 Terabyte of GPU Memory Concurrent Kernels, C++

35 Fermi SM Architecture 32 CUDA cores per SM (512 total) Double precision 50% of single precision 8x over GT200 Dual Thread Scheduler 64 KB of RAM for shared memory and L1 cache (configurable)

CUDA Core Architecture New IEEE 754-2008 floating-point

multiply-add (FMA) instruction for both single and double

36 CUDA Core Architecture New IEEE floating-point standard, surpassing even the most advanced CPUs Fused multiply-add (FMA) instruction for both single and double precision Newly designed integer ALU optimized for 64-bit and extended precision operations

37 Cached Memory Hierarchy First GPU architecture to support a true cache hierarchy in combination with on-chip shared memory L1 Cache per SM (per 32 cores) Improves bandwidth and reduces latency Unified L2 Cache (768 KB) Fast, coherent data sharing across all cores in the GPU Parallel DataCache Memory Hierarchy

38 Larger, Faster Memory Interface GDDR5 memory interface 2x speed of GDDR3 Up to 1 Terabyte of memory attached to GPU Operate on large data sets

39 ECC ECC protection for DRAM ECC supported for GDDR5 memory All major internal memories Register file, shared memory, L1 cache, L2 cache Detect 2-bit errors, correct 1-bit errors (per word)

40 GigaThread Hardware Thread Scheduler Hierarchically manages thousands of simultaneously active threads 10x faster application context switching Concurrent kernel execution

41 GigaThread Hardware Thread Scheduler Concurrent Kernel Execution + Faster Context Switch Serial Kernel Execution Parallel Kernel Execution

42 GigaThread Streaming Data Transfer Engine Dual DMA engines Simultaneous CPU GPU and GPU CPU data transfer Fully overlapped with CPU and GPU processing time Activity Snapshot:

43 Enhanced Software Support Full C++ Support Virtual functions Try/Catch hardware support System call support Support for pipes, semaphores, printf, etc Unified 64-bit memory addressing

44 Performance tips

45 General Tips SP faster than DP: float post-fix for constants: 0.0f instead of 0.0 For unstructured grids, instead of atomics or coloring: consider gather instead of scatter (one thread per element that gathers "contributions") page-locked memory: use cudahostalloc() as often as possible consider zero-copy (GPU-mapped host memory): It might be faster than async memcpy() in streams, despite not using DMA engines (due to memcpy stalls at internal command queues ). (More advice in Best Practices Guide) If forums don't help, me at

46 Tips for Fermi Make algorithm shmem size flexible (16kB/48 kb) Be careful about register usage, still limited (but local memory spilling gets faster due to caching) Think about which global memory accesses are read-only in given kernels (not necessarily constant)

47 Getting Started RESOURCES

48 Getting Started CUDA Zone Introductory tutorials/webinars Forums Documentation Programming Guide Best Practices Guide Webinars Examples CUDA SDK

49 Libraries NVIDIA cublas Dense linear algebra (subset of full BLAS suite) cufft 1D/2D/3D real and complex Third party NAG Numeric libraries e.g. RNGs culapack/magma Open Source Thrust STL/Boost style template language cudpp Data parallel primitives (e.g. scan, sort and reduction) CUSP Sparse linear algebra and graph computation Many more...

50 Tesla GPU Computing Questions? (

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory