GPU Lund Observatory

Size: px

Start display at page:

Download "GPU Lund Observatory"

Miranda Craig
5 years ago
Views:

1 GPU Lund Observatory Gernot Ziegler, NVIDIA UK

2 HISTORY / INTRODUCTION

3 Parallel vs Sequential Architecture Evolution ILLIAC IV Maspar Blue Gene Cray-1 Thinking Machines High Performance Computing Architectures Many-Core GPUs DEC PDP-1 IBM System 360 IBM POWER4 Multi-Core x86 Intel 4004 VAX Database, Operating System Sequential Architectures

4 Recent History Specialised machines faded out (e.g. CRAY) Cost, economies of scale Intel and AMD chips designed for home/office use Increasing clock frequencies gave increasing performance Commodity clusters Computer gaming drives Graphics Processing Unit (GPU) NVIDIA and ATI

5 Present Clock frequency no longer increasing Power consumption ƒ 2 Multi-core dominates NVIDIA Corporation 2009

6 GPU Computing 4 cores CPU + GPU Co-Processing

7 Graphics Pipelines for Last 20 Years Processor per function Vertex T&L evolved to vertex shading Triangle Triangle, point, line setup Pixel Flat shading, texturing, eventually Pixel shading ROP Blending, Z-buffering, anti-aliasing Memory Wider and faster over the years

8 Previous Pipelined Architectures Vertex Shader Pixel Shader Idle hardware Heavy Geometry Workload Perf = 4 Vertex Shader Idle hardware Pixel Shader Heavy Pixel Workload Perf = 8

9 Thread Processor Unified Architecture Replaces the Pipeline Model The future of GPUs is programmable processing So build the architecture around the processor Host Data Assembler Vtx Thread Issue Geom Thread Issue Setup / Rstr / ZCull Pixel Thread Issue SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB

10 Low Latency or High Throughput? CPU GPU Optimised for low-latency access to cached data sets Control logic for out-of-order and speculative execution DRAM Control Cache ALU ALU ALU ALU Optimised for data-parallel, throughput computation Architecture tolerant of memory latency More transistors dedicated to computation DRAM

11 Heterogeneous Computing Domains Massive Data Parallelism Graphics GPU (Parallel Computing) Instruction Level Parallelism CPU (Sequential Computing) Data Fits in Cache Larger Data Sets Oil & Gas Finance Medical Biophysics Numerics Audio Video Imaging

12 146X 36X 18X 50X 100X Medical Imaging U of Utah Molecular Dynamics U of Illinois, Urbana Video Transcoding Elemental Tech Matlab Computing AccelerEyes Astrophysics RIKEN 50x 150x 149X 47X 20X 130X 30X Financial simulation Oxford Linear Algebra Universidad Jaime 3D Ultrasound Techniscan Quantum Chemistry U of Illinois, Urbana Gene Sequencing U of Maryland

13 Tesla, CUDA & PSC definitions CUDA Architecture Our enabling technology for GPU computing The architecture of the GPU to support compute - plus C language extensions and retargetter Usable with any 8 series and up GPU Tesla PSC Dedicated compute hardware C1060 and S1070 Fermi: C2050 and S2070 Personal Super Computer A desktop machine with at least 3 C1060s

14 NVIDIA Tesla 20-Series (Fermi) Products Data Center Products Tesla M2050 / M2070 Module Tesla S2050 / S2070 1U System Workstation Tesla C2050 / C2070 Workstation Board GPUs 1 Tesla GPU 4 Tesla GPUs 1 Tesla GPU Single Precision Performance Double Precision Performance Memory : x2050 Memory : x Gigaflops 4.12 Teraflops 1030 Gigaflops 515 Gigaflops 2.06 Teraflops 515 Gigaflops 3 GB 6 GB 12 GB (3 GB / GPU) 24 GB (6 GB / GPU) 3 GB 6 GB

15 The Performance Gap Widens Further Peak Single Precision Performance GFlops/sec Tesla 20-series Tesla 10-series 1 TF Single Precision 4GB Memory 8x double precision ECC L1, L2 Caches Tesla 8-series Nehalem 3 GHz NVIDIA GPU X86 CPU

CUDA Parallel Computing Architecture GPU

Computing Architecture OpenCL is trademark

16 CUDA Parallel Computing Architecture GPU Computing Applications C++ tm Direct C OpenCL Compute Fortran Java and Python NVIDIA GPU with the CUDA Parallel Computing Architecture OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.

GPU Parallel Computing Developer Eco-System Numerical Packages MATLAB Mathematica NI LabView pycuda Debuggers & Profilers cuda-gdb NV Visual Profiler Parallel Nsight Visual Studio Allinea TotalView

17 GPU Parallel Computing Developer Eco-System Numerical Packages MATLAB Mathematica NI LabView pycuda Debuggers & Profilers cuda-gdb NV Visual Profiler Parallel Nsight Visual Studio Allinea TotalView GPU Compilers C C++ Fortran OpenCL DirectCompute Java Python Parallelizing Compilers PGI Accelerator CAPS HMPP mcuda OpenMP Libraries BLAS FFT LAPACK NPP Video Imaging CUDA Consultants & Training Solution Providers ANEO GPU Tech

18 CUDA OVERVIEW

19 Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory

20 Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance

21 Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory

22 CUDA Parallel Computing Architecture Parallel computing architecture and programming model Includes a CUDA C compiler, support for OpenCL and DirectCompute Architected to natively support multiple computational interfaces (standard languages and APIs)

23 C for CUDA : C with a few keywords void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } // Invoke serial SAXPY kernel saxpy_serial(n, 2.0, x, y); Standard C Code global void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockidx.x*blockdim.x + threadidx.x; if (i < n) y[i] = a*x[i] + y[i]; } // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y); Parallel C Code

24 CUDA Parallel Computing Architecture CUDA defines: Programming model Memory model Execution model CUDA uses the GPU, but is for general-purpose computing Facilitate heterogeneous computing: CPU + GPU CUDA is scalable Scale to run on 100s of cores/1000s of parallel threads

25 Compiling CUDA C Applications (Runtime API) void serial_function( ) {... } void other_function(int... ) {... } void saxpy_serial(float... ) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } void main( ) { float x; saxpy_serial(..);... } Modify into Parallel CUDA code C CUDA Key Kernels NVCC (Open64) CUDA object files Linker Rest of C Application CPU Compiler CPU object files CPU-GPU Executable

26 CUDA Review PROGRAMMING MODEL

27 CUDA Kernels Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads CUDA threads: Lightweight Fast switching 1000s execute simultaneously CPU Host Executes functions GPU Device Executes kernels

28 CUDA Kernels: Parallel Threads A kernel is a function executed on the GPU Array of threads, in parallel All threads execute the same code, can take different paths float x = input[threadid]; float y = func(x); output[threadid] = y; Each thread has an ID Select input/output data Control decisions

29 CUDA Kernels: Subdivide into Blocks

30 CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks

31 CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid

32 CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads

33 CUDA Kernels: Subdivide into Blocks GPU Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads

34 Communication Within a Block Threads may need to cooperate Memory accesses Share results Cooperate using shared memory Accessible by all threads within a block Restriction to within a block permits scalability Fast communication between N threads is not feasible when N large

35 Transparent Scalability G

36 Transparent Scalability G

37 Transparent Scalability GT Idle Idle Idle

38 CUDA Programming Model - Summary A kernel executes as a grid of thread blocks Host Device Kernel D A block is a batch of threads Communicate through shared memory 0,0 0,1 0,2 0,3 Each block has a block ID Kernel 2 1,0 1,1 1,2 1,3 2D Each thread has a thread ID

39 CUDA Review MEMORY MODEL

40 Memory hierarchy Thread: Registers

41 Memory hierarchy Thread: Registers Thread: Local memory

42 Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory

43 Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory

44 Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory All blocks: Global memory

45 Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory All blocks: Global memory

46 Additional Memories Host can also allocate textures and arrays of constants Textures and constants have dedicated caches

47 CUDA Review PROGRAMMING ENVIRONMENT

48 CUDA APIs API allows the host to manage the devices Allocate memory & transfer data Launch kernels CUDA C Runtime API High level of abstraction - start here! CUDA C Driver API More control, more verbose (OpenCL: Similar to CUDA C Driver API)

49 CUDA C and OpenCL Entry point for developers who want low-level API Entry point for developers who prefer high-level C Shared back-end compiler and optimization technology

50 Visual Studio Separate file types.c/.cpp for host code.cu for device/mixed code Compilation rules: cuda.rules Syntax highlighting Intellisense Integrated debugger and profiler: Nexus

51 Linux Separate file types.c/.cpp for host code.cu for device/mixed code Typically makefile driven cuda-gdb for debugging CUDA Visual Profiler

52 Performance CUDA OPTIMIZATION GUIDELINES

53 Optimize Algorithms for GPU Algorithm selection Understand the problem, consider alternate algorithms Maximize independent parallelism Maximize arithmetic intensity (math/bandwidth) Recompute? GPU allocates transistors to arithmetic, not memory Sometimes better to recompute rather than cache Serial computation on GPU? Low parallelism computation may be faster on GPU vs copy to/from host

54 Optimize Memory Access Coalesce global memory access Maximise DRAM efficiency Order of magnitude impact on performance Avoid serialization Minimize shared memory bank conflicts Understand constant cache semantics Understand spatial locality Optimize use of textures to ensure spatial locality

55 Exploit Shared Memory Hundreds of times faster than global memory Inter-thread cooperation via shared memory and synchronization Cache data that is reused by multiple threads Stage loads/stores to allow reordering Avoid non-coalesced global memory accesses

56 Use Resources Efficiently Partition the computation to keep multiprocessors busy Many threads, many thread blocks Multiple GPUs Monitor per-multiprocessor resource utilization Registers and shared memory Low utilization per thread block permits multiple active blocks per multiprocessor Overlap computation with I/O Use asynchronous memory transfers

57 cuda-gdb and Visual Profiler DEBUGGING AND PROFILING

58 CUDA-GDB Extended version of GDB with support for C for CUDA Supported on Linux 32bit/64bit systems Seamlessly debug both the host CPU and device GPU code Set breakpoints on any source line or symbol name Single step executes only one warp except on sync threads Access and print all CUDA memory allocations, local, global, constant and shared vars.

59 Linux GDB Integration with EMACS

60 Linux GDB Integration with DDD

61 CUDA Driver Low-level Profiling support 1. Set up environment variables export CUDA_PROFILE=1 export CUDA_PROFILE_CSV=1 export CUDA_PROFILE_CONFIG=config.txt export CUDA_PROFILE_LOG=profile.csv 2. Set up configuration file FILE "config.txt": gpustarttimestamp instructions 3. Run application matrixmul 4. View profiler output FILE "profile.csv": # CUDA_PROFILE_LOG_VERSION 1.5 # CUDA_DEVICE 0 GeForce 8800 GT # CUDA_PROFILE_CSV 1 # TIMESTAMPFACTOR fa292bb1ea2c12c gpustarttimestamp,method,gputime,cputime,occupancy,instructions 115f4eaa10e3b220,memcpyHtoD,7.328, f4eaa10e5dac0,memcpyHtoD,5.664, f4eaa10e95ce0,memcpyHtoD,7.328, f4eaa10f2ea60,_Z10dmatrixmulPfiiS_iiS_,19.296,40.000,0.333, f4eaa10f443a0,memcpyDtoH,7.776,36.000

62 CUDA Visual Profiler - Overview Performance analysis tool to fine tune CUDA applications Supported on Linux/Windows/Mac platforms Functionality: Execute a CUDA application and collect profiling data Multiple application runs to collect data for all hardware performance counters Profiling data for all kernels and memory transfers Analyze profiling data

63 CUDA Visual Profiler data for kernels

64 CUDA Visual Profiler computed data for kernels Instruction throughput: Ratio of achieved instruction rate to peak single issue instruction rate Global memory read throughput (Gigabytes/second) Global memory write throughput (Gigabytes/second) Overall global memory access throughput (Gigabytes/second) Global memory load efficiency Global memory store efficiency

65 CUDA Visual Profiler data for memory transfers Memory transfer type and direction (D=Device, H=Host, A=cuArray) e.g. H to D: Host to Device Synchronous / Asynchronous Memory transfer size, in bytes Stream ID

66 CUDA Visual Profiler data analysis views Views: Summary table Kernel table Memcopy table Summary plot GPU Time Height plot GPU Time Width plot Profiler counter plot Profiler table column plot Multi-device plot Multi-stream plot Analyze profiler counters Analyze kernel occupancy

67 CUDA Visual Profiler Misc. Multiple sessions Compare views for different sessions Comparison Summary plot Profiler projects save & load Import/Export profiler data (.CSV format)

68 NVIDIA Parallel Nsight The industry s 1st Development Environment for massively parallel applications Accelerates GPU + CPU application development Complete Visual Studio-integrated development environment

69 Parallel Nsight 1.0 Nsight Parallel Debugger GPU source code debugging Variable & memory inspection Nsight Analyzer Platform-level Analysis For the CPU and GPU Nsight Graphics Inspector Visualize and debug graphics content

70 Source Debugging Supporting CUDA C and HLSL code. Hardware breakpoints GPU memory and variable views Nsight menu and toolbars

71 Analysis View a correlated trace timeline with both CPU and GPU events.

72 Analysis Detailed tooltips are available for every event on the timeline.

73 1.0 System Requirements Hardware GeForce 9 series or higher Tesla C1060/S1070 or higher Quadro (G9x or higher) Operating System Windows Server 2008 R2 Windows 7 / Vista 32 or 64-bit Visual Studio Visual Studio 2008 SP1

74 Supported System Configurations #1: Single machine, Single GPU Analyzer Graphics Inspector #2: Two machines connected over the network Debugger TCP/IP Analyzer Graphics Inspector #3: Single SLI MOS machine, Two Quadro GPUs Debugger Analyzer Graphics Inspector

75 Parallel Nsight 1.0 Versions Standard (free) GPU Source Debugger Graphics Inspector Professional ($349) Analyzer Data Breakpoints Premium ticket-based support Volume and Site Licensing available

76 NVIDIA Nexus IDE The industry s first IDE for massively parallel applications Accelerates co-processing (CPU + GPU) application development Complete Visual Studio-integrated development environment

77 NVIDIA Nexus IDE - Debugging

78 NVIDIA Nexus IDE - Profiling

79 Productivity RESOURCES

80 Getting Started CUDA Zone Introductory tutorials GPU computing online seminars (aka Webinars) Forums Documentation Programming Guide Best Practices Guide Examples CUDA SDK

81 Libraries NVIDIA cublas Dense linear algebra (subset of full BLAS suite) cufft 1D/2D/3D real and complex Third party NAG Numeric libraries e.g. RNGs culapack/magma Open Source Thrust STL/Boost style template language cudpp Data parallel primitives (e.g. scan, sort and reduction) CUSP Sparse linear algebra and graph computation Many more...

82 NVIDIA Corporation 2009 Additional material

83 Targeting Multiple Platforms with CUDA NVCC NVIDIA CUDA Toolkit NVIDIA GPUs PTX Ocelot PTX to Multi-core CUDA C / C++ MCUDA CUDA to Multi-core Multi-Core CPUs SWAN CUDA to OpencL Other GPUs MCUDA: Ocelot: Swan:

84 OPTIMIZATION 1: MEMORY TRANSFERS & COALESCING

85 Execution Model Software Hardware Thread Thread Block... Grid Scalar Processor Multiprocessor Device Threads are executed by scalar processors Thread blocks are executed on multiprocessors Thread blocks do not migrate Several concurrent thread blocks can reside on one multiprocessor - limited by multiprocessor resources (shared memory and register file) A kernel is launched as a grid of thread blocks Only one kernel can execute on a device at one time 9

86 Warps and Half Warps... Thread Block = 32 Threads 32 Threads 32 Threads Warps Multiprocessor A thread block consists of 32-thread warps A warp is executed physically in parallel (SIMD) on a multiprocessor Half Warps DRAM Global Local A half-warp of 16 threads can coordinate global memory accesses into a single transaction Device Memory 9

87 Memory Architecture Host CPU Chipset Device DRAM Local Global GPU Multiprocessor Multiprocessor Multiprocessor Registers Registers Shared Memory Registers Shared Memory Shared Memory DRAM Constant Texture Constant and Texture Caches 9

88 Host-Device Data Transfers Device to host memory bandwidth much lower than device to device bandwidth 8 GB/s peak (PCI-e x16 Gen 2) vs. 141 GB/s peak (GTX 280) Minimize transfers Intermediate data can be allocated, operated on, and deallocated without ever copying them to host memory Group transfers One large transfer much better than many small ones 9

89 Page-Locked Data Transfers cudamallochost() allows allocation of page-locked ( pinned ) host memory Enables highest cudamemcpy performance 3.2 GB/s on PCI-e x16 Gen1 5.2 GB/s on PCI-e x16 Gen2 See the bandwidthtest CUDA SDK sample Use with caution!! Allocating too much page-locked memory can reduce overall system performance Test your systems and apps to learn their limits 9

90 Overlapping Data Transfers and Computation Async and Stream APIs allow overlap of H2D or D2H data transfers with computation CPU computation can overlap data transfers on all CUDA capable devices Kernel computation can overlap data transfers on devices with Concurrent copy and execution (roughly compute capability >= 1.1) Stream = sequence of operations that execute in order on GPU Operations from different streams can be interleaved Stream ID used as argument to async calls and kernel launches 1

91 Coalescing Global memory access of 32, 64, or 128-bit words by a half-warp of threads (Fermi: warp of threads) can result in as few as one (or two) transaction(s) if certain access requirements are met Float (32-bit) data example: 32-byte segments 64-byte segments Global Memory 128-byte segments Half-warp of threads 1

92 Coalescing Compute capability 1.2 and higher Issues transactions for segments of 32B, 64B, and 128B Smaller transactions used to avoid wasted bandwidth 1 transaction - 64B segment 2 transactions - 64B and 32B segments 1 transaction - 128B segment 1

93 OPTIMIZATION 2: EXECUTION CONFIG

94 Occupancy Thread instructions are executed sequentially, so executing other warps is the only way to hide latencies and keep the hardware busy Occupancy = Number of warps running concurrently on a multiprocessor divided by maximum number of warps that can run concurrently Limited by resource usage: Registers Shared memory 1

95 Blocks per Grid Heuristics # of blocks > # of multiprocessors So all multiprocessors have at least one block to execute # of blocks / # of multiprocessors > 2 Multiple blocks can run concurrently in a multiprocessor Blocks that aren t waiting at a syncthreads() keep the hardware busy Subject to resource availability registers, shared memory # of blocks > 100 to scale to future devices Blocks executed in pipeline fashion 1000 blocks per grid will scale across multiple generations 1

96 Register Pressure Hide latency by using more threads per multiprocessor Limiting Factors: Number of registers per kernel 8K/16K per multiprocessor, partitioned among concurrent threads Amount of shared memory 16KB per multiprocessor, partitioned among concurrent threadblocks Compile with ptxas-options=-v flag Use maxrregcount=n flag to NVCC N = desired maximum registers / kernel At some point spilling into local memory may occur Reduces performance local memory is slow 1

97 Occupancy Calculator 1

98 Optimizing threads per block Choose threads per block as a multiple of warp size Avoid wasting computation on under-populated warps Facilitates coalescing Want to run as many warps as possible per multiprocessor (hide latency) Multiprocessor can run up to 8 blocks at a time Heuristics Minimum: 64 threads per block Only if multiple concurrent blocks 192 or 256 threads a better choice Usually still enough regs to compile and invoke successfully This all depends on your computation, so experiment! 1

99 Occupancy!= Performance Increasing occupancy does not necessarily increase performance BUT Low-occupancy multiprocessors cannot adequately hide latency on memory-bound kernels (It all comes down to arithmetic intensity and available parallelism) 1

100 OPTIMIZATION 3: MATH FUNCS & BRANCHING

101 Runtime Math Library There are two types of runtime math operations in single precision funcf(): direct mapping to hardware ISA Fast but lower accuracy (see prog. guide for details) Examples: sinf(x), expf(x), powf(x,y) funcf() : compile to multiple instructions Slower but higher accuracy (5 ulp or less) Examples: sinf(x), expf(x), powf(x,y) The -use_fast_math compiler option forces every funcf() to compile to funcf() 1

102 Control Flow Instructions Main performance concern with branching is divergence Threads within a single warp take different paths Different execution paths must be serialized Avoid divergence when branch condition is a function of thread ID Example with divergence: if (threadidx.x > 2) { } Branch granularity < warp size Example without divergence: if (threadidx.x / WARP_SIZE > 2) { } Branch granularity is a whole multiple of warp size 1

103 OPTIMIZATION 4: SHARED MEMORY

104 Shared Memory ~Hundred times faster than global memory Cache data to reduce global memory accesses Threads can cooperate via shared memory Use it to avoid non-coalesced access Stage loads and stores in shared memory to re-order non-coalesceable addressing 1

105 Shared Memory Architecture Many threads accessing memory Therefore, memory is divided into banks Successive 32-bit words assigned to successive banks Each bank can service one address per cycle A memory can service as many simultaneous accesses as it has banks Multiple simultaneous accesses to a bank result in a bank conflict Conflicting accesses are serialized Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 15 1

106 Bank Addressing Examples No Bank Conflicts Linear addressing stride == 1 No Bank Conflicts Random 1:1 Permutation Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 15 Bank 15 Thread 15 Bank 15 1

107 Bank Addressing Examples 2-way Bank Conflicts Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 8 Thread 9 Thread 10 Thread 11 Linear addressing stride == 2 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 15 8-way Bank Conflicts Linear addressing stride == 8 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 15 x8 x8 Bank 0 Bank 1 Bank 2 Bank 7 Bank 8 Bank 9 Bank 15 1

108 Shared memory bank conflicts Shared memory is ~ as fast as registers if there are no bank conflicts warp_serialize profiler signal reflects conflicts The fast case: If all threads of a half-warp access different banks, there is no bank conflict If all threads of a half-warp read the identical address, there is no bank conflict (broadcast) The slow case: Bank Conflict: multiple threads in the same half-warp access the same bank Must serialize the accesses Cost = max # of simultaneous accesses to a single bank 1

109 Shared Memory Example: Transpose Each thread block works on a tile of the matrix Naïve implementation exhibits strided access to global memory idata odata Elements transposed by a half-warp of threads 1

110 Naïve Transpose Loads are coalesced, stores are not (strided by height) global void transposenaive(float *odata, float *idata, int width, int height) { int xindex = blockidx.x * TILE_DIM + threadidx.x; int yindex = blockidx.y * TILE_DIM + threadidx.y; int index_in = xindex + width * yindex; int index_out = yindex + height * xindex; } odata[index_out] = idata[index_in]; idata odata 1

111 Coalescing through shared memory Access columns of a tile in shared memory to write contiguous data to global memory Requires syncthreads() since threads access data in shared memory stored by other threads idata tile odata Elements transposed by a half-warp of threads 1

112 Coalescing through shared memory global void transposecoalesced(float *odata, float *idata, int width, int height) { shared float tile[tile_dim][tile_dim]; int xindex = blockidx.x * TILE_DIM + threadidx.x; int yindex = blockidx.y * TILE_DIM + threadidx.y; int index_in = xindex + (yindex)*width; xindex = blockidx.y * TILE_DIM + threadidx.x; yindex = blockidx.x * TILE_DIM + threadidx.y; int index_out = xindex + (yindex)*height; tile[threadidx.y][threadidx.x] = idata[index_in]; syncthreads(); } odata[index_out] = tile[threadidx.x][threadidx.y]; 1

113 Bank Conflicts in Transpose 16x16 shared memory tile of floats Data in columns are in the same bank 16-way bank conflict reading columns in tile Solution - pad shared memory array shared float tile[tile_dim][tile_dim+1]; Data in anti-diagonals are in same bank idata tile odata Elements transposed by a half-warp of threads 1

114 FERMI: NEW ARCHITECTURE

DRAM I/F Giga Thread HOST I/F DRAM I/F Fermi: The Computational GPU Performance 13x Double Precision of CPUs IEEE 754-2008 SP & DP Floating Point Flexibility Increased Shared Memory from 16 KB to 64

115 DRAM I/F Giga Thread HOST I/F DRAM I/F Fermi: The Computational GPU Performance 13x Double Precision of CPUs IEEE SP & DP Floating Point Flexibility Increased Shared Memory from 16 KB to 64 KB Added L1 and L2 Caches ECC on all Internal and External Memories Enable up to 1 TeraByte of GPU Memories High Speed GDDR5 Memory Interface L2 DRAM I/F DRAM I/F DRAM I/F Usability Multiple Simultaneous Tasks on GPU 10x Faster Atomic Operations C++ Support System Calls, printf support Availability: Q Disclaimer: Specifications subject to change DRAM I/F

116 Fermi Memory operations are done per warp (32 threads) instead of half-warp Global memory, Shared memory Shared memory: 16 or 48KB Now 32 banks, 32-bit wide each No bank-conflicts when accessing 8-byte words L1 cache per multiprocessor Should help with misaligned access, strides access, some register spilling Much improved dual-issue: Can dual issue fp32 pairs, fp32-mem, fp64-mem, etc. IEEE-conformant rounding 64bit address space, uniform

117 L1 cache For all memory operations Global memory, Shared memory Shares 64kb with Shared memory: Switch size between 16 or 48KB (CUDA API call) Caches gmem reads only It benefits if compiler detects that all threads load same value L1 cache per multiprocessor NOT coherent! Use volatile for global memory access if other SM's threads change the location. (but why needed? not all blocks running -> danger of deadlock) But caches local memory reads and writes To improve spilling behaviour (Coherence no problem as local memory SM-private)

118 Fermi has 64bit address space But only 32bit registers In unfortunate cases, register allocation unnecessary overhead on Fermi C2050 (3 GB) Driver API: Compile kernels in 32bit mode, can be loaded by 64bit app Runtime API (CUDART): Use new launchbounds() intrinsic to help compiler optimize register usage compile application in 32bit mode (nvcc -m32), produces also GPU code in 32bit.

119 umul24 not optimal on Fermi On Tesla C1060 / GT200 architecture, bounded integer multiplications could be accelerated with umul24(a, b) instead of a * b, e.g. for unsigned int tid = umul24(blockidx.x, blockdim.x) + threadidx.x On Fermi, umul24() is emulated, and thus slower than a * b

120 HPC and IEEE conformance Default settings for computation on GPU now more conservative (for HPC) Denormal support, IEEE-conformant division and square root Accuracy over speed If your app runs faster on Fermi with -arch=sm_13 than -arch=sm_20 then PTX JIT has used "old" Tesla C1060 settings, which favor speed: flush-to-zero instead denormals, no IEEE-precise division, no IEEE-precise square root For similar results in -arch=sm_20, use: -ftz=true -prec-div=false -prec-sqrt=false NVIDIA CUDA Programming Guide, sections 5.4.1, G.2 The CUDA Compiler Driver NVCC, pg (BTW, Sections also contains information on instruction timings)

121 CONCLUSION, QUESTIONS & GTC INVITE

GPU Technology Conference 2010 Monday, Sept.

23, 2010 San Jose Convention Center, San Jose, California The most important event in the GPU ecosystem

tools and techniques to impact mission critical projects Network with experts, colleagues, and peers across

122 GPU Technology Conference 2010 Monday, Sept Thurs., Sept. 23, 2010 San Jose Convention Center, San Jose, California The most important event in the GPU ecosystem Learn about seismic shifts in GPU computing Preview disruptive technologies and emerging applications Get tools and techniques to impact mission critical projects Network with experts, colleagues, and peers across industries I consider the GPU Technology Conference to be the single best place to see the amazing work enabled by the GPU. It s a great venue for meeting researchers, developers, scientists, and entrepreneurs from around the world. -- Professor Hanspeter Pfister, Harvard University and GTC 2009 keynote speaker Opportunities Call for Submissions Sessions & posters Sponsors / Exhibitors Reach decision makers CEO on Stage Showcase for Startups Tell your story to VCs and analysts

123 Thank You Questions? NVIDIA Corporation 2009

124

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory