The Visual Computing Company

Size: px

Start display at page:

Download "The Visual Computing Company"

Geoffrey Robertson
5 years ago
Views:

1 The Visual Computing Company Update NVIDIA GPU Ecosystem Axel Koehler, Senior Solutions Architect HPC, NVIDIA

2 Outline Tesla K40 and GPU Boost Jetson TK-1 Development Board for Embedded HPC Pascal GPU 3D Memory NVLINK CUDA 6.0 Unified memory Extended Library Interfaces GPU Direct RDMA with OpenMPI and beyond

3 Tesla K40 FASTER 1.4 TF 2880 Cores 288 GB/s ns/day 5 AMBER Benchmark LARGER 2x Memory Enables More Apps SMARTER Unlock Extra Performance Using Power Headroom 4 3 6GB 2 1 Fluid Dynamics Seismic Analysis Rendering 0 CPU K20X K40 12GB AMBER Benchmark: SPFP-Nucleosome CPU: Dual 3.10GHz, 64GB System Memory, CentOS 6.2, GPU systems: Single Tesla K20X or Single Tesla K40

4 Board Power (Watts) Average GPU Power in Watts 180 Avg GPU Power in Watts for Real Applications on K20X AMBER ANSYS Black Scholes Chroma GROMACS GTC LAMMPS LSMS NAMD Nbody QMCPACK RTM SPECFEM3D

5 GPU Boost on Tesla K40 Convert Power Headroom to Higher Performance Boost Clock #2 875Mhz Boost Clock #1 810Mhz Base Clock 745Mhz 235W 235W 235W Workload # 1 Worst case Reference App Workload # 2 E.g. AMBER Workload # 3 E.g. ANSYS Fluent 5

6 Compute Workload Behavior with GPU Boost Non-Tesla Tesla K40 Boost Clock # 2 Boost Clock # 1 GPU Clock Base Clock # 1 Automatic clock switching Deterministic Clocks Default Boost Base Preset Options Lock to base clock 3 Levels: Base, Boost1 or Boost2 Boost Interface Target duration for boost clocks Control Panel ~50% of run-time NV-SMI, NVML nvidia-smi -q d CLOCK,SUPPORTED_CLOCKS nvidia-smi -ac <MEM clock, Graphics clock> 100% of workload run time Must-have for HPC workload

7 Pascal GPU Optimized for double precision FP Very high bandwidth, large capacity 3D memory on package NVLINK for high bandwidth CPU GPU and GPU GPU interconnect Unified Memory (UM) HW support New packaging allows much denser solutions (one-third (one-third the size of current PCIe boards)

8 Stacked Memory 3D chip on wafer integration Multiple layers of DRAM components will be integrated vertically on the package along with the GPU Compared to GDDR5 memory 4x Higher Bandwidth 3x Larger Capacity 4x More Energy Efficient per bit

9 NVLINK CPU GPU communication limited by low bandwidth connection via PCI-e NVLINK is a high speed interconnect between CPU GPU and GPU GPU Basic building block is a 8-lane, differential, dual simplex bidirectional link Multiple links can be aggregated to increase BW of a connection NVLink will provide between 80 and 200 GB/s of bandwidth Cache coherency provided with NVLINK 2.0 Preserves the PCIe programming model CPU-initiated transactions such as control and configuration over a PCIe connection GPU-initiated transactions use NVLink Allowing the GPU full-bandwidth access to the CPU s memory system NVLink is more than twice as energy efficient as a PCIe 3.0 connection

10 NVLINK

JETSON TK1 THE WORLD S 1st EMBEDDED SUPERCOMPUTER Development Platform for Embedded Computer Vision, Robotics, Medical,... Tegra K1 SOC Kepler GPU with 192 Cores (Compute Capability 3.

11 JETSON TK1 THE WORLD S 1st EMBEDDED SUPERCOMPUTER Development Platform for Embedded Computer Vision, Robotics, Medical,... Tegra K1 SOC Kepler GPU with 192 Cores (Compute Capability 3.2) 4 Plus 1 Quad core ARM Cortex A15 CPU 2 GB Memory, 16 GB emmc memory IO options minipci-e slot, GigE, HDMI, SD/MMC connector, USB 3.0, SATA data port,. CUDA Toolkit 6.0, OpenGL 4.4, OpenGL ES 3.0 Runs 32-bit Ubuntu Linux for Tegra (L4T) 326 GFLOPS, 5 Watts

12 IBM Partners with NVIDIA to Build Next- Generation Supercomputers + Tesla GPU POWER 8 CPU GPU-Accelerated POWER-Based Systems Available in

13 Three ISAs, One Programming Model CUDA Ecosystem (Libraries, Directives, Languages) x86 ARM POWER

14 Integration of Compute and Visualisation New GPU Operation Mode was introduced with Kepler GK110 based K20/K20X/K40 (not on C-Class) All on mode enables graphics capabilities nvidia-smi --gom=0

15 Unified Memory Dramatically Lower Developer Effort Developer View Today Developer View With Unified Memory System Memory GPU Memory Unified Memory 15

Unified Memory void sortfile(file *fp, int N) { char *data; cudamallocmanaged(&data, N); fread(data, 1, N,

16 Super Simplified Memory Management Code void sortfile(file *fp, int N) { char *data; data = (char *)malloc(n); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); CPU Code CUDA 6 Code with Unified Memory void sortfile(file *fp, int N) { char *data; cudamallocmanaged(&data, N); fread(data, 1, N, fp); qsort<<<...>>>(data,n,1,compare); cudadevicesynchronize(); use_data(data); } free(data); } cudafree(data); 16

17 Unified Memory Delivers 1. Simpler Programming & Memory Model Single pointer to data, accessible anywhere Tight language integration Greatly simplifies code porting 2. Performance Through Data Locality Migrate data to accessing processor Guarantee global coherency Still allows cudamemcpyasync() hand tuning 17

18 Unified Memory Roadmap CUDA 6: Ease of Use Next: Optimizations Single Pointer to Data Future GPUs No Memcopy Required launch & sync Shared C/C++ Data Structures Prefetching Migration Hints Additional OS Support Finer Grain Migration Not Limited to GPU Memory Size Learn More:

19 Kepler Enables Full NVIDIA GPUDirect RDMA System Memory GDDR5 Memory GDDR5 Memory GDDR5 Memory GDDR5 Memory System Memory CPU GPU1 GPU2 GPU2 GPU1 CPU Server 1 PCI-e Network Card Network Network Card PCI-e Server 2 19

20 Mellanox Infiniband with GPUDirect RDMA Mellanox GPUDirect (GDR) MLNX_OFED driver is available Beta release works with CUDA 5.5 ( ) Final release will be based on CUDA 6.0 Supported on any ConnectX adapter that use the MLX4 driver MVAPICH2-GDR release can be used with this IB driver release

21 GPU Direct RDMA with OpenMPI Starting with CUDA 6 OpenMPI also supports GPU Direct RDMA Kepler class GPUs (K10, K20, K20X, K40) Mellanox ConnectX-3, ConnectX-3 Pro, Connect-IB CUDA 6.0 (EA, RC, Final), Open MPI and Mellanox OFED 2.1 drivers. GPU Direct RDMA enabling software

22 GPU Direct RDMA with OpenMPI OpenMPI Compilation: configure --with-cuda Support is configured in if CUDA 6.0 cuda.h header file is detected. To check: > ompi_info --all grep btl_openib_have_cuda_gdr MCA btl: informational "btl_openib_have_cuda_gdr" (current value: "true", data source: default, level: 4 tuner/basic, type: bool) > ompi_info -all grep btl_openib_have_driver_gdr MCA btl: informational "btl_openib_have_driver_gdr" (current value: "true", data source: default, level: 4 tuner/basic, type: bool) Enable GPU Direct RDMA usage (off by default) --mca btl_openib_want_cuda_gdr 1 Adjust when we switch to pipeline transfers through host memory. Current default is 30,000 bytes --mca btl_openib_cuda_rdma_limit 60000

23 GPU Direct RDMA with OpenMPI Chipset implementation limits bandwidth at larger message sizes Still use pipelining with host memory staging for large messages (hybrid version utilizes asynchronous copies)

GPU Direct RDMA with OpenMPI HOOMD-blue (git master 28Jan14), Lennard-Jones Liquid dataset (16K, 512K Particles) Higher is better Higher is better 102% 20%

4rc1, GPUDirect RDMA (nvidia_peer_memory-1.0-0.tar.gz) Dual-Socket Intel E5-2630 v2 @ 2.60 GHz CPUs, 64GB memory, Scientific Linux 6.4, MLNX_OFED 2.1-1.0.0, Mellanox FDR 2 x Tesla K20 per node, Driver 331.

24 GPU Direct RDMA with OpenMPI HOOMD-blue (git master 28Jan14), Lennard-Jones Liquid dataset (16K, 512K Particles) Higher is better Higher is better 102% 20% Dual-Socket Intel E GHz CPUs, 64GB memory, RHEL 6.2, MLNX_OFED , Mellanox FDR 1 x Tesla K40 per node, Driver , Open MPI 1.7.4rc1, GPUDirect RDMA (nvidia_peer_memory tar.gz) Dual-Socket Intel E GHz CPUs, 64GB memory, Scientific Linux 6.4, MLNX_OFED , Mellanox FDR 2 x Tesla K20 per node, Driver , Open MPI 1.7.4rc1, GPUDirect RDMA (nvidia_peer_memory tar.gz)

25 Extended (XT) Library Interfaces Automatic Scaling to multiple GPUs per node cufft 2D/3D & cublas level 3 Operate directly on large datasets that reside in CPU memory developer.nvidia.com/cublasxt 6.0 TFLOPS 4.2 TFLOPS 2.2 TFLOPS 7.9 TFLOPS x K10 2 x K10 3 x K10 4 x K10 16K x 16K SGEMM on Tesla K10

26 fp64 GFlops/s New Drop-in NVBLAS Library Drop-in replacement for CPU-only BLAS Automatically route BLAS3 calls to cublas Matrix-Matrix Multiplication in R 3000 Example: Drop-in Speedup for R > LD_PRELOAD=/usr/local/cuda/lib64/libnvblas.so R > A <- matrix(rnorm(4096*4096), nrow=4096, ncol=4096) > B <- matrix(rnorm(4096*4096), nrow=4096, ncol=4096) > system.time(c <- A %*% B) user system elapsed Use in any app that uses standard BLAS3 Octave, Scilab, etc nvblas, 4x K20X GPUs MKL, 6-core Xeon E CPU matrix dimension

27 Remote Development with Nsight Eclipse Edition Local IDE, remote application Edit locally, build & run remotely Automatic sync via ssh Cross-compilation to ARM Full debugging & profiling via remote connection Edit sync Build Run Debug Profile

28 Goals for the CUDA Platform Simplicity Learn, adopt, & use parallelism with ease Productivity Quickly achieve feature & performance goals Portability Write code that can execute on all targets Performance High absolute performance and scalability

Simpler Heterogeneous Applications We want: homogeneous programs, heterogeneous execution Unified programming model includes parallelism in language Abstract

29 Simpler Heterogeneous Applications We want: homogeneous programs, heterogeneous execution Unified programming model includes parallelism in language Abstract heterogeneous execution via Runtime or Virtual Machine Hybrid Program Single Program parallel serial Homogeneous Programming Model parallel + serial GPU CPU GPU CPU Current Ideal

30 Parallelism in Mainstream Languages Enable more programmers to write parallel software Give programmers the choice of language to use GPU support in key languages C

begin(), vec.end(), f); Complete set of parallel primitives: for_each, sort, reduce, scan, etc.

31 C++ Parallel Algorithms Library Progress std::vector<int> vec =... // previous standard sequential loop std::for_each(vec.begin(), vec.end(), f); // explicitly sequential loop std::for_each(std::seq, vec.begin(), vec.end(), f); // permitting parallel execution std::for_each(std::par, vec.begin(), vec.end(), f); Complete set of parallel primitives: for_each, sort, reduce, scan, etc. ISO C++ committee voted unanimously to accept as official tech. specification working draft N3960 Technical Specification Working Draft: Prototype:

Numba Python Compiler Free and open source compiler for

cuda module integrates CUDA directly into Python @cuda.

32 Numba Python Compiler Free and open source compiler for array-oriented Python NEW numba.cuda module integrates CUDA directly into void(float32[:], float32, float32[:], float32[:]) ) def saxpy(out, a, x, y): i = cuda.grid(1) out[i] = a * x[i] + y[i] # Launch saxpy kernel saxpy[griddim, blockdim](out, a, x, y)

33 34

34 GPU-Accelerated Hadoop Extract insights from customer data Data Analytics using clustering algorithms Developed using CUDA-accelerated IBM Java

foreach(x, Y, Z, new jogcontext(), new jogclosureret<jogcontext>() { public float execute(float x, float y) { return x + y; } } ); 14 12 10 8

35 Compile Java for GPUs Approach: apply a closure to a set of arrays // vector addition float[] X = {1.0, 2.0, 3.0, 4.0, }; float[] Y = {9.0, 8.1, 7.2, 6.3, }; float[] Z = {0.0, 0.0, 0.0, 0.0, }; jog.foreach(x, Y, Z, new jogcontext(), new jogclosureret<jogcontext>() { public float execute(float x, float y) { return x + y; } } ); Java Black-Scholes Options Pricing Speedup Speedup vs. Sequential Java Millions of Options foreach iterations parallelized over GPU threads Threads run closure execute() method

GPUs are Going Beyond Scientific & Technical

Hadoop-based Clustering Recommendation Engine Visual

36 GPUs are Going Beyond Scientific & Technical Computing GPUs Accelerate Machine Learning & Data Analytics Auto Tagging in Creative Cloud Speech/Image Recognition Analyzing Twitter Hadoop-based Clustering Recommendation Engine Visual Shopping Searching Audio Real Time Video Delivery Database Queries Search Ranking

37 The Massively Parallel Programming Blog Technical posts on GPUs, CUDA, OpenACC, Libraries, C/C++/Python and more In-depth articles and regular series: CUDACasts: instructive videos CUDA Pro Tips: useful techniques CUDA Spotlight Interviews Join the conversation by subscribing to or RSS updates today!

38 The Visual Computing Company Axel Koehler NVIDIA, the NVIDIA logo, GeForce, Quadro, Tegra, Tesla, GeForce Experience, GRID, GTX, Kepler, ShadowPlay, GameStream, SHIELD, and The Way It s Meant To Be Played are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated NVIDIA Corporation. All rights reserved.

NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS

NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS 1 Unified Memory CUDA 6 2 3 XT and Drop-in Libraries GPUDirect RDMA in MPI 4 Developer Tools 1 Unified Memory CUDA 6 2 3 XT and Drop-in Libraries