Building NVLink for Developers

Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing

Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized skills required vs. PCIe Tap into Unified Memory and Namespace Transparent Data Movement with Page Migration Engine Data scheduling by silicon & software vs programmer Superior Architecture Hardware fast enough to be forgiving to programmers Simplifies assigning best processor for the job Improved Performance Faster access to your accelerator 2.5x bandwidth between CPU & GPU 5X the data flow in and out of your GPU

Collaborative Innovation between IBM and NVIDIA: POWER8 with NVLink Casting NVLink into Silicon IBM: transistors and I/O to NVLink on CPU NVIDIA: deep interface into GPU (NVLink) 2+ years in the making 2.5X the bandwidth from CPU:GPU, built into the chip Embedded NVLink Built for Developer Goals Think less about architecture in code Break apart my problem less Spend less time optimizing Write simpler code with NVLink Don t overthink your hardware Don t waste time writing for data movement Easily unleash the parallelism of your GPU

Fat and Flat Systems for Data - S822LC for HPC Infused with OpenPOWER Ecosystem Designed for Programmabilty InfiniBand Fabric DDR4 115GB/S CPU CPU 115GB/S DDR4 NVLink Tesla P100 80GB/S Tesla P100 Tesla P100 80GB/S Tesla P100 2.5X the CPU:GPU Interface Bandwidth Tight coupling: strong CPU: strong GPU performance Equalizing access to memory - for all kinds of programming Closer programming to the CPU paradigm

Why it matters? Raw Application Performance 2.5X the performance of x86 accelerated solutions Bandwidth Throughput 40.00 30.00 20.00 10.00 0.00 2.91X the bandwidth CUDA H2D Bandwidth 11.72 34.16 CUDA H2D Bandwidth x86 Xeon E5-2640 v4 Competitor Tesla K80 Solely to Device 0, PCI-E IBM Power Systems S822LC for HPC Tesla P100, NVLink Throughput (queries/hour) 200000 180000 188852 160000 140000 120000 100000 80000 60000 40000 20000 0 POWER8 IBM Power S822LC (20c/4x Tesla P100) 2.5X More Throughput 73320 x86 2x Xeon E5-2640v4 (20c/4x Tesla K80) PCIe x16 3.0/x86 System Xeon E5-2640 v4 with 4 Tesla K80s : 73,320 queries per hour POWER8 with NVLink System Power Systems S822LC with 4 Tesla P100s: 188,852 queries per hour But how much of this speedup was due to NVLink vs a faster GPU?

Why it matters: Stop waiting for Data! Improve Code Performance for Developers 65% reduction in data transfer time in for Kinetica GPU-accelerated DB Less data-induced latency in all applications Unique to POWER8 with NVLink Less coding to compensate for slow data movement! 1.95X of the 2.5X overall performance improvement attributable to NVLink 100 tick Query Time: Competing System PCI-E x16 3.0 Data Transfer 73 ticks 65% Reduction Data Transfer Calculation* 26 ticks 14 ticks Calculation* 27 ticks 40 tick Query Time: S822LC for HPC, NVLink * Includes non-overlapping: CPU, GPU, and idle times. All results are based on running Kinetica Filter by geographic area queries on data set of 280 million simulated Tweets with 5 up to 80 simultaneous query streams each with 0 think time. Power System S822LC for HPC; 20 cores (2 x 10c chips) / 160 threads, POWER8 with NVLink; 2.86 GHz, 1024 GB memory, 2x 6Gb SSDs, 2-port 10 GbEth, 4x Tesla P100 GPU; Ubuntu 16.04. Competitive stack: 2x Xeon E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 512GB memory 2x 6Gb SSDs, 2-port 10 GbEth, 4xTesla K80 GPU, Ubuntu 16.04.

Why it matters: New Applications Attempting GPU acceleration CMPD on PCI-E systems Data movement overwhelms execution Early efforts: no net speedup or reduced performance Developer: Lots of thinking about it in the coding

Data Transfer Time (sec) Why it matters: New Applications 40 CPMD Data Transfer per Kernel 30 20 10 0 3.5X Faster data movement 8.8 POWER8 IBM Power S822LC (20c/2x Tesla P100) 31.3 x86 2x Xeon E5-2640v4 (20c, 2x Tesla K80) POWER8 with NVLink: a 3.5X improvement in data-transfer time Now a feasible GPU implementation Balanced profile - avoids complex data management Net: ~3X Speedup factor vs CPU-only CPMD All results are based on running CPMD, a parallelized plane wave / pseudopotential implementation of Density Functional Theory Application. A Hybrid version of CPMD (e.g. MPI + OPENMP + GPU + streams) was implemented with runs are made for 128-Water Box, RANDOM initialization. Results are reported in Execution Time (seconds). IBM Power System S822LC for HPC; 20 cores (2 x 10c chips) / 160 threads, POWER8 with NVLink; 2.86 GHz, 256 GB memory, 2 x 1TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 2x Tesla P100 GPU; Ubuntu 16.04. Competitive stack: 2x Xeon E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 256 GB memory, 1 x 2TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 2x Tesla K80 GPUs, Ubuntu 16.04. 8

Why it Matters: Application Profiles Where NVLink will have the most Impact Stream Data at Same Rate as Computation Burst Data at Startup and Teardown. Constant Data Transfers between adjacent GPUs Mask Bus Transfers from Host-Device

Why it Matters: Use Cases where NVLink will have the most Impact Stream Data at Same Rate as Computation Burst Data at Startup and Teardown. Constant Data Transfers between adjacent GPUs Mask Bus Transfers from Host-Device Genomics, Cryptography, Video Processing, etc. CFD/CAE, Machine Learning, Deep Learning, etc. Molec. Dynamics (ex: Amber), Deep Learning etc. Accelerated Databases, Analytics, etc.

What Kinds of Domains and New Kernels EDA Solvers Physics Molecular Dynamics Weather Analytics CFD Solvers Enterprise Databases New Application Potential Graph Databases

Where to Get Access 1. Remotely: IBM-NVIDIA Acceleration Lab 2. In House: IBM, Partner Ecosystem Access to POWER8 with NVLink Run on only platforms w/cpu-gpu NVLink Immediate performance gains from the wider bus and Tesla P100 Team up with IBM, NVIDIA on Advanced Acceleration Deep technical resources Custom plan to help migrate, performance tune code together Unlock What was Previously Impossible Bring new applications with unified memory & easier data movement Learn more at Ibm.biz/accellab Online Engagement Partner Locator