Building NVLink for Developers

Size: px

Start display at page:

Download "Building NVLink for Developers"

Camilla Reed
6 years ago
Views:

1 Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing

vs. PCIe Tap into Unified Memory and Namespace Transparent Data Movement with Page

Architecture Hardware fast enough to be forgiving to programmers Simplifies

2 Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized skills required vs. PCIe Tap into Unified Memory and Namespace Transparent Data Movement with Page Migration Engine Data scheduling by silicon & software vs programmer Superior Architecture Hardware fast enough to be forgiving to programmers Simplifies assigning best processor for the job Improved Performance Faster access to your accelerator 2.5x bandwidth between CPU & GPU 5X the data flow in and out of your GPU

Collaborative Innovation between IBM and NVIDIA: POWER8 with

NVLink on CPU NVIDIA: deep interface into GPU (NVLink) 2+ years in

5X the bandwidth from CPU:GPU, built into the chip Embedded NVLink

Break apart my problem less Spend less time optimizing Write

3 Collaborative Innovation between IBM and NVIDIA: POWER8 with NVLink Casting NVLink into Silicon IBM: transistors and I/O to NVLink on CPU NVIDIA: deep interface into GPU (NVLink) 2+ years in the making 2.5X the bandwidth from CPU:GPU, built into the chip Embedded NVLink Built for Developer Goals Think less about architecture in code Break apart my problem less Spend less time optimizing Write simpler code with NVLink Don t overthink your hardware Don t waste time writing for data movement Easily unleash the parallelism of your GPU

4 Fat and Flat Systems for Data - S822LC for HPC Infused with OpenPOWER Ecosystem Designed for Programmabilty InfiniBand Fabric DDR4 115GB/S CPU CPU 115GB/S DDR4 NVLink Tesla P100 80GB/S Tesla P100 Tesla P100 80GB/S Tesla P X the CPU:GPU Interface Bandwidth Tight coupling: strong CPU: strong GPU performance Equalizing access to memory - for all kinds of programming Closer programming to the CPU paradigm

5 Why it matters? Raw Application Performance 2.5X the performance of x86 accelerated solutions Bandwidth Throughput X the bandwidth CUDA H2D Bandwidth CUDA H2D Bandwidth x86 Xeon E v4 Competitor Tesla K80 Solely to Device 0, PCI-E IBM Power Systems S822LC for HPC Tesla P100, NVLink Throughput (queries/hour) POWER8 IBM Power S822LC (20c/4x Tesla P100) 2.5X More Throughput x86 2x Xeon E5-2640v4 (20c/4x Tesla K80) PCIe x16 3.0/x86 System Xeon E v4 with 4 Tesla K80s : 73,320 queries per hour POWER8 with NVLink System Power Systems S822LC with 4 Tesla P100s: 188,852 queries per hour But how much of this speedup was due to NVLink vs a faster GPU?

to compensate for slow data movement! 1.95X of the 2.5X overall performance improvement attributable to NVLink 100 tick Query Time: Competing System PCI-E x16 3.

6 Why it matters: Stop waiting for Data! Improve Code Performance for Developers 65% reduction in data transfer time in for Kinetica GPU-accelerated DB Less data-induced latency in all applications Unique to POWER8 with NVLink Less coding to compensate for slow data movement! 1.95X of the 2.5X overall performance improvement attributable to NVLink 100 tick Query Time: Competing System PCI-E x Data Transfer 73 ticks 65% Reduction Data Transfer Calculation* 26 ticks 14 ticks Calculation* 27 ticks 40 tick Query Time: S822LC for HPC, NVLink * Includes non-overlapping: CPU, GPU, and idle times. All results are based on running Kinetica Filter by geographic area queries on data set of 280 million simulated Tweets with 5 up to 80 simultaneous query streams each with 0 think time. Power System S822LC for HPC; 20 cores (2 x 10c chips) / 160 threads, POWER8 with NVLink; 2.86 GHz, 1024 GB memory, 2x 6Gb SSDs, 2-port 10 GbEth, 4x Tesla P100 GPU; Ubuntu Competitive stack: 2x Xeon E v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E v4; 2.4 GHz; 512GB memory 2x 6Gb SSDs, 2-port 10 GbEth, 4xTesla K80 GPU, Ubuntu

7 Why it matters: New Applications Attempting GPU acceleration CMPD on PCI-E systems Data movement overwhelms execution Early efforts: no net speedup or reduced performance Developer: Lots of thinking about it in the coding

Data Transfer Time (sec) Why it matters: New Applications 40 CPMD Data Transfer per Kernel 30 20 10 0 3.5X Faster data movement 8.8 POWER8 IBM Power S822LC (20c/2x Tesla P100) 31.

8 Data Transfer Time (sec) Why it matters: New Applications 40 CPMD Data Transfer per Kernel X Faster data movement 8.8 POWER8 IBM Power S822LC (20c/2x Tesla P100) 31.3 x86 2x Xeon E5-2640v4 (20c, 2x Tesla K80) POWER8 with NVLink: a 3.5X improvement in data-transfer time Now a feasible GPU implementation Balanced profile - avoids complex data management Net: ~3X Speedup factor vs CPU-only CPMD All results are based on running CPMD, a parallelized plane wave / pseudopotential implementation of Density Functional Theory Application. A Hybrid version of CPMD (e.g. MPI + OPENMP + GPU + streams) was implemented with runs are made for 128-Water Box, RANDOM initialization. Results are reported in Execution Time (seconds). IBM Power System S822LC for HPC; 20 cores (2 x 10c chips) / 160 threads, POWER8 with NVLink; 2.86 GHz, 256 GB memory, 2 x 1TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 2x Tesla P100 GPU; Ubuntu Competitive stack: 2x Xeon E v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E v4; 2.4 GHz; 256 GB memory, 1 x 2TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 2x Tesla K80 GPUs, Ubuntu

9 Why it Matters: Application Profiles Where NVLink will have the most Impact Stream Data at Same Rate as Computation Burst Data at Startup and Teardown. Constant Data Transfers between adjacent GPUs Mask Bus Transfers from Host-Device

Why it Matters: Use Cases where NVLink will have the most Impact

from Host-Device Genomics, Cryptography, Video Processing, etc.

10 Why it Matters: Use Cases where NVLink will have the most Impact Stream Data at Same Rate as Computation Burst Data at Startup and Teardown. Constant Data Transfers between adjacent GPUs Mask Bus Transfers from Host-Device Genomics, Cryptography, Video Processing, etc. CFD/CAE, Machine Learning, Deep Learning, etc. Molec. Dynamics (ex: Amber), Deep Learning etc. Accelerated Databases, Analytics, etc.

11 What Kinds of Domains and New Kernels EDA Solvers Physics Molecular Dynamics Weather Analytics CFD Solvers Enterprise Databases New Application Potential Graph Databases

In House: IBM, Partner Ecosystem Access to POWER8

Immediate performance gains from the wider bus and

Acceleration Deep technical resources Custom plan to

What was Previously Impossible Bring new

12 Where to Get Access 1. Remotely: IBM-NVIDIA Acceleration Lab 2. In House: IBM, Partner Ecosystem Access to POWER8 with NVLink Run on only platforms w/cpu-gpu NVLink Immediate performance gains from the wider bus and Tesla P100 Team up with IBM, NVIDIA on Advanced Acceleration Deep technical resources Custom plan to help migrate, performance tune code together Unlock What was Previously Impossible Bring new applications with unified memory & easier data movement Learn more at Ibm.biz/accellab Online Engagement Partner Locator

OpenPOWER Performance

OpenPOWER Performance Alex Mericas Chief Engineer, OpenPOWER Performance IBM Delivering the Linux ecosystem for Power SOLUTIONS OpenPOWER IBM SOFTWARE LINUX ECOSYSTEM OPEN SOURCE Solutions with full stack