Building NVLink for Developers

Similar documents
OpenPOWER Performance

IBM Power AC922 Server

IBM Power Advanced Compute (AC) AC922 Server

IBM Deep Learning Solutions

World s most advanced data center accelerator for PCIe-based servers

IBM CORAL HPC System Solution

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Interconnect Your Future

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

GPU Architecture. Alan Gray EPCC The University of Edinburgh

19. prosince 2018 CIIRC Praha. Milan Král, IBM Radek Špimr

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Preparing GPU-Accelerated Applications for the Summit Supercomputer

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

Interconnect Your Future

OpenACC Course. Office Hour #2 Q&A

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

Mapping MPI+X Applications to Multi-GPU Architectures

CS6453. Data-Intensive Systems: Rachit Agarwal. Technology trends, Emerging challenges & opportuni=es

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

S THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE. Presenter: Louis Capps, Solution Architect, NVIDIA,

GPUs and Emerging Architectures

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Concurrent execution of an analytical workload on a POWER8 server with K40 GPUs A Technology Demonstration

High Performance Computing with Accelerators

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

IBM POWER SYSTEMS: YOUR UNFAIR ADVANTAGE

CUDA Experiences: Over-Optimization and Future HPC

Sharing High-Performance Devices Across Multiple Virtual Machines

Advances of parallel computing. Kirill Bogachev May 2016

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Deep Learning mit PowerAI - Ein Überblick

(software agnostic) Computational Considerations

CPMD Performance Benchmark and Profiling. February 2014

FUJITSU Server PRIMERGY CX400 M4 Workload-specific power in a modular form factor. 0 Copyright 2018 FUJITSU LIMITED

Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

A Breakthrough in Non-Volatile Memory Technology FUJITSU LIMITED

An Innovative Massively Parallelized Molecular Dynamic Software

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

HPC Architectures. Types of resource currently in use

Architectures for Scalable Media Object Search

IBM Power User Group - Atlanta

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

The Future of High Performance Interconnects

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

MICROWAY S NVIDIA TESLA V100 GPU SOLUTIONS GUIDE

S8688 : INSIDE DGX-2. Glenn Dearth, Vyas Venkataraman Mar 28, 2018

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Trends in HPC (hardware complexity and software challenges)

n N c CIni.o ewsrg.au

Deep Learning Performance and Cost Evaluation

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

NVMe Takes It All, SCSI Has To Fall. Brave New Storage World. Lugano April Alexander Ruebensaal

IBM Power Systems: Open Innovation to put data to work. Juan López-Vidriero Mata Director técnico de ventas de servidores

Accelerating Implicit LS-DYNA with GPU

Dell EMC PowerEdge R740xd as a Dedicated Milestone Server, Using Nvidia GPU Hardware Acceleration

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

Stan Posey, CAE Industry Development NVIDIA, Santa Clara, CA, USA

When MPPDB Meets GPU:

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

rcuda: an approach to provide remote access to GPU computational power

Toward a Memory-centric Architecture

Infrastructure Matters: POWER8 vs. Xeon x86

Martin Dubois, ing. Contents

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

OpenPOWER Performance

HPC Enabling R&D at Philip Morris International

IBM Power Systems Update. David Spurway IBM Power Systems Product Manager STG, UK and Ireland

OpenPOWER Innovations for HPC. IBM Research. IWOPH workshop, ISC, Germany June 21, Christoph Hagleitner,

Oracle Exadata: Strategy and Roadmap

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

IBM Technology and Solutions for Artificial Intelligence and HPC

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

Fra superdatamaskiner til grafikkprosessorer og

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

SAP HANA. Jake Klein/ SVP SAP HANA June, 2013

IBM FlashSystem. IBM FLiP Tool Wie viel schneller kann Ihr IBM i Power Server mit IBM FlashSystem 900 / V9000 Storage sein?

The rcuda middleware and applications

Refining and redefining HPC storage

April 4-7, 2016 Silicon Valley INSIDE PASCAL. Mark Harris, October 27,

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

GAMER : a GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics GPU 與自適性網格於天文模擬之應用與效能

Porting Scientific Applications to OpenPOWER

ANSYS HPC Technology Leadership

Tesla GPU Computing A Revolution in High Performance Computing

DGX UPDATE. Customer Presentation Deck May 8, 2017

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems

Transcription:

Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing

Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized skills required vs. PCIe Tap into Unified Memory and Namespace Transparent Data Movement with Page Migration Engine Data scheduling by silicon & software vs programmer Superior Architecture Hardware fast enough to be forgiving to programmers Simplifies assigning best processor for the job Improved Performance Faster access to your accelerator 2.5x bandwidth between CPU & GPU 5X the data flow in and out of your GPU

Collaborative Innovation between IBM and NVIDIA: POWER8 with NVLink Casting NVLink into Silicon IBM: transistors and I/O to NVLink on CPU NVIDIA: deep interface into GPU (NVLink) 2+ years in the making 2.5X the bandwidth from CPU:GPU, built into the chip Embedded NVLink Built for Developer Goals Think less about architecture in code Break apart my problem less Spend less time optimizing Write simpler code with NVLink Don t overthink your hardware Don t waste time writing for data movement Easily unleash the parallelism of your GPU

Fat and Flat Systems for Data - S822LC for HPC Infused with OpenPOWER Ecosystem Designed for Programmabilty InfiniBand Fabric DDR4 115GB/S CPU CPU 115GB/S DDR4 NVLink Tesla P100 80GB/S Tesla P100 Tesla P100 80GB/S Tesla P100 2.5X the CPU:GPU Interface Bandwidth Tight coupling: strong CPU: strong GPU performance Equalizing access to memory - for all kinds of programming Closer programming to the CPU paradigm

Why it matters? Raw Application Performance 2.5X the performance of x86 accelerated solutions Bandwidth Throughput 40.00 30.00 20.00 10.00 0.00 2.91X the bandwidth CUDA H2D Bandwidth 11.72 34.16 CUDA H2D Bandwidth x86 Xeon E5-2640 v4 Competitor Tesla K80 Solely to Device 0, PCI-E IBM Power Systems S822LC for HPC Tesla P100, NVLink Throughput (queries/hour) 200000 180000 188852 160000 140000 120000 100000 80000 60000 40000 20000 0 POWER8 IBM Power S822LC (20c/4x Tesla P100) 2.5X More Throughput 73320 x86 2x Xeon E5-2640v4 (20c/4x Tesla K80) PCIe x16 3.0/x86 System Xeon E5-2640 v4 with 4 Tesla K80s : 73,320 queries per hour POWER8 with NVLink System Power Systems S822LC with 4 Tesla P100s: 188,852 queries per hour But how much of this speedup was due to NVLink vs a faster GPU?

Why it matters: Stop waiting for Data! Improve Code Performance for Developers 65% reduction in data transfer time in for Kinetica GPU-accelerated DB Less data-induced latency in all applications Unique to POWER8 with NVLink Less coding to compensate for slow data movement! 1.95X of the 2.5X overall performance improvement attributable to NVLink 100 tick Query Time: Competing System PCI-E x16 3.0 Data Transfer 73 ticks 65% Reduction Data Transfer Calculation* 26 ticks 14 ticks Calculation* 27 ticks 40 tick Query Time: S822LC for HPC, NVLink * Includes non-overlapping: CPU, GPU, and idle times. All results are based on running Kinetica Filter by geographic area queries on data set of 280 million simulated Tweets with 5 up to 80 simultaneous query streams each with 0 think time. Power System S822LC for HPC; 20 cores (2 x 10c chips) / 160 threads, POWER8 with NVLink; 2.86 GHz, 1024 GB memory, 2x 6Gb SSDs, 2-port 10 GbEth, 4x Tesla P100 GPU; Ubuntu 16.04. Competitive stack: 2x Xeon E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 512GB memory 2x 6Gb SSDs, 2-port 10 GbEth, 4xTesla K80 GPU, Ubuntu 16.04.

Why it matters: New Applications Attempting GPU acceleration CMPD on PCI-E systems Data movement overwhelms execution Early efforts: no net speedup or reduced performance Developer: Lots of thinking about it in the coding

Data Transfer Time (sec) Why it matters: New Applications 40 CPMD Data Transfer per Kernel 30 20 10 0 3.5X Faster data movement 8.8 POWER8 IBM Power S822LC (20c/2x Tesla P100) 31.3 x86 2x Xeon E5-2640v4 (20c, 2x Tesla K80) POWER8 with NVLink: a 3.5X improvement in data-transfer time Now a feasible GPU implementation Balanced profile - avoids complex data management Net: ~3X Speedup factor vs CPU-only CPMD All results are based on running CPMD, a parallelized plane wave / pseudopotential implementation of Density Functional Theory Application. A Hybrid version of CPMD (e.g. MPI + OPENMP + GPU + streams) was implemented with runs are made for 128-Water Box, RANDOM initialization. Results are reported in Execution Time (seconds). IBM Power System S822LC for HPC; 20 cores (2 x 10c chips) / 160 threads, POWER8 with NVLink; 2.86 GHz, 256 GB memory, 2 x 1TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 2x Tesla P100 GPU; Ubuntu 16.04. Competitive stack: 2x Xeon E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 256 GB memory, 1 x 2TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 2x Tesla K80 GPUs, Ubuntu 16.04. 8

Why it Matters: Application Profiles Where NVLink will have the most Impact Stream Data at Same Rate as Computation Burst Data at Startup and Teardown. Constant Data Transfers between adjacent GPUs Mask Bus Transfers from Host-Device

Why it Matters: Use Cases where NVLink will have the most Impact Stream Data at Same Rate as Computation Burst Data at Startup and Teardown. Constant Data Transfers between adjacent GPUs Mask Bus Transfers from Host-Device Genomics, Cryptography, Video Processing, etc. CFD/CAE, Machine Learning, Deep Learning, etc. Molec. Dynamics (ex: Amber), Deep Learning etc. Accelerated Databases, Analytics, etc.

What Kinds of Domains and New Kernels EDA Solvers Physics Molecular Dynamics Weather Analytics CFD Solvers Enterprise Databases New Application Potential Graph Databases

Where to Get Access 1. Remotely: IBM-NVIDIA Acceleration Lab 2. In House: IBM, Partner Ecosystem Access to POWER8 with NVLink Run on only platforms w/cpu-gpu NVLink Immediate performance gains from the wider bus and Tesla P100 Team up with IBM, NVIDIA on Advanced Acceleration Deep technical resources Custom plan to help migrate, performance tune code together Unlock What was Previously Impossible Bring new applications with unified memory & easier data movement Learn more at Ibm.biz/accellab Online Engagement Partner Locator