Building the Most Efficient Machine Learning System

Size: px

Start display at page:

Download "Building the Most Efficient Machine Learning System"

Ambrose Marsh
5 years ago
Views:

1 Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017

2 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide Offices ~2,900 Employees worldwide Ticker: MLNX 2

3 Exponential Data Growth Everywhere Higher Data Speeds SmartNIC System on a Chip Faster Better Data Processing Data Security Adapters Switches Cables & Transceivers 3

4 Enabling the Future of Machine Learning Applications IoT Storage Self-Driving Vehicles Database Embedded Appliances High Performance Computing Machine Learning Healthcare Financial Hyperscale Retail Manufacturing HPC and Machine Learning Share Same Interconnect Needs 4

5 Highest Performance 100 and 200Gb/s Interconnect Solutions Adapters 200Gb/s, 0.6us Latency 200 Million Messages per Second (10 / 25 / 40 / 50 / 56 / 100 / 200Gb/s) Switch 40 HDR (200Gb/s) Ports 80 HDR100 (100Gb/s) Ports 16Tb/s Throughput, 15.6 Billion msg/sec Switch GbE Ports, 64 25/50GbE Ports (10 / 25 / 40 / 50 / 100GbE) Throughput of 6.4Tb/s Interconnect Transceivers Active Optical and Copper Cables (10 / 25 / 40 / 50 / 56 / 100 / 200Gb/s) Today s Datacenters Need the Most Intelligent Interconnect 5

World s Highest Performance, Scalability and Productivity for

6 Mellanox Delivers Best Return on Investment 60% Higher Return on Investment Up to 50% Savings on Capital and Operation Expenses World s Highest Performance, Scalability and Productivity for Deep Learning Cognitive Toolkit Chainer Mellanox Unlocks the Power of AI 6

Mellanox is Leading Artificial Intelligence (AI) Advancing

Critical and Timely Decision Making Health Care, Business

Customer Support and more More Data Better Models Faster

7 Mellanox is Leading Artificial Intelligence (AI) Advancing Technology to Affect Science, Business, and Society By Enabling Critical and Timely Decision Making Health Care, Business Integrity, Business Intelligence Knowledge Discovery, Security, Customer Support and more More Data Better Models Faster Interconnect GPUs CPUs FPGAs Storage More Data Faster Interconnect Better Insight Competitive Advantage 7

8 Enabling Most Efficient Machine Learning Platforms (Examples) Highest Performance, Scalability and Productivity for Deep Learning 8

Energy Efficient than 2015 Record http://sortbenchmark.org/tencentsort2016.

9 Mellanox Accelerates Machine Learning and Big Data World s First PCIe Gen 4 Public Cloud Server for Cognitive Computing Sets TeraSort 2016 Benchmark Record 5x Faster, 3x Energy Efficient than 2015 Record Smart Network for Azure Cloud Server Designed for Big Data Analytics & AI Enabling Analytics in Cloud 9

10 Mellanox Accelerates Machine Learning and Big Data Big Sur & Big Basin Facebook Open Source AI Hardware Platform Only ONE Network of Choice - Mellanox Caffe2 Powering Self Driving Car 2X Faster Training with Paddle Paddle We rely on fast interconnect technologies and RDMA. Andrew Ng, Chief Scientist, Baidu Caffe Real Time Fraud Detection 14 Million Transactions per Day 4 Billion Database Inserts Image Recognition ~90% Prediction Accuracy RDMA in Tensorflow and Caffe 10

11 AI is Changing the Way We Interact with Computers Automotive and Transportation Security and Public Safety Consumer Web, Mobile, Retail Medicine and Biology Broadcast, Media and Entertainment Finance, Fraud and Insurance Autonomous driving Surveillance Image tagging Drug discovery Captioning Real Time Trade Pedestrian detection Image analysis Speech Diagnostic Search Credit / Risk Accident avoidance Facial recognition recognition assistance Recommendations Analysis and detection Natural language Cancer cell Real time Fraud Detection processing detection translation and Prevention Recommendation and sentiment analysis Efficient Deep Learning Depends on Mellanox 11

12 Deep Learning Demands Highest Performance TRAINING Scalability requires ultra-fast networking Same hardware needs as HPC Images Video Text TRAINING DATASET Billions of TFLOPS Faster access to storage RDMA SHARP PeerDirect, GPUDirect, ROCm, others Speech Tabular INFERENCING Time Series Highly transactional / supports many users Mellanox ultra-low latency Instant network response NEW DATA Billions of FLOPS RDMA PeerDirect, GPUDirect, ROCm, others 12

for the Data Creates Performance Bottlenecks Analyze Data as it Moves!

13 Exponential Data Growth The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates Performance Bottlenecks Analyze Data as it Moves! Faster Data Speeds and In-Network Computing Enable Higher Performance and Scale 13

Data Centric Architecture to Overcome Latency

(Offload) Network In-Network Computing HPC / Machine

Machine Learning Communications Latencies of 3-4us

14 Data Centric Architecture to Overcome Latency Bottlenecks CPU-Centric (Onload) Data-Centric (Offload) Network In-Network Computing HPC / Machine Learning Communications Latencies of 30-40us HPC / Machine Learning Communications Latencies of 3-4us Intelligent Interconnect Paves the Road to Exascale Performance 14

15 Mellanox Technology Accelerations for Machine Learning CPU GPU GPU RDMA GPUDirect SHARP CPU CPU NVMe over Fabrics Security GPU GPU CPU In-Network Computing Key for Highest Return on Investment 15

16 In-Network Computing Enables Deep Learning Frameworks Middleware (MPI, grpc) - Optional CUDA SHARP rcuda GPUDirect RDMA NVMe over Fabrics Mellanox Interconnect Solutions Mellanox Accelerations for Machine Learning and Big Data 16

SHARP performs the gradient averaging Removes the need for physical parameter server

17 Mellanox SHARP for Gradient Computation CPU in a parameter server becomes the bottleneck quickly (roughly 4 nodes) TCP adds a lot of overhead and the traffic pattern is bursty SHARP performs the gradient averaging Removes the need for physical parameter server Removes all parameter server overhead SHARP Provides Better Scalability and Reduced Network Traffic 17

18 PeerDirect, GPUDirect RDMA and ASYNC Purpose-built for Acceleration of Deep Learning 18

What is GPUDirect Provides significant decrease in communication latency for

communications between Mellanox adapters and third-party devices No unnecessary

ROCm and others InfiniBand and Ethernet CPU CPU Chip set Chip set Vendor Device

19 What is GPUDirect Provides significant decrease in communication latency for acceleration devices Natively supported by Mellanox OFED Supports peer-to-peer communications between Mellanox adapters and third-party devices No unnecessary system memory copies & CPU overhead Enables GPUDirect RDMA, GPUDirect ASYNC, ROCm and others InfiniBand and Ethernet CPU CPU Chip set Chip set Vendor Device Chipset Chipset Vendor Device Designed for Deep Learning Acceleration 19

20 GPUDirect RDMA and GPUDirect ASYNC Direct Connectivity GPU - Interconnect 20

21 Higher is Better GPUDirect RDMA Performance GPU-GPU Internode Latency GPU-GPU Internode Bandwidth 10x 9.3X Lower is Better 2.18 usec Source: Prof. DK Panda 9.3X Better Latency 10X Better Throughput 21

22 NVIDIA NCCL 2.0 Near-Linear Scalability Optimized collective communication library Allreduce, Reduce, Broadcast, Reduce-scatter, Allgather Inter-node communication using InfiniBand verbs and GPUDirect RDMA Multi-rail support, Topology detection 50% performance improvement with NVIDIA DGX-1 across 32 NVIDIA Tesla V100 GPUs NVIDIA Accelerates Scalable Deep Learning with Mellanox 22

23 Performance and Scalability Examples 23

24 TensorFlow with Mellanox RDMA Reference Deployment Guide Unmatched Linear Scalability, No Additional Cost Up to 76% Efficiency and 50% Better Performance versus TCP 24

Accelerating TensorFlow with grpc over RDMA Open source

framework Google s Optimized RPC for distributed network RDMA

Integration with upstream TensorFlow >2x Faster Lower is

25 Accelerating TensorFlow with grpc over RDMA Open source Machine Learning from Google Distributed training with grpc framework Google s Optimized RPC for distributed network RDMA Acceleration over UCX Unified Communication X (UCX) Integration with upstream TensorFlow >2x Faster Lower is better 2X higher Performance with RDMA ~2X Acceleration for TensorFlow with RDMA 25

TensorFlow over RDMA in Apache Spark Environment Yahoo enhanced

InfiniBand provides faster connectivity and supports

Scalability for Inception Model Training Source:

26 TensorFlow over RDMA in Apache Spark Environment Yahoo enhanced the TensorFlow C++ layer to enable RDMA over InfiniBand InfiniBand provides faster connectivity and supports accelerated offload capability InfiniBand Provides Near Linear Scalability for Inception Model Training Source: 26

Lowers latency, increases throughput More cores for training Even better

27 2X Acceleration for Baidu Machine Learning Software from Baidu Usage: word prediction, translation, image processing RDMA (GPUDirect) speeds training Lowers latency, increases throughput More cores for training Even better results with optimized RDMA ~2X Acceleration for Paddle Training with RDMA 27

near linear performance Mellanox InfiniBand allows ChainerMN to achieve ~72% accuracy.

28 ChainerMN Depends on InfiniBand ChainerMN depends on MPI for inter-node communication NVIDIA NCCL library is then used for intra-node communication between GPUs Leveraging InfiniBand results in near linear performance Mellanox InfiniBand allows ChainerMN to achieve ~72% accuracy. Source: 28

29 Machine Learning Performance Comparison 60.3% 32 Accelerators 16 Accelerators DeepBench measures the performance of basic operations involved in training deep neural networks. Lower is Better 8 Accelerators InfiniBand Delivers 60% Better Performance with 2X Less Infrastructure 29

30 A Few Solution Examples Scalable Deep Learning Depends on Mellanox 30

(Pascal) w/16gb per GPU 28672 CUDA Cores 4x ConnectX-4 EDR 100Gb/s HCAs Fully integrated software

31 NVIDIA DGX-1 World s first purpose-built system for deep learning SaturnV is #28 on the Top500, 3.3Pf with 124 nodes SaturnV is also #1 on the Green500 Fully integrated hardware 8x Tesla P100 (Pascal) w/16gb per GPU CUDA Cores 4x ConnectX-4 EDR 100Gb/s HCAs Fully integrated software stack Major deep learning frameworks Drivers, NVIDIA CUDA, NVIDIA Deep Learning SDK GPUDirect RDMA 31

32 NVIDIA DGX-1 Deep Learning Server Deep Learning Supercomputer in a Box NVIDIA SaturnV NVIDIA Machine Learning Supercomputer #28 on the Top Pf with 124 DGX-1 nodes #1 on the Green500 8 x NVIDIA Tesla P100 GPUs 4 x ConnectX -4 EDR 100G InfiniBand Adapters 5.3TFlops 16nm FinFET NVLINK 32

33 End-to-End Interconnect Solutions for All Platforms Highest Performance and Scalability for X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms X86 Open POWER GPU ARM FPGA Smart Interconnect to Unleash The Power of All Compute Architectures 33

Proven Advantages RDMA delivers 2X performance advantage over traditional TCP Machine Learning and HPC platforms share the same interconnect needs Scalable, flexible, high performance, high

34 Proven Advantages RDMA delivers 2X performance advantage over traditional TCP Machine Learning and HPC platforms share the same interconnect needs Scalable, flexible, high performance, high bandwidth, end-to-end connectivity Standards-based and supported by the largest eco-system Supports all compute architectures: x86, Power, ARM, GPU, FPGA etc. Native Offloading architecture RDMA, GPUDirect, SHARP and other core accelerations Backward and future compatible Scalable Machine Learning Depends on Mellanox 34

35 Thank You

Building the Most Efficient Machine Learning System

Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide