Parallel Stochastic Gradient Descent: The case for native GPU-side GPI

Similar documents
CafeGPI. Single-Sided Communication for Scalable Deep Learning

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

Asynchronous Parallel Stochastic Gradient Descent. A Numeric Core for Scalable Distributed Machine Learning Algorithms

Towards Scalable Machine Learning

GPUnet: networking abstractions for GPU programs

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

Accelerator-centric operating systems

Technologies for High Performance Data Analytics

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 10 Fall 2018

High Performance Computing

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017

Building the Most Efficient Machine Learning System

Intelligent Hybrid Flash Management

Parallelism. CS6787 Lecture 8 Fall 2017

High-Performance Data Loading and Augmentation for Deep Neural Network Training

PacketShader: A GPU-Accelerated Software Router

Democratizing Machine Learning on Kubernetes

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

N V M e o v e r F a b r i c s -

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Accelerating Data Centers Using NVMe and CUDA

Building the Most Efficient Machine Learning System

The Future of High Performance Interconnects

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018

Near-Data Processing for Differentiable Machine Learning Models

Scalable Distributed Training with Parameter Hub: a whirlwind tour

CS 6453: Parameter Server. Soumya Basu March 7, 2017

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Onto Petaflops with Kubernetes

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Profiling DNN Workloads on a Volta-based DGX-1 System

Decentralized and Distributed Machine Learning Model Training with Actors

Asynchronous Parallel Stochastic Gradient Descent

Linear Regression Optimization

Deep Learning Accelerators

Why DNN Works for Speech and How to Make it More Efficient?

Accelerating Ceph with Flash and High Speed Networks

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

THOUGHTS ABOUT THE FUTURE OF I/O

GPUfs: Integrating a file system with GPUs

Application Acceleration Beyond Flash Storage

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

The NE010 iwarp Adapter

Networking at the Speed of Light

Research Faculty Summit Systems Fueling future disruptions

Mellanox InfiniBand Solutions Accelerate Oracle s Data Center and Cloud Solutions

Scaling Distributed Machine Learning

Using MRAM to Create Intelligent SSDs

Toward Scalable Deep Learning

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

Hardened Security in the Cloud Bob Doud, Sr. Director Marketing March, 2018

Deep Learning with Tensorflow AlexNet

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

Deep Learning and Its Applications

Evaluating On-Node GPU Interconnects for Deep Learning Workloads

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang

Optimization for Machine Learning

CIS581: Computer Vision and Computational Photography Project 4, Part B: Convolutional Neural Networks (CNNs) Due: Dec.11, 2017 at 11:59 pm

IO virtualization. Michael Kagan Mellanox Technologies

Deep Learning Frameworks with Spark and GPUs

Asynchronous Stochastic Gradient Descent on GPU: Is It Really Better than CPU?

High-Performance Training for Deep Learning and Computer Vision HPC

How to Network Flash Storage Efficiently at Hyperscale. Flash Memory Summit 2017 Santa Clara, CA 1

Efficient Communication Library for Large-Scale Deep Learning

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

A performance comparison of Deep Learning frameworks on KNL

Parallel Deep Network Training

Harp-DAAL for High Performance Big Data Computing

Training Deep Neural Networks (in parallel)

Interconnect Your Future

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

The rcuda middleware and applications

High Performance Computing

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability.

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

World s most advanced data center accelerator for PCIe-based servers

Deep learning prevalence. first neuroscience department. Spiking Neuron Operant conditioning First 1 Billion transistor processor

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

Interconnect Your Future

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Eleos: Exit-Less OS Services for SGX Enclaves

RDMA and Hardware Support

Asynchronous Peer-to-Peer Device Communication

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Paving the Road to Exascale

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Scale-Out Acceleration for Machine Learning

Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training

Entropy-Aware I/O Pipelining for Large-Scale Deep Learning on HPC Systems

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

GUNREAL: GPU-accelerated UNsupervised REinforcement and Auxiliary Learning

High-Performance Broadcast for Streaming and Deep Learning

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

Transcription:

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI J. Keuper Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Mark Silberstein Accelerated Computer Systems Lab Technion Israel Institute of Technology

Accelerated Systems Lab Operating system support for accelerators GPU file system layer GPU networking API GPU virtual memory and huge data sets OS support for optimized GPU-SSD transfers GPU RDMA FPGA-CPU SoCs, near-data I/O accelerators SGX, accelerator security https://sites.google.com/site/silbersteinmark/ October 2016 Mark Silberstein @ Technion 2

Outline Stochastic Gradual Descent in a nutshell Data-parallel distributed algorithm GPI implementation Communication bottlenecks and sparsity GPU-native GPI-2 communications October 2016 Mark Silberstein @ Technion 3

SGD in a nutshell Common technique for training ML models Optimizes a loss function by modifying model's parameters Computations October 2016 Mark Silberstein @ Technion 4

Parallelization Intrinsically sequential Computations Data parallel October 2016 Mark Silberstein @ Technion 5

Parallelization via Master Worker Parameter server (w, s 0 ) (w, s 1 ) w w Worker Worker Sub-batch 0 Sub-batch 1 October 2016 Mark Silberstein @ Technion 6

SGD with Deep Neural Networks Network output Layer n w n Forward pass Layer 2 w 2 Backward pass Layer 1 w 1 October 2016 Mark Silberstein @ Technion 7

GPUs are used for computations Network output Layer n w n Forward pass Layer 2 w 2 Backward pass Layer 1 w 1 October 2016 Mark Silberstein @ Technion 8

Data parallel SGD with DNNs Parameter server Layer n w n Layer n w n Layer n w n Layer 2 w 2 Layer 2 w 2 Layer 2 w 2 Layer 1 w 1 Layer 1 w 1 Layer 1 w 1 October 2016 Mark Silberstein @ Technion 9

Problem: scalability limit Time until convergence Machines October 2016 Mark Silberstein @ Technion 10

Communication bottleneck Communications grow linearly with the number of nodes! Communication time per node Keuper, Preundt Distributed training of deep neural networks: Theoretical and practical Limits of Parallel Scalability. To appear at MLHCP16 October 2016 Mark Silberstein @ Technion 11

Steps toward improved scalability Asynchronous, zero-copy I/O Direct transfer from/to GPU memory GPI-2 Sparsity-aware compressed communications GPU-side networking Ongoing research October 2016 Mark Silberstein @ Technion 12

Background: GPI October 2016 Mark Silberstein @ Technion 13

Special Requirement 1: 1-sided communications Previous computations Layer 3 w 3 (w 3, s 0 ) Current computations Layer 2 w 2 Parameter server Layer 1 w 1 October 2016 Mark Silberstein @ Technion 14

Special Requirement 1: 1-sided communications Previous computations Layer 3 w 3 (w 3, s 0 ) Current computations Layer 2 w 2 Parameter server Layer 1 w 1 GPI-2 PGAS model makes it easy to implement Cons: high memory requirements due to zero-copy October 2016 Mark Silberstein @ Technion 15

Special requirement 2: Direct data transfer from GPU CPU NIC NIC CPU GPU GPU Extra-hop in CPU memory: extra latency lower bandwidth more complex pipelining October 2016 Mark Silberstein @ Technion 16

Special requirement 2: Direct data transfer from GPU GPI-2 leverages GPUDirectRDMA, allows CPU to move data from GPU to NIC CPU NIC NIC CPU GPU GPU October 2016 Mark Silberstein @ Technion 17

Reducing network traffic via smart compression Observation 1: updates many zeros or close to zero values are sent 40% values Fully connected layer: AlexNet iteration #10 October 2016 Mark Silberstein @ Technion 18

Reducing network traffic via smart compression Observation 2: updates become more sparse toward convergence 95% values Fully connected layer: AlexNet Iteration #100 October 2016 Mark Silberstein @ Technion 19

Special requirement 3: Predicated send send(vector,len, F()): foreach i<len if (F(vector[i])) send(vector[i]) 0.5 0.01 0.0 10E-6 0.1 0.9 0.5 0.01 0.9 October 2016 Mark Silberstein @ Technion 20

Problem with predicated send on GPUs send(vector,len, F()): foreach i<len if (F(vector[i])) send(vector[i]) Must be done on the GPU close to the data Must be done on CPU because no GPU send October 2016 Mark Silberstein @ Technion 21

Problem with predicated send on GPUs send(vector,len, F()): foreach i<len if (F(vector[i])) send(vector[i]) Must be done on the GPU close to the data Goal: enable GPU networking October 2016 Mark Silberstein @ Technion 22

GPUrdma and GPU-side networking Daoud, Wated, Silberstein GPUrdma: GPU-side library for high performance networking from GPU kernels, ROSS16 October 2016 Mark Silberstein @ Technion 23

GPUrdma enables networking from GPU without CPU October 2016 Mark Silberstein @ Technion 24

GPUrdma is faster for small messages GPU x3 CPU October 2016 Mark Silberstein @ Technion 25

GPU-to-GPU GPU-side GPI-2 Preliminary results 52Gbit/s max throughput vs. 38Gbit/s GPI-2 4.5 usec one-way latency 4x performance on toy applications Ongoing collaboration with NVIDIA and Mellanox October 2016 Mark Silberstein @ Technion 26

Conclusions Improved scalability of distributed SGD requires Careful communication-computation overlap via one-sided communications Optimized GPU-NIC data path Smart sparsity-aware data compression GPU-side networking Thank you! https://sites.google.com/site/silbersteinmark/ mark@ee.technion.ac.il October 2016 Mark Silberstein @ Technion 27