Device Placement Optimization with Reinforcement Learning

Similar documents
Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Post: Device Placement with Cross-Entropy Minimization and Proximal Policy Optimization

Spotlight: Optimizing Device Placement for Training Deep Neural Networks

Scaling Distributed Machine Learning

Device Placement Optimization with Reinforcement Learning

High Performance Computing

Convolutional Sequence to Sequence Learning. Denis Yarats with Jonas Gehring, Michael Auli, David Grangier, Yann Dauphin Facebook AI Research

Beyond Data and Model Parallelism for Deep Neural Networks

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

NVIDIA PLATFORM FOR AI

GUNREAL: GPU-accelerated UNsupervised REinforcement and Auxiliary Learning

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI

Learning Scheduling Algorithms for Data Processing Clusters

CafeGPI. Single-Sided Communication for Scalable Deep Learning

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

ImageNet Classification with Deep Convolutional Neural Networks

CS839: Probabilistic Graphical Models. Lecture 22: The Attention Mechanism. Theo Rekatsinas

FastText. Jon Koss, Abhishek Jindal

Deep Learning with Tensorflow AlexNet

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Accelerating Reinforcement Learning in Engineering Systems

Unified Deep Learning with CPU, GPU, and FPGA Technologies

Deep Learning Training System: Adam & Google Cloud ML. CSE 5194: Introduction to High- Performance Deep Learning Qinghua Zhou

IN-MEMORY ASSOCIATIVE COMPUTING

TENSORFLOW: LARGE-SCALE MACHINE LEARNING ON HETEROGENEOUS DISTRIBUTED SYSTEMS. By Sanjay Surendranath Girija

arxiv: v1 [cs.cv] 2 Sep 2018

Introduction to Deep Q-network

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Progressive Neural Architecture Search

Deep Learning Accelerators

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

UTS submission to Google YouTube-8M Challenge 2017

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

! References: ! Computer eyesight gets a lot more accurate, NY Times. ! Stanford CS 231n. ! Christopher Olah s blog. ! Take ECS 174!

Matchbox. automatic batching for imperative deep learning NVIDIA GTC, 2018/3/28. James Bradbury

Building the Most Efficient Machine Learning System

10703 Deep Reinforcement Learning and Control

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

arxiv: v1 [cs.cl] 31 Oct 2016

A Machine Learning Approach to Databases Indexes

Deep Learning for Computer Vision II

Onto Petaflops with Kubernetes

Toward Scalable Deep Learning

POINT CLOUD DEEP LEARNING

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Can Active Memory Replace Attention?

Convolutional Restricted Boltzmann Machine Features for TD Learning in Go

Building the Most Efficient Machine Learning System

Big Data Era Time Domain Astronomy

Real Time Monitoring of CCTV Camera Images Using Object Detectors and Scene Classification for Retail and Surveillance Applications

Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

XES Tensorflow Process Prediction using the Tensorflow Deep-Learning Framework

arxiv: v2 [cs.lg] 3 Dec 2018

Machine Learning In A Snap. Thomas Parnell Research Staff Member IBM Research - Zurich

Machine Learning on VMware vsphere with NVIDIA GPUs

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

Recurrent Neural Nets II

Research Faculty Summit Systems Fueling future disruptions

Usable while performant: the challenges building. Soumith Chintala

Machine Learning in WAN Research

Minimizing Computation in Convolutional Neural Networks

A Brief Introduction to Reinforcement Learning

The Future of High Performance Computing

Machine Learning in WAN Research

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

Towards Scalable Machine Learning

BAYESIAN GLOBAL OPTIMIZATION

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

Evaluating On-Node GPU Interconnects for Deep Learning Workloads

Review: The best frameworks for machine learning and deep learning

Deep Learning Applications

Semantic Image Search. Alex Egg

Speculations about Computer Architecture in Next Three Years. Jan. 20, 2018

Structured Attention Networks

Neural Networks and Tree Search

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

Exploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina

Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu

TensorFlow: A System for Machine Learning on Heterogeneous Systems

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

OPERATIONALIZING MACHINE LEARNING USING GPU ACCELERATED, IN-DATABASE ANALYTICS

Large-Scale Evolution of Image Classifiers

Computer Vision Lecture 16

Deep Learning Frameworks. COSC 7336: Advanced Natural Language Processing Fall 2017

Deep Learning. Yee Whye Teh (Oxford Statistics & DeepMind)

10 Billion Parameter Neural Networks in your Basement. Adam Coates Stanford University

Neural Optimizer Search with Reinforcement Learning

Transcription:

Device Placement Optimization with Reinforcement Learning Azalia Mirhoseini, Anna Goldie, Hieu Pham, Benoit Steiner, Mohammad Norouzi, Naveen Kumar, Rasmus Larsen, Yuefeng Zhou, Quoc Le, Samy Bengio, and Jeff Dean Google Brain

Why device placement? Trend towards many-device training, bigger models, larger batch sizes Google neural machine translation 16 (trained on 128 GPUs) Sparsely gated mixture of experts 17 (batch size = 8192, >130 billion parameters, trained on 128 GPUs) Training ImageNet in 1-hr 17 (batch size = 8192, trained on 256 GPUs)

Standard practice for device placement Often based on greedy heuristics Requires deep understanding of devices: nonlinear FLOPs, bandwidth, latency behavior Requires modeling parallelism and pipelining Does not generalize well

ML for device placement ML is repeatedly replacing rule based heuristics

ML for device placement ML is repeatedly replacing rule based heuristics We show how RL can be applied to device placement Effective search across large state and action spaces Automated learning from environment only based on reward function (e.g. runtime of a program)

ML for device placement ML is repeatedly replacing rule based heuristics We show how RL can be applied to device placement Effective search across large state and action spaces Automated learning from environment only based on reward function (e.g. runtime of a program) We are inspired by success of ML for learning to search

Posing device placement as an RL problem Input RL model Output Neural model Policy Set of available devices CPU GPU Assignment of ops in neural model to devices

Posing device placement as an RL problem Input RL model Output Neural model Assignment of ops in neural model to devices Policy Set of available devices CPU GPU Evaluate runtime

Problem formulation ( ): expected runtime : trainable parameters of policy R: runtime (Ƥ g; ): policy g : input neural graph Ƥ: output placements {1,2,..,num_ops}num_devices

Policy architecture

Policy architecture Op types, e.g., Sum, Conv2d, MatMul Op output shapes Op adjacency information

Training with REINFORCE

Training with REINFORCE

Distributed training Parameter server Controller 1 Controller 2 Placement Runtime Runtime Placement Worker 1 Worker 2 Controller m Placement Runtime Runtime Placement Worker 1 Worker 2... Placement Runtime Runtime Placement Worker 1 Worker 2

Learned placement on Neural Machine Translation (NMT) Decoder Softmax Attention Layer-2 Layer-1 Embedding Encoder Layer-2 Layer-1 Embedding White represents CPU (Ixion Haswell 2300) Each other color represents a separate GPU (Nvidia Tesla K80) Searching over a space of 5^280 possible assignments

NMT end to end training Mirhoseini, Pham et al., Device Placement Optimization with Reinforcement Learning, ICML 17

Learned placement on Inception-V3 White represents CPU (Ixion Haswell 2300) Each other color represents a separate GPU (Nvidia Tesla K80) Searching over a space of 5^83 possible assignments

Inception-V3 end to end training Model-parallelism with batch 128 Data-parallelism, 4 towers each with batch 32 Mirhoseini, Pham, et al., Device Placement Optimization with Reinforcement Learning, ICML 17

Profiling placement on NMT

Profiling placement on Inception-V3

Profiling placement on Inception-V3

Shortcomings of initially proposed model Seq2seq models cannot be unrolled for more than few hundred steps Most TensorFlow graphs contain tens of thousands of operations Manual grouping of operations hampers scalability

An end-to-end hierarchical placement model

Problem formulation for hierarchical placement Objective: Minimize expected runtime for predicted placement d ( expected runtime g: trainable parameters of Grouper d: trainable parameters of Placer Rd: runtime for placement d g, d):

Problem formulation for hierarchical placement Objective: Minimize expected runtime for predicted placement d ( expected runtime g: trainable parameters of Grouper d: trainable parameters of Placer Rd: runtime for placement d g, d):

Problem formulation for hierarchical placement Probability of predicted group assignment of operations

Problem formulation for hierarchical placement Probability of predicted device placement conditioned on grouping results

Gradient update for Grouper Derivative w.r.t. parameters of Grouper

Gradient update for Placer Derivative w.r.t. parameters of Placer

Learned placements on NMT White represents CPU (Ixion Haswell 2300) Each other color represents a separate GPU (Nvidia Tesla K80) Searching over a space of ~5^40000 possible assignments

Results (runtime in seconds)

Results (runtime in seconds)

Results (runtime in seconds)

Summary Pose device placement as an RL problem Use policy gradient to learn placements Policy finds non-trivial assignment of operations to devices that outperform heuristic approaches Profiling of results show policy learns implicit trade-offs between computational and communication capabilities of underlying devices