Deep learning at Microsoft

Similar documents
Sayan Pathak Principal ML Scientist. Chris Basoglu Partner Dev Manager

With many contributors: A. Agarwal, E. Akchurin, C. Basoglu, G. Chen, S. Cyphers, W. Darling, J. Droppo, A. Eversole, B. Guenter, P. He, M.

S6843 Deep Learning in Microsoft with CNTK

Emad Barsoum, Prin. Software Dev. Engineer Sayan Pathak, Prin. ML Scientist Cha Zhang, Prin. Researcher. With 140+ contributors

Deep Learning Explained Module 4: Convolution Neural Networks (CNN or Conv Nets)

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

Keras: Handwritten Digit Recognition using MNIST Dataset

CafeGPI. Single-Sided Communication for Scalable Deep Learning

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer

A Quick Guide on Training a neural network using Keras.

Vinnie Saini Cloud Solution Architect Big Data & AI

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

MIXED PRECISION TRAINING OF NEURAL NETWORKS. Carl Case, Senior Architect, NVIDIA

Review: The best frameworks for machine learning and deep learning

MoonRiver: Deep Neural Network in C++

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

Advanced Video Analysis & Imaging

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Convolutional Neural Networks

Xilinx ML Suite Overview

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov

RNN LSTM and Deep Learning Libraries

A performance comparison of Deep Learning frameworks on KNL

Keras: Handwritten Digit Recognition using MNIST Dataset

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

INTRODUCTION TO DEEP LEARNING

Frameworks in Python for Numeric Computation / ML

Fei-Fei Li & Justin Johnson & Serena Yeung

Unified Deep Learning with CPU, GPU, and FPGA Technologies

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Deep Learning and Its Applications

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

Deep Learning with Tensorflow AlexNet

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

NVIDIA DLI HANDS-ON TRAINING COURSE CATALOG

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Tutorial on Machine Learning Tools

All You Want To Know About CNNs. Yukun Zhu

Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Neural Network Exchange Format

Deep Learning Applications

Mocha.jl. Deep Learning in Julia. Chiyuan Zhang CSAIL, MIT

POINT CLOUD DEEP LEARNING

CSC 578 Neural Networks and Deep Learning

Democratizing Machine Learning on Kubernetes

Object Detection Lecture Introduction to deep learning (CNN) Idar Dyrdal

Fuzzy Set Theory in Computer Vision: Example 3

In-Place Activated BatchNorm for Memory- Optimized Training of DNNs

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Deep Learning for Computer Vision II

An Introduction to Deep Learning with RapidMiner. Philipp Schlunder - RapidMiner Research

End-to-End Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition with TensorFlow

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Deep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Encoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44

CNN optimization. Rassadin A

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Deep Learning. Volker Tresp Summer 2017

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

Neural Network Neurons

NVIDIA DEEP LEARNING INSTITUTE

Como funciona o Deep Learning

Machine Learning 13. week

Xuedong Huang Chief Speech Scientist & Distinguished Engineer Microsoft Corporation

IBM Deep Learning Solutions

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

Defense Data Generation in Distributed Deep Learning System Se-Yoon Oh / ADD-IDAR

Using CNN Across Intel Architecture

ECE 5470 Classification, Machine Learning, and Neural Network Review

A Survey on Deep Learning Methods for Robot Vision

Towards Scalable Machine Learning

Introduction to Deep Learning in Signal Processing & Communications with MATLAB

Convolutional Neural Networks: Applications and a short timeline. 7th Deep Learning Meetup Kornel Kis Vienna,

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Tutorial on Deep Learning with Theano and Lasagne. Jan Schlüter Sander Dieleman EMBL-EBI

Deep Neural Network Hyperparameter Optimization with Genetic Algorithms

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Snapdragon NPE Overview

TENSORRT. RN _v01 January Release Notes

Real Time Monitoring of CCTV Camera Images Using Object Detectors and Scene Classification for Retail and Surveillance Applications

Toward Scalable Deep Learning

CNN Basics. Chongruo Wu

Research Faculty Summit Systems Fueling future disruptions

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

Lecture 7: Neural network acoustic models in speech recognition

Convolutional Sequence to Sequence Learning. Denis Yarats with Jonas Gehring, Michael Auli, David Grangier, Yann Dauphin Facebook AI Research

Practical Deep Learning

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

Bidirectional Recurrent Convolutional Networks for Video Super-Resolution

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

A Torch Library for Action Recognition and Detection Using CNNs and LSTMs

Transcription:

Deep learning at Services Skype Translator Cortana Bing HoloLens Research

Services

ImageNet: 2015 ResNet 28.2 25.8 ImageNet Classification top-5 error (%) 16.4 11.7 7.3 6.7 3.5 ILSVRC 2010 NEC America ILSVRC 2011 Xerox ILSVRC 2012 AlexNet ILSVRC 2013 Clarifi ILSVRC 2014 VGG ILSVRC 2014 GoogleNet ILSVRC 2015 ResNet had all 5 entries being the 1-st places this year: ImageNet classification, ImageNet localization, ImageNet detection, COCO detection, and COCO segmentation

Image Similarity Goal: given query image, find similar images. Customer: Anonymous ISV (Azure Partner) Task: given a retail image, find same product on competitor websites (to compare price). Existing solution: solely based on mining text information from the websites of Target, Macy, etc. Customer asked for individual similarity measure (e.g. texture, neck style, etc).

Bing / Bing Ads

Translator http://translate.it Power point-plug in for translating speech to subtitles

s historic speech breakthrough 2016 research system for conversational speech recognition 5.9% word-error rate enabled by CNTK s multi-server scalability [W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, G. Zweig: Achieving Human Parity in Conversational Speech Recognition, https://arxiv.org/abs/1610.05256]

Customer Support Agent

(CNTK) s open-source deep-learning toolkit https://github.com//cntk Created by Speech researchers (Dong Yu et al.) in 2012, Computational Network On GitHub since Jan 2016 under MIT license Python support since Oct 2016 (beta), rebranded as External contributions e.g. from MIT, Stanford and NVidia

(CNTK) Over 80% internal DL workload runs CNTK 1st-class on Linux and Windows, docker support Python, C++, C#, Java Internal == External

CTNK The Fastest http://dlbench.comp.hkbu.edu.hk/ Benchmarking by HKBU, Version 8 Single Tesla K80 GPU, CUDA: 8.0 CUDNN: v5.1 Caffe: 1.0rc5(39f28e4) CNTK: 2.0 Beta10(1ae666d) MXNet: 0.93(32dc3a2) TensorFlow: 1.0(4ac9c09) Torch: 7(748f5e3) Caffe CNTK MxNet TensorFlow Torch FCN5 (1024) 55.329ms 51.038ms 60.448ms 62.044ms 52.154ms AlexNet (256) 36.815ms 27.215ms 28.994ms 103.960ms 37.462ms ResNet (32) 143.987ms 81.470ms 84.545ms 181.404ms 90.935ms LSTM (256) (v7 benchmark) - 43.581ms (44.917ms) 288.142ms (284.898ms) - (223.547ms) 1130.606ms (906.958ms)

CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to multi-gpu/multi-server. 80000 70000 speed comparison (samples/second), higher = better [note: December 2015] 60000 50000 Achieved with 1-bit gradient quantization algorithm 40000 30000 20000 Theano only supports 1 GPU 10000 0 CNTK Theano TensorFlow Torch 7 Caffe 1 GPU 1 x 4 GPUs 2 x 4 GPUs (8 GPUs)

Superior performance

Scalability

What is new in CNTK 2.0? has now released a major upgrade of the software and rebranded it as part of the. This release is a major improvement over the initial release. There are two major changes from the first release that you will see when you begin to look at the new release. First is that CNTK now has a very nice Python API and, second, the documentation and examples are excellent. Installing the software from the binary builds is very easy on both Ubuntu Linux and Windows. https://esciencegroup.com/2016/11/10/cntk-revisited-a-new-deep-learning-toolkit-release-from-microsoft/

CNTK Other Advantages Python and C++ API Mostly implemented in C++ Low level + high level Python API Extensibility User functions and learners in pure Python Readers Distributed, highly efficient built-in data readers Internal == External

The (CNTK) CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications. CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to multi-gpu/multi-server.

MNIST Handwritten Digits (OCR) Handwritten Digits Data set of hand written digits with 60,000 training images 10,000 test images Each image is: 28 x 28 pixels 1 5 4 3 5 3 5 3 5 9 0 6 Corresponding Labels

.. Multi-layer perceptron 784 pixels (x) https://github.com//cntk/tree/master/tutorials Deep Model 28 pix 28 pix D D D i = 784 O= 400 a = relu i = 400 O= 200 a = relu 10 nodes i = 200 O= 10 a = None Weights 400 + 400 bias 784 200 + 200 bias 400 10 + 10 bias 200 z0 z1 z2 z3 z4 z5 z6 z7 z8 z9 softmax e z i σ9 j=0 e z j 0.08 0.08 0.10 0.17 0.11 0.09 0.08 0.08 0.13 0.01

. Loss function Label One-hot encoded (Y) 1 5 4 3 5 3 5 3 5 9 0 6 0 0 0 1 0 0 0 0 0 0 28 x 28 pix (p) 28 pix Loss function 9 ce = σ j=0 y j log p j Cross entropy error 28 pix Model (w, b) Predicted Probabilities (p) 0.08 0.08 0.10 0.17 0.11 0.09 0.08 0.08 0.13 0.01

CNTK Model Example: 2-hidden layer feed-forward NN h 1 = s(w 1 x + b 1 ) h1 = sigmoid (x @ W1 + b1) h 2 = s(w 2 h 1 + b 2 ) h2 = sigmoid (h1 @ W2 + b2) P = softmax(w out h 2 + b out ) P = softmax (h2 @ Wout + bout) with input x R M and one-hot label y R J and cross-entropy training criterion ce = y T log P ce = cross_entropy (L, P)

CNTK Model example: 2-hidden layer feed-forward NN h 1 = s(w 1 x + b 1 ) h1 = sigmoid (x @ W1 + b1) h 2 = s(w 2 h 1 + b 2 ) h2 = sigmoid (h1 @ W2 + b2) P = softmax(w out h 2 + b out ) P = softmax (h2 @ Wout + bout) with input x R M and one-hot label y R J and cross-entropy training criterion ce = y T log P ce = cross_entropy (P, y)

CNTK Model b out W out b 2 W 2 b 1 W 1 ce cross_entropy P softmax + h 2 s + h 1 s + h1 = sigmoid (x @ W1 + b1) h2 = sigmoid (h1 @ W2 + b2) P = softmax (h2 @ Wout + bout) ce = cross_entropy (P, y) x y

CNTK Model b out W out b 2 W 2 ce cross_entropy P softmax + h 2 s + h 1 s Nodes: functions (primitives) Can be composed into reusable composites Edges: values Incl. tensors, sparse Automatic differentiation F / in = F / out out / in Deferred computation execution engine Editable, clonable b 1 W 1 + x y LEGO-like composability allows CNTK to support wide range of networks & applications

Authoring networks as functions model function features predictions defines the model structure & parameter initialization holds parameters that will be learned by training criterion function (features, labels) (training loss, additional metrics) defines training and evaluation criteria on top of the model function provides gradients w.r.t. training criteria b out W out b 2 W 2 b 1 W 1 ce cross_entropy P softmax + h 2 s + h 1 s + x y

Authoring networks as functions CNTK model: neural networks are functions pure functions with special powers : can compute a gradient w.r.t. any of its nodes external deity can update model parameters user specifies network as function objects: formula as a Python function (low level, e.g. LSTM) function composition of smaller sub-networks (layering) higher-order functions (equiv. of scan, fold, unfold) model parameters held by function objects compiled into the static execution graph under the hood

Layers lib: full list of layers/blocks layers/blocks.py: LSTM(), GRU(), RNNUnit() Stabilizer(), identity ForwardDeclaration(), Tensor[], SparseTensor[], Sequence[], SequenceOver[] layers/layers.py: Dense(), Embedding() Convolution(), Convolution1D(), Convolution2D(), Convolution3D(), Deconvolution() MaxPooling(), AveragePooling(), GlobalMaxPooling(), GlobalAveragePooling(), MaxUnpooling() BatchNormalization(), LayerNormalization() Dropout(), Activation() Label() layers/higher_order_layers.py: Sequential(), For(), operator >>, (function tuples) ResNetBlock(), SequentialClique() layers/sequence.py: Delay(), PastValueWindow() Recurrence(), RecurrenceFrom(), Fold(), UnfoldFrom() models/models.py: AttentionModel()

CNTK workflow Script configure and executes through CNTK Python APIs corpus reader minibatch source task-specific deserializer automatic randomization distributed reading network model function criterion function CPU/GPU execution engine packing, padding trainer SGD (momentum, Adam, ) minibatching model

As easy as 1-2-3 from cntk import * # reader def create_reader(path, is_training):... # network def create_model_function():... def create_criterion_function(model):... # trainer (and evaluator) def train(reader, model):... def evaluate(reader, model):... # main function model = create_model_function() reader = create_reader(..., is_training=true) train(reader, model) reader = create_reader(..., is_training=false) evaluate(reader, model)

Workflow prepare data configure reader, network, learner (Python) train: mpiexec --np 16 --hosts server1,server2,server3,server4 \ python my_cntk_script.py

Prepare data: reader def create_reader(map_file, mean_file, is_training): # image preprocessing pipeline transforms = [ ImageDeserializer.crop(crop_type='Random', ratio=0.8, jitter_type='uniratio') ImageDeserializer.scale(width=image_width, height=image_height, channels=num_channels, interpolations='linear'), ImageDeserializer.mean(mean_file) ] # deserializer return MinibatchSource(ImageDeserializer(map_file, StreamDefs( features = StreamDef(field='image', transforms=transforms), ' labels = StreamDef(field='label', shape=num_classes) )), randomize=is_training, epoch_size = INFINITELY_REPEAT if is_training else FULL_DATA_SWEEP)

Prepare data: reader def create_reader(map_file, mean_file, is_training): # image preprocessing pipeline transforms = [ ImageDeserializer.crop(crop_type='Random', ratio=0.8, jitter_type='uniratio') ImageDeserializer.scale(width=image_width, height=image_height, channels=num_channels, interpolations='linear'), ImageDeserializer.mean(mean_file) ] # deserializer return MinibatchSource(ImageDeserializer(map_file, StreamDefs( features = StreamDef(field='image', transforms=transforms), ' labels = StreamDef(field='label', shape=num_classes) )), randomize=is_training, epoch_size = INFINITELY_REPEAT if is_training else FULL_DATA_SWEEP) automatic on-the-fly randomization important for large data sets readers compose, e.g. image text caption

Distributed training prepare data configure reader, network, learner (Python) train: --distributed! mpiexec --np 16 --hosts server1,server2,server3,server4 \ python my_cntk_script.py

Workflow prepare data configure reader, network, learner (Python) train: mpiexec --np 16 --hosts server1,server2,server3,server4 \ python my_cntk_script.py deploy offline (Python): apply model file-to-file your code: embed model through C++ API online: web service wrapper through C#/Java API

CNTK Unique Features Symbolic loops over sequences with dynamic scheduling Turn graph into parallel program through minibatching Unique parallel training algorithms (1-bit SGD, Block Momentum)

Symbolic Loops over Sequential Data extend our example to a recurrent network (RNN) h 1 (t) = s(w 1 x(t) + H 1 h 1 (t-1) + b 1 ) h1 = sigmoid(x @ W1 + past_value(h1) + b1) h 2 (t) = s(w 2 h 1 (t) + H 2 h 2 (t-1) + b 2 ) h2 = sigmoid(h1 @ W2 + past_value(h2) @ H2 + b2) P(t) = softmax(w out h 2 (t) + b out ) P = softmax(h2 @ Wout + bout) ce(t) = L T (t) log P(t) ce = cross_entropy(p, L) = max Scorpus ce(t) no explicit notion of time

Symbolic Loops over Sequential Data extend our example to a recurrent network (RNN) h 1 (t) = s(w 1 x(t) + H 1 h 1 (t-1) + b 1 ) h1 = sigmoid(x @ W1 + past_value(h1) + b1) h 2 (t) = s(w 2 h 1 (t) + H 2 h 2 (t-1) + b 2 ) h2 = sigmoid(h1 @ W2 + past_value(h2) @ H2 + b2) P(t) = softmax(w out h 2 (t) + b out ) P = softmax(h2 @ Wout + bout) ce(t) = L T (t) log P(t) ce = cross_entropy(p, L) = max Scorpus ce(t) no explicit notion of time

Symbolic Loops over Sequential Data extend our example to a recurrent network (RNN) h 1 (t) = s(w 1 x(t) + H 1 h 1 (t-1) + b 1 ) h1 = sigmoid(x @ W1 + past_value(h1) + b1) h 2 (t) = s(w 2 h 1 (t) + H 2 h 2 (t-1) + b 2 ) h2 = sigmoid(h1 @ W2 + past_value(h2) @ H2 + b2) P(t) = softmax(w out h 2 (t) + b out ) P = softmax(h2 @ Wout + bout) ce(t) = L T (t) log P(t) ce = cross_entropy(p, L) = max Scorpus ce(t) no explicit notion of time

Symbolic Loops over Sequential Data extend our example to a recurrent network (RNN) h 1 (t) = s(w 1 x(t) + H 1 h 1 (t-1) + b 1 ) h1 = sigmoid(x @ W1 + past_value(h1) @ H1 + b1) h 2 (t) = s(w 2 h 1 (t) + H 2 h 2 (t-1) + b 2 ) h2 = sigmoid(h1 @ W2 + past_value(h2) @ H2 + b2) P(t) = softmax(w out h 2 (t) + b out ) P = softmax(h2 @ Wout + bout) ce(t) = L T (t) log P(t) ce = cross_entropy(p, L) = max Scorpus ce(t)

Symbolic Loops over Sequential Data b out W out b 2 W 2 b 1 W 1 ce cross_entropy P softmax + h 2 s + h 1 s + + z -1 + H 2 z -1 H 1 h1 = sigmoid(x @ W1 + past_value(h1) @ H1 + b1) h2 = sigmoid(h1 @ W2 + past_value(h2) @ H2 + b2) P = softmax(h2 @ Wout + bout) ce = cross_entropy(p, L) CNTK automatically unrolls cycles deferred computation Efficient and composable x y

parallel sequences Batch-Scheduling of Variable-Length Sequences minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel sequence 1 sequence 2 sequence 3 sequence 4 sequence 7 padding sequence 5 sequence 6 CNTK handles the special cases: past_value operation correctly resets state and gradient at sequence boundaries non-recurrent operations just pretend there is no padding ( garbage-in/garbage-out ) sequence reductions

parallel sequences Batch-Scheduling of Variable-Length Sequences minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel sequence 1 sequence 2 sequence 3 sequence 4 sequence 7 padding sequence 5 sequence 6 CNTK handles the special cases: past_value operation correctly resets state and gradient at sequence boundaries non-recurrent operations just pretend there is no padding ( garbage-in/garbage-out ) sequence reductions

parallel sequences Batch-Scheduling of Variable-Length Sequences minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel sequence 1 sequence 2 sequence 3 sequence 3 sequence 7 padding sequence 5 sequence 6 CNTK handles the special cases: past_value operation correctly resets state and gradient at sequence boundaries non-recurrent operations just pretend there is no padding ( garbage-in/garbage-out ) sequence reductions

parallel sequences Batch-Scheduling of Variable-Length Sequences minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel sequence 1 sequence 2 sequence 3 sequence 4 sequence 7 padding sequence 5 sequence 6 CNTK handles the special cases: past_value operation correctly resets state and gradient at sequence boundaries non-recurrent operations just pretend there is no padding ( garbage-in/garbage-out ) sequence reductions

parallel sequences Batch-Scheduling of Variable-Length Sequences minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel sequence 1 sequence 2 sequence 3 sequence 4 sequence 7 padding sequence 5 sequence 6 CNTK handles the special cases: past_value operation correctly resets state and gradient at sequence boundaries non-recurrent operations just pretend there is no padding ( garbage-in/garbage-out ) sequence reductions

parallel sequences Batch-Scheduling of Variable-Length Sequences minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel sequence 1 sequence 2 sequence 3 sequence 4 sequence 7 padding sequence 5 sequence 6 CNTK handles the special cases: past_value operation correctly resets state and gradient at sequence boundaries non-recurrent operations just pretend there is no padding ( garbage-in/garbage-out ) sequence reductions

parallel sequences Batch-Scheduling of Variable-Length Sequences minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel sequence 1 sequence 2 sequence 3 sequence 4 sequence 7 padding sequence 5 sequence 6 CNTK handles the special cases: past_value operation correctly resets state and gradient at sequence boundaries non-recurrent operations just pretend there is no padding ( garbage-in/garbage-out ) sequence reductions

parallel sequences Batch-Scheduling of Variable-Length Sequences minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel sequence 1 sequence 2 sequence 3 sequence 4 sequence 7 padding sequence 5 sequence 6 CNTK handles the special cases: past_value operation correctly resets state and gradient at sequence boundaries non-recurrent operations just pretend there is no padding ( garbage-in/garbage-out ) sequence reductions

parallel sequences Batch-Scheduling of Variable-Length Sequences minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel sequence 1 sequence 2 sequence 3 sequence 4 sequence 7 padding speed-up is automatic: sequence 5 sequence 6 Speed comparison on RNNs Optimized Naïve Naïve, Single Sequence, 1 Optimized, multi sequence >20 0 5 10 15 20 25

Data-Parallel Training Data-parallelism: distribute minibatch over workers, all-reduce partial gradients node 1 node 2 node 3 S all-reduce

Data-parallel training Data-parallelism: distribute minibatch over workers, all-reduce partial gradients node 1 node 2 node 3 ring algorithm O(2 (K-1)/K M) O(1) w.r.t. K

Data-parallel training how to reduce communication cost: communicate less each time 1-bit SGD: [F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: 1-Bit Stochastic Gradient Descent... Distributed Training of Speech DNNs, Interspeech 2014] quantize gradients to 1 bit per value trick: carry over quantization error to next minibatch 1-bit quantized with residual minibatch GPU 1 GPU 2 GPU 3 1-bit quantized with residual

Data-Parallel Training How to reduce communication cost: communicate less each time 1-bit SGD: [F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: 1-Bit Stochastic Gradient Descent...Distributed Training of Speech DNNs, Interspeech 2014] quantize gradients to 1 bit per value trick: carry over quantization error to next minibatch communicate less often Automatic MB sizing [F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: ON Parallelizability of Stochastic Gradient Descent..., ICASSP 2014] Block momentum [K. Chen, Q. Huo: Scalable training of deep learning machines by incremental block training, ICASSP 2016] Very recent, very effective parallelization method Combines model averaging with error-residual idea

Benchmark Result of Parallel Training on CNTK Training data: 2,670-hour speech from real traffics of VS, SMD, and Cortana About 16 and 20 days to train DNN and LSTM on 1-GPU, respectively 1bit/BMUF Speedup Factors in LSTM Training 60.0 50.0 40.0 1bit-average 1bit-peak BMUF-average 27.3 43.7 54.0 30.0 20.0 10.0 0.0 2.9 BMUF-peak 4.1 3.7 3.3 5.4 25.5 14.1 13.8 8.1 6.9 10.8 6.7 8.0 4 GPUs 8 GPUs 16 GPUs 32 GPUs 64 GPUs Credit: Yongqiang Wang, Kai Chen, Qiang Huo

Results Achievement Almost linear speedup without degradation of model quality Verified for training DNN, CNN, LSTM up to 64 GPUs for speech recognition, image classification, OCR, and click prediction tasks Released in CNTK as a critical differentiator Used for enterprise scale production data loads Production tools in other companies such as iflytek and Alibaba

Where to begin? On GitHub: https://github.com//cntk/wiki Tutorials: https://www.cntk.ai/pythondocs/tutorials.html (latest release) https://github.com//cntk/tree/master/tutorials (latest) Azure Notebooks: Try for free pre-hosted https://notebooks.azure.com/cntk/libraries/tutorials Seek help on Stack Overflow: http://stackoverflow.com/search?q=cntk (please add cntk tag) Seek help on Stack Overflow: http://stackoverflow.com/search?q=cntk (please add cntk tag)

Where to begin? Tutorials: https://www.cntk.ai/pythondocs/tutorials.html (latest release) https://github.com//cntk/tree/master/tutorials (latest)

Where to begin? Azure Notebooks: Try for free pre-hosted https://notebooks.azure.com/cntk/libraries/tutorials

Where to begin? On GitHub: https://github.com//cntk/wiki Tutorials: https://www.cntk.ai/pythondocs/tutorials.html (latest release) https://github.com//cntk/tree/master/tutorials (latest) Azure Notebooks: Try for free pre-hosted https://notebooks.azure.com/cntk/libraries/tutorials Seek help on Stack Overflow: http://stackoverflow.com/search?q=cntk (please add cntk tag) Seek help on Stack Overflow: http://stackoverflow.com/search?q=cntk (please add cntk tag)