Sayan Pathak Principal ML Scientist. Chris Basoglu Partner Dev Manager

Size: px

Start display at page:

Download "Sayan Pathak Principal ML Scientist. Chris Basoglu Partner Dev Manager"

Randall Ray
5 years ago
Views:

Sayan Pathak Principal ML Scientist Chris Basoglu Partner Dev Manager With many contributors: A.

Eversole, B. Guenter, P. He, M. Hillebrand, X. Huang, Z. Huang, R. Hoens, V. Ivanov, A. Kamenev, N.

Navarro, A. Orlov, M. Radmilac, A. Reznichenko, P. Parthasarathi, S. Pathak, B. Peng, A. Reznichenko, W.

2 Sayan Pathak Principal ML Scientist Chris Basoglu Partner Dev Manager With many contributors: A. Agarwal, E. Akchurin, E. Barsoum, C. Basoglu, G. Chen, S. Cyphers, W. Darling, J. Droppo, K. Deng, A. Eversole, B. Guenter, P. He, M. Hillebrand, X. Huang, Z. Huang, R. Hoens, V. Ivanov, A. Kamenev, N. Karampatziakis, P. Kranen, O. Kuchaiev, W. Manousek, C. Marschner, A. May, B. Mitra, O. Nano, G. Navarro, A. Orlov, M. Radmilac, A. Reznichenko, P. Parthasarathi, S. Pathak, B. Peng, A. Reznichenko, W. Richert, F. Seide, M. Seltzer, M. Slaney, A. Stolcke, T. Will, H. Wang, Z. Wang, W. Xio. Yao, D. Yu, C. Zhang, Y. Zhang, G. Zweig

3 Outline CNTK overview Key features Symbolic loop Batch scheduling Data parallel training Conclusions

4 Outline CNTK overview Key features Symbolic loop Batch scheduling Data parallel training Conclusions

5 Deep learning at Services Skype Translator Cortana Bing HoloLens Research

6 Services

8 ImageNet: 2015 ResNet 28.2 ImageNet Classification top-5 error (%) ILSVRC 2010 NEC America ILSVRC 2011 Xerox ILSVRC 2012 AlexNet ILSVRC 2013 Clarifi ILSVRC 2014 VGG ILSVRC 2014 GoogleNet ILSVRC 2015 ResNet had all 5 entries being the 1-st places this year: ImageNet classification, ImageNet localization, ImageNet detection, COCO detection, and COCO segmentation

9 Youtube Link

10 Translator

11 You can follow along to this presentation on your own device, in the language of your choice. Download the Translator app for Android, ios, or Windows or Visit translate.it/<enter CODE HERE> Type in this unique conversation code below to join this conversation <ENTER CODE HERE> Translator is powered by machine learning. Any voice or text information you provide will be used to improve products and services.

12 Bing / Bing Ads

s historic speech breakthrough 2016 research system for conversational speech recognition 5.9% word-error rate enabled by CNTK s multi-server scalability [W.

13 s historic speech breakthrough 2016 research system for conversational speech recognition 5.9% word-error rate enabled by CNTK s multi-server scalability [W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, G. Zweig: Achieving Human Parity in Conversational Speech Recognition,

14 Customer Support Agent

20 (CNTK) s open-source deep-learning toolkit Created by Speech researchers in 2012 On GitHub since Jan 2016 under MIT license Python support since Oct 2016 (beta), rebranded as External contributions e.g. from MIT, Stanford and NVidia

21 (CNTK) Over 80% internal DL workload runs CNTK 1st-class on Linux and Windows, docker support Python, C++, C#, Java Internal == External

22 CNTK The Fastest Benchmarking by HKBU, Version 8 Single Tesla K80 GPU, CUDA: 8.0 CUDNN: v5.1 Caffe: 1.0rc5(39f28e4) CNTK: 2.0 Beta10(1ae666d) MXNet: 0.93(32dc3a2) TensorFlow: 1.0(4ac9c09) Torch: 7(748f5e3) Caffe CNTK MxNet TensorFlow Torch FCN5 (1024) ms ms ms ms ms AlexNet (256) ms ms ms ms ms ResNet (32) ms ms ms ms ms LSTM (256) (v7 benchmark) ms (44.917ms) ms ( ms) - ( ms) ms ( ms)

23 CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to multi-gpu/multi-server speed comparison (samples/second), higher = better [note: December 2015] Achieved with 1-bit gradient quantization algorithm Theano only supports 1 GPU CNTK Theano TensorFlow Torch 7 Caffe 1 GPU 1 x 4 GPUs 2 x 4 GPUs (8 GPUs)

24 Superior performance

25 Scalability

What is new in CNTK 2.0? has now released a major upgrade of the software and rebranded it as part of the. This release is a major improvement over the initial release.

26 What is new in CNTK 2.0? has now released a major upgrade of the software and rebranded it as part of the. This release is a major improvement over the initial release. There are two major changes from the first release that you will see when you begin to look at the new release. First is that CNTK now has a very nice Python API and, second, the documentation and examples are excellent. Installing the software from the binary builds is very easy on both Ubuntu Linux and Windows.

27 CNTK Other Advantages Python and C++ API Mostly implemented in C++ Low level + high level Python API Extensibility User functions and learners in pure Python Readers Distributed, highly efficient built-in data readers Internal == External

29 The (CNTK) CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications. CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to multi-gpu/multi-server.

30 MNIST Handwritten Digits (OCR) Handwritten Digits Corresponding Labels Data set of hand written digits with 60,000 training images 10,000 test images Each image is: 28 x 28 pixels

31 .. Multi-layer perceptron 784 pixels (x) Deep Model 28 pix 28 pix D D D i = 784 O= 400 a = relu i = 400 O= 200 a = relu 10 nodes i = 200 O= 10 a = None Weights bias bias bias 200 z0 z1 z2 z3 z4 z5 z6 z7 z8 z9 softmax e z i σ9 j=0 e z j

32 . Loss function Label One-hot encoded (Y) x 28 pix (p) 28 pix Loss function 9 ce = σ j=0 y j log p j Cross entropy error 28 pix Model (w, b) Predicted Probabilities (p)

33 CNTK Model Example: 2-hidden layer feed-forward NN h 1 = s(w 1 x + b 1 ) h1 = sigmoid W1 + b1) h 2 = s(w 2 h 1 + b 2 ) h2 = sigmoid W2 + b2) P = softmax(w out h 2 + b out ) P = softmax Wout + bout) with input x R M and one-hot label y R J and cross-entropy training criterion ce = y T log P ce = cross_entropy (L, P)

34 CNTK Model example: 2-hidden layer feed-forward NN h 1 = s(w 1 x + b 1 ) h1 = sigmoid W1 + b1) h 2 = s(w 2 h 1 + b 2 ) h2 = sigmoid W2 + b2) P = softmax(w out h 2 + b out ) P = softmax Wout + bout) with input x R M and one-hot label y R J and cross-entropy training criterion ce = y T log P ce = cross_entropy (P, y)

35 CNTK Model b out W out b 2 W 2 ce cross_entropy P softmax + h 2 s + h 1 s h1 = sigmoid W1 + b1) h2 = sigmoid W2 + b2) P = softmax Wout + bout) ce = cross_entropy (P, y) b 1 W 1 + x y

36 CNTK Model b out W out b 2 W 2 ce cross_entropy P softmax + h 2 s + h 1 s Nodes: functions (primitives) Can be composed into reusable composites Edges: values Incl. tensors, sparse Automatic differentiation F / in = F / out out / in Deferred computation execution engine Editable, clonable b 1 W 1 + x y LEGO-like composability allows CNTK to support wide range of networks & applications

37 Authoring networks as functions model function features predictions defines the model structure & parameter initialization holds parameters that will be learned by training criterion function (features, labels) (training loss, additional metrics) defines training and evaluation criteria on top of the model function provides gradients w.r.t. training criteria b out W out b 2 W 2 b 1 W 1 ce cross_entropy P softmax + h 2 s + h 1 s + x y

38 Authoring networks as functions CNTK model: neural networks are functions pure functions with special powers : can compute a gradient w.r.t. any of its nodes external deity can update model parameters user specifies network as function objects: formula as a Python function (low level, e.g. LSTM) function composition of smaller sub-networks (layering) higher-order functions (equiv. of scan, fold, unfold) model parameters held by function objects compiled into the static execution graph under the hood

Layers lib: full list of layers/blocks layers/blocks.py: LSTM(), GRU(), RNNUnit() Stabilizer(), identity ForwardDeclaration(), Tensor[], SparseTensor[], Sequence[], SequenceOver[] layers/layers.

39 Layers lib: full list of layers/blocks layers/blocks.py: LSTM(), GRU(), RNNUnit() Stabilizer(), identity ForwardDeclaration(), Tensor[], SparseTensor[], Sequence[], SequenceOver[] layers/layers.py: Dense(), Embedding() Convolution(), Convolution1D(), Convolution2D(), Convolution3D(), Deconvolution() MaxPooling(), AveragePooling(), GlobalMaxPooling(), GlobalAveragePooling(), MaxUnpooling() BatchNormalization(), LayerNormalization() Dropout(), Activation() Label() layers/higher_order_layers.py: Sequential(), For(), operator >>, (function tuples) ResNetBlock(), SequentialClique() layers/sequence.py: Delay(), PastValueWindow() Recurrence(), RecurrenceFrom(), Fold(), UnfoldFrom() models/models.py: AttentionModel()

41 CNTK unique features Symbolic loops over sequences with dynamic scheduling Turn graph into parallel program through minibatching Unique parallel training algorithms (1-bit SGD, Block Momentum)

42 Symbolic loops over sequential data extend our example to a recurrent network (RNN) h 1 (t) = s(w 1 x(t) + H 1 h 1 (t-1) + b 1 ) h1 = W1 + past_value(h1) + b1) h 2 (t) = s(w 2 h 1 (t) + H 2 h 2 (t-1) + b 2 ) h2 = W2 + H2 + b2) P(t) = softmax(w out h 2 (t) + b out ) P = Wout + bout) ce(t) = y T (t) log P(t) ce = cross_entropy(p, L) no explicit notion of time

43 Symbolic loops over sequential data extend our example to a recurrent network (RNN) h 1 (t) = s(w 1 x(t) + H 1 h 1 (t-1) + b 1 ) h1 = W1 + past_value(h1) + b1) h 2 (t) = s(w 2 h 1 (t) + H 2 h 2 (t-1) + b 2 ) h2 = W2 + H2 + b2) P(t) = softmax(w out h 2 (t) + b out ) P = Wout + bout) ce(t) = y T (t) log P(t) ce = cross_entropy(p, L) no explicit notion of time

44 Symbolic loops over sequential data extend our example to a recurrent network (RNN) h 1 (t) = s(w 1 x(t) + H 1 h 1 (t-1) + b 1 ) h1 = W1 + past_value(h1) + b1) h 2 (t) = s(w 2 h 1 (t) + H 2 h 2 (t-1) + b 2 ) h2 = W2 + H2 + b2) P(t) = softmax(w out h 2 (t) + b out ) P = Wout + bout) ce(t) = y T (t) log P(t) ce = cross_entropy(p, L) no explicit notion of time

45 Symbolic loops over sequential data extend our example to a recurrent network (RNN) h 1 (t) = s(w 1 x(t) + H 1 h 1 (t-1) + b 1 ) h1 = W1 + H1 + b1) h 2 (t) = s(w 2 h 1 (t) + H 2 h 2 (t-1) + b 2 ) h2 = W2 + H2 + b2) P(t) = softmax(w out h 2 (t) + b out ) P = Wout + bout) ce(t) = L T (t) log P(t) ce = cross_entropy(p, L) = max Scorpus ce(t)

46 Symbolic loops over sequential data b out W out b 2 W 2 ce cross_entropy P softmax + h 2 s + h 1 s z -1 + H 2 z -1 h1 = W1 + H1 + b1) h2 = W2 + H2 + b2) P = Wout + bout) ce = cross_entropy(p, y) CNTK automatically unrolls cycles deferred computation Efficient and composable b 1 W H 1 x y

47 parallel sequences Batch-scheduling of variable-length sequences minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel sequence 1 sequence 2 sequence 3 sequence 4 sequence 7 padding sequence 5 sequence 6 CNTK handles the special cases: past_value operation correctly resets state and gradient at sequence boundaries non-recurrent operations just pretend there is no padding ( garbage-in/garbage-out ) sequence reductions

48 parallel sequences Batch-scheduling of variable-length sequences minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel sequence 1 sequence 2 sequence 3 sequence 4 sequence 7 padding sequence 5 sequence 6 CNTK handles the special cases: past_value operation correctly resets state and gradient at sequence boundaries non-recurrent operations just pretend there is no padding ( garbage-in/garbage-out ) sequence reductions

49 parallel sequences Batch-scheduling of variable-length sequences minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel sequence 1 sequence 2 sequence 3 sequence 3 sequence 7 padding sequence 5 sequence 6 CNTK handles the special cases: past_value operation correctly resets state and gradient at sequence boundaries non-recurrent operations just pretend there is no padding ( garbage-in/garbage-out ) sequence reductions

50 parallel sequences Batch-scheduling of variable-length sequences minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel sequence 1 sequence 2 sequence 3 sequence 4 sequence 7 padding sequence 5 sequence 6 CNTK handles the special cases: past_value operation correctly resets state and gradient at sequence boundaries non-recurrent operations just pretend there is no padding ( garbage-in/garbage-out ) sequence reductions

51 parallel sequences Batch-scheduling of variable-length sequences minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel sequence 1 sequence 2 sequence 3 sequence 4 sequence 7 padding sequence 5 sequence 6 CNTK handles the special cases: past_value operation correctly resets state and gradient at sequence boundaries non-recurrent operations just pretend there is no padding ( garbage-in/garbage-out ) sequence reductions

52 parallel sequences Batch-scheduling of variable-length sequences minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel sequence 1 sequence 2 sequence 3 sequence 4 sequence 7 padding sequence 5 sequence 6 CNTK handles the special cases: past_value operation correctly resets state and gradient at sequence boundaries non-recurrent operations just pretend there is no padding ( garbage-in/garbage-out ) sequence reductions

53 parallel sequences Batch-scheduling of variable-length sequences minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel sequence 1 sequence 2 sequence 3 sequence 4 sequence 7 padding sequence 5 sequence 6 CNTK handles the special cases: past_value operation correctly resets state and gradient at sequence boundaries non-recurrent operations just pretend there is no padding ( garbage-in/garbage-out ) sequence reductions

54 parallel sequences Batch-scheduling of variable-length sequences minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel sequence 1 sequence 2 sequence 3 sequence 4 sequence 7 padding sequence 5 sequence 6 CNTK handles the special cases: past_value operation correctly resets state and gradient at sequence boundaries non-recurrent operations just pretend there is no padding ( garbage-in/garbage-out ) sequence reductions

55 parallel sequences Batch-scheduling of variable-length sequences minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel sequence 1 sequence 2 sequence 3 sequence 4 sequence 7 padding sequence 5 sequence 6 speed-up is automatic: Speed comparison on RNNs Optimized Naïve Naïve, Single Sequence, 1 Optimized, multi sequence >

56 Data-parallel training Data-parallelism: distribute minibatch over workers, all-reduce partial gradients node 1 node 2 node 3 S all-reduce

57 Data-parallel training Data-parallelism: distribute minibatch over workers, all-reduce partial gradients node 1 node 2 node 3

58 Data-parallel training Data-parallelism: distribute minibatch over workers, all-reduce partial gradients node 1 node 2 node 3 ring algorithm O(2 (K-1)/K M) O(1) w.r.t. K

59 Data-parallel training How to reduce communication cost: communicate less each time communicate less often

60 Data-parallel training How to reduce communication cost: communicate less each time 1-bit SGD: [F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: 1-Bit Stochastic Gradient Descent... Distributed Training of Speech DNNs, Interspeech 2014] Quantize gradients to 1 bit per value Trick: carry over quantization error to next minibatch 1-bit quantized with residual node 1 node 2 node 3 1-bit quantized with residual

61 data-parallel training how to reduce communication cost: communicate less each time 1-bit SGD: [F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: 1-Bit Stochastic Gradient Descent... Distributed Training of Speech DNNs, Interspeech 2014] quantize gradients to 1 bit per value trick: carry over quantization error to next minibatch 1-bit quantized with residual minibatch GPU 1 GPU 2 GPU 3 1-bit quantized with residual

62 Data-parallel training How to reduce communication cost: communicate less each time 1-bit SGD: [F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: 1-Bit Stochastic Gradient Descent...Distributed Training of Speech DNNs, Interspeech 2014] quantize gradients to 1 bit per value trick: carry over quantization error to next minibatch communicate less often Automatic MB sizing [F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: ON Parallelizability of Stochastic Gradient Descent..., ICASSP 2014] Block momentum [K. Chen, Q. Huo: Scalable training of deep learning machines by incremental block training, ICASSP 2016] Very recent, very effective parallelization method Combines model averaging with error-residual idea

63 Benchmark result of parallel training on CNTK Training data: 2,670-hour speech from real traffics of VS, SMD, and Cortana About 16 and 20 days to train DNN and LSTM on 1-GPU, respectively 1bit/BMUF Speedup Factors in LSTM Training bit-average 1bit-peak BMUF-average BMUF-peak GPUs 8 GPUs 16 GPUs 32 GPUs 64 GPUs Credit: Yongqiang Wang, Kai Chen, Qiang Huo

64 Results Achievement Almost linear speedup without degradation of model quality Verified for training DNN, CNN, LSTM up to 64 GPUs for speech recognition, image classification, OCR, and click prediction tasks Released in CNTK as a critical differentiator Used for enterprise scale production data loads Production tools in other companies such as iflytek and Alibaba

65 Where to begin? On GitHub: Tutorials: (latest release) (latest) Azure Notebooks: Try for free pre-hosted Seek help on Stack Overflow: (please add cntk tag) Seek help on Stack Overflow: (please add cntk tag)

Where to begin? Tutorials: https://www.cntk.ai/pythondocs/tutorials.

66 Where to begin? Tutorials: (latest release) (latest)

67 Where to begin? Azure Notebooks: Try for free pre-hosted

68 Where to begin? On GitHub: Tutorials: (latest release) (latest) Azure Notebooks: Try for free pre-hosted Seek help on Stack Overflow: (please add cntk tag) Seek help on Stack Overflow: (please add cntk tag)

Deep learning at Microsoft

Deep learning at Microsoft Deep learning at Services Skype Translator Cortana Bing HoloLens Research Services ImageNet: 2015 ResNet 28.2 25.8 ImageNet Classification top-5 error (%) 16.4 11.7 7.3 6.7 3.5 ILSVRC 2010 NEC America