Parallel Implementation of Deep Learning Using MPI

Similar documents
PARALLELIZED IMPLEMENTATION OF LOGISTIC REGRESSION USING MPI

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

Large Scale Distributed Deep Networks

To be Bernoulli or to be Gaussian, for a Restricted Boltzmann Machine

Autoencoders, denoising autoencoders, and learning deep networks

Neural Networks: promises of current research

Decentralized and Distributed Machine Learning Model Training with Actors

Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient. Ali Mirzapour Paper Presentation - Deep Learning March 7 th

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

Depth Image Dimension Reduction Using Deep Belief Networks

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

DL4J Components. Open Source DeepLearning Library CPU & GPU support Hadoop-Yarn & Spark integration Futre additions

Deep Learning with Tensorflow AlexNet

ECS289: Scalable Machine Learning

Deep Belief Network for Clustering and Classification of a Continuous Data

A Fast Learning Algorithm for Deep Belief Nets

Training Restricted Boltzmann Machines with Overlapping Partitions

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

Deep Boltzmann Machines

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Toward Scalable Deep Learning

ECS289: Scalable Machine Learning

Extracting and Composing Robust Features with Denoising Autoencoders

Stacked Denoising Autoencoders for Face Pose Normalization

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Deep Learning. Volker Tresp Summer 2014

COMPUTING OVERLAPPING LINE SEGMENTS. - A parallel approach by Vinoth Selvaraju

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Introduction to Deep Learning

Harp-DAAL for High Performance Big Data Computing

Deep Learning With Noise

Emotion Detection using Deep Belief Networks

Neural Networks and Deep Learning

Restricted Boltzmann Machines. Shallow vs. deep networks. Stacked RBMs. Boltzmann Machine learning: Unsupervised version

Machine Learning. MGS Lecture 3: Deep Learning

Advanced Introduction to Machine Learning, CMU-10715

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

Learning Two-Layer Contractive Encodings

Shortening Time Required for Adaptive Structural Learning Method of Deep Belief Network with Multi-Modal Data Arrangement

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

3D Object Recognition with Deep Belief Nets

Lecture 17: Neural Networks and Deep Learning. Instructor: Saravanan Thirumuruganathan

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

CAP 6412 Advanced Computer Vision

Implicit Mixtures of Restricted Boltzmann Machines

DEEP LEARNING TO DIVERSIFY BELIEF NETWORKS FOR REMOTE SENSING IMAGE CLASSIFICATION

Deep Learning. Volker Tresp Summer 2015

Parallel Deep Network Training

Determining Line Segment Visibility with MPI

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Lecture 19: Generative Adversarial Networks

Unsupervised Deep Learning for Scene Recognition

Kernels vs. DNNs for Speech Recognition

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

SPE MS. Abstract. Introduction. Autoencoders

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Training Deep Neural Networks (in parallel)

Backpropagation + Deep Learning

Keras: Handwritten Digit Recognition using MNIST Dataset

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Deep Learning with R. Francesca Lazzeri Data Scientist II - Microsoft, AI Research

Sparse Feature Learning for Deep Belief Networks

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Static Gesture Recognition with Restricted Boltzmann Machines

Transfer Learning Using Rotated Image Data to Improve Deep Neural Network Performance

Training Restricted Boltzmann Machines with Multi-Tempering: Harnessing Parallelization

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Parallel Deep Network Training

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

Cambridge Interview Technical Talk

Pair-wise Distance Metric Learning of Neural Network Model for Spoken Language Identification

Deep Learning via Semi-Supervised Embedding. Jason Weston, Frederic Ratle and Ronan Collobert Presented by: Janani Kalyanam

Convolutional Neural Networks

Unsupervised Feature Learning for Optical Character Recognition

COMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare

TENSORFLOW: LARGE-SCALE MACHINE LEARNING ON HETEROGENEOUS DISTRIBUTED SYSTEMS. By Sanjay Surendranath Girija

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

Large-scale Deep Unsupervised Learning using Graphics Processors

Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on

arxiv: v1 [cs.cl] 18 Jan 2015

Machine Learning for Physicists Lecture 6. Summer 2017 University of Erlangen-Nuremberg Florian Marquardt

Neural Bag-of-Features Learning

Deep Learning Training System: Adam & Google Cloud ML. CSE 5194: Introduction to High- Performance Deep Learning Qinghua Zhou

CS-541 Wireless Sensor Networks

Asynchronous Parallel Stochastic Gradient Descent. A Numeric Core for Scalable Distributed Machine Learning Algorithms

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Computer Vision: Homework 5 Optical Character Recognition using Neural Networks

Unsupervised Learning

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Deep Generative Models Variational Autoencoders

M. Sc. (Artificial Intelligence and Machine Learning)

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Deep Learning and Its Applications

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

Transcription:

Parallel Implementation of Deep Learning Using MPI CSE633 Parallel Algorithms (Spring 2014) Instructor: Prof. Russ Miller Team #13: Tianle Ma Email: tianlema@buffalo.edu May 7, 2014

Content Introduction to Deep Belief Network Parallel Implementation Using MPI Experiment Results and Analysis

Why Deep Learning? Deep Learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using architectures composed of multiple non-linear transformations. It has been the hottest topic in speech recognition, computer vision, natural language processing, applied mathematics, in the last 2 years Deep Learning is about representing high-dimensional data It's deep if it has more than one stage of non-linear feature transformation

Today s Focus

Stacked Restricted Boltzmann Machines

p v, h = 1 Z eht Wv+b T v+a T h

Go Up Go Down Go Up for all hidden units i do Compute Q h 0i v 0 = sigm(b i + j W ij v 0j ) (for binomial units) Sample h 0i from Q h 0i v 0 end for for all hidden units j do Compute P v 1j h 0 Algorithm1 RBMupdate (v 0, ε, W, b, c) = sigm(c j + i W ij h 0i ) (for binomial units) Sample v 1j from P v 1j h 0 end for for all hidden units i do Compute Q h 0i v 0 = sigm(b i + j W ij v 0j ) (for binomial units) Sample h 0i from Q h 0i v 0 end for

Algorithm1 RBMupdate (v 0, ε, W, b, c) W W ε(h 0 v 0 Q h 1 = 1 v 1 v 1 ) b b ε(h 0 Q h 1 = 1 v 1 ) c c ε(v 0 v 1 ) Update model parameters Feature representation Contrastive Divergence

Algorithm2 PreTrainDBN (x, ε, L, n, W, b) Initialize b 0 = 0 for l = 1 to L do Initialize W l = 0, b l = 0 while not stopping criterion do g 0 = x for i = 1 to l 1 do Sample g i from Q g i g i 1 end for RBMupdate (g l 1, ε, W l, b l, b l 1 ) end while end for Stacked RBMs! Unsupervised Learning Learn Latent Variables (Higher level Feature Representations)

Algorithm3 FineTuneDBN (x, y, ε, L, n, W, b) μ 0 (x) = x for l = 1 to L do μ l x = E g i g i 1 = μ l 1 x = sigm(b j l + W jk l μ k l (x)) (for binomial units) end for Network output function: f x = V(μ l x, 1) k Use Stochastic Gradient Descent to iteratively minimize cost function C f x, y (Back Propagation) Supervised Learning Fine tune Model parameters

Learning Deep Belief Network Step1: Unsupervised generative pre-training of stacked RBMs (Greedy layer wise training) Step2: Supervised fine-tuning (Back Propagation) How many parameters to learn? L i=0 n i n i+1 + L+1 i=0 n i

Can We SCALE UP? Deep learning methods have higher capacity and have the potential to model data better. More features always improve performance unless data is scarce. Given lots of data and lots of machines, can we scale up deep learning methods? MapReduce? MPI? No! Yes!

For large models, partition the model across several machines. Models with local connectivity structures tend to be more amenable to extensive distribution than fullyconnected structures, given their lower communication costs. Need to manage communication, synchronization, and data transfer between machines. Reference Implementation: Google DistBelief Model Parallelism

Data Parallelism (Asynchronous SGD) Fetch parameters Train DBN Push gradients Before processing each batch, a model replica asks the Parameter Sever for an updated copy of its model parameters; Compute a parameter gradient. Send the parameter gradient to the server. Parameter Sever applies the gradient to the current value of the model parameters. Divide the data into a number of subsets and run a copy of the model on each of the subsets.

Update Parameters Fetch parameters w w w w w w Push gradients Processing Training Data (Train DBN)

Algorithm4 Asynchronous SGD(α) Procedure StartFetchingParameters(parameters) parameters GetParametersFromParameterSever(); Procedure StartPushingGradients(gradients) SendGradientsToParameterSever(gradients); MPI_Get(parameters,, win); MPI_Put(parameters,, win); Main Global parameters, gradients while not stopping criterion do StartFetchingParameters(parameters) data GetNextMiniBatch() gradients ComputeGradient(parameters, data) parmeters parameters + α gradients StartPushingGradients(gradients) end while Train DBN Takes lots of time Update the parameters

Experiment MNIST Handwritten Dataset 28 28 = 784 pixels 60000 training images 10000 test images Partition the training data into data shards for each model replica for parallelism.

Equally divide the training data into #Total Cores partitions (Balanced Partitions). The smaller training data, the less training time which is dominate in the total time. After about 10 partitions, the training data is small enough, the training time is not dominate in the total time, so the speed-up is not increasing linearly. Cores VS Time (#Iterations = 100)

Cores VS Speed-up (#Iterations = 100) Equally divide the training data into #Total Cores partitions (Balanced Partitions). The smaller training data, the less training time which is dominate in the total time. After about 10 partitions, the training data is small enough, the training time is not dominate in the total time, so the speed-up is not increasing linearly.

Fixing #Node= 2, Accuracy > 90% Time begins increasing after about 20 data partitions. Reason: Data partition becomes too small and insufficient to learn the model parameters. So it needs more iterations to get the same accuracy.

Fixing #Node= 2, Accuracy > 90% Speed-up begins decreasing after about 20 data partitions. Reason: Data partition becomes too small and insufficient to learn the model parameters. So it needs more iterations to get the same accuracy.

Fixing #Tasks Per Node (TPN) = 2 Time begins increasing after about 20 data partitions. Reason: Data partition becomes too small and insufficient to learn the model parameters. So it needs more iterations to get the same accuracy. Similar to the results of fixing #Nodes

Fixing #Tasks Per Node (TPN) = 2 Speed-up begins decreasing after about 20 data partitions. Reason: Data partition becomes too small and insufficient to learn the model parameters. So it needs more iterations to get the same accuracy. Similar to the results of fixing #Nodes, slightly different.

#Total Cores VS Time (Accuracy > 90%) Time begins increasing after about 20 data partitions. Reason: Data partition becomes too small and insufficient to learn the model parameters. So it needs more iterations to get the same accuracy. Inter-communication cost is higher than intracommunication cost.

#Total Cores VS Speed-up (Accuracy > 90%) Speed begins decreasing after about 20 data partitions. Reason: Data partition becomes too small and insufficient to learn the model parameters. So it needs more iterations to get the same accuracy. Inter-communication cost is higher than intracommunication cost.

Results #Node #TPN #Total Cores Table1 Fixing Nodes Time/s Speed-up 2 2 4 1413 3.1552 2 4 8 674 6.6152 2 6 12 405 11.0152 2 8 16 314 14.2152 2 10 20 318 14.0152 2 12 24 342 13.0152 2 14 28 428 10.4152 2 16 32 741 6.0152 Table2 Fixing Tasks Per Nodes (TPN) #Node #TPN #Total Cores Time/s Speed-up 2 2 4 1411 3.1552 4 2 8 728 6.1254 6 2 12 438 10.1854 8 2 16 337 13.2054 10 2 20 337 13.2054 12 2 24 365 12.2054 14 2 28 474 9.4054 16 2 32 856 5.2054 Inter-communication costs between nodes are higher than intra-communication costs between nodes.

Conclusion There is a tradeoff between communication costs and computation costs. Inter-communication costs > Intra-communication costs When each data partition is big, the training time of DBN dominates. The speed-up on CCR using MPI is approximately linear. When the partition becomes small enough, it s insufficient to train sophisticated DBN model. To achieve certain accuracy, it needs more iterations. The performance could become significant worse when the partition is too small. It depends on the datasets. The bigger dataset, the more amenable to extensive distribution and the more obvious speed-up. In general, using MPI framework to distribute large deep neural network is a good choice. The efficiency and scalability have been proved in industrial practice.

Reference Russ Miller, Laurence Boxer. Algorithms Sequential & Parallel: A Unified Approach, 3rd edition, 2012 Marc Snir, Steve Otto, etc. MPI The Complete Reference, 2 nd Edition, 1998 Jeffrey Dean, Greg S. Corrado, etc. Large Scale Distributed Deep Networks. NIPS, 2012 Yoshua Bengio. Learn Deep Architecture for AI. Foundations and Trends in Machine Learning, 2009 http://deeplearning.net/tutorial/