Distributed Optimization of Deeply Nested Systems

Similar documents
Optimising nested functions using auxiliary coordinates

Distributed Optimization of Deeply Nested Systems

Hashing with Binary Autoencoders

The K-modes and Laplacian K-modes algorithms for clustering

Neural Networks and Deep Learning

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Network Neurons

Backpropagation + Deep Learning

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Deep Learning Cook Book

Model compression as constrained optimization, with application to neural nets

Machine Learning. MGS Lecture 3: Deep Learning

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

A Brief Look at Optimization

Perceptron: This is convolution!

Convex Optimization. Lijun Zhang Modification of

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Learning independent, diverse binary hash functions: pruning and locality

Deep Learning. Architecture Design for. Sargur N. Srihari

Deep Generative Models Variational Autoencoders

Logistic Regression. Abstract

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

ImageNet Classification with Deep Convolutional Neural Networks

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

All lecture slides will be available at CSC2515_Winter15.html

Deep Learning and Its Applications

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

What is machine learning?

Introduction to Optimization

Locally Linear Landmarks for large-scale manifold learning

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

Deep Learning. Volker Tresp Summer 2014

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Stacked Denoising Autoencoders for Face Pose Normalization

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Computational Methods. Constrained Optimization

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Deep Learning with Tensorflow AlexNet

Content-based image and video analysis. Machine learning

Deep Learning for Computer Vision II

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Deep Boltzmann Machines

Deep Learning & Neural Networks

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Vulnerability of machine learning models to adversarial examples

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Algorithms for convex optimization

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Theoretical Concepts of Machine Learning

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Cost Functions in Machine Learning

Two-phase matrix splitting methods for asymmetric and symmetric LCP

Machine Learning 13. week

Natural Language Processing

Computer Vision in a Non-Rigid World

Constrained optimization

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

How Learning Differs from Optimization. Sargur N. Srihari

CS6220: DATA MINING TECHNIQUES

DM6 Support Vector Machines

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

Optimization for Machine Learning

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Deep Neural Networks:

Convolutional Neural Networks

5 Machine Learning Abstractions and Numerical Optimization

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Optimization. Industrial AI Lab.

Data Mining. Neural Networks

Bilevel Sparse Coding

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

COMP9444 Neural Networks and Deep Learning 5. Geometry of Hidden Units

Efficient Algorithms may not be those we think

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Practical Methodology. Lecture slides for Chapter 11 of Deep Learning Ian Goodfellow

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

Advanced Introduction to Machine Learning, CMU-10715

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Experimental Data and Training

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

Optimization for Machine Learning

Emotion Detection using Deep Belief Networks

Neural Networks. Theory And Practice. Marco Del Vecchio 19/07/2017. Warwick Manufacturing Group University of Warwick

The Role of Dimensionality Reduction in Classification

Scaled Machine Learning at Matroid

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

The Alternating Direction Method of Multipliers

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

Convexization in Markov Chain Monte Carlo

Nonlinear Low-Dimensional Regression Using Auxiliary Coordinates

Transcription:

Distributed Optimization of Deeply Nested Systems Miguel Á. Carreira-Perpiñán and Weiran Wang Electrical Engineering and Computer Science University of California, Merced http://eecs.ucmerced.edu

Nested (hierarchical) systems: examples Common in computer vision, speech processing, machine learning... Object recognition pipeline: pixels SIFT/HoG k-means sparse coding Phone classification pipeline: waveform MFCC/PLP classifier phoneme label Preprocessing for regression/classification: image pixels PCA/LDA classifier output/label pooling classifier object category Deep net: x {σ(w T i x+a i)} {σ(w T j {σ(wt i x+a i)})+b j } y x W 1 W 2 W 3 W 4 σ σ σ y σ σ σ p. 1

Nested systems Mathematically, they construct a (deeply) nested, parametric mapping from inputs to outputs: f(x;w) = f K+1 (...f 2 (f 1 (x;w 1 );W 2 )...;W K+1 ) Each layer (processing stage) has its own trainable parameters (weights) W k. Each layer performs some (nonlinear, nondifferentiable) processing on its input, extracting ever more sophisticated features from it (ex.: pixels edges parts ) Nearly always nonconvex; difficult to train. The composition of functions is nonconvex in general. The ideal performance is when the parameters at all layers are jointly optimised towards the overall goal (e.g. classification error). This work is about how to do this easily and efficiently. p. 2

Training nested systems: backpropagated gradient Apply the chain rule, layer by layer, to obtain a gradient wrt all the parameters. Ex.: g (g(f( ))) = g (F( )), F (g(f( ))) = g (F( ))F ( ). Then feed to nonlinear optimiser. Gradient descent, CG, L-BFGS, Levenberg-Marquardt, Newton, etc. Disadvantages: requires differentiable layers the gradient is cumbersome to compute, code and debug requires nonlinear optimisation vanishing gradients ill-conditioning slow progress even with second-order methods difficult to parallelise. p. 3

Training nested systems: layerwise, filter Fix each layer sequentially from the input to the output. Fast and easy, but suboptimal. Sometimes used to initialise the parameters and refine the model with backpropagation ( fine tuning ). Examples: Pretraining approaches for deep nets, RBF networks, etc. (Hinton & Salakhutdinov 2006, Bengio et al. 2007) Filter (vs wrapper) approach: consider a nested mapping g(f(x)) (e.g. F reduces dimension, g classifies): 1. First train F (the filter ), e.g. with PCA on the input data {x n }. 2. Then, fit a classifier g with inputs {F(x n )} and labels {y n }. F can be seen as a fixed preprocessing or feature extraction stage. But it may not be the best possible preprocessing for classification. Wrapper approach: optimise F and g jointly to minimise the classification error. More difficult, less frequently done. p. 4

Training nested systems: model selection Finally, we also have to select the best architecture: Number of units or basis functions in each layer of a deep net; number of filterbanks in a speech front-end processing; etc. Requires a combinatorial search, training models for each hyperparameter choice and picking the best according to a model selection criterion, cross-validation, etc. In practice, this is approximated using expert know-how: Train only a few models, pick the best from there. Fix the parameters of some layers irrespective of the rest of the pipeline. Very costly in runtime, in effort and expertise required, and leads to suboptimal solutions. p. 5

The method of auxiliary coordinates (MAC) A general strategy to train all parameters of a nested system. Not an algorithm but a meta-algorithm (like EM). Enjoys the benefits of layerwise training (fast, easy steps that reuse existing algorithms for shallow systems) but with optimality guarantees. Embarrassingly parallel iterations. Basic idea: 1. Turn the nested problem into a constrained optimisation problem by introducing new parameters to be optimised over (the auxiliary coordinates). 2. Optimise the constrained problem with a penalty method. 3. Optimise the penalised objective function with alternating optimisation. Result: alternate layerwise training steps with coordination steps. p. 6

The nested objective function Consider for simplicity: a single hidden layer: x F(x) g(f(x)) a least-squares regression for inputs {x n } N and outputs {y n } N : mine nested (F,g) = 1 2 y n g(f(x n )) 2 F, g have their own parameters (weights). We want to find a local minimum of E nested. p. 7

The MAC-constrained problem Transform the problem into a constrained one in an augmented space: mine(f,g,z) = 1 2 y n g(z n ) 2 s.t. z n = F(x n ) n = 1,...,N. For each data point, we turn the subexpression F(x n ) into an equality constraint associated with a new parameter z n (the auxiliary coordinates). Thus, a constrained problem with N equality constraints and new parameters Z = (z 1,...,z N ). We optimise over (F,g) and Z jointly. Equivalent to the nested problem. p. 8

The MAC quadratic-penalty function We solve the constrained problem with the quadratic-penalty method: we minimise the following while driving the penalty parameter µ : mine Q (F,g,Z;µ) = 1 2 y n g(z n ) 2 + µ 2 z n F(x n ) 2 }{{} constraints as quadratic penalties We can also use the augmented Lagrangian method (ADMM) instead: mine L (F,g,Z,Λ;µ) = 1 2 y n g(z n ) 2 + λ T n(z n F(x n ))+ µ 2 z n F(x n ) 2 For simplicity, we focus on the quadratic-penalty method. p. 9

What have we achieved? Net effect: unfold the nested objective into shallow additive terms connected by the auxiliary coordinates: E nested (F,g) = 1 2 E Q (F,g,Z;µ) = 1 2 y n g(f(x n )) 2 = y n g(z n ) 2 + µ 2 All terms equally scaled, but uncoupled. Vanishing gradients less problematic. z n F(x n ) 2 Derivatives required are simpler: no backpropagated gradients, sometimes no gradients at all. Optimising E nested follows a convoluted trajectory in (F,g) space. Optimising E Q can take shortcuts by jumping across Z space. This corresponds to letting the layers mismatch during the optimisation. p. 10

Alternating optimisation of the MAC/QP objective (F, g) step, for Z fixed: 1 min F 2 z n F(x n ) 2 1 min g 2 y n g(z n ) 2 Layerwise training: each layer is trained independently (not sequentially): Fit F to {(x n,z n )} N (gradient needed: F ( )). Fit g to {(z n,y n )} N (gradient needed: g ( )). This looks like a filter approach with intermediate features {z n } N. Usually simple fit, even convex. Can be done by using existing algorithms for shallow models linear, logistic regression, SVM, RBF network, k-means, decision tree, etc. Does not require backpropagated gradients. p. 11

Alternating optimisation of the MAC/QP objective (cont.) Z step, for (F, g) fixed: 1 min z n 2 y n g(z n ) 2 + µ 2 z n F(x n ) 2 n = 1,...,N The auxiliary coordinates are trained independently for each point. N small problems (of size z ) instead of one large problem (of size N z ). They coordinate the layers. The Z step has the form of a proximal operator. min z f(z)+ µ 2 z u 2 The solution has a geometric flavour ( projection ). Often closed-form (depending on the model). p. 12

Alternating optimisation of the MAC/QP objective (cont.) MAC/QP is a coordination-minimisation (CM) algorithm: M step: minimise (train) layers C step: coordinate layers. The coordination step is crucial: it ensures we converge to a minimum of the nested function (which layerwise training by itself does not do). p. 13

MAC in general MAC applies very generally: K layers: f(x;w) = f K+1 (...f 2 (f 1 (x;w 1 );W 2 )...;W K+1 ). Various loss functions, full/sparse layer connectivity, constraints... The resulting optimisation algorithm depends on: the type of layers f k used linear, logistic regression, SVM, RBF network, decision tree... how/where we introduce the auxiliary coordinates at no/some/all layers; before/after the nonlinearity; can redefine Z during the optimisation how we optimise each step choice of method; inexact steps, warm start, caching factorisations, stochastic updates... Convergence guarantee: under mild assumptions, MAC/QP defines a continuous path (W (µ),z (µ)) that converges to a local minimum of the constrained problem and thus to a local minimum of the nested problem. In practice, we follow this path loosely. p. 14

MAC in general: the design pattern How to train your system using auxiliary coordinates: 1. Write your nested objective function E nested (W). 2. Identify subexpressions and turn them into auxiliary coordinates with equality constraints. 3. Apply quadratic penalty or augmented Lagrangian method. 4. Do alternating optimisation: W step: reuse a single-layer training algorithm, typically Z step: needs to be solved specially for your problem proximal operator; for many important cases closed-form or simple to optimise. Similar to deriving an EM algorithm: define your probability model, write the log-likelihood objective function, identify hidden variables, write the complete-data log-likelihood, obtain E and M steps, solve them. p. 15

Experiment: deep sigmoidal autoencoder USPS handwritten digits, 256 300 100 20 100 300 256 autoencoder (K = 5 logistic layers), auxiliary coordinates at each hidden layer, random initial weights. W and Z steps use Gauss-Newton. σ σ σ σ σ σ W 1 W 2 W 3 W 4 x z 1 z 2 z 3 y = x objective function 30 µ = 1 25 20 15 10 5 SGD ( = 20 epochs) CG ( = 100 its.) 10 1 10 2 10 3 10 4 10 6 MAC ( = 1 it.) Parallel MAC MAC (minibatches) Parallel MAC (minibatches) 0 0 0.5 1 1.5 2 runtime (hours) 10 7 10 8 p. 16

Another example: low-dimensional SVM W. Wang and M. Á. Carreira-Perpiñán: The role of dimensionality reduction in classification, AAAI 2014. Wrapper approach to learn a nonlinear classifier y = g(f(x)) where F is a nonlinear mapping that reduces dimension and g a linear SVM: min λr(f)+ 1 F,g,ξ 2 w 2 +C ξ n s.t. { y n (w T F(x n )+b) 1 ξ n, ξ n 0 } N. MAC/QP results in the following iteration: W step: train, in parallel and reusing existing algorithms: A nonlinear regression F on {(x n,z n )} N. A linear SVM classifier g on {(z n,y n )} N. Z step: comes out in closed form; for each data point n = 1,...,N, set z n = F(x n )+γ n y n w for a certain γ n R. p. 17

Model selection on the fly Model selection criteria (AIC, BIC, MDL, etc.) separate over layers: E(W) = E nested (W)+C(W) = nested-error+model-cost C(W) total # parameters = W 1 + + W K Traditionally, a grid search (with M values per layer) means testing an exponential number M K of nested models. In MAC, the cost C(W) separates over layers in the W step, so each layer can do model selection independently of the others, testing a polynomial number MK of shallow models. This still provably minimises the overall objective E(W). Instead of a criterion, we can do cross-validation in each layer. In practice, no need to do model selection at each W step. The algorithm usually settles in a region of good architectures early during the optimisation, with small and infrequent changes thereafter. MAC searches over the parameter space of the architecture and over the space of architectures itself, in polynomial time, iteratively. p. 18

Experiment: RBF autoencoder (model selection) COIL object images, 1024 M 1 2 M 2 1024 autoencoder (K = 3 hidden layers), AIC model selection over (M 1,M 2 ) in {150,...,1368} (50 values 50 2 possible models). x 120 (1368,1368) φ φ φ φ C 1 M 1 BFs W 1 C 2 z M 2 BFs objective function 100 80 60 40 µ = 1 22.5 22 21.5 21 20.5 (1050,150) (700, 150) (1368,150) µ = 5 20 40 60 80 W 2 y 20 20 40 60 80 iteration p. 19

Distributed optimisation with MAC MAC/QP is embarrassingly parallel: W step: all layers separate (K +1 independent subproblems) often, all units within each layer separate one independent subproblem for each unit s input weight vector the model selection steps also separate (test each model independently) Z step: all points separate (N independent subproblems). Enormous potential for parallel implementation: Unlike other machine learning or optimisation algorithms, where subproblems are not independent (e.g. SGD). Suitable for large-scale data. Shared-memory multiprocessor model, Matlab Parallel Processing Toolbox: speedup 6 4 2 Sigmoidal deep net RBF autoencoder RBF auto. (learn arch.) 2 4 6 8 10 12 number of processors p. 20

Conclusion: the method of auxiliary coordinates (MAC) Jointly optimises a nested function over all its parameters. Restructures the nested problem into a sequence of iterations with independent subproblems; a coordination-minimisation algorithm: M step: minimise (train) layers C step: coordinate layers Advantages: easy to develop, reuses existing algorithms for shallow models convergent embarrassingly parallel can work with nondifferentiable or discrete layers can do model selection on the fly. Widely applicable in machine learning, computer vision, speech, NLP, etc. Work partially supported by NSF CAREER award IIS 0754089 and a Google Faculty Research Award. p. 21