Facing Non-Convex Optimization to Scale Machine Learning to AI

Size: px

Start display at page:

Download "Facing Non-Convex Optimization to Scale Machine Learning to AI"

Estella Greene
5 years ago
Views:

1 Facing Non-Convex Optimization to Scale Machine Learning to AI October 10th 2006 Thanks to: Yann Le Cun, Geoffrey Hinton, Pascal Lamblin, Olivier Delalleau, Nicolas Le Roux, Hugo Larochelle

2 Machine Learning for Artificial Intelligence Knowledge-based AI has stalled : extracting knowledge from humans and formalizing it into a coherent AI is too labor-intensive, and does not work because much human knowledge is not explicit. The promise of Machine Learning to solve AI : use data and learn the AI tasks from examples. Remains elusive! Here : examine limitations linked to AI for large class of non-parametric approaches (e.g. SVMs), enjoying easy (convex) optimization, and discuss approaches not suffering from these limitations, but non-convex.

3 ML for AI : Desiderata Need large nb examples n because learning complex functions Most examples will be unlabeled : need semi-supervised learning Need to be able to efficiently represent such complex functions : statistical scaling Computational scaling should be O(n) : online learnine Human labor required should NOT increase linearly with nb of sub-tasks Many interrelated tasks in world of humans : need multi-task learning

4 No Free Lunch but Broad Priors for AI No Free Lunch theorems for ML : no completely general learning algorithm exists. Reduce to AI tasks, that animals perform effortlessly : perception, control higher animals and humans : long-term prediction, reasoning, planning, language Should we hand-craft priors for each particular task? (e.g. recognizing one class of objects in images) overwhelming human labor required or can we hope to find a few broad priors (i.e. learning principles) that cover most AI tasks?

5 Kernel Machines f (x) = b + i α i K(x, x i ) In some cases K depends mildly on the data (e.g. normalization). In Support Vector Machines (SVMs), Gaussian Processes, but also in unsupervised manifold learning (LLE, Isomap, kernel PCA, etc.), and non-parametric semi-supervised learning (based on neighborhood graph). Easy optimization (analytic or convex optimization problems), but scaling may still be unacceptable w.r.t. number of training examples (e.g. quadratic). Kernel machines usually embody a smoothness prior x y f (x) f (y) through a local kernel functions K(x, y) large for x near y. Unsuprisingly, relying only on this prior is inadequate to learn functions with many variations, such as in AI.

6 Shallow vs Deep Architectures 1 Linear predictors f (x) = w φ(x), with φ(x) low-dimensional 2 Kernel machines f (x) = i α ik(x, x i ) (kernel trick allows high-dimensional φ(x)) 3 2-layer machines : truly adaptive kernel, RBF networks, standard 1-hidden-layer neural nets, boosting 4 Deep architectures : many (4 to 12 levels reported to date). All above (except 1 if φ low-dim) can theoretically approximate any function. Good thing about type 1 and 2 : convex optimization programs. But what is the price?

7 The Depth-Breadth Tradeoff The disadvantage of shallow architectures is inefficient representation. Worse-case can be exponentially bad. Examples from boolean circuits theory Multiplier circuits N bits N bits can be shallow : O(2 N ) elements needed, or deep : O(N log N) elements needed. DNF (shallow) representation of a O(N) formula may require O(2 N ) terms N-bit parity : shallow needs O(2 N ), deep needs O(N log N) elements with log N levels. FFT : shallow (matrix) representation needs O(N 2 ) op., Fast FT needs O(N log N) with log N levels. (see also Utgoff 2002)

8 Myopic vs Far-Reaching Learning Algorithms The current algorithms are myopic because they must rely on highly local data to characterize the data distribution. We should develop algorithms that allow one to generalize far from the training set, for example sharing information about global parameters that describe the structure of the manifold. DEEP ARCHITECTURES ALLOW NON-LOCAL GENERALIZATION. But they come with a price, non-convex optimization.

9 Local Learning Algorithms A learned parameter of the model influences the value of the learned function in a local area of the input domain.

10 Local Learning Algorithms A learned parameter of the model influences the value of the learned function in a local area of the input domain. With local kernel machine f (x) = i α i K(x, x i ), α i only influences f (x) for x near x i.

11 Local Learning Algorithms A learned parameter of the model influences the value of the learned function in a local area of the input domain. With local kernel machine f (x) = i α i K(x, x i ), α i only influences f (x) for x near x i. Examples : nearest-neighbor algorithms local kernel machines most non-parametric models except multi-layer neural networks

12 Mathematical Problem with Local Learning Theorem With K the Gaussian kernel and f ( ) changing sign at least 2k times along some straight line (i.e. that line crosses the decision surface at least 2k times), then at least k examples are required.

13 Mathematical Problem with Local Learning Theorem With K the Gaussian kernel and f ( ) changing sign at least 2k times along some straight line (i.e. that line crosses the decision surface at least 2k times), then at least k examples are required. Class 1 decision surface Class 1 With local kernels, learning a function that has many bumps requires as many examples as bumps.

14 The Curse of Dimensionality Mathematical problem with classical non-parametric models

15 The Curse of Dimensionality Mathematical problem with classical non-parametric models

16 The Curse of Dimensionality Mathematical problem with classical non-parametric models May need to have examples for each probable combination of the variables of interest. OK for 2 or 3 variables, NOT OK for abstract concepts...

17 Mathematical Problem with Local Kernels Theorem With K the Gaussian kernel, and the goal to learn a maximally changing binary function (f (x) f (x ) when x x = 1) with d inputs, then at least 2 d 1 examples are required.

18 Mathematical Problem with Local Kernels Theorem With K the Gaussian kernel, and the goal to learn a maximally changing binary function (f (x) f (x ) when x x = 1) with d inputs, then at least 2 d 1 examples are required. need to cover the space of possibilities with examples may require nb examples exponential in nb inputs

19 Mathematical Problem with Local Kernels Theorem With K the Gaussian kernel, and the goal to learn a maximally changing binary function (f (x) f (x ) when x x = 1) with d inputs, then at least 2 d 1 examples are required. need to cover the space of possibilities with examples may require nb examples exponential in nb inputs = strongly negative mathematical results on local kernel machines Other similar results in (Bengio, Delalleau, Le Roux, NIPS 2005)

20 Local Manifold Learning : Local Linear Patches Current manifold learning algorithms cannot handle highly curved manifolds because they are based on locally linear patches estimated locally (possibly aligned globally). tangent image tangent directions high contrast image shifted image tangent image tangent directions

21 Local Manifold Learning Algorithms We have shown that LLE, Isomap, kernel PCA, Laplacian Eigenmaps, etc. are kernel machines with a local kernel (Bengio et al 2005). Local Manifold Learning Algorithms : derive information about the manifold structure near x using mostly the neighbors of x. For LLE, kernel PCA with Gaussian kernel, spectral clustering, Laplacian Eigenmaps K D (x, y) 0 for x far from y, so e k (x) only depends on the neighbors of x. Therefore the tangent plane e k (x) x = 1 λ k n i=1 v ki K D (x, x i ) x also only depends on the neighbors of x. can t say anything about the manifold structure near a new example x that is far from training examples!

22 The Curse of Dimensionality on a Manifold Similar to ordinary curse of dimensionality for classical non-parametric statistics, but where d = dimension of the manifold. Hurts all local manifold learning methods!

23 Fundamental Problems with Local Manifold Learning High Noise : constraints not perfectly satisfied. Data not strictly on manifold. More noise more data needed per local patch. High Curvature : need more smaller patches O((1/r) d ) with r = patch radius decreasing with curvature. High Manifold Dimension : O((1/r) d ) patches are needed (curse of dimensionality), at least O(d) examples per patch ( noise). Many manifolds : e.g. images of transformed object instances = 1 manifold per instance or per object class. Local manifold learning can t take advantage of shared structure across multiple manifolds.

24 Fat but Shallow Neural Networks Equivalent to SVMs In Convex Neural Networks (Bengio, Le Roux, Vincent, Delalleau, Marcotte, 2006) : show an equivalence between shallow neural networks (1-hidden layer) and SVMs or Gaussian Processes when number of hidden units becomes large using L2 regularization of the output weights The only difference is in the type of kernel, but it still depends on Euclidean distance between its arguments. So ordinary MLPs may have the disadvantages of SVMs (shallow architecture) without the advantages (convex optimization).

25 Non-Smooth Functions are Learnable Wrong common belief : that if we do not have strong prior knowledge, then highly variable (non-smooth) functions are not learnable. Simple counter-examples : Prior can generally be encoded using Kolmogov complexity, use MDL strategy. Highly variable functions but simple functions in C language : sinus, parity. Such functions could be learned using only few examples and C language MDL prior.

26 Convex Optimization can be Too Slow Convex optimization for SVMs : between O(n 2 ) and O(n 3 ) CPU for n examples, and O(n 2 ) memory. Convex optimization for manifold learning based on neighborhood graph : between O(n 2 ) (LLE), O(n 3 log n) (Isomap) and O(n 7 ) (semi-definite embedding). Does not scale well when n. On the other, sub-optimal stochastic gradient descent for multi-layer neural networks (which only approaches a local minimum) can be applied online (i.e. needs to perform constant computation per example = O(n) computation). Recent work on approximate and on-line optimization of SVMs (Bottou et al 2006) : better scaling properties but nb. support vectors may still grow too fast.

27 Non-Convex Optimization of Deep Architectures What deep architectures are known? various kinds of multi-layer neural networks with many layers. Except for a very special kind of architectures for machine vision (convolutional networks), deep architectures have been neglected in machine learning. Why? training gets stuck in mediocre solutions (Tesauro 92). No hope?

28 Convex Optimization can be a Good Initialization Trading Convexity for Scalability, Collobert, Weston and Bottou (2005) : show that SVM classification can be significantly improved by continuing training (local optimization) with a non-convex more discriminant criterion. In addition, this non-convex criterion can be applied to transductive SVMs, allowing for the first time to optimize them in a reasonable time.

29 Greedy Learning of Abstractions Greedily learning simple things first, higher-level abstractions on top of lower-level ones seems like a possible good strategy and is psychologically plausible.

30 Greedy Learning of Abstractions Greedily learning simple things first, higher-level abstractions on top of lower-level ones seems like a possible good strategy and is psychologically plausible. Coherent with psychological litterature starting with Piaget We learn baby math before arithmetic before algebra before differential equations... Also evidence from neurobiology : (Guillery 2005) Is postnatal neocortical maturation hierarchical?.

31 Deep Networks Some functions can be represented very efficiently with a deep network, but require many more computational elements with a 1-layer or 2-layer network.

32 Deep Networks Some functions can be represented very efficiently with a deep network, but require many more computational elements with a 1-layer or 2-layer network

33 Deep Networks Some functions can be represented very efficiently with a deep network, but require many more computational elements with a 1-layer or 2-layer network. e.g. d-bit parity : adaptive layer (SVM) : 2 d parameters required units and adaptive layers (neural net) : d units, d 2 parameters d-layer net : 2d units, 5d parameters recurrent net : 2 units, 5 param.

34 Deep Belief Networks Geoff Hinton just introduced a deep network model (Hinton, Osindero and Teh, 2006) that provides more evidence that this direction is worthwhile :

35 Deep Belief Networks Geoff Hinton just introduced a deep network model (Hinton, Osindero and Teh, 2006) that provides more evidence that this direction is worthwhile : unsupervised learning of each layer, each trying to model distribution of its inputs

36 Deep Belief Networks Geoff Hinton just introduced a deep network model (Hinton, Osindero and Teh, 2006) that provides more evidence that this direction is worthwhile : unsupervised learning of each layer, each trying to model distribution of its inputs unsupervised greedy layer-wise training serves as INITIALIZATION, to replace traditional random initialization of multi-layer networks

37 Deep Belief Networks Geoff Hinton just introduced a deep network model (Hinton, Osindero and Teh, 2006) that provides more evidence that this direction is worthwhile : unsupervised learning of each layer, each trying to model distribution of its inputs unsupervised greedy layer-wise training serves as INITIALIZATION, to replace traditional random initialization of multi-layer networks beating state-of-the-art statistical learning in experiments on a large machine learning benchmark task (MNIST)

38 Greedy Layer-wise Initialization The principle of greedy layer-wise initialization proposed by Hinton can be generalized to other algorithms. We replaced the probabilistic model (Restricted Boltzmann Machine) used by Hinton for unsupervised training of each layer by a simple auto-associator : Find W which minimizes cross-entropy loss in predicting x from sigmoid(w tanh(wx)). In this context W could be initialized using a convex + analytic heuristic.

39 Experiments on Greedy Layer-wise Initialization Deep nets with 3 to 4 hidden layers. Compare SUPERVISED and UNSUPERVISED (auto-associator or DBN) greedy strategy. train. valid. test DBN, unsupervised pre-training 0% 1.3% 1.4% Deep net, auto-associator pre-training 0% 1.4% 1.4% Deep net, supervised pre-training 0% 1.75% 2.0% Deep net, no pre-training.004% 2.1% 2.4% Shallow net, no pre-training.004% 1.8% 1.9% Classification error on MNIST training, validation, and test sets, with the best hyper-parameters according to validation error, with and without pre-training, using purely supervised or purely unsupervised pre-training. Finds around 500 hidden units per layer. SUPERVISED GREEDY is TOO GREEDY.

40 It is Really an Optimization Problem Why 0 train error even with deep net / no-pretraining? Because last fat hidden layer did all the work. Classification error on MNIST with 20 hidden units on top layer : train. valid. test Deep net, auto-associator pre-training 0% 1.4% 1.6% Deep net, supervised pre-training 0% 1.8% 1.9% Deep net, no pre-training.59% 2.1% 2.2% Shallow net, no pre-training 3.6% 4.7% 5.0% YES IT IS REALLY AN OPTIMIZATION PROBLEM, AND GREEDY UNSUPERVISED HELPS LOTS.

41 Learning Visual Invariances A number of experiments from three different labs (Hinton, LeCun, Bengio) point to inability of kernel machines to efficiently (computational + statistical senses) learn from data involving many complex invariances, e.g., from geometry of images, manifold due to translation, rotation, scaling, shear, thickness, etc. Example : consider N objects in image, each with K different geometric degrees of freedom, corresponding to curvatures such that M different values of each dimension must be considered. Need O(M KN ) templates. the ability of various non-local non-shallow architectures to capture such invariances. But research on deep architectures is still very young!

42 Conclusions Need for AI need ML for highly-varying functions Shallow architectures / local kernel machines with convex optimization do not deliver Number of required examples grows linearly with nb of desired variations Curse of dimensionality arguments Can trade convexity for scalability (computational and statistical)! Deep architectures were thought not trainable, but new methods appear to break through obstacle

43 Belkin, M., Matveeva, I., and Niyogi, P. (2004). Regularization and semi-supervised learning on large graphs. In Shawe-Taylor, J. and Singer, Y., editors, COLT Springer. Belkin, M. and Niyogi, P. (2003). Using manifold structure for partially labeled classification. In Becker, S., Thrun, S., and Obermayer, K., editors, Advances in Neural Information Processing Systems 15, Cambridge, MA. MIT Press. Bengio, Y., Delalleau, O., and Le Roux, N. (2006). The curse of highly variable functions for local kernel machines. In Advances in Neural Information Processing Systems 18. MIT Press. Bengio, Y., Delalleau, O., Le Roux, N., Paiement, J.-F., Vincent, P., and Ouimet, M. (2004a). Learning eigenfunctions links spectral embedding and kernel PCA. Neural Computation, 16(10) : Bengio, Y. and Larochelle, H. (2006). Non-local manifold parzen windows. In Weiss, Y., Schölkopf, B., and Platt, J., editors, Advances in Neural Information Processing Systems 18. MIT Press. Bengio, Y., Paiement, J., Vincent, P., Delalleau, O., Le Roux, N., and Ouimet, M. (2004b). Out-of-sample extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering. In Thrun, S., Saul, L., and Schölkopf, B., editors, Advances in Neural Information Processing Systems 16. MIT Press. Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2) : Bourlard, H. and Kamp, Y. (1988). Auto-association by multilayer perceptrons and singular value decomposition. Biological Cybernetics, 59 : Brand, M. (2003).

44 Charting a manifold. In Becker, S., Thrun, S., and Obermayer, K., editors, Advances in Neural Information Processing Systems 15. MIT Press. Chapelle, O., Weston, J., and Schölkopf, B. (2003). Cluster kernels for semi-supervised learning. In Becker, S., Thrun, S., and Obermayer, K., editors, Advances in Neural Information Processing Systems 15, Cambridge, MA. MIT Press. Cox, T. and Cox, M. (1994). Multidimensional Scaling. Chapman & Hall, London. Delalleau, O., Bengio, Y., and Le Roux, N. (2005). Efficient non-parametric function induction in semi-supervised learning. In Cowell, R. and Ghahramani, Z., editors, Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, Jan 6-8, 2005, Savannah Hotel, Barbados, pages Society for Artificial Intelligence and Statistics. Freund, Y. and Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Machine Learning : Proceedings of Thirteenth International Conference, pages Ghahramani, Z. and Hinton, G. (1996). The EM algorithm for mixtures of factor analyzers. Technical Report CRG-TR-96-1, Dpt. of Comp. Sci., Univ. of Toronto. Hinton, G. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8) : Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation. Ng, A. Y., Jordan, M. I., and Weiss, Y. (2002).

45 On spectral clustering : analysis and an algorithm. In Dietterich, T., Becker, S., and Ghahramani, Z., editors, Advances in Neural Information Processing Systems 14, Cambridge, MA. MIT Press. Rao, R. and Ruderman, D. (1999). Learning lie groups for invariant visual perception. In Kearns, M., Solla, S., and Cohn, D., editors, Advances in Neural Information Processing Systems 11, pages MIT Press, Cambridge, MA. Roweis, S. and Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500) : Rumelhart, D., Hinton, G., and Williams, R. (1986). Learning internal representations by error propagation. In Rumelhart, D. and McClelland, J., editors, Parallel Distributed Processing, volume 1, chapter 8, pages MIT Press, Cambridge. Saul, L. and Roweis, S. (2002). Think globally, fit locally : unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research, 4 : Saund, E. (1989). Dimensionality-reduction using connectionist networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(3) : Schölkopf, B., Smola, A., and Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10 : Szummer, M. and Jaakkola, T. (2002). Partially labeled classification with markov random walks. In Dietterich, T., Becker, S., and Ghahramani, Z., editors, Advances in Neural Information Processing Systems 14, Cambridge, MA. MIT Press.

46 Teh, Y. W. and Roweis, S. (2003). Automatic alignment of local representations. In Becker, S., Thrun, S., and Obermayer, K., editors, Advances in Neural Information Processing Systems 15. MIT Press. Tenenbaum, J., de Silva, V., and Langford, J. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500) : Tipping, M. and Bishop, C. (1999). Mixtures of probabilistic principal component analysers. Neural Computation, 11(2) : Torgerson, W. (1952). Multidimensional scaling, 1 : Theory and method. Psychometrika, 17 : Vincent, P. and Bengio, Y. (2003). Manifold parzen windows. In Becker, S., Thrun, S., and Obermayer, K., editors, Advances in Neural Information Processing Systems 15, Cambridge, MA. MIT Press. Weiss, Y. (1999). Segmentation using eigenvectors : a unifying view. In Proceedings IEEE International Conference on Computer Vision, pages Zhou, D., Bousquet, O., Navin Lal, T., Weston, J., and Schölkopf, B. (2004). Learning with local and global consistency. In Thrun, S., Saul, L., and Schölkopf, B., editors, Advances in Neural Information Processing Systems 16, Cambridge, MA. MIT Press. Zhu, X., Ghahramani, Z., and Lafferty, J. (2003). Semi-supervised learning using Gaussian fields and harmonic functions. In ICML 2003.

Non-Local Manifold Tangent Learning

Non-Local Manifold Tangent Learning Yoshua Bengio and Martin Monperrus Dept. IRO, Université de Montréal P.O. Box 1, Downtown Branch, Montreal, H3C 3J7, Qc, Canada {bengioy,monperrm}@iro.umontreal.ca Abstract