LSML19: Introduction to Large Scale Machine Learning MINES ParisTech March 2019 Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr
Practical information Course website http://cazencott.info > Teaching > LSML 19 Schedule: Morning sessions: 09:30 12:30 Afternoon sessions: 14:00 17:00 Lectures are here in L.118 Practicals are in L.117 L.119 L.120 Grade: 60% Exam (Friday, 14:00) 40% RAMP
Today's goals Review what machine learning is Understand scalabilities issues Discover how to accelerate gradient descents with stochastic gradient descent
Acknowledgements Slides inspired by Ala Al-Fuqaha, Ethem Alpaydi, Matthew Blaschko, Léon Bottou, Sanjiv Kumar, Trevor Hastie, Rob Tibshirani and Jean-Philippe Vert
Why machine learning? 5
Business Insider: 2017 is the year of machine learning Improving doctors Assisting lawyers Make cars drive themselves 6
Perception 7
Communication Chatbot example from www.supportbots.ai Text courtesy an Eddie Izzard show from 1999 called Dressed to Kill. 8
Reasoning 9
Diagnosis 10
Scientific discovery 11 LHC image: Anna Pantelia/CERN
A common thread: ML Using algorithms to build models from example (training) data. Statistics + optimization + computer science 12
Empirical risk minimization Ingredients: Data Hypothesis class: Shape of the decision function f Loss function: Cost/error of f on one data point Recipe: Find, among all functions of the hypothesis class, one that minimizes the loss on the training data (empirical risk). 13
(Un)supervised learning setting features descriptors n observations samples data points p variables attributes t supervision outcome target label Binary classification: data matrix design matrix y X Multi-class classification: Regression: Iris dataset: n=150, p=4, t=1. Cancer drug sensitivity: n=103, p=106, t=100. ImageNet: n=14.106, p=6.103, t=22.103. Shopping, e-marketing...: n=o(106), p=o(109), t=o(108). Astronomy, GAFAs, web...: n=o(109), n=o(109), n=o(109). 14
Scaling ML algorithms What is large scale? Important considerations: Data does not fit in RAM; Data streams; Algorithms do not run in a reasonable time on a single machine. Performance increases with the number of samples. Likelihood to overfit increases with the number of features. Iris dataset: n=150, p=4, t=1. Cancer drug sensitivity: n=103, p=106, t=100. ImageNet: n=14.106, p=6.103, t=22.103. Shopping, e-marketing...: n=o(106), p=o(109), t=o(108). Astronomy, GAFAs, web...: n=o(109), n=o(109), n=o(109). 15
A brief zoo of ML problems 16
Unsupervised learning Learn a new representation of the data ML algo Data p n Images, text, measurements, omics data... Data! X 17
Dimensionality reduction Find a lower-dimensional representation ML algo Data p m X X n Images, text, measurements, omics data... n Data 18
Clustering Group similar data points together Data ML algo 19
Unsupervised learning Dimensionality reduction PCA Clustering k-means Density estimation Feature learning 20
Supervised learning Make predictions ML algo Predictor Data X n n p Labels decision function y 21
Classification Make discrete predictions ML algo Predictor Data Labels Binary classification Multi-class classification 22
Regression Make continuous predictions ML algo Predictor Data Labels 23
Supervised learning Regression Ordinary least squares, ridge regression Classification Logistic regression, SVM Structured output prediction 24
Main ML paradigms Unsupervised learning: Supervised learning: Dimensionality reduction; Clustering; Density estimation; Feature learning. Regression; Classification; Structured output prediction. Semi-supervised learning. Reinforcement learning. 25
Dimensionality reduction: Principal Components Analysis 26
Principal Components Analysis Objective: Reduce the dimension without losing the variability in the data; Find a low-dimensional space such as to maximize the variance of the data projected onto that space. The k-th principal component: Is orthogonal to all previous components: Captures the largest amount of variance. Solution: w is the k-th eigenvector of 27
PCA example: Population genetics Genetic data of 1387 Europeans Novembre et al, 2008 28
Algorithmic complexity of PCA Memory: nxp pxp store the data (X) and the covariance matrix (XTX): Runtime: Computing XTX: Computing K eigenvectors by Lanczos iterations: Computing the covariance matrix is more expensive than computing the K first principal components! Example n=109, p=108: Computing the covariance matrix: 1025 FLOPS. Fastest world computer (Nov 2018): 75 peta flops 2+ years. Storing the covariance matrix: 1016B = (1016/250)PB = 8PB. 29
Clustering: k-means 30
K-means clustering Goal: Find a cluster assignement that minimizes the intra-cluster variance: centroids: 31
K-means clustering Goal: Find a cluster assignement that minimizes the intra-cluster variance: centroids: Voronoi tesselation: 32
K-means clustering Goal: Find a cluster assignement that minimizes the intra-cluster variance: centroids: NP-hard Iterative algorithm Assignment step: fix the centroids, update assignments Update step: update the centroids 33
K means clustering 34
K means clustering Pick 3 centroids at random. 35
K means clustering Assign each observation to the nearest centroid. 36
K means clustering Recompute centroids. 37
K means clustering Re-assign each observation to the nearest centroid. 38
K means clustering Recompute centroids, and iterate process until convergence. 39
k-means complexity Assignment step: fix the centroids, update assignments: Compute n x K distances in p dimensions Update step: update the centroids: Sum n values in p dimensions T iterations Storage: Store n cluster assignements + K centroids Store X 40
Ridge regression 41
Linear regression Least-squares fit (equivalent to MLE under the assumption of Gaussian noise): Solution uniquely defined when invertible. 42
Ridge regression Hoerl & Kennard 1970 Goodness-of-fit + ridge regularization Empirical risk Solution unique and always exists Limit cases: Ridge regularization λ 0 : OLS (non-regularized) solution (low bias, high variance); λ : β=0 (high bias, low variance). Correlated features get similar weights. 43
Complexity of ridge regression Computing Inverting When n >> p, computing expensive than inverting it! is more 44
error (MSE) Regularization path regression weights regularization regularization 45
Ridge regression Hyperparameter setting 46
Setting λ Overfitting Prediction error Underfitting On new data On training data Model complexity 47
Setting λ Data splitting strategy: cross-validation: Cut the training set in k equally-sized chunks. K folds: one chunk to test, the (K-1) others for training. Valid Training Valid Training Training Valid Valid Training Cross-validation score: perf averaged over the K folds. For a grid of values for λ. λ1 λ2... λm 48
Setting λ Data splitting strategy: cross-validation: Cut the training set in k equally-sized chunks. K folds: one chunk to test, the (K-1) others for training. Valid Training Valid Training Training Training Valid Valid Cross-validation score: perf averaged over the K folds. Choose the λ with the best cross-validation score. Multiplies time complexity by KM. 49
Ridge regression ℓ2-regularized learning 50
ℓ2-regularized learning loss Empirical risk ℓ2 regularization Generalization of the ridge regression to any loss. If the loss is convex, then the problem is strictly convex and has a unique global solution, which can be found numerically. 51
ℓ2-regularized learning loss Empirical risk ℓ2 regularization Generalization of the ridge regression to any loss. If the loss is convex, then the problem is strictly convex and has a unique global solution, which can be found numerically. Absolute loss: Quadratic loss: ε-insensitive loss: Huber loss: mix quadratic & linear 52
Gradient descent 53
Gradient descent If the loss is convex, then the problem is strictly convex and has a unique global solution, which can be found numerically. Suppose the function to minimize is derivable: First-order Taylor expansion of f in u (v, J(v)) (v, J(u)+ J (u).(v u)) (u, J(u)) 54
Gradient descent Minimize a derivable, strictly convex function J by finding where its gradient is 0 Set a0 randomly Update: a1 = a0 α J(a0) Repeat Stop when J(ak) < ε. J(a) J (a0) < 0 a0 a1 55
Classification 56
Logistic regression Model y as a linear function of x? T A C T A C N NO 57
Logistic regression Model y as a linear function of x? Model the log-odds ratio as a linear function of x. 58
Logistic regression Model y as a linear function of x? Model the log-odds ratio as a linear function of x. p 59
Ridge logistic regression Le Cessie and van Houwelingen, 1992 Goodness-of-fit + ridge regularization Logistic loss Empirical risk Ridge regularization Logistic loss: negative conditional likelihood No explicit solution. Smooth convex optimization problem that can be solved numerically. 60
Newton-Raphson iterations Minimise J convex, differentiable. Gradient descent: Suppose f is twice differentiable. Second-order Taylor s expansion: g Minimize in v instead of in u. Hence use
Solving the logistic regression Can be solved with Newton-Raphson iterations. Each step is equivalent to solving a weighted ridge regression. IRLS: Iteratively Reweighted Least Squares. Complexity: Number of iterations 62
Large margin classifiers Margin: classify x as positive if f(x) > 0 and negative otherwise. ideally, yf(x) > 0. Large margin classifier: maximize yf(x): for a convex, non-increasing function Logistic regression: 63
Large margin classifiers 1 1 Linear Support Vector Machine (SVM): Hinge loss 64
Linear SVM Non-smooth, convex optimization problem; Quadratic Program. Equivalent to the dual problem Complexity (training): Storing Optimization: Complexity (prediction): Primal: Dual: 65
Kernel methods 66
Motivation 67
Non-linear mapping to a feature space R R2 68
SVM in the feature space Train: Predict with the decision function 69
Kernel SVM Train: Predict with the decision function 70
Kernel trick k may be quite efficient to compute, even if H is a very highdimensional or even infinite-dimensional space. For any positive semi-definite function k, there exists a feature space H and a feature map φ such that Hence you can define mappings implicitely. Kernel trick: algorithms that only involve the samples through their dot products can be rewritten using kernels in such a way that they can be applied in the initial space without ever computing the mapping. 71
Non-linear mapping to a feature space R R2 72
Polynomial kernels More generally, for is an inner product in a feature space of all monomials of degree up to d. 73
Gaussian kernel The feature space has infinite dimension. 74
Kernel ridge regression Ridge regression in input space: pxp In a feature space of dimension d: dxd nxn Ridge regression in sample space: 75
Complexity of KRR Computing K: Storing K: Inverting K + λi: Computing a prediction for one sample: computing κ: computing the products: 76
Algorithmic complexity recap 77
Summary Method PCA k-means Ridge regression Logistic regression SVM, kernel methods Memory O(p2) O(np) O(p2) O(np) O(np) Train time O(np2) O(Knp) O(np2) O(np2) O(n3) Predict time O(p) O(Kp) O(p) O(p) O(np) Training can take place offline unless data is streaming Prediction should be fast! 78
Techniques for large-scale ML Use the deep learning tricks. Deep learning (March 26, F. Moutarde). Natural Language Processing (March 27, E. Grave). Distribute data & computation on modern archs. Systems for large-scale ML (March 28, C.-A. Azencott). Trade optimization accuracy for speed. 79
Gradient descents convergence rates 80
Flavours of convexity J differentiable is convex iff for all J is L-smooth, or L-Lipschitz gradient iff it is twice differentiable and for all J doesn t vary sharply. J is m-strongly convex iff for all J is not too flat, lower-bounded by a quadratic. 81
Gradient descent convergence rates Gradient descent with fixed step size: J convex, L-smooth: converges in O(1/ε) iterations O(np/ε). J m-strongly convex, L-smooth: converges in O(κ log 1/ε) iterations O(np κ log(1/ε)). Newton-Raphson iterations J convex, L-smooth: converges in O(log log 1/ε) iterations O((np2+p3) log log(1/ε)). 82
Gradient descent convergence rates Gradient descent with fixed step size: J convex, L-smooth: converges in O(1/ε) iterations O(np/ε). J m-strongly convex, L-smooth: converges in O(κ log 1/ε) iterations O(np κ log(1/ε)). Newton-Raphson iterations J convex, L-smooth: converges in O(log log 1/ε) iterations O((np2+p3) log log(1/ε)). Usually unknown... 83
Stochastic gradient descent Gradient descent: Stochastic gradient descent: Sampling with replacement: chose l uniformely at random in {1, 2,, m}. J convex, L-smooth: converges in O(κ/ε2) iterations O(κp/ε2). J m-strongly convex, L-smooth: We got rid of n! converges in O(κ/ε) iterations O(κp/ε).
Convex optimization A number of ML algorithms can be formulated as convex optimization problems. They can be solved numerically thanks to variants of the gradient descent. Trading optimization accuracy for speed: Necessary when reaching accuracy is too resourceconsuming. Does not necessarily have a large impact on performance: test error is more important than training error.