LSML19: Introduction to Large Scale Machine Learning

Similar documents
CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

What is machine learning?

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Machine Learning / Jan 27, 2010

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Machine Learning: Think Big and Parallel

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

Network Traffic Measurements and Analysis

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Using Machine Learning to Optimize Storage Systems

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

9. Support Vector Machines. The linearly separable case: hard-margin SVMs. The linearly separable case: hard-margin SVMs. Learning objectives

Clustering and Visualisation of Data

CSE446: Linear Regression. Spring 2017

Expectation Maximization (EM) and Gaussian Mixture Models

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Topics in Machine Learning

CS 229 Midterm Review

10. Support Vector Machines

Classification: Feature Vectors

1 Case study of SVM (Rob)

A Brief Look at Optimization

Theoretical Concepts of Machine Learning

Divide and Conquer Kernel Ridge Regression

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Support Vector Machines.

Linear methods for supervised learning

Constrained optimization

All lecture slides will be available at CSC2515_Winter15.html

Unsupervised Learning

CSE 446 Bias-Variance & Naïve Bayes

EE 511 Linear Regression

Hyperparameters and Validation Sets. Sargur N. Srihari

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Support Vector Machines

CSE 573: Artificial Intelligence Autumn 2010

Clustering and The Expectation-Maximization Algorithm

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

CSE 158. Web Mining and Recommender Systems. Midterm recap

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

Lecture #11: The Perceptron

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Challenges motivating deep learning. Sargur N. Srihari

Lecture 25: Review I

Perceptron: This is convolution!

The exam is closed book, closed notes except your one-page cheat sheet.

Deep Learning for Computer Vision

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Support vector machines

Class 6 Large-Scale Image Classification

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Introduction to Machine Learning CMU-10701

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

Classification by Nearest Shrunken Centroids and Support Vector Machines

scikit-learn (Machine Learning in Python)

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Lecture on Modeling Tools for Clustering & Regression

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Data Mining in Bioinformatics Day 1: Classification

Supervised vs unsupervised clustering

Large Scale Data Analysis Using Deep Learning

Linear Methods for Regression and Shrinkage Methods

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Generative and discriminative classification techniques

Regularization and model selection

Kernels and Clustering

Instance-based Learning

Development in Object Detection. Junyuan Lin May 4th

A Course in Machine Learning

Unsupervised Learning

Object Detection with Discriminatively Trained Part Based Models

Network Traffic Measurements and Analysis

Lecture 7: Support Vector Machine

Data Mining Chapter 8: Search and Optimization Methods Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Image Processing. Image Features

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

CS 343: Artificial Intelligence

Clustering: Classic Methods and Modern Views

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems

Content-based image and video analysis. Machine learning

SVM in Oracle Database 10g: Removing the Barriers to Widespread Adoption of Support Vector Machines

Gradient LASSO algoithm

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

Prediction of Dialysis Length. Adrian Loy, Antje Schubotz 2 February 2017

Chap.12 Kernel methods [Book, Chap.7]

Cross-validation and the Bootstrap

Orange3 Educational Add-on Documentation

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Transcription:

LSML19: Introduction to Large Scale Machine Learning MINES ParisTech March 2019 Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr

Practical information Course website http://cazencott.info > Teaching > LSML 19 Schedule: Morning sessions: 09:30 12:30 Afternoon sessions: 14:00 17:00 Lectures are here in L.118 Practicals are in L.117 L.119 L.120 Grade: 60% Exam (Friday, 14:00) 40% RAMP

Today's goals Review what machine learning is Understand scalabilities issues Discover how to accelerate gradient descents with stochastic gradient descent

Acknowledgements Slides inspired by Ala Al-Fuqaha, Ethem Alpaydi, Matthew Blaschko, Léon Bottou, Sanjiv Kumar, Trevor Hastie, Rob Tibshirani and Jean-Philippe Vert

Why machine learning? 5

Business Insider: 2017 is the year of machine learning Improving doctors Assisting lawyers Make cars drive themselves 6

Perception 7

Communication Chatbot example from www.supportbots.ai Text courtesy an Eddie Izzard show from 1999 called Dressed to Kill. 8

Reasoning 9

Diagnosis 10

Scientific discovery 11 LHC image: Anna Pantelia/CERN

A common thread: ML Using algorithms to build models from example (training) data. Statistics + optimization + computer science 12

Empirical risk minimization Ingredients: Data Hypothesis class: Shape of the decision function f Loss function: Cost/error of f on one data point Recipe: Find, among all functions of the hypothesis class, one that minimizes the loss on the training data (empirical risk). 13

(Un)supervised learning setting features descriptors n observations samples data points p variables attributes t supervision outcome target label Binary classification: data matrix design matrix y X Multi-class classification: Regression: Iris dataset: n=150, p=4, t=1. Cancer drug sensitivity: n=103, p=106, t=100. ImageNet: n=14.106, p=6.103, t=22.103. Shopping, e-marketing...: n=o(106), p=o(109), t=o(108). Astronomy, GAFAs, web...: n=o(109), n=o(109), n=o(109). 14

Scaling ML algorithms What is large scale? Important considerations: Data does not fit in RAM; Data streams; Algorithms do not run in a reasonable time on a single machine. Performance increases with the number of samples. Likelihood to overfit increases with the number of features. Iris dataset: n=150, p=4, t=1. Cancer drug sensitivity: n=103, p=106, t=100. ImageNet: n=14.106, p=6.103, t=22.103. Shopping, e-marketing...: n=o(106), p=o(109), t=o(108). Astronomy, GAFAs, web...: n=o(109), n=o(109), n=o(109). 15

A brief zoo of ML problems 16

Unsupervised learning Learn a new representation of the data ML algo Data p n Images, text, measurements, omics data... Data! X 17

Dimensionality reduction Find a lower-dimensional representation ML algo Data p m X X n Images, text, measurements, omics data... n Data 18

Clustering Group similar data points together Data ML algo 19

Unsupervised learning Dimensionality reduction PCA Clustering k-means Density estimation Feature learning 20

Supervised learning Make predictions ML algo Predictor Data X n n p Labels decision function y 21

Classification Make discrete predictions ML algo Predictor Data Labels Binary classification Multi-class classification 22

Regression Make continuous predictions ML algo Predictor Data Labels 23

Supervised learning Regression Ordinary least squares, ridge regression Classification Logistic regression, SVM Structured output prediction 24

Main ML paradigms Unsupervised learning: Supervised learning: Dimensionality reduction; Clustering; Density estimation; Feature learning. Regression; Classification; Structured output prediction. Semi-supervised learning. Reinforcement learning. 25

Dimensionality reduction: Principal Components Analysis 26

Principal Components Analysis Objective: Reduce the dimension without losing the variability in the data; Find a low-dimensional space such as to maximize the variance of the data projected onto that space. The k-th principal component: Is orthogonal to all previous components: Captures the largest amount of variance. Solution: w is the k-th eigenvector of 27

PCA example: Population genetics Genetic data of 1387 Europeans Novembre et al, 2008 28

Algorithmic complexity of PCA Memory: nxp pxp store the data (X) and the covariance matrix (XTX): Runtime: Computing XTX: Computing K eigenvectors by Lanczos iterations: Computing the covariance matrix is more expensive than computing the K first principal components! Example n=109, p=108: Computing the covariance matrix: 1025 FLOPS. Fastest world computer (Nov 2018): 75 peta flops 2+ years. Storing the covariance matrix: 1016B = (1016/250)PB = 8PB. 29

Clustering: k-means 30

K-means clustering Goal: Find a cluster assignement that minimizes the intra-cluster variance: centroids: 31

K-means clustering Goal: Find a cluster assignement that minimizes the intra-cluster variance: centroids: Voronoi tesselation: 32

K-means clustering Goal: Find a cluster assignement that minimizes the intra-cluster variance: centroids: NP-hard Iterative algorithm Assignment step: fix the centroids, update assignments Update step: update the centroids 33

K means clustering 34

K means clustering Pick 3 centroids at random. 35

K means clustering Assign each observation to the nearest centroid. 36

K means clustering Recompute centroids. 37

K means clustering Re-assign each observation to the nearest centroid. 38

K means clustering Recompute centroids, and iterate process until convergence. 39

k-means complexity Assignment step: fix the centroids, update assignments: Compute n x K distances in p dimensions Update step: update the centroids: Sum n values in p dimensions T iterations Storage: Store n cluster assignements + K centroids Store X 40

Ridge regression 41

Linear regression Least-squares fit (equivalent to MLE under the assumption of Gaussian noise): Solution uniquely defined when invertible. 42

Ridge regression Hoerl & Kennard 1970 Goodness-of-fit + ridge regularization Empirical risk Solution unique and always exists Limit cases: Ridge regularization λ 0 : OLS (non-regularized) solution (low bias, high variance); λ : β=0 (high bias, low variance). Correlated features get similar weights. 43

Complexity of ridge regression Computing Inverting When n >> p, computing expensive than inverting it! is more 44

error (MSE) Regularization path regression weights regularization regularization 45

Ridge regression Hyperparameter setting 46

Setting λ Overfitting Prediction error Underfitting On new data On training data Model complexity 47

Setting λ Data splitting strategy: cross-validation: Cut the training set in k equally-sized chunks. K folds: one chunk to test, the (K-1) others for training. Valid Training Valid Training Training Valid Valid Training Cross-validation score: perf averaged over the K folds. For a grid of values for λ. λ1 λ2... λm 48

Setting λ Data splitting strategy: cross-validation: Cut the training set in k equally-sized chunks. K folds: one chunk to test, the (K-1) others for training. Valid Training Valid Training Training Training Valid Valid Cross-validation score: perf averaged over the K folds. Choose the λ with the best cross-validation score. Multiplies time complexity by KM. 49

Ridge regression ℓ2-regularized learning 50

ℓ2-regularized learning loss Empirical risk ℓ2 regularization Generalization of the ridge regression to any loss. If the loss is convex, then the problem is strictly convex and has a unique global solution, which can be found numerically. 51

ℓ2-regularized learning loss Empirical risk ℓ2 regularization Generalization of the ridge regression to any loss. If the loss is convex, then the problem is strictly convex and has a unique global solution, which can be found numerically. Absolute loss: Quadratic loss: ε-insensitive loss: Huber loss: mix quadratic & linear 52

Gradient descent 53

Gradient descent If the loss is convex, then the problem is strictly convex and has a unique global solution, which can be found numerically. Suppose the function to minimize is derivable: First-order Taylor expansion of f in u (v, J(v)) (v, J(u)+ J (u).(v u)) (u, J(u)) 54

Gradient descent Minimize a derivable, strictly convex function J by finding where its gradient is 0 Set a0 randomly Update: a1 = a0 α J(a0) Repeat Stop when J(ak) < ε. J(a) J (a0) < 0 a0 a1 55

Classification 56

Logistic regression Model y as a linear function of x? T A C T A C N NO 57

Logistic regression Model y as a linear function of x? Model the log-odds ratio as a linear function of x. 58

Logistic regression Model y as a linear function of x? Model the log-odds ratio as a linear function of x. p 59

Ridge logistic regression Le Cessie and van Houwelingen, 1992 Goodness-of-fit + ridge regularization Logistic loss Empirical risk Ridge regularization Logistic loss: negative conditional likelihood No explicit solution. Smooth convex optimization problem that can be solved numerically. 60

Newton-Raphson iterations Minimise J convex, differentiable. Gradient descent: Suppose f is twice differentiable. Second-order Taylor s expansion: g Minimize in v instead of in u. Hence use

Solving the logistic regression Can be solved with Newton-Raphson iterations. Each step is equivalent to solving a weighted ridge regression. IRLS: Iteratively Reweighted Least Squares. Complexity: Number of iterations 62

Large margin classifiers Margin: classify x as positive if f(x) > 0 and negative otherwise. ideally, yf(x) > 0. Large margin classifier: maximize yf(x): for a convex, non-increasing function Logistic regression: 63

Large margin classifiers 1 1 Linear Support Vector Machine (SVM): Hinge loss 64

Linear SVM Non-smooth, convex optimization problem; Quadratic Program. Equivalent to the dual problem Complexity (training): Storing Optimization: Complexity (prediction): Primal: Dual: 65

Kernel methods 66

Motivation 67

Non-linear mapping to a feature space R R2 68

SVM in the feature space Train: Predict with the decision function 69

Kernel SVM Train: Predict with the decision function 70

Kernel trick k may be quite efficient to compute, even if H is a very highdimensional or even infinite-dimensional space. For any positive semi-definite function k, there exists a feature space H and a feature map φ such that Hence you can define mappings implicitely. Kernel trick: algorithms that only involve the samples through their dot products can be rewritten using kernels in such a way that they can be applied in the initial space without ever computing the mapping. 71

Non-linear mapping to a feature space R R2 72

Polynomial kernels More generally, for is an inner product in a feature space of all monomials of degree up to d. 73

Gaussian kernel The feature space has infinite dimension. 74

Kernel ridge regression Ridge regression in input space: pxp In a feature space of dimension d: dxd nxn Ridge regression in sample space: 75

Complexity of KRR Computing K: Storing K: Inverting K + λi: Computing a prediction for one sample: computing κ: computing the products: 76

Algorithmic complexity recap 77

Summary Method PCA k-means Ridge regression Logistic regression SVM, kernel methods Memory O(p2) O(np) O(p2) O(np) O(np) Train time O(np2) O(Knp) O(np2) O(np2) O(n3) Predict time O(p) O(Kp) O(p) O(p) O(np) Training can take place offline unless data is streaming Prediction should be fast! 78

Techniques for large-scale ML Use the deep learning tricks. Deep learning (March 26, F. Moutarde). Natural Language Processing (March 27, E. Grave). Distribute data & computation on modern archs. Systems for large-scale ML (March 28, C.-A. Azencott). Trade optimization accuracy for speed. 79

Gradient descents convergence rates 80

Flavours of convexity J differentiable is convex iff for all J is L-smooth, or L-Lipschitz gradient iff it is twice differentiable and for all J doesn t vary sharply. J is m-strongly convex iff for all J is not too flat, lower-bounded by a quadratic. 81

Gradient descent convergence rates Gradient descent with fixed step size: J convex, L-smooth: converges in O(1/ε) iterations O(np/ε). J m-strongly convex, L-smooth: converges in O(κ log 1/ε) iterations O(np κ log(1/ε)). Newton-Raphson iterations J convex, L-smooth: converges in O(log log 1/ε) iterations O((np2+p3) log log(1/ε)). 82

Gradient descent convergence rates Gradient descent with fixed step size: J convex, L-smooth: converges in O(1/ε) iterations O(np/ε). J m-strongly convex, L-smooth: converges in O(κ log 1/ε) iterations O(np κ log(1/ε)). Newton-Raphson iterations J convex, L-smooth: converges in O(log log 1/ε) iterations O((np2+p3) log log(1/ε)). Usually unknown... 83

Stochastic gradient descent Gradient descent: Stochastic gradient descent: Sampling with replacement: chose l uniformely at random in {1, 2,, m}. J convex, L-smooth: converges in O(κ/ε2) iterations O(κp/ε2). J m-strongly convex, L-smooth: We got rid of n! converges in O(κ/ε) iterations O(κp/ε).

Convex optimization A number of ML algorithms can be formulated as convex optimization problems. They can be solved numerically thanks to variants of the gradient descent. Trading optimization accuracy for speed: Necessary when reaching accuracy is too resourceconsuming. Does not necessarily have a large impact on performance: test error is more important than training error.