EE613 Machine Learning for Engineers LINEAR REGRESSION. Sylvain Calinon Robot Learning & Interaction Group Idiap Research Institute Nov.

Similar documents
EE613 Machine Learning for Engineers LINEAR REGRESSION. Sylvain Calinon Robot Learning & Interaction Group Idiap Research Institute Nov.

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Linear Methods for Regression and Shrinkage Methods

Practical Least-Squares for Computer Graphics

I How does the formulation (5) serve the purpose of the composite parameterization

ELEG Compressive Sensing and Sparse Signal Representations

Generalized Additive Models

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

CS 775: Advanced Computer Graphics. Lecture 3 : Kinematics

Unsupervised learning in Vision

Jane Li. Assistant Professor Mechanical Engineering Department, Robotic Engineering Program Worcester Polytechnic Institute

Lecture «Robot Dynamics»: Kinematic Control

Interlude: Solving systems of Equations

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

Recognition, SVD, and PCA

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Robust Regression. Robust Data Mining Techniques By Boonyakorn Jantaranuson

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

CALCULATING RANKS, NULL SPACES AND PSEUDOINVERSE SOLUTIONS FOR SPARSE MATRICES USING SPQR

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Soft Threshold Estimation for Varying{coecient Models 2 ations of certain basis functions (e.g. wavelets). These functions are assumed to be smooth an

Fitting Uncertain Data with NURBS

GPUML: Graphical processors for speeding up kernel machines

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

What is machine learning?

COMPUTATIONAL INTELLIGENCE (CS) (INTRODUCTION TO MACHINE LEARNING) SS16. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Classroom Tips and Techniques: Least-Squares Fits. Robert J. Lopez Emeritus Professor of Mathematics and Maple Fellow Maplesoft

Machine Learning / Jan 27, 2010

Image Analysis, Classification and Change Detection in Remote Sensing

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

STA 4273H: Sta-s-cal Machine Learning

Lecture 13: Model selection and regularization

REGULARIZED REGRESSION FOR RESERVING AND MORTALITY MODELS GARY G. VENTER

Integrated Math I. IM1.1.3 Understand and use the distributive, associative, and commutative properties.

Multiresponse Sparse Regression with Application to Multidimensional Scaling

Multiple View Geometry in Computer Vision

Euclidean Space. Definition 1 (Euclidean Space) A Euclidean space is a finite-dimensional vector space over the reals R, with an inner product,.

Index. 3D reconstruction, point algorithm, point algorithm, point algorithm, point algorithm, 263

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

CS 6210 Fall 2016 Bei Wang. Review Lecture What have we learnt in Scientific Computing?

Lecture 27, April 24, Reading: See class website. Nonparametric regression and kernel smoothing. Structured sparse additive models (GroupSpAM)

6. NEURAL NETWORK BASED PATH PLANNING ALGORITHM 6.1 INTRODUCTION

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Introduction to ANSYS DesignXplorer

3. Data Structures for Image Analysis L AK S H M O U. E D U

Algebraic Iterative Methods for Computed Tomography

COMP 558 lecture 19 Nov. 17, 2010

CS231A Course Notes 4: Stereo Systems and Structure from Motion

FUZZY STATE ESTIMATION APPLIED TO SMART DISTRIBUTION NETWORK AUTOMATION FUNCTIONS

Index. 3D reconstruction, point algorithm, point algorithm, point algorithm, point algorithm, 253

Nonparametric Risk Attribution for Factor Models of Portfolios. October 3, 2017 Kellie Ottoboni

IMAGE ANALYSIS, CLASSIFICATION, and CHANGE DETECTION in REMOTE SENSING

Introduction to Machine Learning Spring 2018 Note Sparsity and LASSO. 1.1 Sparsity for SVMs

Unit II-2. Orthogonal projection. Orthogonal projection. Orthogonal projection. the scalar is called the component of u along v. two vectors u,v are

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

This leads to our algorithm which is outlined in Section III, along with a tabular summary of it's performance on several benchmarks. The last section

scikit-learn (Machine Learning in Python)

Siggraph 2007 course notes. Practical Least-Squares for Computer Graphics

Instance-based Learning

Numerical Linear Algebra

Minnesota Academic Standards for Mathematics 2007

REPLACING MLE WITH BAYESIAN SHRINKAGE CAS ANNUAL MEETING NOVEMBER 2018 GARY G. VENTER

Regularized Tensor Factorizations & Higher-Order Principal Components Analysis

The Curse of Dimensionality

Lecture notes on the simplex method September We will present an algorithm to solve linear programs of the form. maximize.

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

K-Means and Gaussian Mixture Models

The Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers Detection

Roundoff Noise in Scientific Computations

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

5 Machine Learning Abstractions and Numerical Optimization

Feature selection. LING 572 Fei Xia

Robust Subspace Computation Using L1 Norm

Smooth Transition Between Tasks on a Kinematic Control Level: Application to Self Collision Avoidance for Two Kuka LWR Robots

Multiple View Geometry in Computer Vision

EECS 442 Computer vision. Fitting methods

Ill-Posed Problems with A Priori Information

Nonparametric Methods Recap

Robust l p -norm Singular Value Decomposition

arxiv: v1 [stat.me] 9 Oct 2017

A Brief Look at Optimization

Advanced Computer Graphics Transformations. Matthias Teschner

Radial Basis Function Networks: Algorithms

Image Compression with Singular Value Decomposition & Correlation: a Graphical Analysis

Lecture 8 Fitting and Matching

3 Nonlinear Regression

Fitting. Instructor: Jason Corso (jjcorso)! web.eecs.umich.edu/~jjcorso/t/598f14!! EECS Fall 2014! Foundations of Computer Vision!

Lecture 9 Fitting and Matching

Support vector machines

Introduction to Automated Text Analysis. bit.ly/poir599

Perceptron as a graph

Classification by Nearest Shrunken Centroids and Support Vector Machines

Effectiveness of Sparse Features: An Application of Sparse PCA

Introduction to Mobile Robotics

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

Parameter estimation. Christiano Gava Gabriele Bleser

Transcription:

EE613 Machine Learning for Engineers LINEAR REGRESSION Sylvain Calinon Robot Learning & Interaction Group Idiap Research Institute Nov. 4, 2015 1

Outline Multivariate ordinary least squares Singular value decomposition (SVD) Kernels in least squares (nullspace projection) Ridge regression (Tikhonov regularization) Weighted least squares (WLS) Recursive least squares (RLS) 2

Multivariate ordinary least squares demo_ls01.m demo_ls_polfit01.m 3

Multivariate ordinary least squares Least squares is everywhere: from simple problems to large scale problems. It was the earliest form of regression, which was published by Legendre in 1805 and by Gauss in 1809. They both applied the method to the problem of determining the orbits of bodies around the Sun from astronomical observations. The term regression was only coined later by Galton to describe the biological phenomenon that the heights of descendants of tall ancestors tend to regress down towards a normal average (a phenomenon also known as regression toward the mean). Pearson later provided the statistical context showing that the phenomenon is more general than a biological context.

Multivariate ordinary least squares Moore-Penrose pseudoinverse 5

Multivariate ordinary least squares 6

Multivariate ordinary least squares

Multivariate ordinary least squares 8

Least squares with Cholesky decomposition 9

Singular value decomposition (SVD) Matrix with non-negative diagonal entries (singular values of X) Unitary matrix (orthonormal bases) Unitary matrix (orthonormal bases)

Singular value decomposition (SVD) 11

Singular value decomposition (SVD) 12

Singular value decomposition (SVD) Applications employing SVD include pseudoinverse computation, least squares fitting, multivariable control, matrix approximation, as well as the determination of the rank, range and null space of a matrix. The SVD can also be thought of as decomposing a matrix into a weighted, ordered sum of separable matrices (e.g., decomposition of an image processing filter into separable horizontal and vertical filters). It is possible to use the SVD of a square matrix A to determine the orthogonal matrix O closest to A. The closeness of fit is measured by the Frobenius norm of O A. The solution is the product UV T. A similar problem, with interesting applications in shape analysis, is the orthogonal Procrustes problem, which consists of finding an orthogonal matrix O which most closely maps A to B. Extensions to higher order arrays exist, generalizing SVD to a multi-way analysis of the data (multilinear algebra). 13

Condition number of a matrix with SVD 14

Least squares with SVD 15

Least squares with SVD 16

Different line fitting alternatives 17

Data fitting with linear least squares In linear least squares, we are not restricted to using a line as the model as in the previous slide! For instance, we could have chosen a quadratic model Y=X 2 A: This model is still linear in the A parameter. The function does not need to be linear in the argument: only in the parameters that are determined to give the best fit. Also, not all of X contains information about the datapoints: the first/last column can for example be populated with ones, so that an offset is learned This can be used for polynomial fitting by treating x, x2,... as being distinct independent variables in a multiple regression model. 18

Data fitting with linear least squares 19

Data fitting with linear least squares Polynomial regression is an example of regression analysis using basis functions to model a functional relationship between two quantities. Specifically, it replaces x in linear regression with polynomial basis [1, x, x 2,, x d ]. A drawback of polynomial bases is that the basis functions are "non-local. It is for this reason that the polynomial basis functions are often used along with other forms of basis functions, such as splines, radial basis functions, and wavelets. (We will learn more about this at next lecture ) 20

Polynomial fitting with least squares 21

Kernels in least squares (nullspace projection) demo_ls_nullspace01.m 22

Kernels in least squares (nullspace) 23

Kernels in least squares (nullspace) 24

Kernels in least squares (nullspace) 25

Kernels in least squares (nullspace) without nullspace with nullspace 26

Example with polynomial fitting 27

Example with robot inverse kinematics Joint space / configuration space coordinates Task space / operational space coordinates 28

Example with robot inverse kinematics Keeping the tip still as primary constraint Trying to move the first joint as secondary constraint 29

Example with robot inverse kinematics Tracking target with left hand Tracking target with right hand if possible 30

Ridge regression (Tikhonov regularization, penalized least squares) demo_ls_polfit02.m 31

Ridge regression (Tikhonov regularization) 32

Ridge regression (Tikhonov regularization) 33

Ridge regression (Tikhonov regularization) 34

Ridge regression (Tikhonov regularization) 35

Ridge regression (Tikhonov regularization) 36

Ridge regression (Tikhonov regularization) 37

LASSO (L 1 regularization) An alternative regularized version of least squares is Lasso (least absolute shrinkage and selection operator) using the constraint that the L 1 -norm A 1 is no greater than a given value. The L 1 -regularized formulation is useful due to its tendency to prefer solutions with fewer nonzero parameter values, effectively reducing the number of variables upon which the given solution is dependent. The increase of the penalty term in ridge regression will reduce all parameters while still remaining non-zero, while in Lasso, it will cause more and more of the parameters to be driven to zero. This is an advantage of Lasso over ridge regression, as driving parameters to zero deselects the features from the regression. 38

LASSO (L 1 regularization) Thus, Lasso automatically selects more relevant features and discards the others, whereas ridge regression never fully discards any features. Lasso is equivalent to an unconstrained minimization of the least-squares penalty with A 1 added. In a Bayesian context, this is equivalent to placing a zero-mean Laplace prior distribution on the parameter vector. The optimization problem may be solved using quadratic programming or more general convex optimization methods, as well as by specific algorithms such as the least angle regression algorithm. 39

Weighted least squares (Generalized least squares) demo_ls_weighted01.m 40

Weighted least squares 41

Weighted least squares 42

Weighted least squares - Example 43

Weighted least squares - Example 44

Iteratively reweighted least squares (IRLS) demo_ls_irls01.m 45

Robust regression Regression analysis seeks to find the relationship between one or more independent variables and a dependent variable. Methods such as ordinary least squares have favorable properties if their underlying assumptions are true, but can give misleading results if those assumptions are not true Least squares estimates are highly sensitive to outliers. Robust regression methods are designed to be not overly affected by violations of assumptions through the underlying data-generating process. It down-weights the influence of outliers, which makes their residuals larger and easier to identify. A simple approach to robust least squares fitting is to first do an ordinary least squares fit, then identify the k data points with the largest residuals, omit these, perform the fit on the remaining data. 46

Robust regression Another simple method to estimate parameters in a regression model that are less sensitive to outliers than the least squares estimates is to use least absolute deviations. An alternative parametric approach is to assume that the residuals follow a mixture of normal distributions: A contaminated normal distribution in which the majority of observations are from a specified normal distribution, but a small proportion are from a normal distribution with much higher variance. Another approach to robust estimation of regression models is to replace the normal distribution with a heavy-tailed distribution (e.g., t-distributions). Bayesian robust regression often relies on such distributions. 47

Iteratively reweighted least squares (IRLS) Iteratively Reweighted Least Squares generalizes trimmed least squares by raising the error to a power less than 2. No longer least squares. IRLS can be used to find the maximum likelihood estimates of a generalized linear model, and find an M-estimator as a way of mitigating the influence of outliers in an otherwise normally-distributed data set. For example, by minimizing the least absolute error rather than the least square error. The strategy is that an error e p can be rewritten as e p = e p-2 e 2. Then, e p-2 can be interpreted as a weight, which is used to minimize e 2 with weighted least squares. p=1 corresponds to least absolute deviation regression. 48

Iteratively reweighted least squares (IRLS) 49

Iteratively reweighted least squares (IRLS) 50

Recursive least squares demo_ls_recursive01.m 51

Recursive least squares 52

Recursive least squares 53

Recursive least squares 54

Recursive least squares 55

Recursive least squares 56

Recursive least squares 57

Main references Regression F. Stulp and O. Sigaud. Many regression algorithms, one unified model a review. Neural Networks, 69:60 79, September 2015 W. W. Hager. Updating the inverse of a matrix. SIAM Review, 31(2):221 239, 1989 G. Strang. Introduction to Applied Mathematics. Wellesley-Cambridge Press, 1986. The Matrix Cookbook Kaare Brandt Petersen Michael Syskind Pedersen