COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Size: px

Start display at page:

Download "COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions"

Angela Cummings
5 years ago
Views:

1 COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18 Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

2 LINEAR REGRESSION MOTIVATION

3 Why Linear Regression? Simplest machine learning algorithm for regression Widely used in biological, behavioural and social sciences to describe and to extract relationships between variables from data Prediction of real-valued outputs Easy to implement, fast to execute Benchmark algorithm for comparison with more complex algorithms Introduction to notation and concepts that we will need again later in the course Data format, vector & matrix notation Learning from data by minimizing a cost function Gradient descent Non-linear features and basis functions Preparation for neural networks

4 Applications of (linear) regression Brain computer interfaces Neuroprosthetic control

5 LINEAR REGRESSION WITH ONE INPUT

A regression problem We want to learn to predict a person s height based on his/her knee height and/or arm span This is useful for patients who are bed bound

6 A regression problem We want to learn to predict a person s height based on his/her knee height and/or arm span This is useful for patients who are bed bound or in a wheelchair and cannot stand to take an accurate measurement of their height Knee Height [cm] Arm span [cm] Height [cm]

7 Linear regression with one input Training set body height Learning algorithm? knee height Hypothesis Test input x Hypothesis h Prediction Parameters?

8 Example Data body height 180 Knee height [cm] Arm span [cm] Height [cm] m=30 data points body height knee height armspan

9 Example Data 190 Knee Height [cm] Arm span [cm] Height [cm] body height armspan knee height 55 60

10 Linear regression with one input Knee Height [cm] Height [cm] Which 190 hypothesis is better? In what sense is it better? body height knee height Hypothesis Parameters?

11 Formalization of problem Knee Height [cm] Height [cm] m=30 data points Given m training examples Goal: learn parameters such that for all training examples i=1 30. body height knee height

12 Least Squares Objective Minimize Error body height knee height

13 Least Squares Objective Minimize Error cost function mean squared error body height knee height

14 Least Squares Objective Minimize Error cost function mean squared error body height knee height

15 Cost function illustrated Properties of cost function: Quadratic function body height Convex function Unique local and global minimum (under regular conditions) knee height body height knee height

16 Minimizing the cost Two ways to find the parameters minimizing Gradient descent Direct analytical solution (setting derivatives = 0)

Recall: Functions of multiple variables Example:

the partial derivatives (fundamental in lecture 2)

lecture 4) Function of multiple variable with high

17 Recall: Functions of multiple variables Example: Partial derivatives Gradient vector is formed with the partial derivatives (fundamental in lecture 2) Chain rule (fundamental for neural networks in lecture 4) Function of multiple variable with high dimensional values Jacobian matrix is formed with the partial derivatives

18 GRADIENT DESCENT

19 Descending in the steepest direction Gradient descent on some arbitrary cost function

20 Gradient descent algorithm Repeat until convergence (simultaneously updating and ) negative gradient = descent learning rate ( eta ) partial derivative of with respect to

21 Gradient is orthogonal to contour lines A contour line is a line along which = const

22 Potential issues with gradient descent May get stuck in local minima Learning rate too small: slow convergence Learning rate too large: oscillations, divergence too small too large

23 LINEAR REGRESSION WITH GRADIENT DESCENT (ONE INPUT)

24 Application of gradient descent Linear regression cost Gradient descent (simultaneous update) learning rate (simultaneous update) error input

25 Predicting height from knee height Optimal fit to training data body height knee height

26 LINEAR REGRESSION MORE GENERAL FORMULATION: MULTIPLE FEATURES

27 Multiple inputs (features) Knee Height x1 Arm span x2 Age x3 Height y = = = 3 Notation: number of training examples number of features input features of i th training example (vector-valued). value of feature j in i th training example

28 Linear hypothesis Hypothesis (one input): Hypothesis (multiple input features): Example: h(x) = *kneeheight + 0.3*armspan + 0.1*age More compact notation: Introduce Why? Notation convenience!

29 Multiple inputs (features) revisited x0 Knee Height x1 Arm span x2 = 3 Notation: number of training examples number of features Age x3 Height y = = 1 = input features of i th training example (vector-valued). value of feature j in i th training example

30 Matrix and vector notation x0 Knee Height x1 Arm span x2 Age x3 Height y features of i th training example design matrix output/target vector (n+1) 1 m (n+1) m 1

31 Matrix and vector notation x0 Knee Height x1 Arm span x2 Age x3 Height y H θ = Xθ

32 LINEAR REGRESSION WITH GRADIENT DESCENT (GENERAL FORMULATION)

33 Linear regression problem statement Hypothesis: Cost function: high-dimensional quadratic ( bowl -shaped) function Goal is to find parameters which minimize the cost

34 Gradient descent (multiple features) with one input feature: learning rate (simultaneous update) error input with n input features: learning rate error input (simultaneous update for j=0 n) For j = 0: define for convenience

35 LINEAR REGRESSION ANALYTICAL SOLUTION

36 Analytical solution Set all partial derivatives of cost function = 0 Solving system of linear equations yields: design matrix output/target vector Moore-Penrose Pseudoinverse of Note: This analytical solution requires that columns of independent ( regular conditions) are linearly

37 Example: analytical solution applied to problem with one input Knee Height [cm] Height [cm] body height knee height

38 Example: analytical solution applied to problem with one input Knee Height [cm] Height [cm]

39 Predicting height from knee height body height knee height

40 Gradient descent Analytical solution Need to choose learning rate Iterative algorithm (needs many iterations to converge) Works well even when number of input features is large No need to choose Direct solution (no iteration) Slow if is too large (inverting n x n matrix)

41 NON-LINEAR FEATURES (NON-LINEAR BASIS FUNCTIONS)

42 Non-linear trends in data How can we learn non-linear hypotheses? x y ? 0???

43 Linear fit to this non-linear data x y standard design matrix Hypothesis: Optimal parameters:

44 Linear fit to this non-linear data

45 Non-linear (quadratic) fit x y design matrix with non-linear features Hypothesis: Optimal parameters:

46 Non-linear (quadratic) fit

47 Non-linear (sinusoid) fit x y design matrix with non-linear features Hypothesis: Optimal parameters:

48 Non-linear (sinusoidal) fit

49 Non-linear input features (in general) feature 2 of all training examples all features of 1st training example Feature 2 for each training example i is computed by applying a non-linear basis function: Allows to learn a variety of non-linear functions with the same technique(s): Analytical or gradient descent

50 Polynomial regression Features are powers of x n = degree of polynome to be learned n=0 n=1 n=3 n=9 What happened here? Next lecture

51 Radial basis functions Gaussian -shaped RBFs (localized representation): Each basis function j has a center in the input space The width of the basis functions is determined by x

52 Radial basis functions Gaussian -shaped RBFs: Each basis function j has a center in the input space The width of the basis functions is determined by x

53 Radial basis functions Gaussian -shaped RBFs: Each basis function j has a center in the input space The width of the basis functions is determined by x

54 Fitting a single RBF to data RBF with

55 Fitting RBFs to data RBFs with

56 Image: JPEG = cosine-basis Each block of 8x8 pixels is represented in a Fourier basis of cosine filters Better representation of edges and corners and compresses the data

57 SUMMARY (QUESTIONS)

58 Some questions Hypothesis for linear regression =? Cost function for linear regression =? How many local minima may the cost function for lin. reg. have (under regular conditions)? Name two ways to minimize the cost function? General gradient descent formula? How is Linear regression with gradient descent solved? What issues can arise during gradient descent? What is the design matrix? What are its dimensions? Analytical solution for linear regression =? What are the components of the solution? Pros and Cons of gradient descent vs. analytical solution? How can one learn non-linear hypotheses with linear regression? What is polynomial regression? What are radial basis functions?

59 What is next? Classification with Logistic Regression Gradient descent tricks & more advanced optimization techniques Underfitting & Overfitting Model selection (Training, Validation and test set)

COMPUTATIONAL INTELLIGENCE (CS) (INTRODUCTION TO MACHINE LEARNING) SS16. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

COMPUTATIONAL INTELLIGENCE (CS) (INTRODUCTION TO MACHINE LEARNING) SS16 Lecture 2: Linear Regression Gradient Descent Non-linear basis functions LINEAR REGRESSION MOTIVATION Why Linear Regression? Regression