Cost Functions in Machine Learning

Similar documents
Today. Gradient descent for minimization of functions of real variables. Multi-dimensional scaling. Self-organizing maps

CS 179 Lecture 16. Logistic Regression & Parallel SGD

CPSC 340: Machine Learning and Data Mining. Robust Regression Fall 2015

Data Mining Chapter 8: Search and Optimization Methods Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Learning from Data: Adaptive Basis Functions

Logistic Regression and Gradient Ascent

COMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

Perceptron: This is convolution!

Supervised Learning in Neural Networks (Part 2)

3 Types of Gradient Descent Algorithms for Small & Large Data Sets

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

REGRESSION ANALYSIS : LINEAR BY MAUAJAMA FIRDAUS & TULIKA SAHA

Deep Neural Networks Optimization

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Motivation. Problem: With our linear methods, we can train the weights but not the basis functions: Activator Trainable weight. Fixed basis function

Today. Golden section, discussion of error Newton s method. Newton s method, steepest descent, conjugate gradient

Linear Methods for Regression and Shrinkage Methods

What is machine learning?

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Module 7 VIDEO CODING AND MOTION ESTIMATION

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Learning via Optimization

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

Linear Regression and K-Nearest Neighbors 3/28/18

Optimization. Industrial AI Lab.

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Notes on Multilayer, Feedforward Neural Networks

Clustering. Image segmentation, document clustering, protein class discovery, compression

How Learning Differs from Optimization. Sargur N. Srihari

A Brief Look at Optimization

CS281 Section 3: Practical Optimization

Greedy Algorithms. Activity-selection problem. Optimization Problem. Example

Analyzing Stochastic Gradient Descent for Some Non- Convex Problems

CSC 411: Lecture 02: Linear Regression

Deep Learning and Its Applications

Robotic Motion Planning: Review C-Space and Start Potential Functions

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

COMPUTATIONAL INTELLIGENCE (CS) (INTRODUCTION TO MACHINE LEARNING) SS16. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Case Study 1: Estimating Click Probabilities

A Simple (?) Exercise: Predicting the Next Word

Regularization and model selection

Image Registration Lecture 4: First Examples

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

Linear Regression Implementation

5 Machine Learning Abstractions and Numerical Optimization

Modern Methods of Data Analysis - WS 07/08

9.1 Linear Inequalities in Two Variables Date: 2. Decide whether to use a solid line or dotted line:

SLAM Part 2 and Intro to Kernel Machines

Exercise: Graphing and Least Squares Fitting in Quattro Pro

Simple Model Selection Cross Validation Regularization Neural Networks

CPSC 340: Machine Learning and Data Mining

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Unsupervised Learning: Clustering

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

3 Nonlinear Regression

Network Traffic Measurements and Analysis

Artificial Neural Networks (Feedforward Nets)

Linear Regression Optimization

Efficient Deep Learning Optimization Methods

Planning and Control: Markov Decision Processes

Lecture 18: March 23

6. Linear Discriminant Functions

Hot X: Algebra Exposed

Gradient Descent. 1) S! initial state 2) Repeat: Similar to: - hill climbing with h - gradient descent over continuous space

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Neural Nets & Deep Learning

10601 Machine Learning. Model and feature selection

Image Processing

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

3 Nonlinear Regression

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017

Linear Regression & Gradient Descent

The Mathematics Behind Neural Networks

Learning Alignments from Latent Space Structures

Gradient Descent - Problem of Hiking Down a Mountain

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

Neural Network Neurons

Alternating Direction Method of Multipliers

Computational Machine Learning, Fall 2015 Homework 4: stochastic gradient algorithms

Machine Learning. MGS Lecture 3: Deep Learning

Conflict Graphs for Parallel Stochastic Gradient Descent

Learning from Data Linear Parameter Models

Local and Global Minimum

Sketching graphs of polynomials

Lecture 20: Neural Networks for NLP. Zubin Pahuja

732A54/TDDE31 Big Data Analytics

DEEP LEARNING IN PYTHON. The need for optimization

Theoretical Concepts of Machine Learning

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

Constrained optimization

Paul's Online Math Notes Calculus III (Notes) / Applications of Partial Derivatives / Lagrange Multipliers Problems][Assignment Problems]

Transcription:

Cost Functions in Machine Learning Kevin Swingler Motivation Given some data that reflects measurements from the environment We want to build a model that reflects certain statistics about that data Something as simple as calculating the mean Or as complex as a multi-valued, non-linear regression model 1

Cost function One class of approach is to define a cost function that compares the output of the model to the observed data The task of designing the model becomes the task of minimising the cost associated with the model Why cost? Why not error, for example Cost is more generic, as we shall see Simple Example Calculate the Mean Ever wonder how the equation = came about? First, let s define the mean in terms of a cost function. We want to calculate a mean, such that ( ) is minimised That is to say, we want to find a value that minimises the squared error between and the data 2

Find So, we write = arg min( ) Which means, the value of that minimises the summed squared differences between and each number in our sample How To Minimise the Error? ( ) is a quadratic function in Its minimum is where the slope is zero 3

Solve Analytically What if We Can t Solve Analytically? This is the point we need to find 4

Gradient Descent We have seen that the gradient of the squared error cost function is 2( ) So we can pick a starting point and follow the gradient down to the bottom. The true mean is zero, in this example Gradient Descent A simple version: 1. Pick one data point at a time 2. Move the mean down the error curve a little 3. Repeat Pick the first data point, let s say = 5 so we start off with = 5 5

Then pick the next point, let s say = 3 So the derivative is 2( ) = 2 3 5 = 4 Now, we only want to take small steps so we use a learning rate, = 0.1 Update rule is =+2( ) Gradient Descent Gradient Descent So = 5 0.1 4 = 4.6 6

Then perhaps = 4.6 0.1 6 = 4 And so on Gradient Descent And so on Gradient Descent 7

Gradient Descent And so on And so on until it hovers around the true mean. To get really precise when close, we might need to take smaller steps, perhaps let = 0.05 Then = 0.01 Gradient Descent 8

Batch or Stochastic Descent In the last example, we updated the estimate once for every data point, one at a time This is known as stochastic gradient descent This process might need to be repeated several times, using each point more than once An alternative is to use a batch approach, where the estimate is updated once per complete pass through the data Batch Gradient Descent Calculate the average cost gradient across the whole data sample Make one change to the estimate based on that average cost gradient Repeat until some criterion is met Batch descent is smoother than SGD But can be slower and doesn t work if data is streamed one point at a time 9

Mini Batches A good compromise is to use mini-batches Smooths out some of the variation that SGD produces Not as inefficient as full batch update Stopping Criteria Each data point will cause a small move in the estimate, so when do we stop? Can choose: Fixed number of iterations Target error Fixed number of iterations where average improvement is smaller than a threshold 10

Pros and Cons Gradient descent is useful when local gradients are available, but the global minimum cannot be found analytically They can suffer from a problem known as local minima Local Minima If we are unlucky and start here 11

Local Minima We will end up with our estimate here Several re-starts Some Solutions Momentum to jump over small dips 12

Isn t That All a bit Pointless? For calculating a mean, yes it is There is no need to use gradient descent for it But there are other examples where you need to We will meet neural networks soon, which make use of gradient descent during learning Another Cost Function: Likelihood What if we want to estimate the parameters of a probability distribution what is a good cost function? The problem is to take a sample, and estimate the probability distribution # $, usually in some parametrised form. Squared error cannot be used as we do not ever know the true value of any # 13

Calculating Likelihood The likelihood associated with a model and a given data set is calculated as the product of the probability estimates made by the model across the examples from the data set: % = &'(( ) We use '(( )to mean the estimate made by the model Log Likelihood Probabilities can be small and multiplying many of them together can make very small numbers So the log likelihood is often used ) = log'(( ) 14

Simple Example Let s say we toss a coin 100 times and get 75 heads and 25 tails We now want to model that coin with a discrete function ',()that takes = - or =. as input and outputs the associated probability (0.75 or 0.25, in this case) Again, the example is trivial and we know the answer is ',( = -)=75/100 and ',( =.) =25/100 (Bayesians look away now) Simple Example But let s say we don t know that, or need a method that can cope in more complex situations where that can t be used 15

Maximise Likelihood Negative Log Likelihood 16

Or Gradient Descent Similarly, we could use an iterative approach and try to find the parameter with the largest likelihood by iteratively moving the estimate along the likelihood gradient P Likelihood gradient 0.5 100 0.6 62.5 0.7 23.81 0.72 14.88 0.75 0 0.76-5.48 Other Optimisations There are many other methods for taking a cost function and trying to find its global minimum Some follow gradients, other use different algorithms or heuristics We will see more of them during the course 17

Summary Many machine learning methods involve optimising some form of cost function Sometimes, it is possible to optimise the cost analytically, for example multiple linear regression does so Other times, you need to use an iterative approach such as gradient descent, for example when training a neural network 18