Cost Functions in Machine Learning Kevin Swingler Motivation Given some data that reflects measurements from the environment We want to build a model that reflects certain statistics about that data Something as simple as calculating the mean Or as complex as a multi-valued, non-linear regression model 1
Cost function One class of approach is to define a cost function that compares the output of the model to the observed data The task of designing the model becomes the task of minimising the cost associated with the model Why cost? Why not error, for example Cost is more generic, as we shall see Simple Example Calculate the Mean Ever wonder how the equation = came about? First, let s define the mean in terms of a cost function. We want to calculate a mean, such that ( ) is minimised That is to say, we want to find a value that minimises the squared error between and the data 2
Find So, we write = arg min( ) Which means, the value of that minimises the summed squared differences between and each number in our sample How To Minimise the Error? ( ) is a quadratic function in Its minimum is where the slope is zero 3
Solve Analytically What if We Can t Solve Analytically? This is the point we need to find 4
Gradient Descent We have seen that the gradient of the squared error cost function is 2( ) So we can pick a starting point and follow the gradient down to the bottom. The true mean is zero, in this example Gradient Descent A simple version: 1. Pick one data point at a time 2. Move the mean down the error curve a little 3. Repeat Pick the first data point, let s say = 5 so we start off with = 5 5
Then pick the next point, let s say = 3 So the derivative is 2( ) = 2 3 5 = 4 Now, we only want to take small steps so we use a learning rate, = 0.1 Update rule is =+2( ) Gradient Descent Gradient Descent So = 5 0.1 4 = 4.6 6
Then perhaps = 4.6 0.1 6 = 4 And so on Gradient Descent And so on Gradient Descent 7
Gradient Descent And so on And so on until it hovers around the true mean. To get really precise when close, we might need to take smaller steps, perhaps let = 0.05 Then = 0.01 Gradient Descent 8
Batch or Stochastic Descent In the last example, we updated the estimate once for every data point, one at a time This is known as stochastic gradient descent This process might need to be repeated several times, using each point more than once An alternative is to use a batch approach, where the estimate is updated once per complete pass through the data Batch Gradient Descent Calculate the average cost gradient across the whole data sample Make one change to the estimate based on that average cost gradient Repeat until some criterion is met Batch descent is smoother than SGD But can be slower and doesn t work if data is streamed one point at a time 9
Mini Batches A good compromise is to use mini-batches Smooths out some of the variation that SGD produces Not as inefficient as full batch update Stopping Criteria Each data point will cause a small move in the estimate, so when do we stop? Can choose: Fixed number of iterations Target error Fixed number of iterations where average improvement is smaller than a threshold 10
Pros and Cons Gradient descent is useful when local gradients are available, but the global minimum cannot be found analytically They can suffer from a problem known as local minima Local Minima If we are unlucky and start here 11
Local Minima We will end up with our estimate here Several re-starts Some Solutions Momentum to jump over small dips 12
Isn t That All a bit Pointless? For calculating a mean, yes it is There is no need to use gradient descent for it But there are other examples where you need to We will meet neural networks soon, which make use of gradient descent during learning Another Cost Function: Likelihood What if we want to estimate the parameters of a probability distribution what is a good cost function? The problem is to take a sample, and estimate the probability distribution # $, usually in some parametrised form. Squared error cannot be used as we do not ever know the true value of any # 13
Calculating Likelihood The likelihood associated with a model and a given data set is calculated as the product of the probability estimates made by the model across the examples from the data set: % = &'(( ) We use '(( )to mean the estimate made by the model Log Likelihood Probabilities can be small and multiplying many of them together can make very small numbers So the log likelihood is often used ) = log'(( ) 14
Simple Example Let s say we toss a coin 100 times and get 75 heads and 25 tails We now want to model that coin with a discrete function ',()that takes = - or =. as input and outputs the associated probability (0.75 or 0.25, in this case) Again, the example is trivial and we know the answer is ',( = -)=75/100 and ',( =.) =25/100 (Bayesians look away now) Simple Example But let s say we don t know that, or need a method that can cope in more complex situations where that can t be used 15
Maximise Likelihood Negative Log Likelihood 16
Or Gradient Descent Similarly, we could use an iterative approach and try to find the parameter with the largest likelihood by iteratively moving the estimate along the likelihood gradient P Likelihood gradient 0.5 100 0.6 62.5 0.7 23.81 0.72 14.88 0.75 0 0.76-5.48 Other Optimisations There are many other methods for taking a cost function and trying to find its global minimum Some follow gradients, other use different algorithms or heuristics We will see more of them during the course 17
Summary Many machine learning methods involve optimising some form of cost function Sometimes, it is possible to optimise the cost analytically, for example multiple linear regression does so Other times, you need to use an iterative approach such as gradient descent, for example when training a neural network 18