CS 237: Probability in Computing

CS 237: Probability in Computing Wayne Snyder Computer Science Department Boston University Lecture 26: Logistic Regression 2 Gradient Descent for Linear Regression Gradient Descent for Logistic Regression Example: Logistic Regression for Heights Gender Putting it all together: Multivariate Logistic Regression

Recall: we were searching for the parameters using the technique of gradient descent: for this linear regression problem

Basic idea: Define a cost or loss function J(...) which gives the cost or penalty for a given choice of parameters, then search for the parameters which minimize this cost. So let s pretend we didn t have the formulae at the upper right, suppose we needed to find them by gradient descent. In linear regression this would mean finding values for which minimize the MSE, in other words minimize the cost function: Cost Function = MSE

Partial Derivatives: To find a partial derivative of a function with multiple parameters, we find the rate at which each parameter varies its derivative -- in isolation, considering each of the other parameters to effectively be a constant: We will use b m instead of will represent points as column vectors:

To find the minimum value along one axis we will work with only one of the partial derivatives as a time, say the y-intercept: Step One: Choose an initial point b 0. Step Two: Choose a step size or learning rate! threshold of accuracy ". Step Three: Move that distance along the axis, in the decreasing direction (the negative of the slope), repeat until the distance moved is less than ". Step Four: Output b n+1 as the minimum. Let s look at a notebook showing this...

Gradient Descent for Linear Regression: To find a point in multiple dimensions, we simply do all dimensions at the same time. Here is the algorithm from the reading: def update(m,b,x,y,lam): N = len(x) m_deriv = 0 b_deriv = 0 for i in range(n): m_deriv += -2*X[i]*(Y[i]-(m*X[i]+b)) b_deriv += -2*(Y[i]-(m*X[i]+b)) m_deriv /= float(n) b_deriv /= float(n) m -= m_deriv * lam b -= b_deriv * lam return (m, b) def gradient_descent(m,b,x,y,lam,epsilon): while True: (m1, b1) = update(m,b,x,y,lam) if abs(m1 m) + abs(b1 b) < epsilon: return (m1,b1) m = m1 b = b1

As the parameters are tuned to minimize the cost (= measuring how well the parameters fit the model) you get a better better fit between the model the data. You can run the gradient descent model as long as you wish to get a better fit. Obviously, defining the cost function picking the learning rate threshold are critical decisions, much research has been devoted to different cost models different approaches to gradient descent.

There are dozens of different cost functions that have been defined even more varients of algorithms that perform minimization of a given cost function available in stard Python libraries:

Logistic Regression: The Cost Function Ok, back to logistic regression! A common cost function used in logistic regression is Cross-Entropy or Log-Loss: If Y is the actual value (0 or 1, from the actual data) is the predicted probability P(Y=1) output by logistic regression, then the cost of this prediction is:

Logistic Regression: The Cost Function Thus, in our simple example (predicting gender from height) we want to predict the y- intercept b slope m such that the regression line, after being transformed by the sigmoid function s(...), minimizes the following cost: Predicted probability that P(y i = 1)

Logistic Regression: Putting it All Together In the case of m multiple independent variables X binary dependent variable Y: where the prediction for the sample (row) using the parameters is denoted, then we use the same gradient descent algorithm to find the values of the parameters such that the following cost function is minimized:

Logistic Regression: Supervised Learning Even if we have successfully minimized the cost function, surely the cost will not be 0, which means that our model will not be perfect. How do we evaluate our model? We can think of this algorithm as trying to learn the categories (0 or 1) that the independent variables belong to, use our data itself to test the results. The basic idea is to take a rom selection (say 80%) of our data set, find the best fitting parameters (called training ) then test it on the remaining 20%. The percentage of the remaining data that is successfully classified is the accuracy of our model: Train Test