Derek Bridge School of Computer Science and Information Technology University College Cork. from sklearn.preprocessing import add_dummy_feature

Size: px
Start display at page:

Download "Derek Bridge School of Computer Science and Information Technology University College Cork. from sklearn.preprocessing import add_dummy_feature"

Transcription

1 CS4618: Artificial Intelligence I Gradient Descent Derek Bridge School of Computer Science and Information Technology University College Cork Initialization In [1]: %load_ext autoreload %autoreload 2 %matplotlib inline In [2]: import pandas as pd import numpy as np import matplotlib.pyplot as plt In [45]: from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import add_dummy_feature from sklearn.linear_model import SGDRegressor Gradient Descent for OLS Regression We saw the basic idea now, the details In fact, three variants: Batch Gradient Descent Stochastic Gradient Descent Mini-batch Gradient Descent 1 of 10

2 Partial Derivatives We need the gradient of the loss function with regards to each In other words, how much the loss will change if we change a little With respect to a particular, it is called the partial derivative Without doing the calculus, the partial derivatives of X y β with respect to β j are m The gradient vector, β J(X, X y, β) And there is a vectorized way to compute it: β j J(X, y, β) J(X, X y, β) β j β j 1 = ( x ) β j m (i) β y (i) x (i) j i=1, is a vector of each partial derivative: J(X,y,β) X y β β 0 J(X,y,β) X y β β 1 β β J(X, X y, β) = J(X,y,β) X y β β n 1 J(X, X y, β) = (Xβ y) m X T y Gradient Descent, Again Recap: It starts with an initial guess for the values of the parameters Then repeatedly: It updates the parameter values hopefully to reduce the loss But now we know how to update the parameter values to reduce the loss: Compute the gradient vector But this points 'uphill' and we want to go 'downhill' Or And we want to make 'baby steps', so we use the learning rate,, which is between 0 and 1 So subtract the times the gradient vector from β β β β J(X, X y, β) β β (Xβ y) m X T y (BTW, this is vectorized. Naive loop implementations are wrong: they lose the simultaneous update of the β j ) Batch Gradient Descent Pseudocode: initialize β randomly repeat until convergence β β m X T (Xβ y) Why is it called Batch Gradient Descent? The update involves a calculation over the entire training set This can be slow for large training sets X on every iteration 2 of 10

3 Batch Gradient Descent in numpy For the hell of it, let's implement it ourselves (We'll be naughty: we'll train on the whole dataset) In [4]: # Loss function for OLS regression (assumes X contains all 1s in its fir st column) def J(X, y, beta): return np.mean((x.dot(beta) - y) ** 2) / 2.0 In [49]: def batch_gradient_descent_for_ols_linear_regression(x, y, alpha, num_it erations): m, n = X.shape beta = np.random.randn(n) Jvals = np.zeros(num_iterations) for iter in range(num_iterations): beta -= (1.0 * alpha / m) * X.T.dot(X.dot(beta) - y) Jvals[iter] = J(X, y, beta) return beta, Jvals In [50]: # Use pandas to read the CSV file df = pd.read_csv("datasets/dataset_corka.csv") # Get the feature-values and the target values X_without_dummy_unscaled = df[["flarea", "bdrms", "bthrms"]].values y = df["price"].values # Scale it scaler = StandardScaler() X_without_dummy = scaler.fit_transform(x_without_dummy_unscaled) # Add the extra column to X X = add_dummy_feature(x_without_dummy) # Run the Batch Gradient Descent beta, Jvals = batch_gradient_descent_for_ols_linear_regression(x, y, alp ha = 0.03, num_iterations = 1000) # Display beta beta Out[50]: array([ , , , ]) Bear in mind that the coefficients it finds are on the scaled data It's a good idea to plot the values of the loss function against the number of iterations If its value ever increases, then the code might be incorrect (I think it's OK!) the value of is too big and is causing divergence 3 of 10

4 In [51]: fig = plt.figure(figsize=(8,6)) plt.title("$j$ during learning") plt.xlabel("number of iterations") plt.xlim(1, Jvals.size) plt.ylabel("$j$") plt.ylim(3500, 50000) xvals = np.linspace(1, Jvals.size, Jvals.size) plt.scatter(xvals, Jvals) plt.show() The algorithm gives us the problem of choosing the number of iterations An alternative is to use a very large number of iterations but exit when the gradient vector becomes tiny: when its norm becomes smaller than tolerance, η Try it without scaling: In [52]: # Get the feature-values and the target values X_without_dummy = df[["flarea", "bdrms", "bthrms"]].values y = df["price"].values # Add the extra column to X X = add_dummy_feature(x_without_dummy) # Run the Batch Gradient Descent beta, Jvals = batch_gradient_descent_for_ols_linear_regression(x, y, alp ha = 0.03, num_iterations = 4000) # Display beta beta C:\Anaconda3\lib\site-packages\ipykernel\ main.py:3: RuntimeWarning: o verflow encountered in square app.launch_new_instance() C:\Anaconda3\lib\site-packages\ipykernel\ main.py:8: RuntimeWarning: i nvalid value encountered in subtract Out[52]: array([ nan, nan, nan, nan]) 4 of 10

5 How can you get it to work? Some people suggest a variant of Batch Gradient Descent in which the value of value in later iterations is smaller Why do they suggest this? And why isn't it necessary? (But, we'll revisit this idea in Stochastic Gradient Descent) is decreased over time, i.e. its Stochastic Gradient Descent As we saw, Batch Gradient Descent can be slow on large training sets Stochastic Gradient Descent (SGD): On each iteration, it picks just one training example one example at random and computes the gradients on just that This gives huge speed-up It enables us to train on huge training sets since only one example needs to be in memory in each iteration But, because it is stochastic (the randomness), the loss will not necessarily decrease on each iteration: On average, the loss decreases, but in any one iteration, loss may go up or down Eventually, it will get close to the minimum, but it will continue to go up and down a bit So, once you stop it, the β will be close to the best, but not necessarily optimal Ironically, if you have a local minimum (which, with OLS regression, we don't), SGD might even escape the local minimum, and might even get to the global minimum x β β x T (xβ y) Simulated Annealing As we discussed, SGD does not settle at the minimum One solution is to gradually reduce the learning rate Updates start out 'large' so you make progress and can escape local minima But, over time, updates get smaller, allowing SGD to settle at the global minimum The function that determines how to reduce the learning rate is called the learning schedule Reduce it too quickly and you may get stuck in a local minimum or en route to the global minimum Reduce it too slowly and you may bounce around a lot and, if stopped after too few iterations, may end up with a suboptimal solution SGD in scikit-learn The fit method of scikit-learn's SGDRegressor class is doing what we have described: You must scale the features but it inserts the extra column of 1s You can supply a learning_rate and lots of other things (in the code below, we'll just use the defaults) (In the code below, we'll be naughty: we'll train on the whole dataset) 5 of 10

6 In [53]: # Use pandas to read the CSV file df = pd.read_csv("datasets/dataset_corka.csv") # Get the feature-values and the target values X_unscaled = df[["flarea", "bdrms", "bthrms"]].values y = df["price"].values # Scale it scaler = StandardScaler() X = scaler.fit_transform(x_unscaled) # Create the SGDRegressor and fit the model sgd = SGDRegressor() sgd.fit(x, y) Out[53]: SGDRegressor(alpha=0.0001, average=false, epsilon=0.1, eta0=0.01, fit_intercept=true, l1_ratio=0.15, learning_rate='invscaling', loss='squared_loss', n_iter=5, penalty='l2', power_t=0.25, random_state=none, shuffle=true, verbose=0, warm_start=false) SGD in numpy For the hell of it, let's implement a simple version ourselves (Again, we'll be naughty: we'll train on the whole dataset) In [56]: def stochastic_gradient_descent_for_ols_linear_regression(x, y, alpha, n um_epochs): m, n = X.shape beta = np.random.randn(n) Jvals = np.zeros(num_epochs * m) for epoch in range(num_epochs): for i in range(m): rand_idx = np.random.randint(m) xi = X[rand_idx:rand_idx + 1] yi = y[rand_idx:rand_idx + 1] beta -= alpha * xi.t.dot(xi.dot(beta) - yi) Jvals[epoch * m + i] = J(X, y, beta) return beta, Jvals 6 of 10

7 In [57]: # Use pandas to read the CSV file df = pd.read_csv("datasets/dataset_corka.csv") # Get the feature-values and the target values X_without_dummy_unscaled = df[["flarea", "bdrms", "bthrms"]].values y = df["price"].values # Scale it scaler = StandardScaler() X_without_dummy = scaler.fit_transform(x_without_dummy_unscaled) # Add the extra column to X X = add_dummy_feature(x_without_dummy) # Run the Stochastic Gradient Descent beta, Jvals = stochastic_gradient_descent_for_ols_linear_regression(x, y, alpha = 0.03, num_epochs = 50) # Display beta beta Out[57]: array([ , , , ]) In [58]: fig = plt.figure(figsize=(8,6)) plt.title("$j$ during learning") plt.xlabel("number of iterations") plt.xlim(1, Jvals.size) plt.ylabel("$j$") plt.ylim(3500, 50000) xvals = np.linspace(1, Jvals.size, Jvals.size) plt.scatter(xvals, Jvals) plt.show() Quite a bumpy ride! So, let's try simuated annealingl 7 of 10

8 In [59]: def learning_schedule(t): return 5 / (t + 50) def stochastic_gradient_descent_for_ols_linear_regression(x, y, num_epoc hs): m, n = X.shape beta = np.random.randn(n) Jvals = np.zeros(num_epochs * m) for epoch in range(num_epochs): for i in range(m): rand_idx = np.random.randint(m) xi = X[rand_idx:rand_idx + 1] yi = y[rand_idx:rand_idx + 1] alpha = learning_schedule(epoch * m + i) beta -= alpha * xi.t.dot(xi.dot(beta) - yi) Jvals[epoch * m + i] = J(X, y, beta) return beta, Jvals In [60]: # Use pandas to read the CSV file df = pd.read_csv("datasets/dataset_corka.csv") # Get the feature-values and the target values X_without_dummy_unscaled = df[["flarea", "bdrms", "bthrms"]].values y = df["price"].values # Scale it scaler = StandardScaler() X_without_dummy = scaler.fit_transform(x_without_dummy_unscaled) # Add the extra column to X X = add_dummy_feature(x_without_dummy) # Run the Stochastic Gradient Descent beta, Jvals = stochastic_gradient_descent_for_ols_linear_regression(x, y, num_epochs = 50) # Display beta beta Out[60]: array([ e+02, e+02, e+01, e-01]) 8 of 10

9 In [61]: fig = plt.figure(figsize=(8,6)) plt.title("$j$ during learning") plt.xlabel("number of iterations") plt.xlim(1, Jvals.size) plt.ylabel("$j$") plt.ylim(3500, 50000) xvals = np.linspace(1, Jvals.size, Jvals.size) plt.scatter(xvals, Jvals) plt.show() Mini-Batch Gradient Descent Batch Gradient Descent computed gradients from the full training set Stochastic Gradient Descent computed gradients from just one example Mini-Batch Gradient Descent lies between the two: It computes gradients from a small randomly-selected subset of the training set, called a mini-batch Since it lies between the two: It may bounce less and get closer to the global minimum than SGD Although both of them can reach the global minimum with a good learning schedule But it may be harder to escape local minima, if you have them (which, for OLS, we don't) And its time and memory costs lie between the two 9 of 10

10 The Normal Equation versus Gradient Descent Efficiency/scaling-up Normal Equation is linear in m, so can handle large training sets efficiently if they fit into main memory but it has to compute the inverse (or psueudo-inverse) of a between quadratic and cubic in, and so is only feasible for smallish Gradient Descent SGD scales really well to huge matrix, which takes time And all three Gradient Descent methods can handle huge (even 100s of 1000s) (up to a few thousand) Finding the global minimum for OLS regression Normal Equation: guaranteed to find the global minimum Gradient Descent: all a bit dependent on number of iterations, learning rate, learning schedule Feature scaling: Normal Equation: scaling is not needed (In fact, I find that scikit-learn's LinearRegression class produces weird results if I do any scaling. I don't know why. So don't do it!) Gradient Descent: scaling is needed n m n n Finally, Gradient Descent is a general method, whereas the Normal Equation is only for OLS regression n n Logicstic Regression So what about classification using logistic regression? We have a different loss function (cross entropy) Happily, it is convex But there is no equivalent to the Normal Equation, so we must use Gradient Descent Not that it matters, but here is the partial derivative of its loss function with respect to β i (binary classification) m J 1 = ( x )β ) β j m (i) β y (i) x (i) j i=1 scikit-learn has the class LogisticRegression, but also SGDClassifier if you want more control After Christmas Here endeth CS4618 What will we do in CS4619? We will study some more complex models (i.e. non-linear ones) We will study underfitting and overfitting, and solutions to these This will lead into Neural Networks From there, we will study so-called Deep Learning for regression and classification, including for images We will generalize to problems such as sequence to vector, vector to sequence and sequence to sequences such as machine translation, speech recognition, We will reviste Reinforcement Learning We will consider knowledge representation and reasoning It'll be tough but brilliant In [ ]: 10 of 10

Derek Bridge School of Computer Science and Information Technology University College Cork

Derek Bridge School of Computer Science and Information Technology University College Cork CS468: Artificial Intelligence I Ordinary Least Squares Regression Derek Bridge School of Computer Science and Information Technology University College Cork Initialization In [4]: %load_ext autoreload

More information

Derek Bridge School of Computer Science and Information Technology University College Cork

Derek Bridge School of Computer Science and Information Technology University College Cork CS4619: Artificial Intelligence II Overfitting and Underfitting Derek Bridge School of Computer Science and Information Technology University College Cork Initialization In [1]: %load_ext autoreload %autoreload

More information

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time,

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time, Chapter 2 Although stochastic gradient descent can be considered as an approximation of gradient descent, it typically reaches convergence much faster because of the more frequent weight updates. Since

More information

3 Types of Gradient Descent Algorithms for Small & Large Data Sets

3 Types of Gradient Descent Algorithms for Small & Large Data Sets 3 Types of Gradient Descent Algorithms for Small & Large Data Sets Introduction Gradient Descent Algorithm (GD) is an iterative algorithm to find a Global Minimum of an objective function (cost function)

More information

CS4618: Artificial Intelligence I. Accuracy Estimation. Initialization

CS4618: Artificial Intelligence I. Accuracy Estimation. Initialization CS4618: Artificial Intelligence I Accuracy Estimation Derek Bridge School of Computer Science and Information echnology University College Cork Initialization In [1]: %reload_ext autoreload %autoreload

More information

Derek Bridge School of Computer Science and Information Technology University College Cork

Derek Bridge School of Computer Science and Information Technology University College Cork CS4618: rtificial Intelligence I Vectors and Matrices Derek Bridge School of Computer Science and Information Technology University College Cork Initialization In [1]: %load_ext autoreload %autoreload

More information

DEEP LEARNING IN PYTHON. The need for optimization

DEEP LEARNING IN PYTHON. The need for optimization DEEP LEARNING IN PYTHON The need for optimization A baseline neural network Input 2 Hidden Layer 5 2 Output - 9-3 Actual Value of Target: 3 Error: Actual - Predicted = 4 A baseline neural network Input

More information

intro_mlp_xor March 26, 2018

intro_mlp_xor March 26, 2018 intro_mlp_xor March 26, 2018 1 Introduction to Neural Networks Some material from peterroelants Goal: understand neural networks so they are no longer a black box In [121]: # do all of the imports here

More information

Derek Bridge School of Computer Science and Information Technology University College Cork

Derek Bridge School of Computer Science and Information Technology University College Cork CS4619: Artificial Intelligence II Methodology Dere Bridge School of Computer Science and Information Technology University College Cor Initialization In [1]: %load_ext autoreload %autoreload 2 %matplotlib

More information

A Brief Look at Optimization

A Brief Look at Optimization A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year s version Overview Introduction Classes of optimization problems Linear programming Steepest

More information

Planar data classification with one hidden layer

Planar data classification with one hidden layer Planar data classification with one hidden layer Welcome to your week 3 programming assignment. It's time to build your first neural network, which will have a hidden layer. You will see a big difference

More information

CS 179 Lecture 16. Logistic Regression & Parallel SGD

CS 179 Lecture 16. Logistic Regression & Parallel SGD CS 179 Lecture 16 Logistic Regression & Parallel SGD 1 Outline logistic regression (stochastic) gradient descent parallelizing SGD for neural nets (with emphasis on Google s distributed neural net implementation)

More information

HMC CS 158, Fall 2017 Problem Set 3 Programming: Regularized Polynomial Regression

HMC CS 158, Fall 2017 Problem Set 3 Programming: Regularized Polynomial Regression HMC CS 158, Fall 2017 Problem Set 3 Programming: Regularized Polynomial Regression Goals: To open up the black-box of scikit-learn and implement regression models. To investigate how adding polynomial

More information

CS281 Section 3: Practical Optimization

CS281 Section 3: Practical Optimization CS281 Section 3: Practical Optimization David Duvenaud and Dougal Maclaurin Most parameter estimation problems in machine learning cannot be solved in closed form, so we often have to resort to numerical

More information

Logistic Regression and Gradient Ascent

Logistic Regression and Gradient Ascent Logistic Regression and Gradient Ascent CS 349-02 (Machine Learning) April 0, 207 The perceptron algorithm has a couple of issues: () the predictions have no probabilistic interpretation or confidence

More information

Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 10, 2017

Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 10, 2017 INF 5860 Machine learning for image classification Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 0, 207 Mandatory exercise Available tonight,

More information

Unsupervised Learning: K-means Clustering

Unsupervised Learning: K-means Clustering Unsupervised Learning: K-means Clustering by Prof. Seungchul Lee isystems Design Lab http://isystems.unist.ac.kr/ UNIST Table of Contents I. 1. Supervised vs. Unsupervised Learning II. 2. K-means I. 2.1.

More information

Neural Networks (pp )

Neural Networks (pp ) Notation: Means pencil-and-paper QUIZ Means coding QUIZ Neural Networks (pp. 106-121) The first artificial neural network (ANN) was the (single-layer) perceptron, a simplified model of a biological neuron.

More information

Efficient Deep Learning Optimization Methods

Efficient Deep Learning Optimization Methods 11-785/ Spring 2019/ Recitation 3 Efficient Deep Learning Optimization Methods Josh Moavenzadeh, Kai Hu, and Cody Smith Outline 1 Review of optimization 2 Optimization practice 3 Training tips in PyTorch

More information

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Ensemble methods in machine learning. Example. Neural networks. Neural networks Ensemble methods in machine learning Bootstrap aggregating (bagging) train an ensemble of models based on randomly resampled versions of the training set, then take a majority vote Example What if you

More information

Neural networks. About. Linear function approximation. Spyros Samothrakis Research Fellow, IADS University of Essex.

Neural networks. About. Linear function approximation. Spyros Samothrakis Research Fellow, IADS University of Essex. Neural networks Spyros Samothrakis Research Fellow, IADS University of Essex About Linear function approximation with SGD From linear regression to neural networks Practical aspects February 28, 2017 Conclusion

More information

scikit-learn (Machine Learning in Python)

scikit-learn (Machine Learning in Python) scikit-learn (Machine Learning in Python) (PB13007115) 2016-07-12 (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 1 / 29 Outline 1 Introduction 2 scikit-learn examples 3 Captcha recognize

More information

IST 597 Foundations of Deep Learning Fall 2018 Homework 1: Regression & Gradient Descent

IST 597 Foundations of Deep Learning Fall 2018 Homework 1: Regression & Gradient Descent IST 597 Foundations of Deep Learning Fall 2018 Homework 1: Regression & Gradient Descent This assignment is worth 15% of your grade for this class. 1 Introduction Before starting your first assignment,

More information

CS4618: Artificial Intelligence I. Clustering: Introduction. Initialization

CS4618: Artificial Intelligence I. Clustering: Introduction. Initialization CS4618: Artificial Intelligence I Clustering: Introduction Dere Bridge School of Computer Science and Information Technology University College Cor Initialization %reload_et autoreload %autoreload 2 %matplotlib

More information

IST 597 Deep Learning Overfitting and Regularization. Sep. 27, 2018

IST 597 Deep Learning Overfitting and Regularization. Sep. 27, 2018 IST 597 Deep Learning Overfitting and Regularization 1. Overfitting Sep. 27, 2018 Regression model y 1 3 x3 13 2 x2 36x10 import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import

More information

5 Machine Learning Abstractions and Numerical Optimization

5 Machine Learning Abstractions and Numerical Optimization Machine Learning Abstractions and Numerical Optimization 25 5 Machine Learning Abstractions and Numerical Optimization ML ABSTRACTIONS [some meta comments on machine learning] [When you write a large computer

More information

Linear Regression Optimization

Linear Regression Optimization Gradient Descent Linear Regression Optimization Goal: Find w that minimizes f(w) f(w) = Xw y 2 2 Closed form solution exists Gradient Descent is iterative (Intuition: go downhill!) n w * w Scalar objective:

More information

06: Logistic Regression

06: Logistic Regression 06_Logistic_Regression 06: Logistic Regression Previous Next Index Classification Where y is a discrete value Develop the logistic regression algorithm to determine what class a new input should fall into

More information

SGD: Stochastic Gradient Descent

SGD: Stochastic Gradient Descent Improving SGD Hantao Zhang Deep Learning with Python Reading: http://neuralnetworksanddeeplearning.com/index.html Chapter 2 SGD: Stochastic Gradient Descent Main Idea: Given a set of input/output examples

More information

Machine Learning Classifiers and Boosting

Machine Learning Classifiers and Boosting Machine Learning Classifiers and Boosting Reading Ch 18.6-18.12, 20.1-20.3.2 Outline Different types of learning problems Different types of learning algorithms Supervised learning Decision trees Naïve

More information

REGRESSION ANALYSIS : LINEAR BY MAUAJAMA FIRDAUS & TULIKA SAHA

REGRESSION ANALYSIS : LINEAR BY MAUAJAMA FIRDAUS & TULIKA SAHA REGRESSION ANALYSIS : LINEAR BY MAUAJAMA FIRDAUS & TULIKA SAHA MACHINE LEARNING It is the science of getting computer to learn without being explicitly programmed. Machine learning is an area of artificial

More information

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah Reference Most of the slides are taken from the third chapter of the online book by Michael Nielson: neuralnetworksanddeeplearning.com

More information

Perceptron: This is convolution!

Perceptron: This is convolution! Perceptron: This is convolution! v v v Shared weights v Filter = local perceptron. Also called kernel. By pooling responses at different locations, we gain robustness to the exact spatial location of image

More information

COMPUTATIONAL INTELLIGENCE (CS) (INTRODUCTION TO MACHINE LEARNING) SS16. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

COMPUTATIONAL INTELLIGENCE (CS) (INTRODUCTION TO MACHINE LEARNING) SS16. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions COMPUTATIONAL INTELLIGENCE (CS) (INTRODUCTION TO MACHINE LEARNING) SS16 Lecture 2: Linear Regression Gradient Descent Non-linear basis functions LINEAR REGRESSION MOTIVATION Why Linear Regression? Regression

More information

Understanding Andrew Ng s Machine Learning Course Notes and codes (Matlab version)

Understanding Andrew Ng s Machine Learning Course Notes and codes (Matlab version) Understanding Andrew Ng s Machine Learning Course Notes and codes (Matlab version) Note: All source materials and diagrams are taken from the Coursera s lectures created by Dr Andrew Ng. Everything I have

More information

Tutorial 1. Linear Regression

Tutorial 1. Linear Regression Tutorial 1. Linear Regression January 11, 2017 1 Tutorial: Linear Regression Agenda: 1. Spyder interface 2. Linear regression running example: boston data 3. Vectorize cost function 4. Closed form solution

More information

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18 Lecture 2: Linear Regression Gradient Descent Non-linear basis functions LINEAR REGRESSION MOTIVATION Why Linear Regression? Simplest

More information

Lab Four. COMP Advanced Artificial Intelligence Xiaowei Huang Cameron Hargreaves. October 22nd 2018

Lab Four. COMP Advanced Artificial Intelligence Xiaowei Huang Cameron Hargreaves. October 22nd 2018 Lab Four COMP 219 - Advanced Artificial Intelligence Xiaowei Huang Cameron Hargreaves October 22nd 2018 1 Reading Begin by reading chapter three of Python Machine Learning until page 80 found in the learning

More information

Lab Five. COMP Advanced Artificial Intelligence Xiaowei Huang Cameron Hargreaves. October 29th 2018

Lab Five. COMP Advanced Artificial Intelligence Xiaowei Huang Cameron Hargreaves. October 29th 2018 Lab Five COMP 219 - Advanced Artificial Intelligence Xiaowei Huang Cameron Hargreaves October 29th 2018 1 Decision Trees and Random Forests 1.1 Reading Begin by reading chapter three of Python Machine

More information

Logistic Regression with a Neural Network mindset

Logistic Regression with a Neural Network mindset Logistic Regression with a Neural Network mindset Welcome to your first (required) programming assignment! You will build a logistic regression classifier to recognize cats. This assignment will step you

More information

Machine Learning and Computational Statistics, Spring 2015 Homework 1: Ridge Regression and SGD

Machine Learning and Computational Statistics, Spring 2015 Homework 1: Ridge Regression and SGD Machine Learning and Computational Statistics, Spring 2015 Homework 1: Ridge Regression and SGD Due: Friday, February 6, 2015, at 4pm (Submit via NYU Classes) Instructions: Your answers to the questions

More information

Lecture Linear Support Vector Machines

Lecture Linear Support Vector Machines Lecture 8 In this lecture we return to the task of classification. As seen earlier, examples include spam filters, letter recognition, or text classification. In this lecture we introduce a popular method

More information

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent Slide credit: http://sebastianruder.com/optimizing-gradient-descent/index.html#batchgradientdescent

More information

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

CS535 Big Data Fall 2017 Colorado State University   10/10/2017 Sangmi Lee Pallickara Week 8- A. CS535 Big Data - Fall 2017 Week 8-A-1 CS535 BIG DATA FAQs Term project proposal New deadline: Tomorrow PA1 demo PART 1. BATCH COMPUTING MODELS FOR BIG DATA ANALYTICS 5. ADVANCED DATA ANALYTICS WITH APACHE

More information

Linear Regression & Gradient Descent

Linear Regression & Gradient Descent Linear Regression & Gradient Descent These slides were assembled by Byron Boots, with grateful acknowledgement to Eric Eaton and the many others who made their course materials freely available online.

More information

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES Deep Learning Practical introduction with Keras Chapter 3 27/05/2018 Neuron A neural network is formed by neurons connected to each other; in turn, each connection of one neural network is associated

More information

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016 CPSC 34: Machine Learning and Data Mining Feature Selection Fall 26 Assignment 3: Admin Solutions will be posted after class Wednesday. Extra office hours Thursday: :3-2 and 4:3-6 in X836. Midterm Friday:

More information

CS230: Deep Learning Winter Quarter 2018 Stanford University

CS230: Deep Learning Winter Quarter 2018 Stanford University : Deep Learning Winter Quarter 08 Stanford University Midterm Examination 80 minutes Problem Full Points Your Score Multiple Choice 7 Short Answers 3 Coding 7 4 Backpropagation 5 Universal Approximation

More information

Cost Functions in Machine Learning

Cost Functions in Machine Learning Cost Functions in Machine Learning Kevin Swingler Motivation Given some data that reflects measurements from the environment We want to build a model that reflects certain statistics about that data Something

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

Notes on Multilayer, Feedforward Neural Networks

Notes on Multilayer, Feedforward Neural Networks Notes on Multilayer, Feedforward Neural Networks CS425/528: Machine Learning Fall 2012 Prepared by: Lynne E. Parker [Material in these notes was gleaned from various sources, including E. Alpaydin s book

More information

2. Linear Regression and Gradient Descent

2. Linear Regression and Gradient Descent Pattern Recognition And Machine Learning - EPFL - Fall 2015 Emtiyaz Khan, Timur Bagautdinov, Carlos Becker, Ilija Bogunovic & Ksenia Konyushkova 2. Linear Regression and Gradient Descent 2.1 Goals The

More information

Class 6 Large-Scale Image Classification

Class 6 Large-Scale Image Classification Class 6 Large-Scale Image Classification Liangliang Cao, March 7, 2013 EECS 6890 Topics in Information Processing Spring 2013, Columbia University http://rogerioferis.com/visualrecognitionandsearch Visual

More information

Machine Learning and Computational Statistics, Spring 2016 Homework 1: Ridge Regression and SGD

Machine Learning and Computational Statistics, Spring 2016 Homework 1: Ridge Regression and SGD Machine Learning and Computational Statistics, Spring 2016 Homework 1: Ridge Regression and SGD Due: Friday, February 5, 2015, at 6pm (Submit via NYU Classes) Instructions: Your answers to the questions

More information

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer

More information

Lecture 1 Notes. Outline. Machine Learning. What is it? Instructors: Parth Shah, Riju Pahwa

Lecture 1 Notes. Outline. Machine Learning. What is it? Instructors: Parth Shah, Riju Pahwa Instructors: Parth Shah, Riju Pahwa Lecture 1 Notes Outline 1. Machine Learning What is it? Classification vs. Regression Error Training Error vs. Test Error 2. Linear Classifiers Goals and Motivations

More information

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3 Neural Network Optimization and Tuning 11-785 / Spring 2018 / Recitation 3 1 Logistics You will work through a Jupyter notebook that contains sample and starter code with explanations and comments throughout.

More information

Machine Learning: Think Big and Parallel

Machine Learning: Think Big and Parallel Day 1 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 1, 2013 Outline Scikit-learn: Machine Learning in Python Supervised Learning day1 Regression: Least

More information

Simple Model Selection Cross Validation Regularization Neural Networks

Simple Model Selection Cross Validation Regularization Neural Networks Neural Nets: Many possible refs e.g., Mitchell Chapter 4 Simple Model Selection Cross Validation Regularization Neural Networks Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February

More information

CS 331: Artificial Intelligence Local Search 1. Tough real-world problems

CS 331: Artificial Intelligence Local Search 1. Tough real-world problems CS 331: Artificial Intelligence Local Search 1 1 Tough real-world problems Suppose you had to solve VLSI layout problems (minimize distance between components, unused space, etc.) Or schedule airlines

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

Tested Paradigm to Include Optimization in Machine Learning Algorithms

Tested Paradigm to Include Optimization in Machine Learning Algorithms Tested Paradigm to Include Optimization in Machine Learning Algorithms Aishwarya Asesh School of Computing Science and Engineering VIT University Vellore, India International Journal of Engineering Research

More information

Optimization Plugin for RapidMiner. Venkatesh Umaashankar Sangkyun Lee. Technical Report 04/2012. technische universität dortmund

Optimization Plugin for RapidMiner. Venkatesh Umaashankar Sangkyun Lee. Technical Report 04/2012. technische universität dortmund Optimization Plugin for RapidMiner Technical Report Venkatesh Umaashankar Sangkyun Lee 04/2012 technische universität dortmund Part of the work on this technical report has been supported by Deutsche Forschungsgemeinschaft

More information

Logistic Regression. Abstract

Logistic Regression. Abstract Logistic Regression Tsung-Yi Lin, Chen-Yu Lee Department of Electrical and Computer Engineering University of California, San Diego {tsl008, chl60}@ucsd.edu January 4, 013 Abstract Logistic regression

More information

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5. More on Neural Networks Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.6 Recall the MLP Training Example From Last Lecture log likelihood

More information

Tutorial Four: Linear Regression

Tutorial Four: Linear Regression Tutorial Four: Linear Regression Imad Pasha Chris Agostino February 25, 2015 1 Introduction When looking at the results of experiments, it is critically important to be able to fit curves to scattered

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 23 Announcements...

More information

Iris Example PyTorch Implementation

Iris Example PyTorch Implementation Iris Example PyTorch Implementation February, 28 Iris Example using Pytorch.nn Using SciKit s Learn s prebuilt datset of Iris Flowers (which is in a numpy data format), we build a linear classifier in

More information

Classification and Regression using Linear Networks, Multilayer Perceptrons and Radial Basis Functions

Classification and Regression using Linear Networks, Multilayer Perceptrons and Radial Basis Functions ENEE 739Q SPRING 2002 COURSE ASSIGNMENT 2 REPORT 1 Classification and Regression using Linear Networks, Multilayer Perceptrons and Radial Basis Functions Vikas Chandrakant Raykar Abstract The aim of the

More information

Linear Regression Implementation

Linear Regression Implementation Linear Regression Implementation 1 When you experience regression, you go back in some way. The process of regressing is to go back to a less perfect or less developed state. Modeling data is focused on

More information

Package automl. September 13, 2018

Package automl. September 13, 2018 Type Package Title Deep Learning with Metaheuristic Version 1.0.5 Author Alex Boulangé Package automl September 13, 2018 Maintainer Alex Boulangé Fits from

More information

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017 Hyperparameter optimization CS6787 Lecture 6 Fall 2017 Review We ve covered many methods Stochastic gradient descent Step size/learning rate, how long to run Mini-batching Batch size Momentum Momentum

More information

Convolutional Neural Networks (CNN)

Convolutional Neural Networks (CNN) Convolutional Neural Networks (CNN) By Prof. Seungchul Lee Industrial AI Lab http://isystems.unist.ac.kr/ POSTECH Table of Contents I. 1. Convolution on Image I. 1.1. Convolution in 1D II. 1.2. Convolution

More information

Optimization. Industrial AI Lab.

Optimization. Industrial AI Lab. Optimization Industrial AI Lab. Optimization An important tool in 1) Engineering problem solving and 2) Decision science People optimize Nature optimizes 2 Optimization People optimize (source: http://nautil.us/blog/to-save-drowning-people-ask-yourself-what-would-light-do)

More information

Homework 2. Due: March 2, 2018 at 7:00PM. p = 1 m. (x i ). i=1

Homework 2. Due: March 2, 2018 at 7:00PM. p = 1 m. (x i ). i=1 Homework 2 Due: March 2, 2018 at 7:00PM Written Questions Problem 1: Estimator (5 points) Let x 1, x 2,..., x m be an i.i.d. (independent and identically distributed) sample drawn from distribution B(p)

More information

Gradient Descent. 1) S! initial state 2) Repeat: Similar to: - hill climbing with h - gradient descent over continuous space

Gradient Descent. 1) S! initial state 2) Repeat: Similar to: - hill climbing with h - gradient descent over continuous space Local Search 1 Local Search Light-memory search method No search tree; only the current state is represented! Only applicable to problems where the path is irrelevant (e.g., 8-queen), unless the path is

More information

Optimization and least squares. Prof. Noah Snavely CS1114

Optimization and least squares. Prof. Noah Snavely CS1114 Optimization and least squares Prof. Noah Snavely CS1114 http://cs1114.cs.cornell.edu Administrivia A5 Part 1 due tomorrow by 5pm (please sign up for a demo slot) Part 2 will be due in two weeks (4/17)

More information

Case Study 1: Estimating Click Probabilities

Case Study 1: Estimating Click Probabilities Case Study 1: Estimating Click Probabilities SGD cont d AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade March 31, 2015 1 Support/Resources Office Hours Yao Lu:

More information

Combine the PA Algorithm with a Proximal Classifier

Combine the PA Algorithm with a Proximal Classifier Combine the Passive and Aggressive Algorithm with a Proximal Classifier Yuh-Jye Lee Joint work with Y.-C. Tseng Dept. of Computer Science & Information Engineering TaiwanTech. Dept. of Statistics@NCKU

More information

Lecture #11: The Perceptron

Lecture #11: The Perceptron Lecture #11: The Perceptron Mat Kallada STAT2450 - Introduction to Data Mining Outline for Today Welcome back! Assignment 3 The Perceptron Learning Method Perceptron Learning Rule Assignment 3 Will be

More information

Machine Learning Part 1

Machine Learning Part 1 Data Science Weekend Machine Learning Part 1 KMK Online Analytic Team Fajri Koto Data Scientist fajri.koto@kmklabs.com Machine Learning Part 1 Outline 1. Machine Learning at glance 2. Vector Representation

More information

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016 CPSC 340: Machine Learning and Data Mining Deep Learning Fall 2016 Assignment 5: Due Friday. Assignment 6: Due next Friday. Final: Admin December 12 (8:30am HEBB 100) Covers Assignments 1-6. Final from

More information

Lecture 3: Theano Programming

Lecture 3: Theano Programming Lecture 3: Theano Programming Misc Class Items Registration & auditing class Paper presentation Projects: ~10 projects in total ~2 students per project AAAI: Hinton s invited talk: Training data size increase

More information

Deep Learning for Computer Vision II

Deep Learning for Computer Vision II IIIT Hyderabad Deep Learning for Computer Vision II C. V. Jawahar Paradigm Shift Feature Extraction (SIFT, HoG, ) Part Models / Encoding Classifier Sparrow Feature Learning Classifier Sparrow L 1 L 2 L

More information

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R. Lecture 24: Learning 3 Victor R. Lesser CMPSCI 683 Fall 2010 Today s Lecture Continuation of Neural Networks Artificial Neural Networks Compose of nodes/units connected by links Each link has a numeric

More information

IMPROVEMENT OF DEEP LEARNING MODELS ON CLASSIFICATION TASKS USING HAAR TRANSFORM AND MODEL ENSEMBLE

IMPROVEMENT OF DEEP LEARNING MODELS ON CLASSIFICATION TASKS USING HAAR TRANSFORM AND MODEL ENSEMBLE IMPROVEMENT OF DEEP LEARNING MODELS ON CLASSIFICATION TASKS USING HAAR TRANSFORM AND MODEL ENSEMBLE Bachelor s thesis Valkeakoski, Automation Engineering Spring 2017 Tung Son Nguyen 1 ABSTRACT Automation

More information

Practical example - classifier margin

Practical example - classifier margin Support Vector Machines (SVMs) SVMs are very powerful binary classifiers, based on the Statistical Learning Theory (SLT) framework. SVMs can be used to solve hard classification problems, where they look

More information

Autoencoder. 1. Unsupervised Learning. By Prof. Seungchul Lee Industrial AI Lab POSTECH.

Autoencoder. 1. Unsupervised Learning. By Prof. Seungchul Lee Industrial AI Lab  POSTECH. Autoencoder By Prof. Seungchul Lee Industrial AI Lab http://isystems.unist.ac.kr/ POSTECH Table of Contents I. 1. Unsupervised Learning II. 2. Autoencoders III. 3. Autoencoder with TensorFlow I. 3.1. Import

More information

Homework 5. Due: April 20, 2018 at 7:00PM

Homework 5. Due: April 20, 2018 at 7:00PM Homework 5 Due: April 20, 2018 at 7:00PM Written Questions Problem 1 (25 points) Recall that linear regression considers hypotheses that are linear functions of their inputs, h w (x) = w, x. In lecture,

More information

Lab 10 - Ridge Regression and the Lasso in Python

Lab 10 - Ridge Regression and the Lasso in Python Lab 10 - Ridge Regression and the Lasso in Python March 9, 2016 This lab on Ridge Regression and the Lasso is a Python adaptation of p. 251-255 of Introduction to Statistical Learning with Applications

More information

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013 Machine Learning Topic 5: Linear Discriminants Bryan Pardo, EECS 349 Machine Learning, 2013 Thanks to Mark Cartwright for his extensive contributions to these slides Thanks to Alpaydin, Bishop, and Duda/Hart/Stork

More information

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning, A Acquisition function, 298, 301 Adam optimizer, 175 178 Anaconda navigator conda command, 3 Create button, 5 download and install, 1 installing packages, 8 Jupyter Notebook, 11 13 left navigation pane,

More information

15.1 Optimization, scaling, and gradient descent in Spark

15.1 Optimization, scaling, and gradient descent in Spark CME 323: Distributed Algorithms and Optimization, Spring 2017 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 16, 5/24/2017. Scribed by Andreas Santucci. Overview

More information

MATH 829: Introduction to Data Mining and Analysis Model selection

MATH 829: Introduction to Data Mining and Analysis Model selection 1/12 MATH 829: Introduction to Data Mining and Analysis Model selection Dominique Guillot Departments of Mathematical Sciences University of Delaware February 24, 2016 2/12 Comparison of regression methods

More information

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz Gradient Descent Wed Sept 20th, 2017 James McInenrey Adapted from slides by Francisco J. R. Ruiz Housekeeping A few clarifications of and adjustments to the course schedule: No more breaks at the midpoint

More information

Interpolation and curve fitting

Interpolation and curve fitting CITS2401 Computer Analysis and Visualization School of Computer Science and Software Engineering Lecture 9 Interpolation and curve fitting 1 Summary Interpolation Curve fitting Linear regression (for single

More information

CSC 1315! Data Science

CSC 1315! Data Science CSC 1315! Data Science Data Visualization Based on: Python for Data Analysis: http://hamelg.blogspot.com/2015/ Learning IPython for Interactive Computation and Visualization by C. Rossant Plotting with

More information

Deep Learning for Computer Vision

Deep Learning for Computer Vision Deep Learning for Computer Vision Lecture 7: Universal Approximation Theorem, More Hidden Units, Multi-Class Classifiers, Softmax, and Regularization Peter Belhumeur Computer Science Columbia University

More information

Neural Networks: Optimization Part 1. Intro to Deep Learning, Fall 2018

Neural Networks: Optimization Part 1. Intro to Deep Learning, Fall 2018 Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2018 1 Story so far Neural networks are universal approximators Can model any odd thing Provided they have the right architecture We must

More information