In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time,

Similar documents
Lab Four. COMP Advanced Artificial Intelligence Xiaowei Huang Cameron Hargreaves. October 22nd 2018

Lab Five. COMP Advanced Artificial Intelligence Xiaowei Huang Cameron Hargreaves. October 29th 2018

Introduction to Artificial Intelligence

EPL451: Data Mining on the Web Lab 10

UNSUPERVISED LEARNING IN PYTHON. Visualizing the PCA transformation

EPL451: Data Mining on the Web Lab 5

Derek Bridge School of Computer Science and Information Technology University College Cork. from sklearn.preprocessing import add_dummy_feature

scikit-learn (Machine Learning in Python)

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

SUPERVISED LEARNING WITH SCIKIT-LEARN. How good is your model?

Efficient Deep Learning Optimization Methods

MATH 829: Introduction to Data Mining and Analysis Model selection

Linear discriminant analysis and logistic

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Data analysis case study using R for readily available data set using any one machine learning Algorithm

Iris Example PyTorch Implementation

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

Logistic Regression and Gradient Ascent

Machine Learning and Computational Statistics, Spring 2015 Homework 1: Ridge Regression and SGD

MLXTEND. Mlxtend v Sebastian Raschka

Introduction to Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Practical example - classifier margin

Analysis and Latent Semantic Indexing

Network Traffic Measurements and Analysis

Lecture Linear Support Vector Machines

HMC CS 158, Fall 2017 Problem Set 3 Programming: Regularized Polynomial Regression

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Statistical Methods in AI

Combine the PA Algorithm with a Proximal Classifier

3 Types of Gradient Descent Algorithms for Small & Large Data Sets

A Brief Look at Optimization

Optimization in the Big Data Regime 5: Parallelization? Sham M. Kakade

Machine Learning Part 1

intro_mlp_xor March 26, 2018

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

1 Training/Validation/Testing

Identification Of Iris Flower Species Using Machine Learning

Deep Neural Networks Optimization

Learning via Optimization

Lab 10 - Ridge Regression and the Lasso in Python

Machine Learning and Computational Statistics, Spring 2016 Homework 1: Ridge Regression and SGD

Analyze the work and depth of this algorithm. These should both be given with high probability bounds.

COMP 364: Computer Tools for Life Sciences

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

CSE 546 Machine Learning, Autumn 2013 Homework 2

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn

Machine Learning: Think Big and Parallel

Orange3 Educational Add-on Documentation

Facial Expression Classification with Random Filters Feature Extraction

Today. Gradient descent for minimization of functions of real variables. Multi-dimensional scaling. Self-organizing maps

Artificial Neural Networks (Feedforward Nets)

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Probabilistic Siamese Network for Learning Representations. Chen Liu

Derek Bridge School of Computer Science and Information Technology University College Cork

Louis Fourrier Fabien Gaie Thomas Rolf

maxbox Starter 66 - Data Science with Max

Neural Networks (pp )

Logistic Regression

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade

CS4618: Artificial Intelligence I. Accuracy Estimation. Initialization

5 Machine Learning Abstractions and Numerical Optimization

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

No more questions will be added

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

Conflict Graphs for Parallel Stochastic Gradient Descent

Learning from High Dimensional fmri Data using Random Projections

Lab 16 - Multiclass SVMs and Applications to Real Data in Python

from sklearn import tree from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier

Lecture #11: The Perceptron

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Parallel Stochastic Gradient Descent

Machine Learning / Jan 27, 2010

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Optimization Plugin for RapidMiner. Venkatesh Umaashankar Sangkyun Lee. Technical Report 04/2012. technische universität dortmund

Experimental Design + k- Nearest Neighbors

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

The Curse of Dimensionality

Accelerated Machine Learning Algorithms in Python

Clustering and Visualisation of Data

Practical 8: Neural networks

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing)

pescador Documentation

Chapter DM:II. II. Cluster Analysis

Hsiaochun Hsu Date: 12/12/15. Support Vector Machine With Data Reduction

Machine Learning for. Artem Lind & Aleskandr Tkachenko

Applying Supervised Learning

User Documentation Decision Tree Classification with Bagging

Data Mining: Exploring Data. Lecture Notes for Chapter 3

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Hands on Datamining & Machine Learning with Weka

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Automated Canvas Analysis for Painting Conservation. By Brendan Tobin

k-nearest Neighbors + Model Selection

Machine Learning in Python. Rohith Mohan GradQuant Spring 2018

COMPUTATIONAL INTELLIGENCE (CS) (INTRODUCTION TO MACHINE LEARNING) SS16. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Transcription:

Chapter 2 Although stochastic gradient descent can be considered as an approximation of gradient descent, it typically reaches convergence much faster because of the more frequent weight updates. Since each gradient is calculated based on a single training example, the error surface is noisier than in gradient descent, which can also have the advantage that stochastic gradient descent can escape shallow local minima more readily. To obtain accurate results via stochastic gradient descent, it is important to present it with data in a random order, which is why we want to shuffle the training set for every epoch to prevent cycles. In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time, c1 for example, [ numberof iterations] + c2 where c 1 and c 2 are constants. Note that stochastic gradient descent does not reach the global minimum but an area very close to it. By using an adaptive learning rate, we can achieve further annealing to a better global minimum Another advantage of stochastic gradient descent is that we can use it for online learning. In online learning, our model is trained on-the-fly as new training data arrives. This is especially useful if we are accumulating large amounts of data for example, customer data in typical web applications. Using online learning, the system can immediately adapt to changes and the training data can be discarded after updating the model if storage space in an issue. A compromise between batch gradient descent and stochastic gradient descent is the so-called mini-batch learning. Mini-batch learning can be understood as applying batch gradient descent to smaller subsets of the training data for example, 50 samples at a time. The advantage over batch gradient descent is that convergence is reached faster via mini-batches because of the more frequent weight updates. Furthermore, mini-batch learning allows us to replace the for-loop over the training samples in Stochastic Gradient Descent (SGD) by vectorized operations, which can further improve the computational efficiency of our learning algorithm. [ 43 ]

Chapter 3 We will assign the petal length and petal width of the 150 flower samples to the feature matrix X and the corresponding class labels of the flower species to the vector y: >>> from sklearn import datasets >>> import numpy as np >>> iris = datasets.load_iris() >>> X = iris.data[:, [2, 3]] >>> y = iris.target If we executed np.unique(y) to return the different class labels stored in iris. target, we would see that the Iris flower class names, Iris-Setosa, Iris-Versicolor, and Iris-Virginica, are already stored as integers (0, 1, 2), which is recommended for the optimal performance of many machine learning libraries. To evaluate how well a trained model performs on unseen data, we will further split the dataset into separate training and test datasets. Later in Chapter 5, Compressing Data via Dimensionality Reduction, we will discuss the best practices around model evaluation in more detail: >>> from sklearn.cross_validation import train_test_split >>> X_train, X_test, y_train, y_test = train_test_split(... X, y, test_size=0.3, random_state=0) Using the train_test_split function from scikit-learn's cross_validation module, we randomly split the X and y arrays into 30 percent test data (45 samples) and 70 percent training data (105 samples). Many machine learning and optimization algorithms also require feature scaling for optimal performance, as we remember from the gradient descent example in Chapter 2, Training Machine Learning Algorithms for Classification. Here, we will standardize the features using the StandardScaler class from scikit-learn's preprocessing module: >>> from sklearn.preprocessing import StandardScaler >>> sc = StandardScaler() >>> sc.fit(x_train) >>> X_train_std = sc.transform(x_train) >>> X_test_std = sc.transform(x_test) [ 51 ]

Building Good Training Sets Data Preprocessing Applied to the standardized Wine data, the L1 regularized logistic regression would yield the following sparse solution: >>> lr = LogisticRegression(penalty='l1', C=0.1) >>> lr.fit(x_train_std, y_train) >>> print('training accuracy:', lr.score(x_train_std, y_train)) Training accuracy: 0.983870967742 >>> print('test accuracy:', lr.score(x_test_std, y_test)) Test accuracy: 0.981481481481 Both training and test accuracies (both 98 percent) do not indicate any overfitting of our model. When we access the intercept terms via the lr.intercept_ attribute, we can see that the array returns three values: >>> lr.intercept_ array([-0.38379237, -0.1580855, -0.70047966]) Since we the fit the LogisticRegression object on a multiclass dataset, it uses the One-vs-Rest (OvR) approach by default where the first intercept belongs to the model that fits class 1 versus class 2 and 3; the second value is the intercept of the model that fits class 2 versus class 1 and 3; and the third value is the intercept of the model that fits class 3 versus class 1 and 2, respectively: >>> lr.coef_ array([[ 0.280, 0.000, 0.000, -0.0282, 0.000, 0.000, 0.710, 0.000, 0.000, 0.000, 0.000, 0.000, 1.236], [-0.644, -0.0688, -0.0572, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, -0.927, 0.060, 0.000, -0.371], [ 0.000, 0.061, 0.000, 0.000, 0.000, 0.000, -0.637, 0.000, 0.000, 0.499, -0.358, -0.570, 0.000 ]]) The weight array that we accessed via the lr.coef_ attribute contains three rows of weight coefficients, one weight vector for each class. Each row consists of 13 weights where each weight is multiplied by the respective feature in the 13-dimensional Wine dataset to calculate the net input: z = wx + + w x = x w = 1 1 m m j= 0 j j m T w x [ 116 ]

Chapter 5 Scikit-learn also implements advanced techniques for nonlinear dimensionality reduction that are beyond the scope of this book. You can find a nice overview of the current implementations in scikit-learn complemented with illustrative examples at http://scikit-learn.org/stable/ modules/manifold.html. Summary In this chapter, you learned about three different, fundamental dimensionality reduction techniques for feature extraction: standard PCA, LDA, and kernel PCA. Using PCA, we projected data onto a lower-dimensional subspace to maximize the variance along the orthogonal feature axes while ignoring the class labels. LDA, in contrast to PCA, is a technique for supervised dimensionality reduction, which means that it considers class information in the training dataset to attempt to maximize the class-separability in a linear feature space. Lastly, you learned about a kernelized version of PCA, which allows you to map nonlinear datasets onto a lower-dimensional feature space where the classes become linearly separable. Equipped with these essential preprocessing techniques, you are now well prepared to learn about the best practices for efficiently incorporating different preprocessing techniques and evaluating the performance of different models in the next chapter. [ 167 ]

Chapter 10 On a side note, it is also worth mentioning that we technically don't have to update the weights of the intercept if we are working with standardized variables since the y axis intercept is always 0 in those cases. We can quickly confirm this by printing the weights: >>> print('slope: %.3f' % lr.w_[1]) Slope: 0.695 >>> print('intercept: %.3f' % lr.w_[0]) Intercept: -0.000 Estimating the coefficient of a regression model via scikit-learn In the previous section, we implemented a working model for regression analysis. However, in a real-world application, we may be interested in more efficient implementations, for example, scikit-learn's LinearRegression object that makes use of the LIBLINEAR library and advanced optimization algorithms that work better with unstandardized variables. This is sometimes desirable for certain applications: >>> from sklearn.linear_model import LinearRegression >>> slr = LinearRegression() >>> slr.fit(x, y) >>> print('slope: %.3f' % slr.coef_[0]) Slope: 9.102 >>> print('intercept: %.3f' % slr.intercept_) Intercept: -34.671 As we can see by executing the preceding code, scikit-learn's LinearRegression model fitted with the unstandardized RM and MEDV variables yielded different model coefficients. Let's compare it to our own GD implementation by plotting MEDV against RM: >>> lin_regplot(x, y, slr) >>> plt.xlabel('average number of rooms [RM]') >>> plt.ylabel('price in $1000\'s [MEDV]') >>> plt.show()w [ 289 ]