scikit-learn (Machine Learning in Python) (PB13007115) 2016-07-12 (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 1 / 29
Outline 1 Introduction 2 scikit-learn examples 3 Captcha recognize 4 Limitations 5 References (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 2 / 29
Outline 1 Introduction 2 scikit-learn examples 3 Captcha recognize 4 Limitations 5 References (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 3 / 29
scikit-learn Simple and efficient tools for data mining and data analysis Accessible to everybody, and reusable in various contexts Built on NumPy, SciPy, and matplotlib Open source, commercially usable - BSD license (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 4 / 29
Types of machine learning problems and tasks Supervised learning: The computer is presented with example inputs and their desired outputs, given by a teacher, and the goal is to learn a general rule that maps inputs to outputs Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to find structure in its input Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning) Reinforcement learning: A computer program interacts with a dynamic environment in which it must perform a certain goal (such as driving a vehicle), without a teacher explicitly telling it whether it has come close to its goal Another example is learning to play a game by playing against an opponent (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 5 / 29
Types of machine learning problems and tasks Another categorization of machine learning tasks arises when one considers the desired output of a machine-learned system: In classification, inputs are divided into two or more classes, and the learner must produce a model that assigns unseen inputs to one or more (multi-label classification) of these classes In regression, also a supervised problem, the outputs are continuous rather than discrete In clustering, a set of inputs is to be divided into groups Unlike in classification, the groups are not known beforehand, making this typically an unsupervised task Density estimation finds the distribution of inputs in some space Dimensionality reduction simplifies inputs by mapping them into a lower-dimensional space (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 6 / 29
Outline 1 Introduction 2 scikit-learn examples 3 Captcha recognize 4 Limitations 5 References (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 7 / 29
Choosing the right estimator (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 8 / 29
Generalized Linear Models The following are a set of methods intended for regression in which the target value is expected to be a linear combination of the input variables In mathematical notion, if ŷ is the predicted value ŷ(w, x) = w 0 + w 1 x 1 + + w p x p (1) Across the module, we designate the vector w = (w 1,, w p ) as coef_ and w 0 as intercept_ (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 9 / 29
Generalized Linear Models Ordinary Least Squares Ridge Regression Lasso Elastic Net Multi-task Lasso Least Angle Regression LARS Lasso Orthogonal Matching Pursuit (OMP) Bayesian Regression Logistic regression Stochastic Gradient Descent - SGD Perceptron Passive Aggressive Algorithms Robustness regression: outliers and modeling errors Polynomial regression: extending linear models with basis functions (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 10 / 29
Ordinary Least Squares LinearRegression fits a linear model with coefficients w = (w 1,, w p ) to minimize the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation Mathematically it solves a problem of the form: min w Xw y 2 2 (2) import matplotlibpyplot as plt import numpy as np from sklearn import datasets, linear_model diabetes = datasetsload_diabetes() regr = linear_modellinearregression() regrfit(diabetes_x_train, diabetes_y_train) (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 11 / 29
Ordinary Least Squares (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 12 / 29
Lasso The Lasso is a linear model that estimates sparse coefficients It is useful in some contexts due to its tendency to prefer solutions with fewer parameter values, effectively reducing the number of variables upon which the given solution is dependent For this reason, the Lasso and its variants are fundamental to the field of compressed sensing Under certain conditions, it can recover the exact set of non-zero weights Mathematically, it consists of a linear model trained with l 1 prior as regularizer The objective function to minimize is: min w 1 2n samples Xw y 2 2 + α w 1 (3) The lasso estimate thus solves the minimization of the least-squares penalty with α w 1 added, where α is a constant and w 1 is the l 1 norm of the parameter vector (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 13 / 29
Lasso alpha = 01 lasso = Lasso(alpha=alpha) y_pred_lasso = lassofit(x_train, y_train)predict(x_test) enet = ElasticNet(alpha=alpha, l1_ratio=07) y_pred_enet = enetfit(x_train, y_train)predict(x_test) (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 14 / 29
Lasso (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 15 / 29
Support Vector Machines Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection The advantages of support vector machines are: Effective in high dimensional spaces Still effective in cases where number of dimensions is greater than the number of samples Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient Versatile: different Kernel functions can be specified for the decision function Common kernels are provided, but it is also possible to specify custom kernels (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 16 / 29
Support Vector Machines The disadvantages of support vector machines include: If the number of features is much greater than the number of samples, the method is likely to give poor performances SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 17 / 29
SVM X = npc_[(4, -7), (-15, -1), (-14, -9), (-13, -12), (-11, -2), (-12, -4), (-5, 12), (-15, 21), (1, 1), (13, 8), (12, 5), (2, -2), (5, -24), (2, -23), (0, -27), (13, 21)]T Y = [0] * 8 + [1] * 8 clf = svmsvc(kernel=kernel, gamma=2) clffit(x, Y) (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 18 / 29
SVM kernel ( linear ) (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 19 / 29
SVM kernel ( poly ) (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 20 / 29
SVM kernel ( rbf ) (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 21 / 29
Functions load_training_data() train_classifier(images, targets) captcha_get() captcha_binarize(image, threshold=100) captcha_split(image) captcha_get_classifier() captcha_get_and_predict(classifier) (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 22 / 29
Tornado web framework (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 23 / 29
Tornado web framework (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 24 / 29
Outline 1 Introduction 2 scikit-learn examples 3 Captcha recognize 4 Limitations 5 References (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 25 / 29
No GPU support Will you add GPU support? No, or at least not in the near future The main reason is that GPU support will introduce many software dependencies and introduce platform specific issues scikit-learn is designed to be easy to install on a wide variety of platforms Outside of neural networks, GPUs don t play a large role in machine learning today, and much larger gains in speed can often be achieved by a careful choice of algorithms (scikit-learn FAQ) (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 26 / 29
Outline 1 Introduction 2 scikit-learn examples 3 Captcha recognize 4 Limitations 5 References (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 27 / 29
References https://enwikipediaorg/wiki/machine_learning http://scikit-learnorg/stable/ http://scikit-learnorg/stable/tutorial/machine_ learning_map/indexhtml The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Second Edition), Springer, Trevor Hastie, Robert Tibshirani, Jerome Friedman (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 28 / 29
Thanks Thanks for your listening (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 29 / 29