Machine Learning with MATLAB --classification

Size: px

Start display at page:

Download "Machine Learning with MATLAB --classification"

Christiana Dean
5 years ago
Views:

Machine Learning with MATLAB --classification Stanley Liang, PhD York University

the problem of identifying to which of a set of categories (subpopulations) a new

(or instances) whose category membership is known Steps for classification 1.

1 Machine Learning with MATLAB --classification Stanley Liang, PhD York University Classification the definition In machine learning and statistics, classification is the problem of identifying to which of a set of categories (subpopulations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known Steps for classification 1. Data prepare preprocessing, creating training / test set 2. Training 3. Cross Validation 4. Model deployment 1

Titanic disaster dataset 891 rows Binary classification Features / predictors Class: cabin

data set Iris dataset 150 rows Multi class (3) classification Features / predictors Sepal

Indians Diabetes Data (NIDDK) 768 rows Binary classification diabetes or not Features /

(mmhg) skin: triceps skinfold thickness (mm) test: 2 Hour serum insulin (mu U/ml) mass: body

Wholesale Customers 440 rows Binary / multiclass (2 categorical) Continuous variables (6): the

2 Titanic disaster dataset 891 rows Binary classification Features / predictors Class: cabin class Sex: gender of the passenger Age Fare Label / response Survived: 0 dead, 1 survived Our data set Iris dataset 150 rows Multi class (3) classification Features / predictors Sepal Length Sepal Width Petal Length Petal Width Label / response Species string Our data set Pima Indians Diabetes Data (NIDDK) 768 rows Binary classification diabetes or not Features / predictors 8 preg: # of pregnant times plas: plasma glucose concentration pres: diastolic BP (mmhg) skin: triceps skinfold thickness (mm) test: 2 Hour serum insulin (mu U/ml) mass: body mass index pedi: diabetes pedigree function (numeric) age Label / response: 1 diabetes, 0 no Wholesale Customers 440 rows Binary / multiclass (2 categorical) Continuous variables (6): the monetary units (m.u.) spent on the products Fresh fresh products Milk diary products Grocery grocery products Frozen frozen products Detergents_Paper detergents and paper products Delicatessen delicatessen products Categorical variables (2) Channel: 1 Horeca, 2 Retail Region: 1 Lisbon, 2 Oporto, 3 Other 2

The workflow of Classification Optimizing a model Because of the prior

classification results, you may want to customize the classifier.

3 The workflow of Classification Optimizing a model Because of the prior knowledge you have about the data or after looking at the classification results, you may want to customize the classifier. You can update and customize the model by setting different options using the fitting functions. Set the options by providing additional inputs for the option name and the option value. model=fitc*(tbl,ʹresponseʹ,ʹoptionnameʹ,optionvalue) ʹoptionNameʹ Name of the option, e.g., ʹCostʹ. optionvalue Value to be set to the option specified, e.g., [0 10; 2 0] change the Cost Matrix 3

k-nearest Neighbor Overview Function fitcknn Performance Fit Time: fast Prediction Time: fast, (Data Size)^2 Memory Overhead: Small Common Properties: ʹNumNeighborsʹ Number of neighbors used for

Special Notes For normalizing the data, use the ʹStandardizeʹ option. 1 The cosine distance metric works well for wide data (more predictors than observations) and data with many predictors.

4 k-nearest Neighbor Overview Function fitcknn Performance Fit Time: fast Prediction Time: fast, (Data Size)^2 Memory Overhead: Small Common Properties: ʹNumNeighborsʹ Number of neighbors used for classification. ʹDistanceʹ Metric used for calculating distances between neighbors. ʹDistanceWeightʹ Weighting given to different neighbors. Special Notes For normalizing the data, use the ʹStandardizeʹ option. 1 The cosine distance metric works well for wide data (more predictors than observations) and data with many predictors. Function fitctree Performance Fit Time Size of the data Prediction Time Fast Memory Overhead small Decision Trees Common Properties ʹSplitCriterionʹ Formula used to determine optimal splits at each level ʹMinLeafSizeʹ Minimum number of observations in each leaf node. ʹMaxNumSplitsʹ Maximum number of splits allowed in the decision tree. Special Notes Trees are a good choice when there is a significant amount of missing data. 4

k NN and decision trees do not make any assumptions about the distribution of the underlying data.

A naïve Bayes classifier assumes the independence of the predictors within each class. This classifier is a good choice for relatively simple problems.

Moderate to large Common Properties ʹDistributionʹ Distribution used to calculate probabilities ʹWidthʹ Width of the smoothing window (when ʹDistributionʹ is set to ʹkernelʹ) ʹKernelʹ Type of kernel

Discriminant Analysis Similar to naive Bayes, discriminant analysis works by assuming that the observations in each prediction class can be modeled with a normal probability distribution.

5 k NN and decision trees do not make any assumptions about the distribution of the underlying data. If we assume that the data comes from a certain underlying distribution, we can treat the data as a statistical sample. This can reduce the influence of the outliers on our model. A naïve Bayes classifier assumes the independence of the predictors within each class. This classifier is a good choice for relatively simple problems. Naïve Bayes Function fitcnb Performance Fit Time: Normal Dist. Fast; Kernel Dist. Slow Prediction Time: Normal Dist. Fast; Kernel Dist. Slow Memory Overhead: Normal Dist. Small; Kernel Dist. Moderate to large Common Properties ʹDistributionʹ Distribution used to calculate probabilities ʹWidthʹ Width of the smoothing window (when ʹDistributionʹ is set to ʹkernelʹ) ʹKernelʹ Type of kernel to use (when ʹDistributionʹ is set to ʹkernelʹ). Special Notes Naive Bayes is a good choice when there is a significant amount of missing data. Discriminant Analysis Similar to naive Bayes, discriminant analysis works by assuming that the observations in each prediction class can be modeled with a normal probability distribution. There is no assumption of independence in each predictor. A multivariate normal distribution is fitted to each class. Fit Time: Fast; size of the data Prediction Time: Fast; size of the data Memory Overhead: Linear DA Small; Quadratic DA Moderate to large; number of predictors Common Properties ʹDiscrimTypeʹ Type of boundary used. ʹDeltaʹ Coefficient threshold for including predictors in a linear boundary. (Default 0.) ʹGammaʹ Regularization to use when estimating the covariance matrix for linear DA. Linear discriminant analysis works well for wide data (more predictors than observations). Linear Discriminant Analysis The default classification assumes that the covariance for each response class is assumed to be the same. This results in linear boundaries between classes. DaModel = fitcdiscr(datatrain,ʹresponseʹ); Quadratic Discriminant Analysis Give up equal covariance assumption, a quadratic boundary will be drawn between classes damodel = fitcdiscr(datatrain,ʹresponseʹ,ʹdiscrimtypeʹ,ʹquadra ticʹ); 5

Support Vector Machines SVM will calculate the closes boundary that can correctly separate different groups of data Fit Time: Fast; square of the size of the

ʹKernelScaleʹ Scaling applied before the kernel transformation.

For data is not normalized, use the ʹStandardizeʹ option. Linear SVMs work well for wide data (more predictors than observations).

Multiclass Support Vector Machines The underlying calculations for classification with support vector machines are binary by nature.

First, Create a template for a binary classifier Second, Create multiclass SVM classifier Use the function fitecoc to create a multiclass SVM classifier.

6 Support Vector Machines SVM will calculate the closes boundary that can correctly separate different groups of data Fit Time: Fast; square of the size of the data Prediction Time: Very Fast; square of the size of the data Memory Overhead: Moderate ʹKernelFunctionʹ Variable transformation to apply. ʹKernelScaleʹ Scaling applied before the kernel transformation. ʹBoxConstraintʹ Regularization parameter controlling the misclassification penalty SVMs use a distance based algorithm. For data is not normalized, use the ʹStandardizeʹ option. Linear SVMs work well for wide data (more predictors than observations). Gaussian SVMs often work better on tall data (more observations than predictors). Multiclass Support Vector Machines The underlying calculations for classification with support vector machines are binary by nature. You can perform multiclass SVM classification by creating an errorcorrecting output codes (ECOC) classifier. First, Create a template for a binary classifier Second, Create multiclass SVM classifier Use the function fitecoc to create a multiclass SVM classifier. Cross Validation To compare model performance, we can calculate the loss for each method and pick the method with minimum loss. The loss is calculated on a specific test data. It is possible that a learning algorithm performs well on that particular test data but does not generalize well to other data The general idea of cross validation is to repeat the above process by creating different training and test data, fit the model to each training data, and calculate the loss using the corresponding test data. 6

$given fraction reserved for validation.$ ʹKFoldʹ : k (scalar) k fold cross validation ʹLeaveoutʹ : ʹonʹ Leave one out cross validation if you already have a partition created using the cvpartition function, you can also provide that to the

ʹKFoldʹ : k (scalar) k fold cross validation ʹLeaveoutʹ : ʹonʹ Leave one out cross validation if you already have a partition created using the cvpartition function, you can also provide that to the

7 Keyword value pairs for cross validation mdl = fitcknn(data,ʹresponsevarnameʹ,ʹoptionnameʹ,ʹoptionvalueʹ) ʹCrossValʹ : ʹonʹ 10 fold cross validation ʹHoldoutʹ : scalar from 0 to 1 Holdout with the given fraction reserved for validation. ʹKFoldʹ : k (scalar) k fold cross validation ʹLeaveoutʹ : ʹonʹ Leave one out cross validation if you already have a partition created using the cvpartition function, you can also provide that to the fitting function. >> part = cvpartition(y,ʹkfoldʹ,k); >> mdl = fitcknn(data,ʹresponsevarnameʹ,ʹcvpartitionʹ,part); To evaluate a cross validated model, use the kfoldloss function to compute the loss >> kfoldloss(mdl) Strategies to reduce predictors High dimensional Data Machine learning problems often involve high dimensional data with hundreds or thousands of predictors, e.g. Facial recognition, Predicting weather Learning algorithms are often computation intensive and reducing the number of predictors can have significant benefits in calculation time and memory consumption. Reducing the number of predictors results in simpler models which can be generalized and are easier to interpret. Two common ways: Feature transformation Transform the coordinate space of the observed variables. Feature selection Choose a subset of the observed variables 7

Feature Transformation Principal Component Analysis (PCA) transforms an n dimensional feature space into a new n dimensional space of orthogonal components.

In the following example, the input X has 11 columns but first 9 principal components explain more than 95% of variance.

8 Feature Transformation Principal Component Analysis (PCA) transforms an n dimensional feature space into a new n dimensional space of orthogonal components. The components are ordered by the variation explained in the data. PCA can therefore be used for dimensionality reduction by discarding the components beyond a chosen threshold of explained variance. In the following example, the input X has 11 columns but first 9 principal components explain more than 95% of variance. Feature Selection The data often contains predictors which do not have any relationship with the response. These predictors should not be included in a model. For example, the patient id in the heart health data does not have any relationship with the risk of heart disease. In the decision tree model, one of the methods, predictorimportance, can be used to identify the predictor variables that are important for creating an accurate model. Sequential Feature Selection to incrementally add predictors to the model as long as there is reduction in the prediction error. 8

Ensemble Learning Classification trees are considered weak learners, meaning that they are

Thus, two slightly different sets of training data can produce two completely different trees

However, this weakness can be harnessed as a strength by creating several trees (or, following

9 Ensemble Learning Classification trees are considered weak learners, meaning that they are highly sensitive to the data used to train them. Thus, two slightly different sets of training data can produce two completely different trees and, consequently, different predictions. However, this weakness can be harnessed as a strength by creating several trees (or, following the analogous naming, a forest). New observations can then be applied to all the trees and the resulting predictions can be compared. To improve the classifier, we can ensemble learning methods. 9

Applying Supervised Learning

Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains