MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February PDF Free Download

MIT 801 [Presented by Anna Bosman] 16 February 2018

Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge and skills. What is learning? The acquisition of knowledge or skills through study, experience, or being taught

Learning Analyse data, acquire insight not visible through the data alone, apply the knowledge

Machine Learning Ability to learn without being explicitly programmed Algorithmic approach to learning Requires a training data set A largely automatic process Algorithms usually have parameters Parameters usually require optimisation

How do machines learn?

Types of Machine Learning Supervised Learning Data is labelled, i.e. outcomes are known Associate e-mails with labels: spam or not spam Associate tumors with labels: malignant or benign Learn to predict the output given then input Unsupervised Learning Data is unlabelled Find structure in the data Identify clusters Study what your customers buy, guess what to advertise based on the interests of their neighbours

Types of Machine Learning Supervised versus Unsupervised

Types of Machine Learning Reinforcement Learning Algorithm performs actions without knowing which actions are good Good actions are rewarded Learn the actions that maximise the overall reward Learn what moves make you win at chess

Supervised Learning There are two subcategories of supervised learning: classification and regression Classification learns a model to differentiate between multiple classes Regression learns a model to predict the real-valued output given a set of inputs

Supervised Learning Classification Example Training set for three classes: chair, table, bed h w chair table bed 0.5 0.3 1 0 0 0.4 0.4 1 0 0 0.9 1.2 0 1 0 0.8 1.5 0 1 0 0.4 2.0 0 0 1 0.6 1.9 0 0 1 How do you determine the class of the object (chair, table, bed) given the dimensions (h, w)?

Supervised Learning Regression Example Training set with two input variables: m 2, bedrooms One output variable: house price m 2 bedrooms price 95 1 950,000 100 2 1,000,000 50 1 860,000 145 3 1,200,000 210 4 2,300,000 How do you predict the price of the house, given its m 2 and the number of bedrooms?

Supervised Learning How does the model learn? The model will attempt to predict the outcome for the training inputs Compare produced output to desired output Update the model to reduce the error What exactly are we learning? y = f (x)

Supervised Learning What does the model learn? We do not know the real y = f (x), we can only approximate y = ˆf (x)

What does the machine learn? We do not know the real y = f (x), we can only approximate y = ˆf (x) Thus, the model will only be as good as the training data Data has to be representative!

How good is my model? Underfitting, Overfitting Underfitting: ˆf (x) is too simplistic compared to real f (x) Overfitting: ˆf (x) is too complex compared to f (x), and fits irrelevant data such as noise We want to be able to arrive at the model in the middle

How good is my model? Regression If you are predicting a real value: calculate the distance between the predicted value and the target value Mean Absolute Error Mean Squared Error MAE = MSE = yi t i N (yi t i ) 2 N Root Mean Squared Error (yi t i ) RMSE = 2 N

How good is my model? Classification We are predicting categorical (discrete) values: how many did we get right? True positive: say cat when there is a cat False positive: say cat when there is no cat True negative: don t say cat when there is no cat False negative: don t say cat when there is a cat Accuracy: all correct predictions / all predictions

How good is my model? Calculate model s goodness value over the training set Classification: accuracy of 100% Does it mean that the model is perfect? NO: training error/accuracy does not tell us how the model will perform on unseen examples Performance on unseen examples: generalisation performance How do you measure it? Reserve a subset of data for testing - do not show it to the model

How good is my model? Early Stopping Prevent overfitting by monitoring training and testing accuracy Stop the training when the test set accuracy goes down

How good is my model? K-Fold Cross-Validation Labelled data is often limited Can we estimate the generalisation error on the entire data set? Perform cross-validation: Divide the data set into K equal parts Train K times, each time choosing a new subset for testing Average the error

Nearest Neighbour Classification 3 2 1 Nearest Neighbour For every unknown pattern x, find the closest known pattern y Class of x is likely to be the same as class of y, because y is x s nearest neighbour The data set is the model No explicit learning process

k-nn: K Nearest Neighbours k-nearest Neighbours Asking a single neighbour can be dangerous Find k neighbours instead, chose the majority class Question: how many neighbours is enough? Answer: determine empirically 2-class problem: odd k prevents ties

k-nn Effect of K on the boundary Larger k leads to smoother boundaries:

k-nn Distance Metrics Given x and y, how do we determine the distance between them? Euclidean: d = (xi y i ) 2 Manhattan (City block): d = x i y i Minkowski: d = ( x i y i p ) 1/p (for p, d = max x i y i ) Binary: Hamming distance (how many bits need to be flipped)

k-nn Data Normalisation Given x = (1, 0.002, 1800) and y = (2, 0.015, 1500), which one of the three components will contribute to the distance the most? Normalise the input variables to even out their contribution: Min-max scaling: x ij = x ij min(x i ) max(x i ) min(x i ) Z-score (standardization): x ij = x ij µ(x i ) σ(x i )

k-nn Things to consider How democratic should the k-vote be? How do you handle ties? Weighted k-nn: Contribution of each neighbour is proportional to the distance from x Ties: closer neighbours have a stronger vote Can you use k-nn for regression? Yes: find k nearest neighbours of x, output the average output as f (x) approximation Is the entire data set necessary? Remove borderline cases Remove noise and outliers Remove redundant examples

k-nn The good and The bad k-nn is great because... It is intuitive Only one parameter to tune: k There is no training phase ( lazy classification/regression) Easily expandable by adding more labelled data k-nn is not perfect because... It is slow and expensive: O(nm) (Store data in an efficient data structure) It does not derive a model of the data: lack of insight Distances between patterns can lose their meaning in high dimensions

What makes a good model? Diagnosing disease There are all kinds of parameters one can measure in a human being Does a doctor send you for every kind of medical test to diagnose a minor cold? No - that would be wasteful A series of tests is performed in order, narrowing down the search space with every step How do we model something like this?

Decision Trees: Rule-Based Classification Decision Tree A flowchart-like structure in which each internal node represents a "test" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label.

Decision Trees To classify a pattern, start at the root Every node asks a question Every possible answer is associated with a branch Leaf nodes represent class labels

Decision Trees What s great about decision trees? Intuitive Interpretable as rules Given a labelled data set, how would you construct a decision tree out of it? We need a machine learning algorithm to automate the process

Decision Tree Learning Build the tree recursively: 1 Pick an attribute to divide the given data 2 Divide the data into subsets on the basis of this attribute 3 For every subset created above, repeat (1) and (2) until you arrive at leaf nodes Leaf nodes are also referred to as pure nodes: i.e., nodes that represent examples of one class only How do we choose which attribute to split the data on first?

ID3 Iterative Dichotomiser 3 Invented in 1986 by J.R. Quinlan Based on information entropy Main idea: split on the attribute which maximises information gain Entropy: a measure of chaos, disorder E(S) = N i p i log N p i S - set, N - number of classes, p i - probability of class i Only one class in a set: E(S) = 0 Two classes, each class is 1/2 of the set: E(S) = 1

ID3 Iterative Dichotomiser 3 Main idea: split on the attribute which maximises information gain Information gain: difference between the entropy of the original set and the weighted sum of entropies of the resulting sets IG(S) = E(S) p j E(S j ) where p j is the proportion of data patterns in the subset j, and S j are the subsets resulting from the split

Information Gain Splitting Rule Clearly, outlook attribute offers the most information gain

Information Gain Splitting Rule Now the same algorithm can be re-applied to every non-leaf sub-branch.

Gain Ratio C4.5 Problem: what if the names of golf players were added to the data set, each entry having a unique name? Not a great attribute to base decisions on! But if we split based on names, every branch will become a leaf: each guy has a definite yes/no outcome Information gain is misleading: it is biased to values with many possible outcomes Remedies? Take the number of splits into account! C4.5 improves on ID3 by using Gain Ratio GR(S) = IG(S) SI SI = k p j log 2 p j k is the number of splits, p j is the proportion of patterns, SI estimates the entropy of the split (split info)

Gain Ratio C4.5

Gini Gain CART: Classification and Regression Trees A simpler alternative to Entropy: Gini impurity Gini(S) = 1 N i pi 2 N - number of classes, p i - proportion of class i Smallest when all patterns belong to one class Largest when classes are equally split Gini gain: GG(S) = Gini(S) j p jgini(s j )

Binary Splits One way to solve the problem of splitting over too many attributes is to force a two-way split:

Numeric Attributes Binning What if one of the attributes is continuous? Eg., age, income... Solution: discretize the attribute using binning Boundaries between the bins are the potential split points

Numeric Attributes Binning Consider the information gain/purity/goodness factor of each split Choose the best one

Regression Trees What if not only the inputs, but the outputs are real numbers, too? Decision trees that output continuous values are called regression trees Instead of minimizing impurity, minimize data variance after split: Var(S) = 1 S 2 i S j S 1 2 (y i y j ) 2 where y i is the target output. Minimize for node N: I V (N) = Var(S) (Var(S t ) + Var(S f )) where S is the set before split, S t is the subset after split with test outcome true, and S f is the subset after split with test outcome false When more than one data point belong to a leaf, estimate is the average y per leaf

When do you stop splitting? Split till all leaf nodes are pure? Not feasible in complex real-life data sets Noisy/imperfect data may lead to a tree that generalises poorly Stop splitting when: All leaf nodes are pure, or Maximum tree depth has been reached, or Improvement in training error yields a drop in generalisation error, or Improvement in purity resulting from a split is less than a preset threshold Problem with early stopping: how early is too early?

Pruning The opposite of splitting Grow the tree to its full size on the training set Starting at the bottom of the tree, remove leaves one by one, each time checking the error on the generalisation set If the generalisation error does not increase, keep pruning! Prune A5:

The End Questions? Assignment 1 will be published on http://cs.up.ac.za/courses/mit801 on Monday Expect to apply the techniques discussed today on a real data set Next lecture: Random Forests, Neural Networks, SVM, Unsupervised Learning

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018