MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Similar documents
Data Mining Lecture 8: Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Artificial Intelligence. Programming Styles

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Random Forest A. Fornaser

Lecture outline. Decision-tree classification

Data Mining in Bioinformatics Day 1: Classification

Classification with Decision Tree Induction

Data Mining Classification - Part 1 -

Network Traffic Measurements and Analysis

Classification and Regression

7. Decision or classification trees

Business Club. Decision Trees

1) Give decision trees to represent the following Boolean functions:

COMP 465: Data Mining Classification Basics

Extra readings beyond the lecture slides are important:

Using Machine Learning to Optimize Storage Systems

CSC411/2515 Tutorial: K-NN and Decision Tree

Classification: Basic Concepts, Decision Trees, and Model Evaluation

CS Machine Learning

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Classification/Regression Trees and Random Forests

Lecture 7: Decision Trees

Decision Tree Learning

Data Mining and Machine Learning: Techniques and Algorithms

Tree-based methods for classification and regression

K- Nearest Neighbors(KNN) And Predictive Accuracy

CSE4334/5334 DATA MINING

Ensemble Methods, Decision Trees

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

Based on Raymond J. Mooney s slides

K-Nearest Neighbors. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Lecture 25: Review I

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

Notes based on: Data Mining for Business Intelligence

Lecture 19: Decision trees

Introduction to Artificial Intelligence

Performance Evaluation of Various Classification Algorithms

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 10: Decision Trees

Knowledge Discovery and Data Mining

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

PROBLEM 4

Supervised Learning Classification Algorithms Comparison

Applying Supervised Learning

k-nearest Neighbor (knn) Sept Youn-Hee Han

Nearest neighbor classification DSE 220

Basic Data Mining Technique

CS 229 Midterm Review

Lecture 5: Decision Trees (Part II)

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 10: Decision Trees

Lazy Decision Trees Ronny Kohavi

Classification and Regression Trees

SOCIAL MEDIA MINING. Data Mining Essentials

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form)

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Supervised Learning: K-Nearest Neighbors and Decision Trees

Problems 1 and 5 were graded by Amin Sorkhei, Problems 2 and 3 by Johannes Verwijnen and Problem 4 by Jyrki Kivinen. Entropy(D) = Gini(D) = 1

Supervised vs unsupervised clustering

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

8. Tree-based approaches

Data Mining and Analytics

Data Mining. Lecture 03: Nearest Neighbor Learning

Nearest Neighbor Classification. Machine Learning Fall 2017

Classification Algorithms in Data Mining

CS7267 MACHINE LEARNING NEAREST NEIGHBOR ALGORITHM. Mingon Kang, PhD Computer Science, Kennesaw State University

Data Mining Concepts & Techniques

K-Nearest Neighbour (Continued) Dr. Xiaowei Huang

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm

Introduction to Machine Learning

5 Learning hypothesis classes (16 points)

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Machine Learning. Decision Trees. Le Song /15-781, Spring Lecture 6, September 6, 2012 Based on slides from Eric Xing, CMU

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Machine Learning. Decision Trees. Manfred Huber

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Implementierungstechniken für Hauptspeicherdatenbanksysteme Classification: Decision Trees

The Basics of Decision Trees

Classification and Regression Trees

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

An introduction to random forests

CISC 4631 Data Mining

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

CSE 573: Artificial Intelligence Autumn 2010

Logical Rhythm - Class 3. August 27, 2018

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Classification: Feature Vectors

Introduction to Automated Text Analysis. bit.ly/poir599

Transcription:

MIT 801 [Presented by Anna Bosman] 16 February 2018

Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge and skills. What is learning? The acquisition of knowledge or skills through study, experience, or being taught

Learning Analyse data, acquire insight not visible through the data alone, apply the knowledge

Machine Learning Ability to learn without being explicitly programmed Algorithmic approach to learning Requires a training data set A largely automatic process Algorithms usually have parameters Parameters usually require optimisation

How do machines learn?

Types of Machine Learning Supervised Learning Data is labelled, i.e. outcomes are known Associate e-mails with labels: spam or not spam Associate tumors with labels: malignant or benign Learn to predict the output given then input Unsupervised Learning Data is unlabelled Find structure in the data Identify clusters Study what your customers buy, guess what to advertise based on the interests of their neighbours

Types of Machine Learning Supervised versus Unsupervised

Types of Machine Learning Reinforcement Learning Algorithm performs actions without knowing which actions are good Good actions are rewarded Learn the actions that maximise the overall reward Learn what moves make you win at chess

Supervised Learning There are two subcategories of supervised learning: classification and regression Classification learns a model to differentiate between multiple classes Regression learns a model to predict the real-valued output given a set of inputs

Supervised Learning Classification Example Training set for three classes: chair, table, bed h w chair table bed 0.5 0.3 1 0 0 0.4 0.4 1 0 0 0.9 1.2 0 1 0 0.8 1.5 0 1 0 0.4 2.0 0 0 1 0.6 1.9 0 0 1 How do you determine the class of the object (chair, table, bed) given the dimensions (h, w)?

Supervised Learning Regression Example Training set with two input variables: m 2, bedrooms One output variable: house price m 2 bedrooms price 95 1 950,000 100 2 1,000,000 50 1 860,000 145 3 1,200,000 210 4 2,300,000 How do you predict the price of the house, given its m 2 and the number of bedrooms?

Supervised Learning How does the model learn? The model will attempt to predict the outcome for the training inputs Compare produced output to desired output Update the model to reduce the error What exactly are we learning? y = f (x)

Supervised Learning What does the model learn? We do not know the real y = f (x), we can only approximate y = ˆf (x)

What does the machine learn? We do not know the real y = f (x), we can only approximate y = ˆf (x) Thus, the model will only be as good as the training data Data has to be representative!

How good is my model? Underfitting, Overfitting Underfitting: ˆf (x) is too simplistic compared to real f (x) Overfitting: ˆf (x) is too complex compared to f (x), and fits irrelevant data such as noise We want to be able to arrive at the model in the middle

How good is my model? Regression If you are predicting a real value: calculate the distance between the predicted value and the target value Mean Absolute Error Mean Squared Error MAE = MSE = yi t i N (yi t i ) 2 N Root Mean Squared Error (yi t i ) RMSE = 2 N

How good is my model? Classification We are predicting categorical (discrete) values: how many did we get right? True positive: say cat when there is a cat False positive: say cat when there is no cat True negative: don t say cat when there is no cat False negative: don t say cat when there is a cat Accuracy: all correct predictions / all predictions

How good is my model? Calculate model s goodness value over the training set Classification: accuracy of 100% Does it mean that the model is perfect? NO: training error/accuracy does not tell us how the model will perform on unseen examples Performance on unseen examples: generalisation performance How do you measure it? Reserve a subset of data for testing - do not show it to the model

How good is my model? Early Stopping Prevent overfitting by monitoring training and testing accuracy Stop the training when the test set accuracy goes down

How good is my model? K-Fold Cross-Validation Labelled data is often limited Can we estimate the generalisation error on the entire data set? Perform cross-validation: Divide the data set into K equal parts Train K times, each time choosing a new subset for testing Average the error

Supervised Learning Classification Example Training set for three classes: chair, table, bed h w chair table bed 0.5 0.3 1 0 0 0.4 0.4 1 0 0 0.9 1.2 0 1 0 0.8 1.5 0 1 0 0.4 2.0 0 0 1 0.6 1.9 0 0 1 How do you determine the class of the object (chair, table, bed) given the dimensions (h, w)?

Nearest Neighbour Classification 3 2 1 Nearest Neighbour For every unknown pattern x, find the closest known pattern y Class of x is likely to be the same as class of y, because y is x s nearest neighbour The data set is the model No explicit learning process

k-nn: K Nearest Neighbours k-nearest Neighbours Asking a single neighbour can be dangerous Find k neighbours instead, chose the majority class Question: how many neighbours is enough? Answer: determine empirically 2-class problem: odd k prevents ties

k-nn Effect of K on the boundary Larger k leads to smoother boundaries:

k-nn Distance Metrics Given x and y, how do we determine the distance between them? Euclidean: d = (xi y i ) 2 Manhattan (City block): d = x i y i Minkowski: d = ( x i y i p ) 1/p (for p, d = max x i y i ) Binary: Hamming distance (how many bits need to be flipped)

k-nn Data Normalisation Given x = (1, 0.002, 1800) and y = (2, 0.015, 1500), which one of the three components will contribute to the distance the most? Normalise the input variables to even out their contribution: Min-max scaling: x ij = x ij min(x i ) max(x i ) min(x i ) Z-score (standardization): x ij = x ij µ(x i ) σ(x i )

k-nn Things to consider How democratic should the k-vote be? How do you handle ties? Weighted k-nn: Contribution of each neighbour is proportional to the distance from x Ties: closer neighbours have a stronger vote Can you use k-nn for regression? Yes: find k nearest neighbours of x, output the average output as f (x) approximation Is the entire data set necessary? Remove borderline cases Remove noise and outliers Remove redundant examples

k-nn The good and The bad k-nn is great because... It is intuitive Only one parameter to tune: k There is no training phase ( lazy classification/regression) Easily expandable by adding more labelled data k-nn is not perfect because... It is slow and expensive: O(nm) (Store data in an efficient data structure) It does not derive a model of the data: lack of insight Distances between patterns can lose their meaning in high dimensions

What makes a good model? Diagnosing disease There are all kinds of parameters one can measure in a human being Does a doctor send you for every kind of medical test to diagnose a minor cold? No - that would be wasteful A series of tests is performed in order, narrowing down the search space with every step How do we model something like this?

Decision Trees: Rule-Based Classification Decision Tree A flowchart-like structure in which each internal node represents a "test" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label.

Decision Trees To classify a pattern, start at the root Every node asks a question Every possible answer is associated with a branch Leaf nodes represent class labels

Decision Trees What s great about decision trees? Intuitive Interpretable as rules Given a labelled data set, how would you construct a decision tree out of it? We need a machine learning algorithm to automate the process

Decision Tree Learning Build the tree recursively: 1 Pick an attribute to divide the given data 2 Divide the data into subsets on the basis of this attribute 3 For every subset created above, repeat (1) and (2) until you arrive at leaf nodes Leaf nodes are also referred to as pure nodes: i.e., nodes that represent examples of one class only How do we choose which attribute to split the data on first?

ID3 Iterative Dichotomiser 3 Invented in 1986 by J.R. Quinlan Based on information entropy Main idea: split on the attribute which maximises information gain Entropy: a measure of chaos, disorder E(S) = N i p i log N p i S - set, N - number of classes, p i - probability of class i Only one class in a set: E(S) = 0 Two classes, each class is 1/2 of the set: E(S) = 1

ID3 Iterative Dichotomiser 3 Main idea: split on the attribute which maximises information gain Information gain: difference between the entropy of the original set and the weighted sum of entropies of the resulting sets IG(S) = E(S) p j E(S j ) where p j is the proportion of data patterns in the subset j, and S j are the subsets resulting from the split

Information Gain Splitting Rule Clearly, outlook attribute offers the most information gain

Information Gain Splitting Rule Now the same algorithm can be re-applied to every non-leaf sub-branch.

Gain Ratio C4.5 Problem: what if the names of golf players were added to the data set, each entry having a unique name? Not a great attribute to base decisions on! But if we split based on names, every branch will become a leaf: each guy has a definite yes/no outcome Information gain is misleading: it is biased to values with many possible outcomes Remedies? Take the number of splits into account! C4.5 improves on ID3 by using Gain Ratio GR(S) = IG(S) SI SI = k p j log 2 p j k is the number of splits, p j is the proportion of patterns, SI estimates the entropy of the split (split info)

Gain Ratio C4.5

Gini Gain CART: Classification and Regression Trees A simpler alternative to Entropy: Gini impurity Gini(S) = 1 N i pi 2 N - number of classes, p i - proportion of class i Smallest when all patterns belong to one class Largest when classes are equally split Gini gain: GG(S) = Gini(S) j p jgini(s j )

Binary Splits One way to solve the problem of splitting over too many attributes is to force a two-way split:

Numeric Attributes Binning What if one of the attributes is continuous? Eg., age, income... Solution: discretize the attribute using binning Boundaries between the bins are the potential split points

Numeric Attributes Binning Consider the information gain/purity/goodness factor of each split Choose the best one

Regression Trees What if not only the inputs, but the outputs are real numbers, too? Decision trees that output continuous values are called regression trees Instead of minimizing impurity, minimize data variance after split: Var(S) = 1 S 2 i S j S 1 2 (y i y j ) 2 where y i is the target output. Minimize for node N: I V (N) = Var(S) (Var(S t ) + Var(S f )) where S is the set before split, S t is the subset after split with test outcome true, and S f is the subset after split with test outcome false When more than one data point belong to a leaf, estimate is the average y per leaf

When do you stop splitting? Split till all leaf nodes are pure? Not feasible in complex real-life data sets Noisy/imperfect data may lead to a tree that generalises poorly Stop splitting when: All leaf nodes are pure, or Maximum tree depth has been reached, or Improvement in training error yields a drop in generalisation error, or Improvement in purity resulting from a split is less than a preset threshold Problem with early stopping: how early is too early?

Pruning The opposite of splitting Grow the tree to its full size on the training set Starting at the bottom of the tree, remove leaves one by one, each time checking the error on the generalisation set If the generalisation error does not increase, keep pruning! Prune A5:

The End Questions? Assignment 1 will be published on http://cs.up.ac.za/courses/mit801 on Monday Expect to apply the techniques discussed today on a real data set Next lecture: Random Forests, Neural Networks, SVM, Unsupervised Learning