Supervised Learning: K-Nearest Neighbors and Decision Trees

Similar documents
Lecture 7: Decision Trees

Classification and K-Nearest Neighbors

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

CSC411/2515 Tutorial: K-NN and Decision Tree

Slides for Data Mining by I. H. Witten and E. Frank

CSE 573: Artificial Intelligence Autumn 2010

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Machine Learning Techniques for Data Mining

Classification: Feature Vectors

Nonparametric Methods Recap

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Kernels and Clustering

Nearest Neighbor Classification. Machine Learning Fall 2017

Supervised Learning: Nearest Neighbors

CS7267 MACHINE LEARNING NEAREST NEIGHBOR ALGORITHM. Mingon Kang, PhD Computer Science, Kennesaw State University

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

CISC 4631 Data Mining

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form)

CS 343: Artificial Intelligence

Nearest Neighbor Classification

Instance-Based Learning.

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Data Mining. Lecture 03: Nearest Neighbor Learning

CS 340 Lec. 4: K-Nearest Neighbors

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML

Decision Tree CE-717 : Machine Learning Sharif University of Technology

k-nearest Neighbors + Model Selection

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

CSE4334/5334 DATA MINING

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Distribution-free Predictive Approaches

VECTOR SPACE CLASSIFICATION

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Nearest neighbor classification DSE 220

Semi-supervised Learning

Instance-based Learning

Classification Algorithms in Data Mining

Business Club. Decision Trees

Nearest Neighbor Classification

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

7. Nearest neighbors. Learning objectives. Foundations of Machine Learning École Centrale Paris Fall 2015

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

1) Give decision trees to represent the following Boolean functions:

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Classification and Regression

Applying Supervised Learning

Going nonparametric: Nearest neighbor methods for regression and classification

K-Nearest Neighbors. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

K-Nearest Neighbour Classifier. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Topics in Machine Learning

Decision Trees: Discussion

Decision tree learning

Machine Learning / Jan 27, 2010

Classification Key Concepts

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Machine Learning and Pervasive Computing

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

Notes and Announcements

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

Lecture outline. Decision-tree classification

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

CS Machine Learning

Introduction to Machine Learning

9 Classification: KNN and SVM

CSC 411: Lecture 05: Nearest Neighbors

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Chapter 4: Algorithms CS 795

Midterm Examination CS540-2: Introduction to Artificial Intelligence

CS6716 Pattern Recognition

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

ADVANCED CLASSIFICATION TECHNIQUES

To earn the extra credit, one of the following has to hold true. Please circle and sign.

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

CS 188: Artificial Intelligence Fall 2008

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

ECG782: Multidimensional Digital Signal Processing

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

CSCI567 Machine Learning (Fall 2014)

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

All lecture slides will be available at CSC2515_Winter15.html

Extra readings beyond the lecture slides are important:

Machine Learning nearest neighbors classification. Luigi Cerulo Department of Science and Technology University of Sannio

Classification and Regression Trees

6. Dicretization methods 6.1 The purpose of discretization

Decision Tree (Continued) and K-Nearest Neighbour. Dr. Xiaowei Huang

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Lecture 3. Oct

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

Chapter 4: Algorithms CS 795

Transcription:

Supervised Learning: K-Nearest Neighbors and Decision Trees Piyush Rai CS5350/6350: Machine Learning August 25, 2011 (CS5350/6350) K-NN and DT August 25, 2011 1 / 20

Supervised Learning Given training data {(x 1,y 1 ),...,(x N,y N )} N input/output pairs; x i - input, y i - output/label (CS5350/6350) K-NN and DT August 25, 2011 2 / 20

Supervised Learning Given training data {(x 1,y 1 ),...,(x N,y N )} N input/output pairs; x i - input, y i - output/label x i is a vector consisting of D features Also called attributes or dimensions Features can be discrete or continuous x im denotes the m-th feature of x i (CS5350/6350) K-NN and DT August 25, 2011 2 / 20

Supervised Learning Given training data {(x 1,y 1 ),...,(x N,y N )} N input/output pairs; x i - input, y i - output/label x i is a vector consisting of D features Also called attributes or dimensions Features can be discrete or continuous x im denotes the m-th feature of x i Forms of the output: y i {1,...,C} for classification; a discrete variable y i R for regression; a continuous (real-valued) variable (CS5350/6350) K-NN and DT August 25, 2011 2 / 20

Supervised Learning Given training data {(x 1,y 1 ),...,(x N,y N )} N input/output pairs; x i - input, y i - output/label x i is a vector consisting of D features Also called attributes or dimensions Features can be discrete or continuous x im denotes the m-th feature of x i Forms of the output: y i {1,...,C} for classification; a discrete variable y i R for regression; a continuous (real-valued) variable Goal: predict the output y for an unseen test example x This lecture: Two intuitive methods K-Nearest-Neighbors Decision Trees (CS5350/6350) K-NN and DT August 25, 2011 2 / 20

K-Nearest Neighbor (K-NN) Given training data D = {(x 1,y 1 ),...,(x N,y N )} and a test point Prediction Rule: Look at the K most similar training examples (CS5350/6350) K-NN and DT August 25, 2011 3 / 20

K-Nearest Neighbor (K-NN) Given training data D = {(x 1,y 1 ),...,(x N,y N )} and a test point Prediction Rule: Look at the K most similar training examples For classification: assign the majority class label (majority voting) For regression: assign the average response (CS5350/6350) K-NN and DT August 25, 2011 3 / 20

K-Nearest Neighbor (K-NN) Given training data D = {(x 1,y 1 ),...,(x N,y N )} and a test point Prediction Rule: Look at the K most similar training examples For classification: assign the majority class label (majority voting) For regression: assign the average response The algorithm requires: Parameter K: number of nearest neighbors to look for Distance function: To compute the similarities between examples (CS5350/6350) K-NN and DT August 25, 2011 3 / 20

K-Nearest Neighbor (K-NN) Given training data D = {(x 1,y 1 ),...,(x N,y N )} and a test point Prediction Rule: Look at the K most similar training examples For classification: assign the majority class label (majority voting) For regression: assign the average response The algorithm requires: Parameter K: number of nearest neighbors to look for Distance function: To compute the similarities between examples Special Case: 1-Nearest Neighbor (CS5350/6350) K-NN and DT August 25, 2011 3 / 20

K-Nearest Neighbors Algorithm Compute the test point s distance from each training point (CS5350/6350) K-NN and DT August 25, 2011 4 / 20

K-Nearest Neighbors Algorithm Compute the test point s distance from each training point Sort the distances in ascending (or descending) order (CS5350/6350) K-NN and DT August 25, 2011 4 / 20

K-Nearest Neighbors Algorithm Compute the test point s distance from each training point Sort the distances in ascending (or descending) order Use the sorted distances to select the K nearest neighbors (CS5350/6350) K-NN and DT August 25, 2011 4 / 20

K-Nearest Neighbors Algorithm Compute the test point s distance from each training point Sort the distances in ascending (or descending) order Use the sorted distances to select the K nearest neighbors Use majority rule (for classification) or averaging (for regression) (CS5350/6350) K-NN and DT August 25, 2011 4 / 20

K-Nearest Neighbors Algorithm Compute the test point s distance from each training point Sort the distances in ascending (or descending) order Use the sorted distances to select the K nearest neighbors Use majority rule (for classification) or averaging (for regression) Note: K-Nearest Neighbors is called a non-parametric method Unlike other supervised learning algorithms, K-Nearest Neighbors doesn t learn an explicit mapping f from the training data (CS5350/6350) K-NN and DT August 25, 2011 4 / 20

K-Nearest Neighbors Algorithm Compute the test point s distance from each training point Sort the distances in ascending (or descending) order Use the sorted distances to select the K nearest neighbors Use majority rule (for classification) or averaging (for regression) Note: K-Nearest Neighbors is called a non-parametric method Unlike other supervised learning algorithms, K-Nearest Neighbors doesn t learn an explicit mapping f from the training data It simply uses the training data at the test time to make predictions (CS5350/6350) K-NN and DT August 25, 2011 4 / 20

K-NN: Computing the distances The K-NN algorithm requires computing distances of the test example from each of the training examples (CS5350/6350) K-NN and DT August 25, 2011 5 / 20

K-NN: Computing the distances The K-NN algorithm requires computing distances of the test example from each of the training examples Several ways to compute distances The choice depends on the type of the features in the data (CS5350/6350) K-NN and DT August 25, 2011 5 / 20

K-NN: Computing the distances The K-NN algorithm requires computing distances of the test example from each of the training examples Several ways to compute distances The choice depends on the type of the features in the data Real-valued features (x i R D ): Euclidean distance is commonly used d(x i,x j ) = D (x im x jm ) 2 = x i 2 + x j 2 2x T i x j m=1 (CS5350/6350) K-NN and DT August 25, 2011 5 / 20

K-NN: Computing the distances The K-NN algorithm requires computing distances of the test example from each of the training examples Several ways to compute distances The choice depends on the type of the features in the data Real-valued features (x i R D ): Euclidean distance is commonly used d(x i,x j ) = D (x im x jm ) 2 = x i 2 + x j 2 2x T i x j m=1 Generalization of the distance between points in 2 dimensions (CS5350/6350) K-NN and DT August 25, 2011 5 / 20

K-NN: Computing the distances The K-NN algorithm requires computing distances of the test example from each of the training examples Several ways to compute distances The choice depends on the type of the features in the data Real-valued features (x i R D ): Euclidean distance is commonly used d(x i,x j ) = D (x im x jm ) 2 = x i 2 + x j 2 2x T i x j m=1 Generalization of the distance between points in 2 dimensions D x i = m=1 x2 im is called the norm of x i Norm of a vector x is also its length (CS5350/6350) K-NN and DT August 25, 2011 5 / 20

K-NN: Computing the distances The K-NN algorithm requires computing distances of the test example from each of the training examples Several ways to compute distances The choice depends on the type of the features in the data Real-valued features (x i R D ): Euclidean distance is commonly used d(x i,x j ) = D (x im x jm ) 2 = x i 2 + x j 2 2x T i x j m=1 Generalization of the distance between points in 2 dimensions D x i = m=1 x2 im is called the norm of x i Norm of a vector x is also its length x T i x j = D m=1 x imx jm is called the dot (or inner) product of x i and x j Dot product measures the similarity between two vectors (orthogonal vectors have dot product=0, parallel vectors have high dot product) (CS5350/6350) K-NN and DT August 25, 2011 5 / 20

K-NN: Feature Normalization Note: Features should be on the same scale Example: if one feature has its values in millimeters and another has in centimeters, we would need to normalize (CS5350/6350) K-NN and DT August 25, 2011 6 / 20

K-NN: Feature Normalization Note: Features should be on the same scale Example: if one feature has its values in millimeters and another has in centimeters, we would need to normalize One way is: Replace x im by z im = (x im x m) σ m (make them zero mean, unit variance) (CS5350/6350) K-NN and DT August 25, 2011 6 / 20

K-NN: Feature Normalization Note: Features should be on the same scale Example: if one feature has its values in millimeters and another has in centimeters, we would need to normalize One way is: Replace x im by z im = (x im x m) σ m (make them zero mean, unit variance) x m = 1 N N i=1 x im: empirical mean of m th feature σm 2 = 1 N N i=1 (x im x m) 2 : empirical variance of m th feature (CS5350/6350) K-NN and DT August 25, 2011 6 / 20

K-NN: Some other distance measures Binary-valued features Use Hamming distance: d(x i,x j ) = D m=1 I(x im x jm ) Hamming distance counts the number of features where the two examples disagree Mixed feature types (some real-valued and some binary-valued)? Can use mixed distance measures E.g., Euclidean for the real part, Hamming for the binary part Can also assign weights to features: d(x i,x j ) = D m=1 w md(x im,x jm ) (CS5350/6350) K-NN and DT August 25, 2011 7 / 20

Choice of K - Neighborhood Size Small K Creates many small regions for each class May lead to non-smooth) decision boundaries and overfit (CS5350/6350) K-NN and DT August 25, 2011 8 / 20

Choice of K - Neighborhood Size Small K Creates many small regions for each class May lead to non-smooth) decision boundaries and overfit Large K Creates fewer larger regions Usually leads to smoother decision boundaries (caution: too smooth decision boundary can underfit) (CS5350/6350) K-NN and DT August 25, 2011 8 / 20

Choice of K - Neighborhood Size Small K Creates many small regions for each class May lead to non-smooth) decision boundaries and overfit Large K Creates fewer larger regions Usually leads to smoother decision boundaries (caution: too smooth decision boundary can underfit) Choosing K Often data dependent and heuristic based Or using cross-validation (using some held-out data) In general, a K too small or too big is bad! (CS5350/6350) K-NN and DT August 25, 2011 8 / 20

K-Nearest Neighbor: Properties What s nice Simple and intuitive; easily implementable (CS5350/6350) K-NN and DT August 25, 2011 9 / 20

K-Nearest Neighbor: Properties What s nice Simple and intuitive; easily implementable Asymptotically consistent (a theoretical property) With infinite training data and large enough K, K-NN approaches the best possible classifier (Bayes optimal) (CS5350/6350) K-NN and DT August 25, 2011 9 / 20

K-Nearest Neighbor: Properties What s nice Simple and intuitive; easily implementable Asymptotically consistent (a theoretical property) With infinite training data and large enough K, K-NN approaches the best possible classifier (Bayes optimal) What s not so nice.. Store all the training data in memory even at test time Can be memory intensive for large training datasets An example of non-parametric, or memory/instance-based methods Different from parametric, model-based learning models (CS5350/6350) K-NN and DT August 25, 2011 9 / 20

K-Nearest Neighbor: Properties What s nice Simple and intuitive; easily implementable Asymptotically consistent (a theoretical property) With infinite training data and large enough K, K-NN approaches the best possible classifier (Bayes optimal) What s not so nice.. Store all the training data in memory even at test time Can be memory intensive for large training datasets An example of non-parametric, or memory/instance-based methods Different from parametric, model-based learning models Expensive at test time: O(ND) computations for each test point Have to search through all training data to find nearest neighbors Distance computations with N training points (D features each) (CS5350/6350) K-NN and DT August 25, 2011 9 / 20

K-Nearest Neighbor: Properties What s nice Simple and intuitive; easily implementable Asymptotically consistent (a theoretical property) With infinite training data and large enough K, K-NN approaches the best possible classifier (Bayes optimal) What s not so nice.. Store all the training data in memory even at test time Can be memory intensive for large training datasets An example of non-parametric, or memory/instance-based methods Different from parametric, model-based learning models Expensive at test time: O(ND) computations for each test point Have to search through all training data to find nearest neighbors Distance computations with N training points (D features each) Sensitive to noisy features (CS5350/6350) K-NN and DT August 25, 2011 9 / 20

K-Nearest Neighbor: Properties What s nice Simple and intuitive; easily implementable Asymptotically consistent (a theoretical property) With infinite training data and large enough K, K-NN approaches the best possible classifier (Bayes optimal) What s not so nice.. Store all the training data in memory even at test time Can be memory intensive for large training datasets An example of non-parametric, or memory/instance-based methods Different from parametric, model-based learning models Expensive at test time: O(ND) computations for each test point Have to search through all training data to find nearest neighbors Distance computations with N training points (D features each) Sensitive to noisy features May perform badly in high dimensions (curse of dimensionality) In high dimensions, distance notions can be counter-intuitive! (CS5350/6350) K-NN and DT August 25, 2011 9 / 20

Not Covered (Further Readings) Computational speed-ups (don t want to spend O(ND) time) Improved data structures for fast nearest neighbor search Even if approximately nearest neighbors, yet may be good enough Efficient Storage (don t want to store all the training data) E.g., subsampling the training data to retain prototypes Leads to computational speed-ups too! Metric Learning: Learning the right distance metric for a given dataset (CS5350/6350) K-NN and DT August 25, 2011 10 / 20

Decision Tree Defined by a hierarchy of rules (in form of a tree) Rules form the internal nodes of the tree (topmost internal node = root) Each rule (internal node) tests the value of some property the data (CS5350/6350) K-NN and DT August 25, 2011 11 / 20

Decision Tree Defined by a hierarchy of rules (in form of a tree) Rules form the internal nodes of the tree (topmost internal node = root) Each rule (internal node) tests the value of some property the data Decision Tree Learning Training data is used to construct the Decision Tree (DT) The DT is used to predict label y for test input x (CS5350/6350) K-NN and DT August 25, 2011 11 / 20

Decision Tree Learning: Example 1 Identifying the region (blue or green) a point lies in A classification problem (blue vs green) Each input has 2 features: co-ordinates {x 1,x 2} in the 2D plane Left: Training data, Right: A decision tree constructed using this data (CS5350/6350) K-NN and DT August 25, 2011 12 / 20

Decision Tree Learning: Example 1 Identifying the region (blue or green) a point lies in A classification problem (blue vs green) Each input has 2 features: co-ordinates {x 1,x 2} in the 2D plane Left: Training data, Right: A decision tree constructed using this data The DT can be used to predict the region (blue/green) of a new test point By testing the features of the test point In the order defined by the DT (first x 2 and then x 1) (CS5350/6350) K-NN and DT August 25, 2011 12 / 20

Decision Tree Learning: Example 2 Deciding whether to play or not to play Tennis on a Saturday A classification problem (play vs no-play) Each input has 4 features: Outlook, Temperature, Humidity, Wind Left: Training data, Right: A decision tree constructed using this data (CS5350/6350) K-NN and DT August 25, 2011 13 / 20

Decision Tree Learning: Example 2 Deciding whether to play or not to play Tennis on a Saturday A classification problem (play vs no-play) Each input has 4 features: Outlook, Temperature, Humidity, Wind Left: Training data, Right: A decision tree constructed using this data (CS5350/6350) K-NN and DT August 25, 2011 13 / 20

Decision Tree Learning: Example 2 Deciding whether to play or not to play Tennis on a Saturday A classification problem (play vs no-play) Each input has 4 features: Outlook, Temperature, Humidity, Wind Left: Training data, Right: A decision tree constructed using this data The DT can be used to predict play vs no-play for a new Saturday By testing the features of that day In the order defined by the DT (CS5350/6350) K-NN and DT August 25, 2011 13 / 20

Decision Tree Construction Now let s look at the Tennis playing example Question: Why does it make more sense to test the feature outlook first? (CS5350/6350) K-NN and DT August 25, 2011 14 / 20

Decision Tree Construction Now let s look at the Tennis playing example Question: Why does it make more sense to test the feature outlook first? Answer: Of all the 4 features, it s most informative (CS5350/6350) K-NN and DT August 25, 2011 14 / 20

Decision Tree Construction Now let s look at the Tennis playing example Question: Why does it make more sense to test the feature outlook first? Answer: Of all the 4 features, it s most informative We will see shortly how to quantity the informativeness (CS5350/6350) K-NN and DT August 25, 2011 14 / 20

Decision Tree Construction Now let s look at the Tennis playing example Question: Why does it make more sense to test the feature outlook first? Answer: Of all the 4 features, it s most informative We will see shortly how to quantity the informativeness Information content of a feature decides its position in the DT (CS5350/6350) K-NN and DT August 25, 2011 14 / 20

Decision Tree Construction Now let s look at the Tennis playing example Question: Why does it make more sense to test the feature outlook first? Answer: Of all the 4 features, it s most informative We will see shortly how to quantity the informativeness Information content of a feature decides its position in the DT Analogy: Playing the game 20 Questions (the most useful questions first) (CS5350/6350) K-NN and DT August 25, 2011 14 / 20

Decision Tree Construction Summarizing: The training data is used to construct the DT Each internal node is a rule (testing the value of some feature) Highly informative features are placed higher up in the tree We need a way to rank features according to their information content We will use Entropy and Information Gain as the criteria Note: There are several specific versions of the Decision Tree ID3, C4.5, Classification and Regression Trees (CART), etc. We will be looking at the ID3 algorithm (CS5350/6350) K-NN and DT August 25, 2011 15 / 20

Entropy Entropy measures the randomness/uncertainty in the data (CS5350/6350) K-NN and DT August 25, 2011 16 / 20

Entropy Entropy measures the randomness/uncertainty in the data Let s consider a set S of examples with C many classes. Entropy of this set: H(S) = c Cp c log 2 p c p c is the probability that an element of S belongs to class c.. basically, the fraction of elements of S belonging to class c (CS5350/6350) K-NN and DT August 25, 2011 16 / 20

Entropy Entropy measures the randomness/uncertainty in the data Let s consider a set S of examples with C many classes. Entropy of this set: H(S) = c Cp c log 2 p c p c is the probability that an element of S belongs to class c.. basically, the fraction of elements of S belonging to class c Intuition: Entropy is a measure of the degree of surprise Some dominant classes = small entropy (less uncertainty) Equiprobable classes = high entropy (more uncertainty) (CS5350/6350) K-NN and DT August 25, 2011 16 / 20

Entropy Entropy measures the randomness/uncertainty in the data Let s consider a set S of examples with C many classes. Entropy of this set: H(S) = c Cp c log 2 p c p c is the probability that an element of S belongs to class c.. basically, the fraction of elements of S belonging to class c Intuition: Entropy is a measure of the degree of surprise Some dominant classes = small entropy (less uncertainty) Equiprobable classes = high entropy (more uncertainty) Entropy denotes the average number of bits needed to encode S (CS5350/6350) K-NN and DT August 25, 2011 16 / 20

Information Gain Let s assume each element of S consists of a set of features Information Gain (IG) on a feature F IG(S,F) = H(S) f F S f S H(S f) S f number of elements of S with feature F having value f IG(S,F) measures the increase in our certainty about S once we know the value of F (CS5350/6350) K-NN and DT August 25, 2011 17 / 20

Information Gain Let s assume each element of S consists of a set of features Information Gain (IG) on a feature F IG(S,F) = H(S) f F S f S H(S f) S f number of elements of S with feature F having value f IG(S,F) measures the increase in our certainty about S once we know the value of F IG(S,F) denotes the number of bits saved while encoding S once we know the value of the feature F (CS5350/6350) K-NN and DT August 25, 2011 17 / 20

Computing Information Gain Let s begin with the root node of the DT and compute IG of each feature Consider feature wind {weak,strong} and its IG w.r.t. the root node (CS5350/6350) K-NN and DT August 25, 2011 18 / 20

Computing Information Gain Let s begin with the root node of the DT and compute IG of each feature Consider feature wind {weak,strong} and its IG w.r.t. the root node Root node: S = [9+,5 ] (all training data: 9 play, 5 no-play) Entropy: H(S) = (9/14)log 2 (9/14) (5/14)log 2 (5/14) = 0.94 (CS5350/6350) K-NN and DT August 25, 2011 18 / 20

Computing Information Gain Let s begin with the root node of the DT and compute IG of each feature Consider feature wind {weak,strong} and its IG w.r.t. the root node Root node: S = [9+,5 ] (all training data: 9 play, 5 no-play) Entropy: H(S) = (9/14)log 2 (9/14) (5/14)log 2 (5/14) = 0.94 S weak = [6+,2 ] = H(S weak ) = 0.811 (CS5350/6350) K-NN and DT August 25, 2011 18 / 20

Computing Information Gain Let s begin with the root node of the DT and compute IG of each feature Consider feature wind {weak,strong} and its IG w.r.t. the root node Root node: S = [9+,5 ] (all training data: 9 play, 5 no-play) Entropy: H(S) = (9/14)log 2 (9/14) (5/14)log 2 (5/14) = 0.94 S weak = [6+,2 ] = H(S weak ) = 0.811 S strong = [3+,3 ] = H(S strong ) = 1 (CS5350/6350) K-NN and DT August 25, 2011 18 / 20

Computing Information Gain Let s begin with the root node of the DT and compute IG of each feature Consider feature wind {weak,strong} and its IG w.r.t. the root node Root node: S = [9+,5 ] (all training data: 9 play, 5 no-play) Entropy: H(S) = (9/14)log 2 (9/14) (5/14)log 2 (5/14) = 0.94 S weak = [6+,2 ] = H(S weak ) = 0.811 S strong = [3+,3 ] = H(S strong ) = 1 IG(S,wind) = H(S) S weak H(S weak ) S strong H(S strong ) S S = 0.94 8/14 0.811 6/14 1 = 0.048 (CS5350/6350) K-NN and DT August 25, 2011 18 / 20

Choosing the most informative feature At the root node, the information gains are: IG(S,wind) = 0.048 (we already saw) IG(S,outlook) = 0.246 IG(S,humidity) = 0.151 IG(S, temperature) = 0.029 outlook has the maximum IG = chosen as the root node (CS5350/6350) K-NN and DT August 25, 2011 19 / 20

Choosing the most informative feature At the root node, the information gains are: IG(S,wind) = 0.048 (we already saw) IG(S,outlook) = 0.246 IG(S,humidity) = 0.151 IG(S, temperature) = 0.029 outlook has the maximum IG = chosen as the root node Growing the tree: Iteratively select the feature with the highest information gain for each child of the previous node (CS5350/6350) K-NN and DT August 25, 2011 19 / 20

Next Lecture.. The ID3 Decision Tree Algorithm formally We have already seen the ingredients by now Decision Tree Properties E.g., dealing with missing features, overfitting, etc. Maths Refresher (CS5350/6350) K-NN and DT August 25, 2011 20 / 20