Classification Key Concepts

Similar documents
Classification Key Concepts

Introduction to Machine Learning. Duen Horng (Polo) Chau Associate Director, MS Analytics Associate Professor, CSE, College of Computing Georgia Tech

How to predict a discrete variable?

How to predict a discrete variable?

Feb 27, 2014 CSE 6242 / CX Classification. How to predict a discrete variable? Based on Parishit Ram s slides

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Scaling Up Pig. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Scaling Up Pig. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Text Analytics (Text Mining)

Data Mining Concepts & Tasks

Text Analytics (Text Mining)

Data Mining Concepts & Tasks

Data Mining Concepts. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech

Data Integration. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Analytics Building Blocks

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Data Collection, Simple Storage (SQLite) & Cleaning

Lecture 7: Decision Trees

Extra readings beyond the lecture slides are important:

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

CSE4334/5334 DATA MINING

Nearest Neighbor Classification

Classification and Regression

Graphs / Networks CSE 6242/ CX Interactive applications. Duen Horng (Polo) Chau Georgia Tech

Graphs III. CSE 6242 A / CS 4803 DVA Feb 26, Duen Horng (Polo) Chau Georgia Tech. (Interactive) Applications

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Using Machine Learning to Optimize Storage Systems

CISC 4631 Data Mining

Basics, how to build & store graphs, laws, etc. Centrality, and algorithms you should know

Basics, how to build & store graphs, laws, etc. Centrality, and algorithms you should know

Lazy Decision Trees Ronny Kohavi

CSCI567 Machine Learning (Fall 2014)

Data Mining Lecture 8: Decision Trees

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Text Analytics (Text Mining)

Lecture outline. Decision-tree classification

Data Mining Concepts & Techniques

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

CSE 6242/CX Ensemble Methods. Or, Model Combination. Based on lecture by Parikshit Ram

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Supervised Learning: K-Nearest Neighbors and Decision Trees

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

Nonparametric Methods Recap

10/5/2017 MIST.6060 Business Intelligence and Data Mining 1. Nearest Neighbors. In a p-dimensional space, the Euclidean distance between two records,

Data Mining. Lecture 03: Nearest Neighbor Learning

Example of DT Apply Model Example Learn Model Hunt s Alg. Measures of Node Impurity DT Examples and Characteristics. Classification.

Data Mining Classification - Part 1 -

Machine Learning. Decision Trees. Le Song /15-781, Spring Lecture 6, September 6, 2012 Based on slides from Eric Xing, CMU

Data Preprocessing. Supervised Learning

CS Machine Learning

Introduction to Artificial Intelligence

K-Nearest Neighbour Classifier. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

COMP 465: Data Mining Classification Basics

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

CPSC 340: Machine Learning and Data Mining

Distribution-free Predictive Approaches

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

CS 584 Data Mining. Classification 1

Tree-based methods for classification and regression

Data Mining and Knowledge Discovery: Practice Notes

Nearest neighbor classification DSE 220

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Clustering & Classification (chapter 15)

Network Traffic Measurements and Analysis

Nonparametric Classification Methods

CS5670: Computer Vision

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Data Mining and Knowledge Discovery: Practice Notes

Part I. Instructor: Wei Ding

PROBLEM 4

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm

7. Decision or classification trees

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Data Mining and Knowledge Discovery: Practice Notes

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Classification. Instructor: Wei Ding

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form)

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Intro to Artificial Intelligence

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Lecture 3. Oct

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

K- Nearest Neighbors(KNN) And Predictive Accuracy

Data Mining Part 5. Prediction

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

1) Give decision trees to represent the following Boolean functions:

PARALLEL CLASSIFICATION ALGORITHMS

Chapter 2: Classification & Prediction

MLCC 2018 Local Methods and Bias Variance Trade-Off. Lorenzo Rosasco UNIGE-MIT-IIT

Lecture 19: Decision trees

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Knowledge Discovery and Data Mining

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Transcription:

http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Classification Key Concepts Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Parishit Ram GT PhD alum; SkyTree Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Parishit Ram (GT PhD alum; SkyTree), Alex Gray 1

How will I rate "Chopin's 5th Symphony"? Songs Like? Some nights Skyfall Comfortably numb We are young............ Chopin's 5th??? 2

Classification What tools do you need for classification? 1. Data S = {(x i, y i )} i = 1,...,n o o x i : data example with d attributes y i : label of example (what you care about) 2. Classification model f (a,b,c,...) with some parameters a, b, c,... 3. Loss function L(y, f(x)) o how to penalize mistakes 3

Terminology Explanation Data S = {(x i, y i )} i = 1,...,n o o x i : data example with d attributes y i : label of example data example = data instance attribute = feature = dimension label = target attribute Song name Artist Length... Like? Some nights Fun 4:23... Skyfall Adele 4:00... Comf. numb Pink Fl. 6:13... We are young Fun 3:50................................. Chopin's 5th Chopin 5:32...?? 4

What is a model? a simplified representation of reality created to serve a purpose Data Science for Business Example: maps are abstract models of the physical world There can be many models!! (Everyone sees the world differently, so each of us has a different model.) In data science, a model is formula to estimate what you care about. The formula may be mathematical, a set of rules, a combination, etc. 5

Training a classifier = building the model How do you learn appropriate values for parameters a, b, c,...? Analogy: how do you know your map is a good map of the physical world? 6

Classification loss function Most common loss: 0-1 loss function More general loss functions are defined by a m x m cost matrix C such that where y = a and f(x) = b T0 (true class 0), T1 (true class 1) P0 (predicted class 0), P1 (predicted class 1) Class T0 T1 P0 0 C 10 P1 C 01 0 7

An ideal model should correctly estimate: o o known or seen data examples labels unknown or unseen data examples labels Song name Artist Length... Like? Some nights Fun 4:23... Skyfall Adele 4:00... Comf. numb Pink Fl. 6:13... We are young Fun 3:50................................. Chopin's 5th Chopin 5:32...?? 8

Training a classifier = building the model Q: How do you learn appropriate values for parameters a, b, c,...? (Analogy: how do you know your map is a good map?) y i = f (a,b,c,...) (x i ), i = 1,..., n o Low/no error on training data ( seen or known ) y = f (a,b,c,...)(x), for any new x o Low/no error on test data ( unseen or unknown ) It is very easy to achieve perfect classification Possible A: on training/seen/known Minimize data. Why? with respect to a, b, c,... 9

If your model works really well for training data, but poorly for test data, your model is overfitting. How to avoid overfitting? 10

Example: one run of 5-fold cross validation You should do a few runs and compute the average (e.g., error rates if that s your evaluation metrics) Image credit: http://stats.stackexchange.com/questions/1826/cross-validation-in-plain-english 11

Cross validation 1.Divide your data into n parts 2.Hold 1 part as test set or hold out set 3.Train classifier on remaining n-1 parts training set 4.Compute test error on test set 5.Repeat above steps n times, once for each n-th part 6.Compute the average test error over all n folds (i.e., cross-validation test error) 12

Cross-validation variations Leave-one-out cross-validation (LOO-CV) test sets of size 1 K-fold cross-validation Test sets of size (n / K) K = 10 is most common (i.e., 10-fold CV) 13

Example: k-nearest-neighbor classifier Like Whiskey Don t like whiskey Image credit: Data Science for Business 14

k-nearest-neighbor Classifier The classifier: f(x) = majority label of the k nearest neighbors (NN) of x Model parameters: Number of neighbors k Distance/similarity function d(.,.) 15

But k-nn is so simple! It can work really well! Pandora uses it or has used it: https://goo.gl/folfmp (from the book Data Mining for Business Intelligence ) Image credit: https://www.fool.com/investing/general/2015/03/16/will-the-music-industry-end-pandoras-business-mode.aspx 16

What are good models? Simple (few parameters) Effective Complex (more parameters) Complex (many parameters) Effective (if significantly more so than simple methods) Not-so-effective 17

k-nearest-neighbor Classifier If k and d(.,.) are fixed Things to learn:? How to learn them:? If d(.,.) is fixed, but you can change k Things to learn:? How to learn them:? 18

k-nearest-neighbor Classifier If k and d(.,.) are fixed Things to learn: Nothing How to learn them: N/A If d(.,.) is fixed, but you can change k Selecting k: How? 19

How to find best k in k-nn? Use cross validation (CV). 20

21

k-nearest-neighbor Classifier If k is fixed, but you can change d(.,.) Possible distance functions: Euclidean distance: Manhattan distance: 22

Summary on k-nn classifier Advantages o Little learning (unless you are learning the o distance functions) quite powerful in practice (and has theoretical guarantees as well) Caveats o Computationally expensive at test time Reading material: ESL book, Chapter 13.3http://www-stat.stanford.edu/ ~tibs/elemstatlearn/printings/eslii_print10.pdf Le Song's slides on knn classifierhttp:// www.cc.gatech.edu/~lsong/teaching/cse6740/lecture2.pdf 23

Decision trees (DT) Visual introduction to decision tree http://www.r2d3.us/visual-intro-to-machine-learning-part-1/ Weather? The classifier: f T (x): majority class in the leaf in the tree T containing x Model parameters: The tree structure and size 24

Decision trees Things to learn:? How to learn them:? Cross-validation:? Weather? 25

Learning the Tree Structure Things to learn: the tree structure How to learn them: (greedily) minimize the overall classification loss Cross-validation: finding the best sized tree with K-fold cross-validation 26

Decision trees Pieces: 1. Find the best split on the chosen attribute 2. Find the best attribute to split on 3. Decide on when to stop splitting 4. Cross-validation Highly recommended lecture slides from CMU http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15381-s06/www/dts.pdf 27

Choosing the split point Split types for a selected attribute j: 1. Categorical attribute (e.g. genre ) x 1j = Rock, x 2j = Classical, x 3j = Pop 2. Ordinal attribute (e.g., achievement ) x 1j =Platinum, x 2j =Gold, x 3j =Silver 3. Continuous attribute (e.g., song duration) x 1j = 235, x 2j = 543, x 3j = 378 x 1,x 2,x 3 x 1,x 2,x 3 x 1,x 2,x 3 Rock Classical Pop Plat. Gold Silver x 1 x 2 x 3 x 1 x 2 x 3 x 1,x 3 x 2 Split on genre Split on achievement Split on duration 28

Choosing the split point At a node T for a given attribute d, select a split s as following: min s loss(t L ) + loss(t R ) where loss(t) is the loss at node T Common node loss functions: Misclassification rate Expected loss Normalized negative log-likelihood (= cross-entropy) More details on loss functions, see Chapter 3.3: http://www.stat.cmu.edu/~cshalizi/350/lectures/22/lecture-22.pdf 29

Choosing the attribute Choice of attribute: 1. Attribute providing the maximum improvement in training loss 2. Attribute with highest information gain (mutual information) Intuition: an attribute with highest information gain helps most rapidly describe an instance (i.e., most rapidly reduces uncertainty ) 30

Let s look at an excellent example using information gain to pick splitting attribute and split point (for that attribute) http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15381-s06/www/dts.pdf Slide 7 onwards 31

When to stop splitting? Common strategies: 1. Pure and impure leave nodes All points belong to the same class; OR All points from one class completely overlap with points from another class (i.e., same attributes) Output majority class as this leaf s label 2. Node contains points fewer than some threshold 3. Node purity is higher than some threshold 4. Further splits provide no improvement in training loss (loss(t) <= loss(t L ) + loss(t R )) Graphics from: http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15381-s06/www/dts.pdf 32

Parameters vs Hyper-parameters Example hyper-parameters (need to experiment ) k-nn: k, similarity function Decision tree: #node, Can be determined using CV and grid search (http://scikit-learn.org/stable/modules/grid_search.html) Example parameters (can be learned / estimated / computed directly from data) Decision tree (entropy-based): which attribute to split split point for an attribute 33

Summary on decision trees Advantages Easy to implement Interpretable Very fast test time Can work seamlessly with mixed attributes Works quite well in practice Caveats too basic but OK if it works! Training can be very expensive Cross-validation is hard (node-level CV) 34