DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Similar documents
CISC 4631 Data Mining

CS 584 Data Mining. Classification 1

Classification and Regression

Data Mining Concepts & Techniques

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Part I. Instructor: Wei Ding

CSE4334/5334 DATA MINING

Machine Learning: Algorithms and Applications Mockup Examination

Data Mining. Lecture 03: Nearest Neighbor Learning

CSE 494/598 Lecture-11: Clustering & Classification

Machine Learning for. Artem Lind & Aleskandr Tkachenko

Data Mining Concept. References. Why Mine Data? Commercial Viewpoint. Why Mine Data? Scientific Viewpoint

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

CISC 4631 Data Mining

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

DATA MINING LECTURE 11. Classification Basic Concepts Decision Trees Evaluation Nearest-Neighbor Classifier

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Naïve Bayes for text classification

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

Performance Evaluation of Various Classification Algorithms

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

DATA MINING LECTURE 9. Classification Basic Concepts Decision Trees Evaluation

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Lecture 9: Support Vector Machines

Introduction to Artificial Intelligence

CS 584 Data Mining. Classification 3

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing)

Classification Salvatore Orlando

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Applying Supervised Learning

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Network Traffic Measurements and Analysis

List of Exercises: Data Mining 1 December 12th, 2015

Information Management course

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

CS145: INTRODUCTION TO DATA MINING

DATA MINING LECTURE 9. Classification Decision Trees Evaluation

k-nearest Neighbors + Model Selection

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

I211: Information infrastructure II

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

Weka ( )

Supervised Learning Classification Algorithms Comparison

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Artificial Intelligence. Programming Styles

Pattern recognition (4)

Input: Concepts, Instances, Attributes

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Supervised vs unsupervised clustering

K- Nearest Neighbors(KNN) And Predictive Accuracy

Evaluating Classifiers

CS249: ADVANCED DATA MINING

Function Algorithms: Linear Regression, Logistic Regression

Data Mining and Soft Computing

Hsiaochun Hsu Date: 12/12/15. Support Vector Machine With Data Reduction

2. On classification and related tasks

Data analysis case study using R for readily available data set using any one machine learning Algorithm

Lecture 25: Review I

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Linear Regression and K-Nearest Neighbors 3/28/18

Application of Support Vector Machine Algorithm in Spam Filtering

EPL451: Data Mining on the Web Lab 5

Lecture 9. Support Vector Machines

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Probabilistic Classifiers DWML, /27

EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION OF MULTIVARIATE DATA SET

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Classification. Slide sources:

Practice EXAM: SPRING 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Introduction to Support Vector Machines

Machine Learning Classifiers and Boosting

Generative and discriminative classification techniques

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Introduction to Information Retrieval

AKA: Logistic Regression Implementation

Chuck Cartledge, PhD. 23 September 2017

ADVANCED CLASSIFICATION TECHNIQUES

Multi-label classification using rule-based classifier systems

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn

HOG-based Pedestriant Detector Training

k-nn Disgnosing Breast Cancer

Chapter 3: Supervised Learning

Performance Analysis of Data Mining Classification Techniques

Natural Language Processing

CSE 158. Web Mining and Recommender Systems. Midterm recap

Chakra Chennubhotla and David Koes

Data Mining Classification - Part 1 -

Instance-based Learning

Data Warehousing & Data Mining

CSEP 573: Artificial Intelligence

SOCIAL MEDIA MINING. Data Mining Essentials

Data Mining Practical Machine Learning Tools and Techniques

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Introduction to Machine Learning

CSE 573: Artificial Intelligence Autumn 2010

Transcription:

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS 1

Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes and a class attribute Model the class attribute as a function of other attributes Goal: previously unseen records should be assigned a class as accurately as possible (predictive accuracy) A test set is used to determine the accuracy of the model Usually the given labeled data is divided into training and test sets Training set used to build the model and test set to evaluate it 2

10 10 Illustrating Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? 14 No Small 95K? 15 No Large 67K? Induction Deduction Learning algorithm Learn Model Apply Model Model Test Set 3

Classification Examples Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc 4

Classification Techniques Linear Discriminant Methods Support vector machines Decision Tree based Methods Rule-based Methods Memory based reasoning (Nearest Neighbor) Neural Networks Naïve Bayes We will start with a simple linear classifier All of these methods will be covered in this course 5

The Classification Problem Katydids Given a collection of 5 instances of Katydids and five Grasshoppers, decide what type of insect the unlabeled corresponds to. Grasshoppers Katydid or Grasshopper? 6

Color {Green, Brown, Gray, Other} For any domain of interest, we can measure features Has Wings? Abdomen Length Thorax Length Antennae Length Mandible Size Spiracle Diameter Leg Length 7

We can store features in a database. The classification problem can now be expressed as: Given a training database, predict the class label of a previously unseen instance Insect ID Abdomen Length My_Collection Antennae Length Insect Class 1 2.7 5.5 Grasshopper 2 8.0 9.1 Katydid 3 0.9 4.7 Grasshopper 4 1.1 3.1 Grasshopper 5 5.4 8.5 Katydid 6 2.9 1.9 Grasshopper 7 6.1 6.6 Katydid 8 0.5 1.0 Grasshopper 9 8.3 6.6 Katydid 10 8.1 4.7 Katydids previously unseen instance = 11 5.1 7.0??????? 8

Grasshoppers Antenna Length Katydids 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Abdomen Length 9

Grasshoppers Antenna Length Katydids 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Abdomen Length Each of these data objects are called exemplars (training) examples instances tuples 10

We will return to the previous slide in two minutes. In the meantime, we are going to play a quick game. 11

Problem 1 Examples of class A Examples of class B 3 4 5 2.5 1.5 5 5 2 6 8 8 3 2.5 5 4.5 3 12

Problem 1 Examples of class A Examples of class B What class is this object? 3 4 5 2.5 8 1.5 1.5 5 5 2 What about this one, A or B? 6 8 8 3 2.5 5 4.5 3 4.5 7 13

Problem 2 Oh! This ones hard! Examples of class A Examples of class B 4 4 5 2.5 8 1.5 5 5 2 5 6 6 5 3 3 3 2.5 3 14

Problem 3 Examples of class A Examples of class B 6 6 4 4 5 6 This one is really hard! What is this, A or B? 1 5 7 5 6 3 4 8 3 7 7 7 15

Why did we spend so much time with this game? Because we wanted to show that almost all classification problems have a geometric interpretation, check out the next 3 slides 16

Left Bar Problem 1 Examples of class A Examples of class B 3 4 5 2.5 10 9 8 7 6 5 4 3 2 1 1.5 5 6 8 5 2 8 3 1 2 3 4 5 6 7 8 9 10 Right Bar Here is the rule again. If the left bar is smaller than the right bar, it is an A, otherwise it is a B. 2.5 5 4.5 3 17

Left Bar Problem 2 Examples of class A Examples of class B 4 4 5 2.5 10 9 8 7 6 5 4 3 2 1 5 5 6 6 2 5 5 3 1 2 3 4 5 6 7 8 9 10 Right Bar Let me look it up here it is.. the rule is, if the two bars are equal sizes, it is an A. Otherwise it is a B. 3 3 2.5 3 18

Left Bar Problem 3 Examples of class A Examples of class B 4 4 5 6 100 90 80 70 60 50 40 30 20 10 1 5 7 5 10 20 30 40 50 60 70 80 90 100 Right Bar 6 3 3 7 4 8 7 7 The rule again: if the square of the sum of the two bars is less than or equal to 100, it is an A. Otherwise it is a B. 19

Grasshoppers Antenna Length Katydids 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Abdomen Length 20

Antenna Length previously unseen instance = 11 5.1 7.0??????? 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Abdomen Length We can project the previously unseen instance into the same space as the database. We have now abstracted away the details of our particular problem. It will be much easier to talk about points in space. Katydids Grasshoppers 21

Simple Linear Classifier 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 If previously unseen instance above the line then class is Katydid else class is Grasshopper Katydids Grasshoppers R.A. Fisher 1890-1962 22

Fitting a Model to Data One way to build a predictive model is to specify the structure of the model with some parameters missing Parameter learning or parameter modeling Common in statistics but includes data mining methods since fields overlap Linear regression, logistic regression, support vector machines 23

Linear Discriminant Functions Equation of a line is y = mx + b A classification function may look like: Class + : if 1.0 age 1.5 balance + 60 > 0 Class - : if 1.0 age 1.5 balance + 60 0 General form is f(x) = w 0 + w 1 x 1 + w 2 x 2 + Parameterized model where the weights for each feature are the parameters The larger the magnitude of the weight the more important the feature The separator is a line when 2D, plane with 3D, and hyperplane with more than 3D 24

What is the Best Separator? 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Each separator has a different margin, which is the distance to the closest point. The orange line has the largest margin. For support vector machines, the line/plane with the largest margin is best. Optimization is done with a hinge loss function, so there is no penalty until a point is on the wrong side of the separator beyond the margin. Then the penalty increases linearly. In most real world cases you will not have data that is linearly separable. 25

Scoring and Ranking Instances Sometimes we want to know which examples are most likely to belong to a class Linear discriminant functions can give us this Closer to separator is less confident and further away is more confident In fact the magnitude of f(x) give us this where larger values are more confident/likely 26

Class Probability Estimation Class probability estimation is also something you often want Often free with data mining methods like decision trees More complicated with linear discriminant functions since the distance from the separator not a probability Logistic regression solves this We will not go into the details in this class Logistic regression determines class probability estimate 27

Classification Accuracy Predicted class Class = Katydid (1) Class = Grasshopper (0) Actual Class Class = Katydid (1) f 11 f 10 Class = Grasshopper (0) f 01 f 00 Confusion Matrix Number of correct predictions f 11 + f 00 Accuracy = --------------------------------------------- = ----------------------- Total number of predictions f 11 + f 10 + f 01 + f 00 Number of wrong predictions f 10 + f 01 Error rate = --------------------------------------------- = ----------------------- Total number of predictions f 11 + f 10 + f 01 + f 00 28

Confusion Matrix In a binary decision problem, a classifier labels examples as either positive or negative. Classifiers produce confusion/ contingency matrix, which shows four entities: TP (true positive), TN (true negative), FP (false positive), FN (false negative) Confusion Matrix Predicted Positive (+) Predicted Negative (-) Actual Positive (Y) TP FN Actual Negative (N) FP TN For now responsible for knowing Recall and Precision 29

The simple linear classifier is defined for higher dimensional spaces 30

we can visualize it as being an n-dimensional hyperplane 31

Which of the Problems can be solved by the Simple Linear Classifier? 1) Perfect 2) Useless 3) Pretty Good 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Problems that can be solved by a linear classifier are call linearly separable. 100 90 80 70 60 50 40 30 20 10 10 20 30 40 50 60 70 80 90 100 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 32

A Famous Problem R. A. Fisher s Iris Dataset. Virginica 3 classes 50 of each class The task is to classify Iris plants into one of 3 varieties using the Petal Length and Petal Width. Setosa Versicolor Iris Setosa Iris Versicolor Iris Virginica 33

We can generalize the piecewise linear classifier to N classes, by fitting N-1 lines. In this case we first learned the line to (perfectly) discriminate between Setosa and Virginica/Versicolor, then we learned to approximately discriminate between Virginica and Versicolor. Virginica Setosa Versicolor If petal width > 3.272 (0.325 * petal length) then class = Virginica Elseif petal width 34

How Compare Classification Algorithms? What criteria do we care about? What matters? Predictive Accuracy Speed and Scalability Time to construct model Time to use/apply the model Interpretability understanding and insight provided by the model Ability to explain/justify the results Robustness handling noise, missing values and irrelevant features, streaming data 35