CS145: INTRODUCTION TO DATA MINING

Similar documents
CS249: ADVANCED DATA MINING

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Information Management course

CS145: INTRODUCTION TO DATA MINING

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

CS4491/CS 7265 BIG DATA ANALYTICS

Evaluating Classifiers

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

CS6220: DATA MINING TECHNIQUES

Classification Part 4

Evaluating Classifiers

Weka ( )

Data Mining and Knowledge Discovery: Practice Notes

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Evaluating Machine Learning Methods: Part 1

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluating Machine-Learning Methods. Goals for the lecture

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

Tutorials Case studies

CS 584 Data Mining. Classification 3

Classification Algorithms in Data Mining

Data Mining and Knowledge Discovery Practice notes 2

CS570: Introduction to Data Mining

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

List of Exercises: Data Mining 1 December 12th, 2015

Data Mining and Knowledge Discovery: Practice Notes

Constrained Classification of Large Imbalanced Data

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Classification. Instructor: Wei Ding

MEASURING CLASSIFIER PERFORMANCE

IEE 520 Data Mining. Project Report. Shilpa Madhavan Shinde

CSE 5243 INTRO. TO DATA MINING

Network Traffic Measurements and Analysis

Contents. Preface to the Second Edition

ECLT 5810 Evaluation of Classification Quality

Pattern recognition (4)

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

K- Nearest Neighbors(KNN) And Predictive Accuracy

Identification of the correct hard-scatter vertex at the Large Hadron Collider

Machine Learning and Bioinformatics 機器學習與生物資訊學

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Model s Performance Measures

CLASSIFICATION JELENA JOVANOVIĆ. Web:

2. On classification and related tasks

Further Thoughts on Precision

Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction Model in Semiconductor Manufacturing Process

PARALLEL SELECTIVE SAMPLING USING RELEVANCE VECTOR MACHINE FOR IMBALANCE DATA M. Athitya Kumaraguru 1, Viji Vinod 2, N.

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Evaluating Classifiers

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Ester Bernadó-Mansilla. Research Group in Intelligent Systems Enginyeria i Arquitectura La Salle Universitat Ramon Llull Barcelona, Spain

DATA MINING OVERFITTING AND EVALUATION

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

Chapter 3: Supervised Learning

Chuck Cartledge, PhD. 23 September 2017

Statistics 202: Statistical Aspects of Data Mining

A Feature Selection Method to Handle Imbalanced Data in Text Classification

Lina Guzman, DIRECTV

Multiclass Classification

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

Data Mining Classification: Bayesian Decision Theory

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

Probabilistic Classifiers DWML, /27

INF 4300 Classification III Anne Solberg The agenda today:

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Classifying Imbalanced Data Sets Using. Similarity Based Hierarchical Decomposition

Multi-label classification using rule-based classifier systems

Lecture 25: Review I

k-nearest Neighbor (knn) Sept Youn-Hee Han

Machine Learning Techniques for Data Mining

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A.

Hybrid classification approach of SMOTE and instance selection for imbalanced. datasets. Tianxiang Gao. A thesis submitted to the graduate faculty

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

CSE 158. Web Mining and Recommender Systems. Midterm recap

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

Data Mining Concepts & Techniques

Machine Learning: Symbolische Ansätze

CISC 4631 Data Mining

Machine Learning Part 1

Fraud Detection using Machine Learning

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

NETWORK FAULT DETECTION - A CASE FOR DATA MINING

Metrics Overfitting Model Evaluation Research directions. Classification. Practical Issues. Huiping Cao. lassification-issues, Slide 1/57

Bagging-Based Logistic Regression With Spark: A Medical Data Mining Method

The class imbalance problem

Classification. Slide sources:

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

10 Classification: Evaluation

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Transcription:

CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017

Learnt Prediction and Classification Methods Vector Data Set Data Sequence Data Text Data Classification Clustering Logistic Regression; Decision Tree; KNN SVM; NN K-means; hierarchical clustering; DBSCAN; Mixture Models Naïve Bayes for Text PLSA Prediction Frequent Pattern Mining Similarity Search Linear Regression GLM* Apriori; FP growth GSP; PrefixSpan DTW 2

Evaluation and Other Practical Issues Model Evaluation and Selection Other issues Summary 3

Model Evaluation and Selection Evaluation metrics: How can we measure accuracy? Other metrics to consider? Use validation test set of class-labeled tuples instead of training set when assessing accuracy Methods for estimating a classifier s accuracy: Holdout method, random subsampling Cross-validation 4

Evaluating Classifier Accuracy: Holdout & Cross-Validation Methods Holdout method Given data is randomly partitioned into two independent sets Training set (e.g., 2/3) for model construction Test set (e.g., 1/3) for accuracy estimation Random sampling: a variation of holdout Repeat holdout k times, accuracy = avg. of the accuracies obtained Cross-validation (k-fold, where k = 10 is most popular) Randomly partition the data into k mutually exclusive subsets, each approximately equal size At i-th iteration, use D i as test set and others as training set Leave-one-out: k folds where k = # of tuples, for small sized data *Stratified cross-validation*: folds are stratified so that class dist. in each fold is approx. the same as that in the whole data 5

Classifier Evaluation Metrics: Confusion Matrix Confusion Matrix: Actual class\predicted class C 1 C 1 Example of Confusion Matrix: C 1 True Positives (TP) False Negatives (FN) C 1 False Positives (FP) True Negatives (TN) Actual class\predicted class buy_computer = yes buy_computer = no Total buy_computer = yes 6954 46 7000 buy_computer = no 412 2588 3000 Total 7366 2634 10000 Given m classes, an entry, CM i,j in a confusion matrix indicates # of tuples in class i that were labeled by the classifier as class j May have extra rows/columns to provide totals 6

Classifier Evaluation Metrics: Accuracy, Error Rate, Sensitivity and Specificity A\P C C C TP FN P C FP TN N P N All Classifier Accuracy, or recognition rate: percentage of test set tuples that are correctly classified Accuracy = (TP + TN)/All Error rate: 1 accuracy, or Error rate = (FP + FN)/All Class Imbalance Problem: One class may be rare, e.g. fraud, or HIV-positive Significant majority of the negative class and minority of the positive class Sensitivity: True Positive recognition rate Sensitivity = TP/P Specificity: True Negative recognition rate Specificity = TN/N 7

Classifier Evaluation Metrics: Precision and Recall, and F-measures Precision: exactness what % of tuples that the classifier labeled as positive are actually positive Recall: completeness what % of positive tuples did the classifier label as positive? Perfect score is 1.0 Inverse relationship between precision & recall F measure (F 1 or F-score): harmonic mean of precision and recall, F ß : weighted measure of precision and recall assigns ß times as much weight to recall as to precision 8

Classifier Evaluation Metrics: Example Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%) cancer = yes 90 210 300 30.00 (sensitivity) cancer = no 140 9560 9700 98.56 (specificity) Total 230 9770 10000 96.50 (accuracy) Precision = 90/230 = 39.13% Recall = 90/300 = 30.00% 9

Classifier Evaluation Metrics: ROC Curves ROC (Receiver Operating Characteristics) curves: for visual comparison of classification models Originated from signal detection theory Shows the trade-off between the true positive rate and the false positive rate The area under the ROC curve is a measure of the accuracy of the model Rank the test tuples in decreasing order: the one that is most likely to belong to the positive class appears at the top of the list Area under the curve: the closer to the diagonal line (i.e., the closer the area is to 0.5), the less accurate is the model Vertical axis represents the true positive rate Horizontal axis rep. the false positive rate The plot also shows a diagonal line A model with perfect accuracy will have an area of 1.0 10

Plotting an ROC Curve True positive rate: TPR = TP/P (sensitivity) False positive rate: FPR = FP/N (1-specificity) Rank tuples according to how likely they will be a positive tuple Idea: when we include more tuples in, we are more likely to make mistakes, that is the trade-off! Nice property: not threshold (cut-off) need to be specified, only rank matters 11

Example 12

Evaluation and Other Practical Issues Model Evaluation and Selection Other issues Summary 13

Multiclass Classification Multiclass classification Classification involving more than two classes (i.e., > 2 Classes) Each data point can only belong to one class Multilabel classification Classification involving more than two classes (i.e., > 2 Classes) Each data point can belong to multiple classes Can be considered as a set of binary classification problem 14

Solutions Method 1. One-vs.-all (OVA): Learn a classifier one at a time Given m classes, train m classifiers: one for each class Classifier j: treat tuples in class j as positive & all others as negative To classify a tuple X, choose the classifier with maximum value Method 2. All-vs.-all (AVA): Learn a classifier for each pair of classes Given m classes, construct m(m-1)/2 binary classifiers A classifier is trained using tuples of the two classes To classify a tuple X, each classifier votes. X is assigned to the class with maximal vote Comparison All-vs.-all tends to be superior to one-vs.-all Problem: Binary classifier is sensitive to errors, and errors affect vote count 15

Illustration of One-vs-All f 1 (x) f 2 (x) Classify x according to: f x = argmax i f i (x) f 3 (x) 16

Illustration of All-vs-All Classify x according to majority voting 17

Extending to Multiclass Classification Directly Very straightforward for Logistic Regression Decision Tree Neural Network KNN 18

Classification of Class-Imbalanced Data Sets Class-imbalance problem Rare positive example but numerous negative ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc. Traditional methods Assume a balanced distribution of classes and equal error costs: not suitable for class-imbalanced data Balanced dataset Imbalanced dataset How about predicting every data point as blue class? 19

Solutions Pick the right evaluation metric E.g., ROC is better than accuracy Typical methods for imbalance data in 2-class classification (training data): Oversampling: re-sampling of data from positive class Under-sampling: randomly eliminate tuples from negative class Synthesizing new data points for minority class Still difficult for class imbalance problem on multiclass tasks https://svds.com/learning-imbalanced-classes/ 20

Illustration of Oversampling and Undersampling 21

Illustration of Synthesizing New Data Points SMOTE: Synthetic Minority Oversampling Technique (Chawla et. al) 22

Evaluation and Other Practical Issues Model Evaluation and Selection Other issues Summary 23

Summary Model evaluation and selection Evaluation metric and cross-validation Other issues Multi-class classification Imbalanced classes 24