CS249: ADVANCED DATA MINING

Similar documents
CS145: INTRODUCTION TO DATA MINING

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Information Management course

Evaluating Classifiers

CS145: INTRODUCTION TO DATA MINING

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Evaluating Classifiers

Weka ( )

Classification Part 4

CS4491/CS 7265 BIG DATA ANALYTICS

CS6220: DATA MINING TECHNIQUES

Network Traffic Measurements and Analysis

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Large Scale Data Analysis Using Deep Learning

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

CSE 5243 INTRO. TO DATA MINING

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Classification. Instructor: Wei Ding

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

CS 584 Data Mining. Classification 3

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Pattern recognition (4)

CSE 158. Web Mining and Recommender Systems. Midterm recap

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

K- Nearest Neighbors(KNN) And Predictive Accuracy

CS570: Introduction to Data Mining

Contents. Preface to the Second Edition

INF 4300 Classification III Anne Solberg The agenda today:

Evaluating Machine Learning Methods: Part 1

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

ECLT 5810 Evaluation of Classification Quality

Evaluating Machine-Learning Methods. Goals for the lecture

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Introduction to Automated Text Analysis. bit.ly/poir599

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

Lecture 25: Review I

2. On classification and related tasks

Machine Learning and Bioinformatics 機器學習與生物資訊學

Tutorials Case studies

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

IEE 520 Data Mining. Project Report. Shilpa Madhavan Shinde

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Data Mining and Knowledge Discovery: Practice Notes

Classification Algorithms in Data Mining

Evaluating Classifiers

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

MEASURING CLASSIFIER PERFORMANCE

Data Mining and Knowledge Discovery Practice notes 2

Constrained Classification of Large Imbalanced Data

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Topics in Machine Learning-EE 5359 Model Assessment and Selection

I211: Information infrastructure II

Lecture 13: Model selection and regularization

Statistical Consulting Topics Using cross-validation for model selection. Cross-validation is a technique that can be used for model evaluation.

Chapter 3: Supervised Learning

Data Mining and Knowledge Discovery: Practice Notes

Data Mining Classification: Bayesian Decision Theory

DATA MINING OVERFITTING AND EVALUATION

Data Mining Concepts & Techniques

Machine Learning Techniques for Data Mining

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Boolean Classification

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

10601 Machine Learning. Model and feature selection

CS6716 Pattern Recognition

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

List of Exercises: Data Mining 1 December 12th, 2015

Model s Performance Measures

CS 229 Midterm Review

CISC 4631 Data Mining

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Random Forest A. Fornaser

Using Machine Learning to Optimize Storage Systems

Business Club. Decision Trees

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Further Thoughts on Precision

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

Metrics Overfitting Model Evaluation Research directions. Classification. Practical Issues. Huiping Cao. lassification-issues, Slide 1/57

Global Journal of Engineering Science and Research Management

10 Classification: Evaluation

Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

Data mining: concepts and algorithms

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

DATA MINING LECTURE 11. Classification Basic Concepts Decision Trees Evaluation Nearest-Neighbor Classifier

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Problem 1: Complexity of Update Rules for Logistic Regression

Statistics 202: Statistical Aspects of Data Mining

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

Supervised vs unsupervised clustering

1 Machine Learning System Design

Transcription:

CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017

Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal Due May 8 th (11:59pm) 2

Learnt Prediction and Classification Methods Classification Clustering Vector Data Text Data Recommender System Decision Tree; Naïve Bayes; Logistic Regression SVM; NN K-means; hierarchical clustering; DBSCAN; Mixture Models; kernel k-means PLSA; LDA Matrix Factorization Graph & Network Label Propagation SCAN; Spectral Clustering Prediction Ranking Linear Regression GLM Collaborative Filtering PageRank Feature Representation Word embedding Network embedding 3

Evaluation and Other Practical Issues Model Evaluation and Selection Bias-Variance Trade-off Other issues Summary 4

Model Evaluation and Selection Evaluation metrics: How can we measure accuracy? Other metrics to consider? Use validation test set of class-labeled tuples instead of training set when assessing accuracy Methods for estimating a classifier s accuracy: Holdout method, random subsampling Cross-validation 5

Classifier Evaluation Metrics: Confusion Matrix Confusion Matrix: Actual class\predicted class C 1 C 1 Example of Confusion Matrix: C 1 True Positives (TP) False Negatives (FN) C 1 False Positives (FP) True Negatives (TN) Actual class\predicted class buy_computer = yes buy_computer = no Total buy_computer = yes 6954 46 7000 buy_computer = no 412 2588 3000 Total 7366 2634 10000 Given m classes, an entry, CM i,j in a confusion matrix indicates # of tuples in class i that were labeled by the classifier as class j May have extra rows/columns to provide totals 6

Classifier Evaluation Metrics: Accuracy, Error Rate, Sensitivity and Specificity A\P C C C TP FN P C FP TN N P N All Classifier Accuracy, or recognition rate: percentage of test set tuples that are correctly classified Accuracy = (TP + TN)/All Error rate: 1 accuracy, or Error rate = (FP + FN)/All Class Imbalance Problem: One class may be rare, e.g. fraud, or HIV-positive Significant majority of the negative class and minority of the positive class Sensitivity: True Positive recognition rate Sensitivity = TP/P Specificity: True Negative recognition rate Specificity = TN/N 7

Classifier Evaluation Metrics: Precision and Recall, and F-measures Precision: exactness what % of tuples that the classifier labeled as positive are actually positive Recall: completeness what % of positive tuples did the classifier label as positive? Perfect score is 1.0 Inverse relationship between precision & recall F measure (F 1 or F-score): harmonic mean of precision and recall, F ß : weighted measure of precision and recall assigns ß times as much weight to recall as to precision 8

Classifier Evaluation Metrics: Example Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%) cancer = yes 90 210 300 30.00 (sensitivity) cancer = no 140 9560 9700 98.56 (specificity) Total 230 9770 10000 96.50 (accuracy) Precision = 90/230 = 39.13% Recall = 90/300 = 30.00% 9

Evaluating Classifier Accuracy: Holdout & Cross-Validation Methods Holdout method Given data is randomly partitioned into two independent sets Training set (e.g., 2/3) for model construction Test set (e.g., 1/3) for accuracy estimation Random sampling: a variation of holdout Repeat holdout k times, accuracy = avg. of the accuracies obtained Cross-validation (k-fold, where k = 10 is most popular) Randomly partition the data into k mutually exclusive subsets, each approximately equal size At i-th iteration, use D i as test set and others as training set Leave-one-out: k folds where k = # of tuples, for small sized data *Stratified cross-validation*: folds are stratified so that class dist. in each fold is approx. the same as that in the initial data 10

Model Selection: ROC Curves ROC (Receiver Operating Characteristics) curves: for visual comparison of classification models Originated from signal detection theory Shows the trade-off between the true positive rate and the false positive rate The area under the ROC curve is a measure of the accuracy of the model Rank the test tuples in decreasing order: the one that is most likely to belong to the positive class appears at the top of the list Area under the curve: the closer to the diagonal line (i.e., the closer the area is to 0.5), the less accurate is the model Vertical axis represents the true positive rate Horizontal axis rep. the false positive rate The plot also shows a diagonal line A model with perfect accuracy will have an area of 1.0 11

Plotting an ROC Curve True positive rate: TTTTTT = TTTT/PP (sensitivity) False positive rate: FFFFFF = FFFF/NN (1-specificity) Rank tuples according to how likely they will be a positive tuple Idea: when we include more tuples in, we are more likely to make mistakes, that is the trade-off! Nice property: not threshold (cut-off) need to be specified, only rank matters 12

Example 13

Similar for prediction tasks 14

Cross-Validation Partition the data into K folds Use K-1 fold as training, and 1 fold as testing Calculate the average accuracy best on K training-testing pairs Accuracy on validation/test dataset! Mean square error can again be used: ii xx ii TT ββ yy ii 2 /nn 15

AIC & BIC AIC and BIC can be used to test the quality of statistical models AIC (Akaike information criterion) AAAAAA = 2kk 2ln( LL), where k is the number of parameters in the model and LL is the likelihood under the estimated parameter BIC (Bayesian Information criterion) BIIII = kkkkkk(nn) 2ln( LL), Where n is the number of objects 16

Stepwise Feature Selection Avoid brute-force selection 2 pp Forward selection Starting with the best single feature Always add the feature that improves the performance best Stop if no feature will further improve the performance Backward elimination Start with the full model Always remove the feature that results in the best performance enhancement Stop if removing any feature will get worse performance 17

Evaluation and Other Practical Issues Model Evaluation and Selection Bias-Variance Trade-off Other issues Summary 18

Model Selection Problem Basic problem: how to choose between competing linear regression models Model too simple: underfit the data; poor predictions; high bias; low variance Model too complex: overfit the data; poor predictions; low bias; high variance Model just right: balance bias and variance to get good predictions 19

Bias: EE( ff xx ) ff(xx) Bias and Variance True predictor ff xx : xx TT ββ Estimated predictor ff xx : xx TT ββ How far away is the expectation of the estimator to the true value? The smaller the better. Variance: VVVVVV ff xx = EE[ ff xx EE ff xx How variant is the estimator? The smaller the better. Reconsider mean square error 2 ] JJ ββ /nn = 1 TT 2 2 ii xx ii ββ yy ii /nn Can be considered as EE[ ff xx ff(xx) εε 2 ] = bbbbbbss 2 + vvvvvvvvvvvvvvvv + nnnnnnnnnn Note EE εε = 0, VVVVVV εε = σσ 2 20

Bias-Variance Trade-off 21

Example: degree d in regression http://www.holehouse.org/mlclass/10_advice_for_applying_machine_learning.html 22

Example: regularization term in regression 23

Evaluation and Other Practical Issues Model Evaluation and Selection Bias-Variance Trade-off Other issues Summary 24

Classification of Class-Imbalanced Data Sets Class-imbalance problem: Rare positive example but numerous negative ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc. Traditional methods assume a balanced distribution of classes and equal error costs: not suitable for class-imbalanced data Typical methods for imbalance data in 2-class classification: Oversampling: re-sampling of data from positive class Under-sampling: randomly eliminate tuples from negative class Threshold-moving: moves the decision threshold, t, so that the rare class tuples are easier to classify, and hence, less chance of costly false negative errors Ensemble techniques: Ensemble multiple classifiers introduced above Still difficult for class imbalance problem on multiclass tasks 25

Multiclass Classification Classification involving more than two classes (i.e., > 2 Classes) Method 1. One-vs.-all (OVA): Learn a classifier one at a time Given m classes, train m classifiers: one for each class Classifier j: treat tuples in class j as positive & all others as negative To classify a tuple X, the set of classifiers vote as an ensemble Method 2. All-vs.-all (AVA): Learn a classifier for each pair of classes Given m classes, construct m(m-1)/2 binary classifiers A classifier is trained using tuples of the two classes To classify a tuple X, each classifier votes. X is assigned to the class with maximal vote Comparison All-vs.-all tends to be superior to one-vs.-all Problem: Binary classifier is sensitive to errors, and errors affect vote count 26

Evaluation and Other Practical Issues Model Evaluation and Selection Bias-Variance Trade-off Other issues Summary 27

Summary Model evaluation and selection Evaluation metric and cross-validation Bias-Variance Trade-off Error = bias^2 + variance + noise Other issues Imbalanced classes Multi-class classification 28