CS4491/CS 7265 BIG DATA ANALYTICS

Similar documents
Classification Part 4

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Classification. Instructor: Wei Ding

Evaluating Classifiers

CS 584 Data Mining. Classification 3

Evaluating Classifiers

10 Classification: Evaluation

Evaluating Machine-Learning Methods. Goals for the lecture

CS145: INTRODUCTION TO DATA MINING

ECLT 5810 Evaluation of Classification Quality

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

How do we obtain reliable estimates of performance measures?

Evaluating Machine Learning Methods: Part 1

Cross- Valida+on & ROC curve. Anna Helena Reali Costa PCS 5024

Data Mining Classification: Bayesian Decision Theory

Machine Learning and Bioinformatics 機器學習與生物資訊學

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

DATA MINING LECTURE 9. Classification Decision Trees Evaluation

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

DATA MINING LECTURE 11. Classification Basic Concepts Decision Trees Evaluation Nearest-Neighbor Classifier

CS249: ADVANCED DATA MINING

Model s Performance Measures

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Metrics Overfitting Model Evaluation Research directions. Classification. Practical Issues. Huiping Cao. lassification-issues, Slide 1/57

Weka ( )

Data Mining and Knowledge Discovery: Practice Notes

DATA MINING LECTURE 9. Classification Basic Concepts Decision Trees Evaluation

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

DATA MINING OVERFITTING AND EVALUATION

CISC 4631 Data Mining

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A.

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Data Mining and Knowledge Discovery Practice notes 2

MEASURING CLASSIFIER PERFORMANCE

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

Data Mining and Knowledge Discovery: Practice Notes

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

List of Exercises: Data Mining 1 December 12th, 2015

Evaluating Classifiers

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Lecture Notes for Chapter 4

Machine Learning Techniques for Data Mining

Machine learning in fmri

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

CSE 5243 INTRO. TO DATA MINING

K- Nearest Neighbors(KNN) And Predictive Accuracy

Information Management course

Classification. Slide sources:

Statistics 202: Statistical Aspects of Data Mining

Machine Learning: Symbolische Ansätze

Probabilistic Classifiers DWML, /27

Business Club. Decision Trees

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

Stages of (Batch) Machine Learning

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

2. On classification and related tasks

Classification Algorithms in Data Mining

Network Traffic Measurements and Analysis

Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio

Nächste Woche. Dienstag, : Vortrag Ian Witten (statt Vorlesung) Donnerstag, 4.12.: Übung (keine Vorlesung) IGD, 10h. 1 J.

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

CSC411/2515 Tutorial: K-NN and Decision Tree

Large Scale Data Analysis Using Deep Learning

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Introduction to Automated Text Analysis. bit.ly/poir599

ISyE 6416 Basic Statistical Methods - Spring 2016 Bonus Project: Big Data Analytics Final Report. Team Member Names: Xi Yang, Yi Wen, Xue Zhang

Machine Learning nearest neighbors classification. Luigi Cerulo Department of Science and Technology University of Sannio

I211: Information infrastructure II

Constrained Classification of Large Imbalanced Data

Further Thoughts on Precision

Lecture 25: Review I

Global Probability of Boundary

Machine Learning. Cross Validation

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Data Mining D E C I S I O N T R E E. Matteo Golfarelli

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

IEE 520 Data Mining. Project Report. Shilpa Madhavan Shinde

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Chuck Cartledge, PhD. 23 September 2017

Nonparametric Methods Recap

Introduction to Machine Learning

CS570: Introduction to Data Mining

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING

Classification Salvatore Orlando

Measuring Intrusion Detection Capability: An Information- Theoretic Approach

Cross-validation. Cross-validation is a resampling method.

CLASSIFICATION JELENA JOVANOVIĆ. Web:

Pattern recognition (4)

RESAMPLING METHODS. Chapter 05

Machine Learning for. Artem Lind & Aleskandr Tkachenko

Transcription:

CS4491/CS 7265 BIG DATA ANALYTICS EVALUATION * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington Dr. Mingon Kang Computer Science, Kennesaw State University

Evaluation for Classification

Evaluation Metrics Confusion Matrix: shows performance of an algorithm, especially predictive capability. rather than how fast it takes to classify, build models, or scalability. Predicted Class Actual Class Class = YES Class = No Class = Yes True Positive False Negative Class = No False Positive True Negative

Evaluation Metrics

Type I and II error

Evaluation Metrics Sensitivity or True Positive Rate (TPR) TP/(TP+FN) Specificity or True Negative Rate (TNR) TN/(FP+TN) Precision or Positive Predictive Value (PPV) TP/(TP+FP) Negative Predictive Value (NPV) TN/(TN+FN) Accuracy (TP+TN)/(TP+FP+TN+FN)

Limitation of Accuracy Consider a binary classification problem Number of Class 0 examples = 9990 Number of Class 1 examples = 10 If predict all as 0, accuracy is 9990/10000=99.9% Precision Recall Weighted Accuracy = w TP TP+w TN TN w TP TP+w FP FP+w TN TN+w FN FN

ROC curve Receiver Operating Characteristic Graphical approach for displaying the tradeoff between true positive rate(tpr) and false positive rate (FPR) of a classifier TPR = positives correctly classified/total positives FPR = negatives incorrectly classified/total negatives TPR on y-axis and FPR on x-axis

ROC curve

ROC curve

ROC curve Points of interests (TP, FP) (0, 0): everything is negative (1, 1): everything is positive (1, 0): perfect (ideal) Diagonal line Random guessing (50%) Area Under Curve (AUC) Measurement how good the model on the average Good to compare with other methods

Evaluation Model Selection How to evaluate the performance of a model? How to obtain reliable estimates? Performance estimation How to compare the relative performance with competing models?

Motivation We often have a finite set of data If using the entire training data for the best model, The model normally overfits the training data, where it often gives almost 100% correct classification results on training data Better to split the training data into disjoint subsets Note that test data is not used in any way to create the classifier Cheating!

Methods of Validation Holdout Use 2/3 for training and 1/3 for testing Cross-validation Random subsampling K-Fold Cross-validation Leave-one-out Stratified cross-validation Stratified 10-fold cross-validation is often the best Bootstrapping Sampling with replacement Oversampling vs undersampling

Holdout Split dataset into two groups for training and test Training dataset: used to train the model Test dataset: use to estimate the error rate of the model Entire dataset Split the data into two Training set Test set Drawback When unfortunate split happens, the holdout estimate of error rate will be misleading

Random Subsampling Split the data set into two groups Randomly selects a number of samples without replacement Usually, one third for testing, the rest for training

K-Fold Cross-validation K-fold Partition Partition K equal sized sub groups Use K-1 groups for training and the remaining one for testing Test set Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5

K-fold cross-validation Suppose that E i is the performance in the i-th experiment The average error rate is k E = 1 K i=1 E i

Leave-one-out cross-validation Use N-1 samples for training and the remaining sample for testing (i.e., there is only one sample for testing) The average error rate is N E = 1 N i=1 E i where N is the total sample number.

How many folds? If a large number of folds Bias to the true estimator will be small. The estimator will be accurate Computationally expensive If a small number of folds Cheap computational time for experiments Variance of the estimator will be small Bias will be large 5 or 10-Fold CV is a common choice for K-fold CV

Stratified cross-validation When randomly selecting training or test sets, ensure that class proportions are maintained in each selected set. 1. Stratify instances by class 2. Randomly select instances from each class proportionally Ref: http://pages.cs.wisc.edu/~dpage/cs760/evaluating.pdf

Bootstrapping Oversampling Amplifying the minor class samples so that the classes are equally distributed

Bootstrapping Undersampling Consider less numbers of samples in the major class so that the classes are equally distributed

Thinking of experimental results