I211: Information infrastructure II

Similar documents
Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

Outline. Prepare the data Classification and regression Clustering Association rules Graphic user interface

Lecture 25: Review I

Evaluating Classifiers

ECLT 5810 Evaluation of Classification Quality

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Lecture 03 Linear classification methods II

CS249: ADVANCED DATA MINING

K- Nearest Neighbors(KNN) And Predictive Accuracy

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Introduction to Artificial Intelligence

Evaluating Classifiers

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Introduction to Automated Text Analysis. bit.ly/poir599

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

1) Give decision trees to represent the following Boolean functions:

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

CS4491/CS 7265 BIG DATA ANALYTICS

CS145: INTRODUCTION TO DATA MINING

Data Mining: STATISTICA

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Supervised Learning Classification Algorithms Comparison

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods

Machine Learning and Bioinformatics 機器學習與生物資訊學

Clustering and Visualisation of Data

Stat 342 Exam 3 Fall 2014

Lecture 7: Linear Regression (continued)

Algorithms: Decision Trees

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Random Forest A. Fornaser

Information Retrieval and Web Search Engines

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

Comparison of Linear Regression with K-Nearest Neighbors

Network Traffic Measurements and Analysis

Evaluating Classifiers

Data Mining and Knowledge Discovery: Practice Notes

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Weka ( )

CSE 158. Web Mining and Recommender Systems. Midterm recap

Clustering will not be satisfactory if:

CSC411/2515 Tutorial: K-NN and Decision Tree

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Nearest Neighbor Predictors

2. On classification and related tasks

Predict the box office of US movies

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

University of Florida CISE department Gator Engineering. Clustering Part 2

MS&E 226: Small Data

Classification Algorithms in Data Mining

Artificial Intelligence. Programming Styles

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Data mining: concepts and algorithms

Unsupervised Learning : Clustering

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Jarek Szlichta

Performance Evaluation of Various Classification Algorithms

Lecture 9: Support Vector Machines

Nearest Neighbor Methods

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Supervised Learning: Nearest Neighbors

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction Model in Semiconductor Manufacturing Process

Lecture 06 Decision Trees I

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

WEKA homepage.

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

SOCIAL MEDIA MINING. Data Mining Essentials

5 Learning hypothesis classes (16 points)

Unsupervised Learning and Clustering

List of Exercises: Data Mining 1 December 12th, 2015

Tutorials Case studies

CS7267 MACHINE LEARNING NEAREST NEIGHBOR ALGORITHM. Mingon Kang, PhD Computer Science, Kennesaw State University

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

Cross-validation for detecting and preventing overfitting

Data Mining and Knowledge Discovery: Practice Notes

Function Algorithms: Linear Regression, Logistic Regression

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Machine Learning nearest neighbors classification. Luigi Cerulo Department of Science and Technology University of Sannio

Business Club. Decision Trees

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

MACHINE LEARNING: CLUSTERING, AND CLASSIFICATION. Steve Tjoa June 25, 2014

CHAPTER 4 STOCK PRICE PREDICTION USING MODIFIED K-NEAREST NEIGHBOR (MKNN) ALGORITHM

Machine Learning Techniques for Data Mining

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

INF 4300 Classification III Anne Solberg The agenda today:

Pattern Recognition for Neuroimaging Data

Information Retrieval and Web Search Engines

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

Distribution-free Predictive Approaches

Kernel Principal Component Analysis: Applications and Implementation

S2 Text. Instructions to replicate classification results.

Transcription:

Data Mining: Classifier Evaluation I211: Information infrastructure II

3-nearest neighbor labeled data find class labels for the 4 data points 1 0 0 6 0 0 0 5 17 1.7 1 1 4 1 7.1 1 1 1 0.4 1 2 1 3.0 0 0.1 1 1 2 3 0 6 1 1 1 0 0 1 0 2.1 0 1 3.2 0 The closest 3 labeled data points to the first unlabeled data point is data point 1 are at distance 1, 1.7 and 2.1. We will predict that the label is 0 based on the majority rule (0 twice, 1 once).

3-nearest neighbor to output probabilities labeled data find class labels for the 4 data points 1 0 0 6 0 0 0 5 17 1.7 1 1 4 1 7.1 1 1 1 0.4 1 2 1 3.0 0 0.1 1 1 2 3 0 6 1 1 1 0 0 1 0 2.1 0 1 3.2 0 The closest 3 labeled data points to the first unlabeled data point is data point 1 are at distance 1, 1.7 and 2.1. We will predict that the probability of class 0 is 2/3 and probability of class 1 is 1/3.

K algorithm for regression same as for classification, except that we can predict the target variable by using mean, instead of the majority rule for example labeled data find class labels for the 4 data points 0 0 5 0 0 6 13.2 1 1 4 17.8 1 1 1 0.4 1 2 1. 6 0 0.1 1 1 2 3 24.1 0 0 1 1.1 1 1 1 0 1 3.2 0.5 target variable

3-nearest neighbor for regression labeled data 1 0 0 5 17 1.7 7.1 1 1 1 3.0 0 0.1 1 1 6 1 1 1 0 0 6 13. 2 1 1 4 17.8 0.4 1 2 1.6 1 2 3 24.1 0 0 1 1.1 0 1 3.2 0.5 2.1 find target values for the 4 data points The closest 3 labeled data points to the first unlabeled data point is data point 1 are at distance 1, 1.7 and 2.1. We will predict that the target is 10.5. This is obtained as (13.2 + 17.8 + 0.5) / 3.

K-nearest neighbor algorithm find closest K data points and use mean rule to assign target; More formally given: Data set D = {(x i, y i ), i = 1,..., n} where, for example, x i R k y i R is the target variable K is an integer (odd number preferably) find target for every unlabeled data point u i R k Algorithm find K closest data points to data point u assign target to be the mean value of the targets for the K nearest neighbors

Classifier Evaluation 1. Question: how good is our algorithm? how will we estimate its performance? 2. Question: what is the performance measure we should use? 1. Classification: we will use accuracy 2. Regression: we will use mean square error and R 2 measure

4-Fold Cross-Validation D 20 data points

4-Fold Cross-Validation D 20 data points Randomly and evenly split into 4 non-overlapping partitions

How to make a random split? Use random number generator!!! D 5 data points >> x = rand(1, 5); >> [a b] = sort(x) a = 0.1576 0.4854 0.8003 0.9572 0.9706 b = 1 4 5 3 2 >> p1 = b(1 : 2 : length(b)) p1 = 1 5 2 >> p2 = b(2 : 2 : length(b)) p2 = 4 3 partition 1 partition 2

4-Fold Cross-Validation D 20 data points Randomly and evenly split into 4 non-overlapping partitions Partition 1. Data points: 1, 3, 5, 15, 16 Partition 2. Data points: 6, 10, 11, 14, 17 Partition 3. Data points: 4, 9, 12, 19, 20 Partition 4. Data points: 2, 7, 8, 13, 17

4-Fold Cross-Validation Step 1: Use partition 1 as test and partitions 2-4 as training Test Training Pretend class (target) t) is not known for the data points in Test. Use Training set to predict class (target) for the data points in Test Count number of misses (classification) or the error (regression)

4-Fold Cross-Validation Step 2: Use partition 2 as test and partitions 1, 3 and 4 as training Test Training Pretend class (target) t) is not known for the data points in Test. Use Training set to predict class (target) for the data points in Test Count number of misses (classification) or the error (regression)

4-Fold Cross-Validation Step 3: Use partition 3 as test and partitions 1, 2 and 4 as training Test Training Pretend class (target) t) is not known for the data points in Test. Use Training set to predict class (target) for the data points in Test Count number of misses (classification) or the error (regression)

4-Fold Cross-Validation Step 4: Use partition 4 as test and partitions 1-3 as training Test Training Pretend class (target) t) is not known for the data points in Test. Use Training set to predict class (target) for the data points in Test Count number of misses (classification) or the error (regression)

4-Fold Cross-Validation True class: 1 Predicted class: 1 True class: 0 Predicted class: 0... True class: 1 Predicted class: 0 For each of 20 data points there is true and predicted class

Confusion Matrix (Binary Class) True class edicted d class Pr 0 1 0 1 00 01 10 11 Accuracy = 00 + 00 + + 10 11 01 + 11 umber of data points whose true class was 0 but predicted class was 1. Error = 1 Accuracy

Classification Accuracy and Error Accuracy correct = correct : number of correctly classified data points : ttl total number of fdt data points it Error =1 correct Another naming convention: 00 = true negatives 01 = false negatives 10 = false positives 11 = true positives

More on Accuracy Binary Class Case sn 11 = Sensitivity or accuracy on data points whose 01 + 11 class is 1. Also called true positive rate. sp + 00 = Specificity or accuracy on data points whose 10 00 class is 0. Also called true negative rate. 1 sn 11 = 1 01 + False negative rate. 11 1 sp = 1 10 00 + 00 False positive rate. sn + sp Accuracy B 2 = Balanced-sample accuracy

Trivial and Random Classifiers 16 data points have class 0 (majority class) 4 data points have class 1 (minority class) 0 Trivial classifier: always predict majority class Accuracy of a trivial classifier is: 16/20 = 80% 1 Random classifier: predict class 0 with probability 0.8 and class 1 with probability 0.2 Accuracy of the random classifier: 68% (0.8 2 + 0.2 2 = 0.68)

-Fold Cross-Validation Is there a chance that our accuracy estimate depends on the random partition we did at the beginning? Yes! One way to overcome this is to generate several random splits into folds, estimate accuracy for each fold and finally average this number.

Leave-One-Out Performance Estimation Test Step 1 Step 2 Step D... Leave-one-out is an extreme case of cross-validation when is equal to the number of data points in data set D.

In Short... 1. Whatever you use for -fold cross-validation, every training i data point will be used exactly once as test. t 2. Each data points will thus have one true value and one predicted value. 3. From these two values we compute accuracy. 4. Terminology: we always say that we are estimating classifier s performance. If the data set is large enough and representative this will be a good estimate. The truth is, we can only hope that our test cases will be representative and that our estimate is meaningful.

Error Bars on Accuracy Estimates For example, data set contains 20 data points and the confusion matrix looks like True class Pr redicted d class s 0 1 0 10 1 1 4 5 15 Accuracy ( 1 Accuracy ) 0.75 0.25 Accuracy = = 0.75 σ = = = 0. 10 20 n 20

How to Report Accuracy Estimates For example, the estimated accuracy is 0.75 Let s see what the estimate is for = 10, 100, and 1000. σ 0.75 0.25 10 10 = = 0.14 σ 0.75 0.25 100 100 = = 0.04 For = 100 we will report accuracy as: Accuracy = 0.75 ± 0.04 σ 0.75 0.25 1000 1000 = = 0.01

Regression MSE = 1 2 ( p i t i ) i= 1 p i : predicted d target for data point i t i : true target for data point i : total number of data points R 2 = ( p i i t 1 i = 1 = ( t i i m 1 ) ) 2 2 m: mean value for the target variable R 2 is the percentage of variance explained by the regression method. Values close to 1 indicate perfect predictor.

Trivial Classifiers for Regression Trivial i classifier: always predicts target t mean R 2 for the trivial classifier is 0.