Chuck Cartledge, PhD. 23 September 2017

Similar documents
Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

CS145: INTRODUCTION TO DATA MINING

Machine Learning nearest neighbors classification. Luigi Cerulo Department of Science and Technology University of Sannio

Classification Algorithms in Data Mining

Evaluating Classifiers

Large Scale Data Analysis Using Deep Learning

Evaluating Classifiers

Chuck Cartledge, PhD. 23 September 2017

Network Traffic Measurements and Analysis

CLASSIFICATION JELENA JOVANOVIĆ. Web:

Weka ( )

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Cross- Valida+on & ROC curve. Anna Helena Reali Costa PCS 5024

CS4491/CS 7265 BIG DATA ANALYTICS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Data Mining and Knowledge Discovery: Practice Notes

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Data Mining and Knowledge Discovery Practice notes 2

Data Linkages - Effect of Data Quality on Linkage Outcomes

Data Mining and Knowledge Discovery: Practice Notes

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

2. On classification and related tasks

A Novel Approach to Compute Confusion Matrix for Classification of n-class Attributes with Feature Selection

k-nn Disgnosing Breast Cancer

Performance Evaluation of Various Classification Algorithms

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

EVALUATIONS OF THE EFFECTIVENESS OF ANOMALY BASED INTRUSION DETECTION SYSTEMS BASED ON AN ADAPTIVE KNN ALGORITHM

Classification Part 4

A study of classification algorithms using Rapidminer

[Programming Assignment] (1)

Machine Learning: Algorithms and Applications Mockup Examination

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

CS 584 Data Mining. Classification 3

Package riskyr. February 19, 2018

Statistics 202: Statistical Aspects of Data Mining

Probabilistic Classifiers DWML, /27

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

CS145: INTRODUCTION TO DATA MINING

Performance Analysis of Data Mining Classification Techniques

CHAPTER 6 EXPERIMENTS

List of Exercises: Data Mining 1 December 12th, 2015

Evaluating Classifiers

Artificial Intelligence. Programming Styles

Supervised Learning Classification Algorithms Comparison

Machine Learning: Symbolische Ansätze

Multimedia Retrieval. Chapter 1: Performance Evaluation. Dr. Roger Weber, Computer Science / / 2018

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

Chuck Cartledge, PhD. 12 October 2018

Nächste Woche. Dienstag, : Vortrag Ian Witten (statt Vorlesung) Donnerstag, 4.12.: Übung (keine Vorlesung) IGD, 10h. 1 J.

Pattern recognition (4)

MEASURING CLASSIFIER PERFORMANCE

Wild Mushrooms Classification Edible or Poisonous

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction Model in Semiconductor Manufacturing Process

A Critical Study of Selected Classification Algorithms for Liver Disease Diagnosis

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

CS249: ADVANCED DATA MINING

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Contents. Preface to the Second Edition

SNS College of Technology, Coimbatore, India

Wildfire smoke-detection algorithms evaluation

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

Creating a Classifier for a Focused Web Crawler

IEE 520 Data Mining. Project Report. Shilpa Madhavan Shinde

Binary Diagnostic Tests Clustered Samples

Impact of Encryption Techniques on Classification Algorithm for Privacy Preservation of Data

INF 4300 Classification III Anne Solberg The agenda today:

Chuck Cartledge, PhD. 23 September 2017

CSE 573: Artificial Intelligence Autumn 2010

Machine learning in fmri

Chuck Cartledge, PhD. 19 January 2018

Applying Improved Random Forest Explainability (RFEX 2.0) steps on synthetic data for variable features having a unimodal distribution

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique

Lecture 9: Support Vector Machines

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Cluster Based detection of Attack IDS using Data Mining

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Machine Learning for. Artem Lind & Aleskandr Tkachenko

CSE 158. Web Mining and Recommender Systems. Midterm recap

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

A Bagging Method using Decision Trees in the Role of Base Classifiers

Classification. Instructor: Wei Ding

ECE 5424: Introduction to Machine Learning

Further Thoughts on Precision

K- Nearest Neighbors(KNN) And Predictive Accuracy

Boolean Classification

DETECTION AND CLASSIFICATION OF ABNORMALITIES IN BRAIN MR IMAGES USING SUPPORT VECTOR MACHINE WITH DIFFERENT KERNELS

Distribution-free Predictive Approaches

Chapter 6 Evaluation Metrics and Evaluation

CSE 190, Fall 2015: Midterm

EE795: Computer Vision and Intelligent Systems

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

Model s Performance Measures

Louis Fourrier Fabien Gaie Thomas Rolf

Machine Learning and Bioinformatics 機器學習與生物資訊學

Transcription:

Introduction K-Nearest Neighbors Na ıve Bayes Hands-on Q&A Conclusion References Files Misc. Big Data: Data Analysis Boot Camp Classification with K-Nearest Neighbors and Na ıve Bayes Chuck Cartledge, PhD 23 September 2017 1/26

Table of contents (1 of 1) 1 Introduction 2 K-Nearest Neighbors 3 Naïve Bayes 4 Hands-on 5 Q & A 6 Conclusion 7 References 8 Files 9 Misc. 2/26

What are we going to cover? We re going to talk about: K-nearest neighbors (knn) as a clustering technique. Naïve Bayes classifier as a simple way to cluster data. 3/26

Explanation and demonstration How does classification work? Simply: 1 A known collection data points have some collection of attributes, including a label 2 The distance between data points can be computed (different distance computations will result in different neighbor sets)[2] 3 A new data point is added to the system 4 The labels of k nearest neighbors are used to assign a label to the new point (Same approach can be used for regression. Instead of assigning a label, a weight can be assigned.) 4/26

Explanation and demonstration Our friend iris data. The attached file chapter-10-classification-revised.r expands on the code in the text. A manual k nn implementation is compared to a built-in one. Different k values result in different classifications. main(3) Differences between the classification and truth are highlighted. Attached file. 5/26

Explanation and demonstration Same image. Attached file. 6/26

Explanation and demonstration How well does k-nn do? We can vary both training percentage and k to test the k-nn algorithm and evaluate the results. The red dot represents: 30% training 5 k-nn processing 10 misclassificaitons Attached file. 7/26

Explanation and demonstration Same image. Attached file. 8/26

Explanation and demonstration Looking at ozone data. Help file claims that data comes from [1]. Not able to see data. k was varied from 1 to 30 Accuracy was computed as percentage of assigned classifications that matched original data The highest accuracy was selected to determine the best k Attached file (chapter-10-ozone-revised.r) replicates and expands on text approach. Attached file. 9/26

Explanation and demonstration Same image. Attached file. 10/26

Background What is naïve Bayes?[3] (1 of 2) Bayes theorem named after Rev. Thomas Bayes. It works on conditional probability. Conditional probability is the probability that something will happen, given that something else has already occurred. Using the conditional probability, we can calculate the probability of an event using its prior knowledge. R. Saxena [3] 11/26

Background What is naïve Bayes? (2 of 2) Baye s theorem is: Where: P(H E) = P(E H) P(H) P(E) P(H) is the probability of hypothesis H being true. This is known as the prior probability. P(E) is the probability of the evidence (regardless of the hypothesis). P(E H) is the probability of the evidence given that hypothesis is true. P(H E) is the probability of the hypothesis given that the evidence is true. 12/26

Background Ultimately naïve Bayes predicts membership of a class. Naïve Bayes is a kind of classifier which uses the Bayes Theorem. It predicts membership probabilities for each class such as the probability that given record or data point belongs to a particular class. The class with the highest probability is considered as the most likely class. R. Saxena [3] The implication is that you have to compute the class membership probability for all classes to identify the most likely membership. 13/26

Examples Load the chapter-10-bayes.r file. Things to notice: Apriori probabilities: are the probabilities that DiseaseZ column 10, for rows 1 through 10 are NO or YES Conditional probabilities: is a confusion matrix for the named column and DiseaseZ column 10. The table s diagonal is TN and TP. Long hand probabilities for person #11 (pg. 184 from the text): P(DiseaseZ == YES) = 0.00908 P(DiseaseZ == NO) = 0.00175 Person #11 is 0.00908 0.00175 to not have it. 5 time more likely to have DiseaseZ than 14/26

Examples Looking at the datasets Titanic results. Half of the data was used for training, half for testing. The Bayes model created from the training data was used to predict survival of the testing data The results aren t encouraging. The Titanic predictions are 76% accurate. 15/26

Examples How well does naïve Bayes do? Wrap the text s code in a loop, varying the training percentage from low to high. Percentage < 17 is under fitting. Percentage > 80 is over fitting. Percentage between 25ish to 80ish is OK. Percentage between 18 and 19 is best. Hard to know best without checking all. Attached file. 16/26

Examples Same image. Attached file. 17/26

Some simple exercises to get familiar with Naïve Bayes The HouseVotes84 is an R dataset[4] tabulating how members of the US House of Representatives voted on a series of issues. The dataset is in the mlbench library. Create a naïve Bayes model that will predict party membership based on education spending given that they voted for a synthetic fuels cutback. Quantify how accurate your model is. 18/26

Q & A time. Women and cats will do as they please, and men and dogs should relax and get used to the idea. Robert Heinlein 19/26

What have we covered? Explored k-nn using the iris and ozone datasets Identified strengths and limitations of k-nn Worked with naïve Bayes classifier on artificial and Titanic datasets Next: LPAR Chapter 11, classification trees 20/26

References (1 of 2) [1] Leo Breiman and Jerome H. Friedman, Estimating Optimal Transformations for Multiple Regression and Corr Journal of the American Statistical Association 80 (1985), no. 391, 580 598. [2] Michel Marie Deza and Elena Deza, Encyclopedia of Distances, Encyclopedia of Distances, Springer, 2009, pp. 1 583. [3] Rahul Saxena, How the Naive Bayes Classifier works in Machine Learning, http://dataaspirant.com/2017/02/06/naive-bayesclassifier-machine-learning/, 2017. [4] UCI Staff, Congressional Voting Records Data Set, https://archive.ics.uci.edu/ml/datasets/ congressional+voting+records, 2017. 21/26

References (2 of 2) [5] Wikipedia Staff, Confusion matrix, https://en.wikipedia.org/wiki/confusion_matrix, 2017. 22/26

Files of interest 1 Revised iris K-nn scripts 2 Revised ozone K-nn 3 Revised Bayes scripts scripts 4 R library script file 23/26

Equations Confusion matrix (1 of 2)[5] Some definitions: TP True Positive hit TN True Negative correct rejection FP False Positive false alarm (Type I error) FN False Negative miss (Type II error) TPR (sensitivity, recall, hit rate, or true positive rate) = TP TP + FN TNR (specificity or true negative rate) = TN TN + FP PPV (precision or positive predictive value) = TP TP + FP NPV (negative predictive value) = TN TN + FN FNR (miss rate or false negative rate) = 1 TPR 24/26

Equations Confusion matrix (2 of 2)[5] FPR (fall-out or false positive rate) = 1 TNR FDR (false discovery rate) = 1 PPV FOR (false omission rate) = 1 NPV ACC (accuracy) = TP + TN TP + TN + FP + FN F 1 (F1 score) = 2TP 2TP + FP + FN 25/26

Plots Confusion matrix Image from [5]. 26/26