Wild Mushrooms Classification Edible or Poisonous

Similar documents
CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

6.034 Quiz 2, Spring 2005

PROBLEM 4

Distribution-free Predictive Approaches

The Data Mining Application Based on WEKA: Geographical Original of Music

Construct an optimal tree of one level

Classification of Hand-Written Numeric Digits

High Dimensional Indexing by Clustering

Technical Whitepaper. Machine Learning in kdb+: k-nearest Neighbor classification and pattern recognition with q. Author:

Feature Selection for fmri Classification

Jeff Howbert Introduction to Machine Learning Winter

Classification Algorithms in Data Mining

Slides for Data Mining by I. H. Witten and E. Frank

MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11. Nearest Neighbour Classifier. Keywords: K Neighbours, Weighted, Nearest Neighbour

CS294-1 Final Project. Algorithms Comparison

Data mining: concepts and algorithms

Data Mining and Data Warehousing Classification-Lazy Learners

Data Mining and Knowledge Discovery: Practice Notes

Data Mining and Knowledge Discovery: Practice Notes

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

Week 12: Running Time and Performance

Predict the box office of US movies

The Problem of Overfitting with Maximum Likelihood

Classifiers and Detection. D.A. Forsyth

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Data Mining and Knowledge Discovery Practice notes Numeric prediction and descriptive DM

Searching, Sorting. Arizona State University 1

Nearest neighbors classifiers

Chuck Cartledge, PhD. 23 September 2017

Data Mining and Knowledge Discovery: Practice Notes

Missing Data. Where did it go?

Nearest Neighbor Classification. Machine Learning Fall 2017

Unit: Rational Number Lesson 3.1: What is a Rational Number? Objectives: Students will compare and order rational numbers.

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Supervised Learning Classification Algorithms Comparison

INF 4300 Classification III Anne Solberg The agenda today:

Instance-based Learning

Evaluating Classifiers

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Homework Assignment #3

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Data Mining and Machine Learning: Techniques and Algorithms

Nearest Neighbor Classifiers

K-Nearest Neighbour (Continued) Dr. Xiaowei Huang

Performance Analysis of Data Mining Classification Techniques

Classification: Feature Vectors

Machine Learning Techniques for Data Mining

Machine Learning. Classification

Feature Selection Technique to Improve Performance Prediction in a Wafer Fabrication Process

Network Traffic Measurements and Analysis

Supervised vs unsupervised clustering

An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification

Best Customer Services among the E-Commerce Websites A Predictive Analysis

Frequency Oriented Scheduling on Parallel Processors

Chapter 10 Language Translation

K- Nearest Neighbors(KNN) And Predictive Accuracy

Semi-Supervised Clustering with Partial Background Information

Rank Measures for Ordering

Lecture 3. Oct

Detecting ads in a machine learning approach

CHAPTER 4 SEMANTIC REGION-BASED IMAGE RETRIEVAL (SRBIR)

Empirical risk minimization (ERM) A first model of learning. The excess risk. Getting a uniform guarantee

Exploratory Analysis: Clustering

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Intro to Algorithms. Professor Kevin Gold

Nearest Neighbor Methods

Chapter 4: Algorithms CS 795

Improvement of SURF Feature Image Registration Algorithm Based on Cluster Analysis

Classifying Building Energy Consumption Behavior Using an Ensemble of Machine Learning Methods

A PSO-based Generic Classifier Design and Weka Implementation Study

CPSC 340: Machine Learning and Data Mining

A STUDY OF SOME DATA MINING CLASSIFICATION TECHNIQUES

What to come. There will be a few more topics we will cover on supervised learning

All lecture slides will be available at CSC2515_Winter15.html

Lecture 03 Linear classification methods II

CSE 573: Artificial Intelligence Autumn 2010

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

5 Learning hypothesis classes (16 points)

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Lorentzian Distance Classifier for Multiple Features

Practical Data Mining COMP-321B. Tutorial 4: Preprocessing

Introduction to Artificial Intelligence

Classification and K-Nearest Neighbors

Nearest Neighbor Predictors

EECS 349 Machine Learning Homework 3

Application of Support Vector Machine Algorithm in Spam Filtering

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Problem Set #6 Due: 11:30am on Wednesday, June 7th Note: We will not be accepting late submissions.

Probabilistic Classifiers DWML, /27

MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods

CS 584 Data Mining. Classification 1

Natural Language Processing

Going nonparametric: Nearest neighbor methods for regression and classification

1 Training/Validation/Testing

Discriminate Analysis

Chapter 6: Cluster Analysis

An Empirical Study of Lazy Multilabel Classification Algorithms

Data analysis case study using R for readily available data set using any one machine learning Algorithm

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

k-nearest Neighbors + Model Selection

Transcription:

Wild Mushrooms Classification Edible or Poisonous Yulin Shen ECE 539 Project Report Professor: Hu Yu Hen 2013 Fall ( I allow code to be released in the public domain) pg. 1

Contents Introduction. 3 Practicality of the project. 3 Aim of the project. 3 Work Performed.... 3 Data analysis. 4 K-NN.... 4 Naïve Bayer.. 5 Results.... 6 Discussion... 7 References.. 11 pg. 2

Introduction Practicality of the project Mushrooms, as a kind of food, are very special due to their edibility. Some countries treat mushrooms as a kind of high nutrition food. However, only small portions of them are edible. It is really dangerous to eat a poisonous mushroom. Thus, I want to use some classification algorithms to develop a best model to predict whether new emerging mushrooms are edible based on the detected data of the mushrooms. Furthermore, it is an opportunity to compare the classifiers and also understand how they operate. Aim of the project The project uses the data from UCI Machine Learning Repository. I intend to implement 2 classification algorithms to build models for prediction. In the process, the project aim to increase the accuracies of them. And also, I intend to compare the 2 classifiers to know their advantages and disadvantages. Work Performed Data analysis The original dataset is recorded with alphabetic characters. Although each alphabetic character is also represented as an integer, the range of each feature is not in an integral scale. pg. 3

First, I calculate a lowest common multiple number from each number of different kinds in features. Then, I use this number to be the range of them, and convert each alphabetic character to a reasonable integer. Now, it can help the K-NN to get better order of the distances. Contrast to the K-NN, Naïve Bayer does not need the dataset to be recorded in numerical character. Thus, I just keep the original dataset to do implementation. The dataset contains about 8000 combinations. Although the number of data is enough, we still need to divide them to get the training samples and testing samples. In this case, I implement 4- way cross validation to get 4 pairs of training dataset and testing dataset. The size ratio of them is 3:1. K-NN After getting 4 pairs of training dataset and testing dataset, I need to calculate distances between each point in training dataset and each point in testing dataset. Because each feature s importance is same, I choose way of Euclidean distance to do calculations without weights. After storing them in a matrix, I sort them in ascending order. Now, I can start to consider how to determine the final result. As we know, the first element in the array should take the biggest weight. Hence, I try to set the weights for each of elements in the matrix. Because the feature size is 22, the number of feature difference is 22 at most. Now, I can implement an exhaustive way from 1 to 22 to find which K is best in the model. Finally, I can compare the predicted results with the true results to get the classification rates. pg. 4

Naïve Bayer The algorithm just needs the original dataset, because it is to calculate the possibility of each feature. In another sentence, it counts the number of features instead of calculating. For example, in the first feature column, I need to count the number of b, and it when the mushroom is edible. Later, I need to calculate the 2 possibilities that b emerging and mushroom is edible and also poisonous. After getting all 2 possibilities of features, I need to multiply them to do comparison. The result of comparison can determine whether the mushroom is edible or poisonous. Now, I can calculate the classification rate as above. Results K-NN: pg. 5

Classification rate1: 0.9764 Classification rate2: 0.8966 Classification rate3: 1 Classification rate4: 0.8631 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 Naïve Bayer: Classification rate1: 0.9764 Classification rate2: 0.8966 Classification rate3: 1 Classification rate4: 0.8631 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 pg. 6

Discussion I tried to use the original dataset to implement the K-NN. The accuracy is obviously lower than the dataset converted in an integral scale. In order to increase prediction accuracy, the way to dealt with data is also important. They should be in the same range. It can help classifiers to get better results. The crucial procedures in the K-NN are how to calculate the distances and how to determine the results. Maybe in other cases, some features are more important than other ones. In this time, the algorithm can give more weight in this feature. It can help to improve the order of distances. And also, the algorithm can give weights in the process of determining results. The first distance is the smallest one, so it is the closest point to the testing data. The result should be most convincing. Actually, it is like the linear regression. However, at this case, I do not think implement regression in the last procedure is a good idea. Because if I implement that, I may have 4 different models of regression because of 4 pairs of datasets, now it is possible to determine which one is better. We do not know the true testing sample for future prediction and also the true training dataset should include both testing samples and training sample in the project. Hence, it is useless to implement the regression. However, to find some reasonable weights for them is still a good choice. We can see that the sample 3 have the highest classification rate, and the sample 1 is also good in prediction. However, the sample 2 and 4 are not very well. Because I parse the dataset into 4 separate sets. Maybe in some sets, the data is not in an average manner. For example, the set 3 may have a lot of poisonous mushrooms or the feature 1 in this set is almost x. In this case, it is pg. 7

hard to find the similar point in this algorithm. However, the true training set is comprehensive. I think the model can perform higher accuracy if it is implemented in a realistic case. In the project, I implement an exhaustive way to detect which K is better for prediction. We can see that K =1, 2, or 3 is better in the sample 1, 2, and 4. However, the sample 2 is very stable due to dataset in a good condition. 0.98 0.975 0.97 0.965 0.96 0.955 0.95 0 5 10 15 20 25 pg. 8

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 1.005 1 0.995 0.99 0.985 0.98 0.975 0.97 0 5 10 15 20 25 pg. 9

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 The classification rates of Naïve Bayer are all about 90%. Although I implement the same procedures as a reference, the classification rates are a little lower than those in the reference. This algorithm is mainly based on the probability. I think the result is not very good due to the same reason as K-NN, which is caused by the separate datasets. However, if I ignore this bad effect, the performance of the algorithm is worse than the K-NN in this problem. In my opinion, the problem has more than 20 features, if the algorithm is only to compare the multiple result of each possibility, the big deviation may be produced. I think in the last procedure, it is better to implement some other models to limit the deviation for prediction. In the process, I calculate all probabilities for all features. We can see that some probabilities are very high, which means that if this feature emerge, the mushroom is likely to be edible. It is a hint to tell us which features can influence the result more. Hence, we may add weights to them. pg. 10

References University of California Irvine. Mushroom Dataset, May 1989. http://archive.ics.uci.edu/ml/datasets/mushroom Grayson Leonard, Matt Schartman. Classifying Edibility of Mushrooms, June 2012. Min-Ling Zhang, Zhi-Hua Zhou. A K-Nearest Neighbor Based Algorithm for Multi-label Classification. Wikipedia. Naïve Bayes Classifier. http://en.wikipedia.org/wiki/naive_bayes_classifier pg. 11