Waleed Pervaiz CSE 352

Similar documents
GEOPT19 - AN INTERNATIONAL PROFICIENCY TEST FOR ANALYTICAL GEOCHEMISTRY LABORATORIES REPORT ON ROUND 19 / July 2006 (Gabbro, MGR-N)

GeoPT22 - MBL-1, Basalt

GEOPT10 - AN INTERNATIONAL PROFICIENCY TEST FOR ANALYTICAL GEOCHEMISTRY LABORATORIES - REPORT ON ROUND 10 / December 2001 (CH-1 marine sediment).

ENDF/B-VII.1 versus ENDFB/-VII.0: What s Different?

Chapter 8 The C 4.5*stat algorithm

REE-mineralogy of Suhuvaara appinite, Inari, Northern Finland Thair Al-Ani and Olli Sarapää

Activity concentration for exempt material (Bq/g)

Quality of Semiconductor Raw Materials: Evolution and Challenges. Yongqiang Lu Kevin McLaughlin

Supporting Information. Highly Selective and Reversible Chemosensor for Pd 2+ Detected by Fluorescence, Colorimetry and Test Paper.

AN INTERNATIONAL PROFICIENCY TEST FOR ANALYTICAL GEOCHEMISTRY LABORATORIES REPORT ON ROUND

Row 1 This is data This is data

Row 1 This is data This is data. This is data out of p This is bold data p This is bold data out of p This is normal data after br H3 in a table

GeoPT14 - OshBO, Alkaline Granite

Data Preparation. UROŠ KRČADINAC URL:

Paul's Online Math Notes Calculus III (Notes) / Applications of Partial Derivatives / Lagrange Multipliers Problems][Assignment Problems]

ISO INTERNATIONAL STANDARD. Water quality Digestion for the determination of selected elements in water Part 2: Nitric acid digestion

EE368 Project Report CD Cover Recognition Using Modified SIFT Algorithm

NRF Filmetrics F40 SOP Revision /02/12 Page 1 of 5. Filmetrics F40 SOP

Classification Algorithms in Data Mining

Data Preparation. Nikola Milikić. Jelena Jovanović

CSE 446 Bias-Variance & Naïve Bayes

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Hands on Datamining & Machine Learning with Weka

3 Virtual attribute subsetting

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

GEOPT13 - AN INTERNATIONAL PROFICIENCY TEST FOR ANALYTICAL GEOCHEMISTRY LABORATORIES - REPORT ON ROUND 13 / July 2003 (Köln Loess)

A Fast Personal Palm print Authentication based on 3D-Multi Wavelet Transformation

Practical Data Mining COMP-321B. Tutorial 4: Preprocessing

A Voxel-Based Skewness and Kurtosis Balancing Algorithm for Updating Road Networks from Airborne Lidar Data

Project Management with RStudio

ICP-OES. By: Dr. Sarhan A. Salman

Shorthand for values: variables

So il Geochem ica l Ba seline and Env ironm en ta l Background Va lues of Agr icultura l Reg ion s in Zhejiang Prov ince.

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats

CHAPTER 3. Preprocessing and Feature Extraction. Techniques

3D Face and Hand Tracking for American Sign Language Recognition

Attribute Discretization and Selection. Clustering. NIKOLA MILIKIĆ UROŠ KRČADINAC

OCR and OCV. Tom Brennan Artemis Vision Artemis Vision 781 Vallejo St Denver, CO (303)

Input: Concepts, Instances, Attributes

Domestic electricity consumption analysis using data mining techniques

Excel Basics Rice Digital Media Commons Guide Written for Microsoft Excel 2010 Windows Edition by Eric Miller

LA ICP MS Analysis with

Computing a Gain Chart. Comparing the computation time of data mining tools on a large dataset under Linux.

Machine Learning Techniques for Data Mining

K*: An Instance-based Learner Using an Entropic Distance Measure

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

CSE 559A: Computer Vision

Statistics Lecture 6. Looking at data one variable

COMP s1 - Getting started with the Weka Machine Learning Toolkit

Singular Value Decomposition, and Application to Recommender Systems

Michele Samorani. University of Alberta School of Business. More tutorials are available on Youtube

APPENDIX C. Chain of Custody Forms

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

TUBE: Command Line Program Calls

Report: Privacy-Preserving Classification on Deep Neural Network

Three-Dimensional Motion Tracking using Clustering

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Team Members: Yingda Hu (yhx640), Brian Zhan (bjz002) James He (jzh642), Kevin Cheng (klc954)

Chapter 2 - Graphical Summaries of Data

MULTI ORIENTATION PERFORMANCE OF FEATURE EXTRACTION FOR HUMAN HEAD RECOGNITION

3 Graphical Displays of Data

Motion Interpretation and Synthesis by ICA

Organizing and Summarizing Data

3 Graphical Displays of Data

15-780: Graduate Artificial Intelligence. Decision trees

Usability Test Report: Bento results interface 1

A Brief Overview of Audio Information Retrieval. Unjung Nam CCRMA Stanford University

Separating Speech From Noise Challenge

CS 398 ACC Data Sourcing / Cleaning

Hebei University of Technology A Text-Mining-based Patent Analysis in Product Innovative Process

NOTES TO CONSIDER BEFORE ATTEMPTING EX 1A TYPES OF DATA

Learning from Data: Adaptive Basis Functions

DEFINITIVE ANNUAL REPORT Trace elements 2014

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

This guide will walk you through the steps for installing and using wget on Windows.

Computer Science 61 Scribe Notes Tuesday, November 25, 2014 (aka the day before Thanksgiving Break)

Evaluating Classifiers

Module: Rasters. 8.1 Lesson: Working with Raster Data Follow along: Loading Raster Data CHAPTER 8

Waleed Pervaiz CSE 352

VI... VIII

Section 2.3: Simple Linear Regression: Predictions and Inference

Application for registration of a compost, potting soil product, mulch material, substrate, of degradable pots or technical materials

Louis Fourrier Fabien Gaie Thomas Rolf

CS125 : Introduction to Computer Science. Lecture Notes #11 Procedural Composition and Abstraction. c 2005, 2004 Jason Zych

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

MetaData for Database Mining

Gnome Data Mine Tools Evaluation Report

Basic Reliable Transport Protocols

Section 8.3 Vector, Parametric, and Symmetric Equations of a Line in

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

Predicting Messaging Response Time in a Long Distance Relationship

IPR Äspö Hard Rock Laboratory. International Progress Report. TRUE Block Scale project. Summary of chemical data December 2002

Lab Assignment 1. Part 1: Feature Selection, Cleaning, and Preprocessing to Construct a Data Source as Input

Machine Learning and Bioinformatics 機器學習與生物資訊學

FACE RECOGNITION USING SUPPORT VECTOR MACHINES

(Refer Slide Time 3:31)

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

Application for registration of a compost, potting soil product, mulch material, substrate, of degradable pots or technical materials

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 05, 2016 ISSN (online):

Transcription:

Waleed Pervaiz CSE 352

Classification Tool: WEKA Waikato Environment for Knowledge Analysis by The University of Waikato. Available on the internet at: http://www.cs.waikato.ac.nz/~ml/weka/index.html

The Raw data does give us a lot of information. However, in this form most of this information is useless and doesn t tell us anything.

To prepare data for pre processing the following steps were taken. Any attributes that have missing data (i.e. more than 20%) will be removed. The following attributes were thus removed: Pb As Cd Ni Sc Co Li Mo

All other attributes that are missing values were filled in with their averages (mean). Missing values for the following attributes were inserted: TiO2 (Carbonates) Mean: 0.005 (Inserted at E58 and E62) P2O5 (Carbonates) Mean: 0.74 (Inserted at L33) S (Carbonates) Mean: 423 (Inserted at M52, M66, M75) Zn (Carbonates) Mean: 16 (Inserted at N46, N53, N67) Cu (Carbonates) Mean: 3 (Inserted at O37, O43, O46, O48, O52, O55, O57, O60, O63, O67) Cr (Galene) Mean: 9.6 (Inserted at P85, P87) Cr (Spahlerite) Mean: 3.6 (Inserted at P90) V (Carbonates) Mean: 5.2 (Inserted at Q43, Q48, Q51, Q52, Q60, Q62, Q66, Q71, Q75) V (Galene) Mean: 2.5 (Inserted at Q86) V (Spahlerite) Mean: 9.4 (Inserted at Q90)

For easier reading, class values were replaced with simpler values. The following values were changed: R. carbonatées changed to C1 Pyrite changed to C2 Chalcopyrites changed to C3 Galène changed to C4 Spahlerite changed to C5 Sédiments terrigènes changed to C6

Using WEKA, we remove any noisy data that may unnecessarily skew our data and results.

With all the missing data filled in, the noisy data eliminated, we discretize the data using the WEKA tool. (3 equal frequency bins)

Values in the bins were then replaced by specific words: Low Medium High This helps in understanding data better. Decision Tree algorithms will still work with these non numerical values.

The following experiments will be carried out on our data: Full Learning: Construction of decision trees characterizing all classes. Contrast Learning: Using all attributes to compare class C1 with the rest of the classes. Limited Learning: Construction of decision tree using only the major attributes.

Experiment 1: Full Learning Decision Trees were generated using the J48 algorithm.

We got the following discriminant rules. IF Ta= Low AND Lu= Low AND Fe2O3= Low THEN Class= C1 IF Ta= Low AND Lu= Low AND Fe2O3= Medium THEN Class= C1 IF Ta= Low AND Lu= Low AND Fe2O3= High AND Cu= Low THEN Class= C1 IF Ta= Low AND Lu= Low AND Fe2O3= High AND Cu= Medium THEN Class= C2 IF Ta= Low AND Lu= Low AND Fe2O3= High AND Cu= High AND Cr= Low THEN Class= C1 IF Ta= Low AND Lu= Low AND Fe2O3= High AND Cu= High AND Cr= Medium THEN Class= C1 IF Ta= Low AND Lu= Low AND Fe2O3= High AND Cu= High AND Cr= High THEN Class= C4

We got the following discriminant rules. IF Ta= Low AND Lu= Medium THEN Class= C1 IF Ta= Low AND Lu= High THEN Class= C1 IF Ta= Medium AND CaO= High THEN Class= C1 IF Ta= Medium AND CaO= Medium THEN Class= C1 IF Ta= Medium AND CaO= Low THEN Class= C2 IF Ta= High AND Zn= Low THEN Class= C1 IF Ta= High AND Zn= Medium THEN Class= C6 IF Ta= High AND Zn= High THEN Class= C6 Predictive Accuracy Determined: 70.58%

Experiment 2: Contrast Learning Decision Trees were generated using the J48 algorithm.

We got the following discriminant rules. IF Fe2O3= Low THEN Class= C1 IF Fe2O3= Medium THEN Class= C1 IF Fe2O3= High AND CaO= High THEN Class= C1 IF Fe2O3= High AND CaO= Medium AND Rb= Medium THEN Class= NOT C1 IF Fe2O3= High AND CaO= Medium AND Rb= Low THEN Class= C1 IF Fe2O3= High AND CaO= Medium AND Rb= High THEN Class= C1 IF Fe2O3= High AND CaO= Low AND Zn= Low THEN Class= C1

We got the following discriminant rules. IF Fe2O3= High AND CaO= Low AND Zn= Medium THEN Class= NOT C1 IF Fe2O3= High AND CaO= Low AND Zn= High THEN Class= NOT C1 Predictive Accuracy Determined: 97.06%

Experiment 3: Using Major Attributes Decision Trees were generated using the J48 algorithm.

We got the following discriminant rules. IF Fe2O3= Low THEN Class= C1 IF Fe2O3= Medium THEN Class= C1 IF Fe2O3= High AND CaO= High THEN Class= C1 IF Fe2O3= High AND CaO= Medium THEN Class= C1 IF Fe2O3= High AND CaO= Low AND Zn= Low THEN Class= C1 IF Fe2O3= High AND CaO= Low AND Zn= Medium AND S= Low THEN Class= C6 IF Fe2O3= High AND CaO= Low AND Zn= Medium AND S= Medium THEN Class= C6

We got the following discriminant rules. IF Fe2O3= High AND CaO= Low AND Zn= Medium AND S= High THEN Class= C2 IF Fe2O3= High AND CaO= Low AND Zn= High THEN Class= C6 Predictive Accuracy Determined: 82.35%

With all the missing data filled in, the noisy data eliminated, we use another method of data discretization. (4 equal width bins)

The following experiments will be carried out on our data: Full Learning: Construction of decision trees characterizing all classes. Contrast Learning: Using all attributes to compare class C1 with the rest of the classes. Limited Learning: Construction of decision tree using only the major attributes.

Experiment 1: Full Learning Decision Trees were generated using the J48 algorithm.

We got the following discriminant rules. IF Ta= Low AND Lu= Low AND Fe2O3= Low THEN Class= C1 IF Ta= Low AND Lu= Low AND Fe2O3= Medium THEN Class= C1 IF Ta= Low AND Lu= Low AND Fe2O3= High AND Cu= Low THEN Class= C1 IF Ta= Low AND Lu= Low AND Fe2O3= High AND Cu= Medium THEN Class= C2 IF Ta= Low AND Lu= Low AND Fe2O3= High AND Cu= High AND Cr= Low THEN Class= C1 IF Ta= Low AND Lu= Low AND Fe2O3= High AND Cu= High AND Cr= Medium THEN Class= C1 IF Ta= Low AND Lu= Low AND Fe2O3= High AND Cu= High AND Cr= High THEN Class= C4

We got the following discriminant rules. IF Ta= Low AND Lu= Medium THEN Class= C1 IF Ta= Low AND Lu= High THEN Class= C1 IF Ta= Medium AND CaO= High THEN Class= C1 IF Ta= Medium AND CaO= Medium THEN Class= C1 IF Ta= Medium AND CaO= Low THEN Class= C2 IF Ta= High AND Zn= Low THEN Class= C1 IF Ta= High AND Zn= Medium THEN Class= C6 IF Ta= High AND Zn= High THEN Class= C6 Predictive Accuracy Determined: 79.59%

Experiment 2: Contrast Learning Decision Trees were generated using the J48 algorithm.

We got the following discriminant rules. IF Fe2O3= Low THEN Class= C1 IF Fe2O3= Medium THEN Class= C1 IF Fe2O3= High AND CaO= High THEN Class= C1 IF Fe2O3= High AND CaO= Medium AND Rb= Medium THEN Class= NOT C1 IF Fe2O3= High AND CaO= Medium AND Rb= Low THEN Class= C1 IF Fe2O3= High AND CaO= Medium AND Rb= High THEN Class= C1 IF Fe2O3= High AND CaO= Low AND Zn= Low THEN Class= C1

We got the following discriminant rules. IF Fe2O3= High AND CaO= Low AND Zn= Medium AND Tb= Low THEN Class= NOT C1 IF Fe2O3= High AND CaO= Low AND Zn= Medium AND Tb= Medium THEN Class= C1 IF Fe2O3= High AND CaO= Low AND Zn= Medium AND Tb= High THEN Class= NOT C1 IF Fe2O3= High AND CaO= Low AND Zn= High THEN Class= NOT C1 Predictive Accuracy Determined: 89.79%

Experiment 3: Using Major Attributes Decision Trees were generated using the J48 algorithm.

We got the following discriminant rules. IF Fe2O3= Low THEN Class= C1 IF Fe2O3= Medium THEN Class= C1 IF Fe2O3= High AND CaO= High THEN Class= C1 IF Fe2O3= High AND CaO= Medium THEN Class= C1 IF Fe2O3= High AND CaO= Low AND Zn= Low THEN Class= C1 IF Fe2O3= High AND CaO= Low AND Zn= Medium THEN Class= NOT C1 IF Fe2O3= High AND CaO= Low AND Zn= High THEN Class= NOT C1

Predictive Accuracy Determined: 94.90%

Here is a comparison of the accuracy achieved with each Dataset. Dataset #1 Dataset #2 Experiment 1 70.58% 79.59% Experiment 2 97.06% 89.79% Experiment 3 82.35% 94.90% Dataset 1 was carried out using 3 bins equal frequency discretization. Dataset 2 was carried out using 4 bins equal width discretization.

High accuracy for a particular value can sometimes be misleading since there is a lot of data (77 records) for C1 as compared to data (21 records) for other classes. WEKA produces different rules depending on the techniques used for data preparation. Dataset 2 generally had better accuracy. Thus, we can conclude that the 4 bin equal width method was slightly more accurate than the 3 bin equal frequency method. Comparing classes with each other gave the best overall accuracy. (i.e. comparing C1 with all other classes)