Attribute Discretization and Selection. Clustering. NIKOLA MILIKIĆ UROŠ KRČADINAC

Similar documents
Data Preparation. UROŠ KRČADINAC URL:

Data Preparation. Nikola Milikić. Jelena Jovanović

Lab Exercise Three Classification with WEKA Explorer

TUBE: Command Line Program Calls

DATA ANALYSIS WITH WEKA. Author: Nagamani Mutteni Asst.Professor MERI

Slides for Data Mining by I. H. Witten and E. Frank

Practical Data Mining COMP-321B. Tutorial 4: Preprocessing

Data Engineering. Data preprocessing and transformation

Weka ( )

Machine Learning Techniques for Data Mining

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 05, 2016 ISSN (online):

Topics In Feature Selection

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008

Function Algorithms: Linear Regression, Logistic Regression

Features: representation, normalization, selection. Chapter e-9

Dr. Prof. El-Bahlul Emhemed Fgee Supervisor, Computer Department, Libyan Academy, Libya

Introduction to Artificial Intelligence

Perform the following steps to set up for this project. Start out in your login directory on csit (a.k.a. acad).

Tutorial Case studies

Data Mining Input: Concepts, Instances, and Attributes

Basic Concepts Weka Workbench and its terminology

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

Data Preprocessing. Data Preprocessing

Data Mining. Lesson 9 Support Vector Machines. MSc in Computer Science University of New York Tirana Assoc. Prof. Dr.

Classifica(on and Clustering with WEKA. Classifica*on and Clustering with WEKA

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data

Data Mining. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Chapter 3: Input

UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

SSV Criterion Based Discretization for Naive Bayes Classifiers

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Using Google s PageRank Algorithm to Identify Important Attributes of Genes

2012 Fall, CENG 514 Data Mining, Homework 3 Key by Dilek Önal

IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER

Data Mining. Part 1. Introduction. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Input

WEKA homepage.

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels

Short instructions on using Weka

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

Data Mining: STATISTICA

Classification Algorithms in Data Mining

Using Weka for Classification. Preparing a data file

Variable Selection 6.783, Biomedical Decision Support

WEKA Explorer User Guide for Version 3-4

Data Mining and Knowledge Discovery: Practice Notes

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Data Collection, Preprocessing and Implementation

Data Mining and Knowledge Discovery: Practice Notes

WRAPPER feature selection method with SIPINA and R (RWeka package). Comparison with a FILTER approach implemented into TANAGRA.

Data Mining and Knowledge Discovery: Practice Notes

The Data Mining Application Based on WEKA: Geographical Original of Music

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Outline. Prepare the data Classification and regression Clustering Association rules Graphic user interface

User Guide Written By Yasser EL-Manzalawy

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Feature Selection Algorithm with Discretization and PSO Search Methods for Continuous Attributes

CLUSTERING. JELENA JOVANOVIĆ Web:

Data Mining Practical Machine Learning Tools And Techniques With Java Implementations The Morgan Kaufmann Series In Data Management Systems

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

CLASSIFICATION OF KNEE MRI IMAGES

DUE By 11:59 PM on Thursday March 15 via make turnitin on acad. The standard 10% per day deduction for late assignments applies.

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Unsupervised Learning

Statistical Pattern Recognition

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Part 1. Introduction. 1.4 Input. Spring Instructor: Dr. Masoud Yaghini. Input

ESERCITAZIONE PIATTAFORMA WEKA. Croce Danilo Web Mining & Retrieval 2015/2016

Data Preparation. Data Preparation. (Data pre-processing) Why Prepare Data? Why Prepare Data? Some data preparation is needed for all mining tools

In this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.

Machine Learning: Symbolische Ansätze

CS 8520: Artificial Intelligence. Weka Lab. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

IBL and clustering. Relationship of IBL with CBR

Figure 1: Workflow of object-based classification

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Feature Selection for Supervised Classification: A Kolmogorov- Smirnov Class Correlation-Based Filter

Cluster based boosting for high dimensional data

Lecture 25: Review I

SOCIAL MEDIA MINING. Data Mining Essentials

Weka VotedPerceptron & Attribute Transformation (1)

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Chapter 3: Supervised Learning

6. Dicretization methods 6.1 The purpose of discretization

Unsupervised Discretization using Tree-based Density Estimation

Weka: Practical machine learning tools and techniques with Java implementations

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

Computing a Gain Chart. Comparing the computation time of data mining tools on a large dataset under Linux.

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

GAIN RATIO BASED FEATURE SELECTION METHOD FOR PRIVACY PRESERVATION

A Comparison of Decision Tree Algorithms For UCI Repository Classification

Data mining: concepts and algorithms

Sponsored by AIAT.or.th and KINDML, SIIT

Transcription:

Attribute Discretization and Selection Clustering NIKOLA MILIKIĆ nikola.milikic@fon.bg.ac.rs UROŠ KRČADINAC uros@krcadinac.com

Naive Bayes Features Intended primarily for the work with nominal attributes In case of numeric attributes Use the propability distribution of attributes (Normal distribution is default) for probability estimation for the each attribute Discretize the attribute s values

Attribute Discretization Discretization is the process of tranformation numeric data into nominal data, by putting the numeric values into distinct groups, which lenght is fixed. Common approaches: Unsupervised: Equal-width binning Equal-frequency binning Supervised classes are taken into account

Equal-Width Binning Equal-width binning divides the scope of possible values into N subscopes (bins) of the same width: width = (max value min value) / N Example: If the scope of the values is between 0 and 100, we should create 5 subscopes (bins) in the following manner: Width = (100 0) / 5 = 20 Subscopes (bins): [0-20], (20-40], (40-60], (60-80], (80-100] Usually, the first and the final subscope (bin) are being expended in order to include possible values outside the original scope.

Equal-frequency binning Equal-frequency binning (or equal-height binning) divides the scope of possible values into N subscopes where each subscope (bin) carries the same number of instances: Example: We want to put the following values in 5 subscopes (bins): 5, 7, 12, 35, 65, 82, 84, 88, 90, 95 So, each subscope will have 2 instances: 5, 7, 12, 35, 65, 82, 84, 88, 90, 95

Discretization in Weka We apply certain Filters to attributes we want to discretize. Preprocess tab Option: Choose -> Filter filters/unsupervised/attribute Discretize. FishersIrisDataset.arff

Discretization in Weka Equal-width binning is the default option. attributeindices the first-last value means that we are discretizing all values. We can also name the attribute numbers. bins the desired number of scopes (bins) useequalfrequency false by default; true if we use Equal Frequency binning

Discretization in Weka Applying the filter The resulting subscopes (bins)

Data, before and after discretization Before After

Attribute Selection Attribute Selection (or Feature Selection) is the process of choosing a subset of relevant attributes that will be used during the further analysis. It is being applied in cases where the dataset contains attributes which are redudant and/or irrelevant. Redundant attributes are the ones that do not provide more information than the attributes we already have in our dataset. Irrelevant attributes are the ones that are useless in the context of the current analysis.

Attribute Selection Advantages Excessive attributes can degrade the performance of the model. Advantages: Advances the readability of the model (because now the model contains only the relevant attributes) Shortens the training time Generalization power is higher because it lowers the possibility of overfitting If the problem is well-known, the best way to select attribute is to do it manually. However, automated apporaches also give good results.

Approaches to Attribute Selection Two approaches: Filter method use the approximation based on the general features of the data. Wrapper method attribute subsets are being evaluated by using the maching learning algorithm, applied to the dataset. The name Wrapper comes from the fact that the algorithm is wrapped within the process of selection. The chosen subset of attributes is the one for which the algorithm gives the best results.

Attribute Selection Example census90-income.arff

Attribute Selection Example We want to apply the selection of attributes

Attribute Selection Example ClassifierSubsetEval is our choice for the evaluator

Attribute Selection Example NaiveBayes classifier

Attribute Selection Example We need to discretize the numeric attributes

Attribute Selection Example As the search method we choose the BestFirst

Attribute Selection Example Filter is set and can be applied

Attribute Selection Example The number of attributes is reduced to 7

Clustering Clustering belongs to a group of techiques of unsupervised learning. It enables grouping instances into groups, where we know which are the possible groups in advance. These groups are called clusters. As the result of clustering each instance is being added a new attribute the cluster to which it belongs. The clustering is said to be successful if the final clusters make sense, if they could be given meaningful names.

K-Means algorithm in Weka FishersIrisDataset.arff

Choosing the clustering algorithm Cluster tab We choose the SimpleKMeans algorithm

Parameter settings numclusters the number of desired clusters; we set it to 3 because we have 3 kinds displaystddevs if true, the standard deviation will be displayed

Running the Clustering Clustering over the imported data We ignore the Species attribute

Results of Clustering Centroids of each cluster and their standard deviations Number of instnaces in each cluster

Evaluation of Results Select the attrubute which we want to compare the results with. Which classes are in which clusters Names of classes which are given to clusters

Visualization of Clusters Right click Visual representation of clusters

Was clustering successful? Within cluster sum of squared error gives us the assessment of quality It is being counted as the sum of square differences between the value of the attribute of each instance and the value of the centroid of the given attribute Vrednosti centroida po svim atributima

How to figure out the number of clusters? 60 50 40 30 20 10 0 55.6 12.1 7 Error Clusters Errors Little difference in error 5.5 5 4.8 4.7 4.2 4.1 3.6 1 2 3 4 5 6 7 8 9 10 11 12 1.7 0.6 1 55.6 2 12.1 3 7.0 4 5.5 5 5.0 6 4.8 7 4.7 8 4.2 9 4.1 10 3.6 20 1.7 50 0.6

Using Clusters for Classification AddCluster our choice of the filter Setting no class

Using Clusters for Classification We choose the SimpleKMeans as the clustering algorithm In terms of clustering, we ignore the attribute 5 (Speices)

Using Clusters for Classification After the filter is being applied (Apply) we add the new attribute by the name of cluster

Using Clusters for Classification Optional: this attribute can be removed before we create a clasification model

Using Clusters for Classification We use the NaiveBayes classifier We do the classification according to the cluster attribute The confusion matrix

Thank you notes Weka Tutorials and Assignments @ The Technology Forge Link: http://www.technologyforge.net/wekatutorials/ Witten, Ian H., Eibe Frank, and Mark A. Hall. Data Mining: Practical Machine Learning Tools and Techniques: Practical Machine Learning Tools and Techniques. Elsevier, 2011.

A survey for you, to judge us :) http://goo.gl/cqdp3i

Any questions? NIKOLA MILIKIĆ nikola.milikic@fon.bg.ac.rs UROŠ KRČADINAC uros@krcadinac.com