Analysis of Different Similarity Measure Functions and their Impacts on Shared Nearest Neighbor Clustering Approach

Similar documents
Image Segmentation EEE 508

Journal of Chemical and Pharmaceutical Research, 2013, 5(12): Research Article

Pruning and Summarizing the Discovered Time Series Association Rules from Mechanical Sensor Data Qing YANG1,a,*, Shao-Yu WANG1,b, Ting-Ting ZHANG2,c

New HSL Distance Based Colour Clustering Algorithm

Administrative UNSUPERVISED LEARNING. Unsupervised learning. Supervised learning 11/25/13. Final project. No office hours today

DATA MINING II - 1DL460

Analysis of Documents Clustering Using Sampled Agglomerative Technique

3D Model Retrieval Method Based on Sample Prediction

HADOOP: A NEW APPROACH FOR DOCUMENT CLUSTERING

Fundamentals of Media Processing. Shin'ichi Satoh Kazuya Kodama Hiroshi Mo Duy-Dinh Le

arxiv: v2 [cs.ds] 24 Mar 2018

Ones Assignment Method for Solving Traveling Salesman Problem

Improving Information Retrieval System Security via an Optimal Maximal Coding Scheme

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method

CSCI 5090/7090- Machine Learning. Spring Mehdi Allahyari Georgia Southern University

New Fuzzy Color Clustering Algorithm Based on hsl Similarity

ANN WHICH COVERS MLP AND RBF

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation

Algorithms for Disk Covering Problems with the Most Points

Accuracy Improvement in Camera Calibration

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON

Bezier curves. Figure 2 shows cubic Bezier curves for various control points. In a Bezier curve, only

An Efficient Algorithm for Graph Bisection of Triangularizations

An Efficient Algorithm for Graph Bisection of Triangularizations

CHAPTER IV: GRAPH THEORY. Section 1: Introduction to Graphs

Cluster Analysis. Andrew Kusiak Intelligent Systems Laboratory

Octahedral Graph Scaling

Harris Corner Detection Algorithm at Sub-pixel Level and Its Application Yuanfeng Han a, Peijiang Chen b * and Tian Meng c

Performance Comparisons of PSO based Clustering

Euclidean Distance Based Feature Selection for Fault Detection Prediction Model in Semiconductor Manufacturing Process

A Generalized Set Theoretic Approach for Time and Space Complexity Analysis of Algorithms and Functions

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Range Free Localization Schemes For Wireless Sensor Networks

COMP9318: Data Warehousing and Data Mining

IMP: Superposer Integrated Morphometrics Package Superposition Tool

Dynamic Programming and Curve Fitting Based Road Boundary Detection

Handwriting Stroke Extraction Using a New XYTC Transform

The isoperimetric problem on the hypercube

Bayesian Network Structure Learning from Attribute Uncertain Data

Cubic Polynomial Curves with a Shape Parameter

Empirical Validate C&K Suite for Predict Fault-Proneness of Object-Oriented Classes Developed Using Fuzzy Logic.

Evaluation scheme for Tracking in AMI

are two specific neighboring points, F( x, y)

Descriptive Data Mining Modeling in Telecom Systems

CLUSTERING TECHNIQUES TO ANALYSES IN DENSITY BASED SOCIAL NETWORKS

Investigating methods for improving Bagged k-nn classifiers

Exact Minimum Lower Bound Algorithm for Traveling Salesman Problem

Pattern Recognition Systems Lab 1 Least Mean Squares

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

Task scenarios Outline. Scenarios in Knowledge Extraction. Proposed Framework for Scenario to Design Diagram Transformation

Eigenimages. Digital Image Processing: Bernd Girod, Stanford University -- Eigenimages 1

Stone Images Retrieval Based on Color Histogram

Automatic Generation of Membership Functions and Rules in a Fuzzy Logic System

Criterion in selecting the clustering algorithm in Radial Basis Functional Link Nets

An Improved Shuffled Frog-Leaping Algorithm for Knapsack Problem

Eigenimages. Digital Image Processing: Bernd Girod, 2013 Stanford University -- Eigenimages 1

Introduction. Nature-Inspired Computing. Terminology. Problem Types. Constraint Satisfaction Problems - CSP. Free Optimization Problem - FOP

Method to match waves of ray-tracing simulations with 3- D high-resolution propagation measurements Guo, P.; van Dommele, A.R.; Herben, M.H.A.J.

Hashing Functions Performance in Packet Classification

Redundancy Allocation for Series Parallel Systems with Multiple Constraints and Sensitivity Analysis

BASED ON ITERATIVE ERROR-CORRECTION

Evaluation of Different Fitness Functions for the Evolutionary Testing of an Autonomous Parking System

c-dominating Sets for Families of Graphs

Evaluating Top-k Selection Queries

Probabilistic Fuzzy Time Series Method Based on Artificial Neural Network

Mobile terminal 3D image reconstruction program development based on Android Lin Qinhua

Rapid Frequent Pattern Growth and Possibilistic Fuzzy C-means Algorithms for Improving the User Profiling Personalized Web Page Recommendation System

Parabolic Path to a Best Best-Fit Line:

VALIDATING DIRECTIONAL EDGE-BASED IMAGE FEATURE REPRESENTATIONS IN FACE RECOGNITION BY SPATIAL CORRELATION-BASED CLUSTERING

A Novel Feature Extraction Algorithm for Haar Local Binary Pattern Texture Based on Human Vision System

Force Network Analysis using Complementary Energy

Fuzzy Rule Selection by Data Mining Criteria and Genetic Algorithms

Design and Implementation of Web Usage Mining Intelligent System in the Field of e-commerce

A Note on Least-norm Solution of Global WireWarping

EMPIRICAL ANALYSIS OF FAULT PREDICATION TECHNIQUES FOR IMPROVING SOFTWARE PROCESS CONTROL

WEBSITE STRUCTURE IMPROVEMENT USING ANT COLONY TECHNIQUE

condition w i B i S maximum u i

Fast algorithm for skew detection. Adnan Amin, Stephen Fischer, Tony Parkinson, and Ricky Shiu

Performance Plus Software Parameter Definitions

Which movie we can suggest to Anne?

A ROUGH SET APPROACH FOR CUSTOMER SEGMENTATION

Fire Recognition in Video. Walter Phillips III Mubarak Shah Niels da Vitoria Lobo.

Improving Template Based Spike Detection

FORMATION OF PART FAMILY IN RECONFIGURABLE MANUFACTURING SYSTEM USING PRINCIPLE COMPONENT ANALYSIS AND K-MEANS ALGORITHM

Bayesian approach to reliability modelling for a probability of failure on demand parameter

A Comparative Study of Positive and Negative Factorials

Using a Dynamic Interval Type-2 Fuzzy Interpolation Method to Improve Modeless Robots Calibrations

Relationship between augmented eccentric connectivity index and some other graph invariants

Data Mining: Concepts and Techniques. Chapter 7

Optimal Mapped Mesh on the Circle

Outline. Research Definition. Motivation. Foundation of Reverse Engineering. Dynamic Analysis and Design Pattern Detection in Java Programs

MATHEMATICAL METHODS OF ANALYSIS AND EXPERIMENTAL DATA PROCESSING (Or Methods of Curve Fitting)

AUTOMATICALLY AND ACCURATELY MATCHING OBJECTS IN GEOSPATIAL DATASETS

Neuro Fuzzy Model for Human Face Expression Recognition

Hui Xiao School of Environmental Science, Nanjing Xiaozhuang University, Nanjing , China

Model Based Design: develpment of Electronic Systems

An Image Retrieval Method Based on Hu Invariant Moment and Improved Annular Histogram

Diego Nehab. n A Transformation For Extracting New Descriptors of Shape. n Locus of points equidistant from contour

1 Graph Sparsfication

Shape Completion and Modeling of 3D Foot Shape While Walking Using Homologous Model Fitting

Transcription:

Aalysis of Differet Similarity Measure Fuctios ad their Impacts o Shared Nearest Neighbor Clusterig Approach Ail Kumar Patidar School of IT, Rajiv Gadhi Techical Uiversity, Bhopal (M.P.), Idia Jitedra Agrawal School of IT, Rajiv Gadhi Techical Uiversity, Bhopal (M.P.), Idia Nishchol Mishra School of IT, Rajiv Gadhi Techical Uiversity, Bhopal (M.P.), Idia ABSTRACT Clusterig is a techique of groupig data with aalogous data cotet. I recet years, Desity based clusterig algorithms especially SNN clusterig approach has gaied high popularity i the field of data miig. It fids clusters of differet size, desity, ad shape, i the presece of large amout of oise ad outliers. SNN is widely used where large multidimesioal ad dyamic databases are maitaied. A typical clusterig techique utilizes similarity fuctio for comparig various data items. Previously, may similarity fuctios such as Euclidea or Jaccard similarity measures have bee worked upo for the compariso purpose. I this paper, we have evaluated the impact of four differet similarity measure fuctios upo Shared Nearest Neighbor (SNN) clusterig approach ad the results were compared subsequetly. Based o our aalysis, we arrived o a coclusio that Euclidea fuctio works best with SNN clusterig approach i cotrast to cosie, Jaccard ad correlatio distace measures fuctio. Keywords Data miig, Clusterig, SNN (Shared Nearest Neighbor), Desity, Noise, Outlier, Similarity Measure. 1. INTRODUCTION 1.1 Data Miig Data miig is ew techology/process of fidig ovel, hidde, iterestig, ad useful iformatio, or kowledge from the large volumes of raw data [6]. This useful iformatio or kowledge ca be used to predict or to tell us somethig ew. Data is a essetial etity or fact of our corporatio, but oly if we kow how to retrieve or extract useful data from the large volumes of raw data. Data miig techique helps us i accomplishig this [7]. 1.2 Clusterig Clusterig is the most importat techique of data miig. Clusterig is a techique of groupig of similar data objects together, so that the objects i each group (called cluster) share the same patter of iformatio. Clusterig techique is widely used i fiacial data classificatio, spatial data processig, satellite photo aalysis, egieerig ad medical figure auto-detectio, Social etwork aalysis etc. [5]. There are two types of clusterig techiques [8] - partitioig ad hierarchical clusterig techique. database miig. From the previous results, has bee iferred that the Desity based clusterig is very effective for aalyzig large amouts of heterogeeous, complex data for example clusterig of complex objects [5]. 1.3 Similarity Measures Similarity measure is defied as the distace betwee various data poits. The performace of may algorithms depeds upo selectig a good distace fuctio over iput data set. While, similarity is a amout that reflects the stregth of relatioship betwee two data items, dissimilarity deals with the measuremet of divergece betwee two data items [2] [3]. Here, we preset a brief overview of similarity measure fuctios used i this paper: 1. Euclidea distace: Euclidea distace determies the root of square differeces betwee the coordiates of a pair of objects [2]. For vectors x ad y distace d (x, y) is give by: Sim(x, y) = d = i=1 x i y i 2 Where x ad y are -dimesioal vectors. 2. Cosie distace: Cosie distace measure for text clusterig determies the cosie of the agle betwee two vectors give by the followig formula [2]: (xi xj) Sim(x i, x j )= cosθ = ( xi xj ) Where, θ refers to the agle betwee two vectors ad x i, x j are -dimesioal vectors. 3. Jaccard distace: The Jaccard distace, ivolves the measuremet of similarity as the itersectio divided by the uio of the data items [3]. The formulae could be stated as: (xi xj) Sim(x i, x j ) = ( xi 2 + xj 2 xi xj) 4. Pearso Correlatio distace: Pearso s correlatio distace is aother measure of the extet to which two vectors are related [3]. The distace measure could be mathematically stated as: I this paper, we have used desity based SNN clusterig approach. It is a efficiet clusterig approach for dyamic 1

Sim(x, y) = x 2 xy x 2 x y y 2 y 2 2. OUTLINE OF THE PAPER This paper is composed of 6 sectios i additio to the itroductio. Sectio-3 describes the related work (literature survey) doe based o the otio of desity ad similarity measure. The SNN clusterig approach is discussed i Sectio-4. While Sectio-5 dealt the experimetal setup, sectio-6 cofied the results ad aalysis. A short coclusio ad directios for future work is preseted i Sectio-7 ad sectio-8 dealt with refereces. 3. LITERATURE SURVEY There are umber of clusterig algorithms based to the otio of desity. However, i this paper our focus cofied o the widely used SNN clusterig approach. I this sectio, we represet a brief overview of the work doe i the area of Desity based clusterig ad similarity measure. Discoverig clusters of differet sizes ad shapes is difficult i the presece of oise ad outliers. May recet clusterig algorithms like DBSCAN [9], CURE [10], ROCK [11] ad Chameleo [12], ad other variatios of DBSCAN clusterig approach have tried to address this problem, but these algorithms did ot work well with the objects of varyig desity. Fidig clusters of differet shape, size, ad desity, especially i the presece of oise ad outlier is a problem dealt most recetly with a recet clusterig algorithm kow as SNN clusterig approach. Jarvis ad Patrick [4], first itroduced this idea of shared earest eighbor. I the Jarvis Patrick approach, a s (shared earest eighbor) graph is created from the proximity matrix. A lik is costructed from pair of poits a ad b if ad oly if a ad b has their closest k- earest eighbor lists to each other. This approach is k-earest eighbor sparsificatio. The umber of ear eighbors that two poits share derives the weights of the liks betwee two poits i the s graph. Marti Ester, Has-Peter Kriegel, Joerg Sader, ad Xiaowei Xu [9], demostrated that the DBSCAN clusterig approach fid clusters of arbitrary shapes ad sizes but it caot work with data clusters of differig desities, because its desitybased defiitio of core poits ca t address the core poits of varyig desity clusters. I DBSCAN clusterig approach, if user defies the eighborhood of a poit by givig a particular radius ad the looks up for core poits (core objects) the oe of the poit that satisfy the coditios for core poit is selected as core poit while rest of the poits will be marked as oise. Else every poit coected to that core poit will belog to oe cluster. Sudipto Guha, Rajeev Rastogi ad Kyuseok Shim [10], represeted that CURE (Clusterig Usig REpresetatives), utilizes represetative poits to fid o-globular clusters. Oe of the problems of usig CURE clusterig approach is that it caot hadle may types of globular shapes. This problem is due to the approach of CURE algorithm to fids represetative poits, i.e., CURE algorithm fid poits alog the boudary, ad the shriks those poits towards the ceter of the cluster. George Karypis, Eui-Hog Ha, ad Vipi Kumar [12] verified that while DBSCAN uses the otio of core poits, CURE utilizes represetative poits as criterio, but either of the core poits or represetative poits was explicitly used by Chameleo. All three approaches (DBSCAN, CURE, ad Chameleo) share the commo idea (that the challege) of fidig clusters of differet shapes ad sizes. Mai motto of these three clusterig approaches is to fid poits or subsets of poits ad the costructig clusters aroud them. Chameleo approach is importat for spatial data, as we caot represet o-globular clusters by their cetroid, thus, cetroid based scheme caot hadle them [12]. While usig DBSCAN, CURE, ad Chameleo approaches, we must also give cosiderable attetio to hadlig of oise ad outliers. Aa Huag [2], evaluated the effects of may similarity fuctios o k-mea clusterig algorithm. Kazem Taghva ad Rushikesh Vei [3], compared ad aalyzed the effectiveess of these measures i partitioal clusterig for text documet datasets. I this paper, we described SNN clusterig approach with four differet similarity measure fuctios ad compared the effects of these similarity measures o SNN clusterig approach. 4. SNN CLUSTERING APPROACH Shared Nearest Neighbor (SNN) [1] is oe of the most importat ad most commo clusterig approach i egieerig ad scietific literature, which has the ability to produce clusters of differet size, shape, ad desity. The SNN approach, like DBSCAN approach [9], is based o desity-based clusterig approach. The mai differece betwee SNN approach ad DBSCAN approach is that while SNN deals with varyig desities clusters, DBSCAN do ot deal with clusters of varyig desities. SNN defies the similarity betwee poits by examiig the umber of earest eighbors that are shared by two poits. Utilizig the similarity measure i the SNN clusterig approach, we defied the desity as the sum of all the similarities of the earest eighbors of a poit. High-desity poits become core poits, ad low-desity poits become oise poits. All other poits, greatly similar to particular core poits were drew as ew clusters. SNN clusterig approach [1] ca be explaied as uder. 1. Compute the similarity matrix: This correspods to a similarity graph with data poits for odes ad edges whose weights are the similarities betwee data poits. 2. Sparsify the similarity matrix: This ivolves keepig oly the k most similar eighbors of each data poit. This correspods to oly keepig the k strogest liks of the similarity graph. 3. Costruct the shared earest eighbor graph: SNN graph obtaied from the sparsified similarity matrix. Here, we could apply a similarity threshold ad fid the coected compoets to obtai the clusters (Jarvis Patrick algorithm) 4. Fid the SNN desity of each Poit: Data poits havig a SNN similarity greater or equal to Eps were obtaied. 5. Fid the core poits: All poits that have a SNN desity greater tha MiPt were desigated as Core poits. 2

6. Form clusters from the core poits: If two core poits are withi a radius, Eps, of each other, they are placed i the same cluster. 7. Discard all oise poits: All o-core poits that were ot withi a radius of Eps of a cluster are discarded. 8. Assig all o-oise, o-core poits to clusters: All these poits are assiged to the earest cluster. Followig are the iputs ad their correspodig outputs as geerated by the SNN clusterig approach. Iput: Output: D- Data set k- Maximum umber of earest eighbors to each poit Eps- Desity threshold (radius of cluster) mipt- Core poit threshold K: a set of clusters I this paper, we used four differet similarity measure fuctios for calculatig similarity matrix ad compared the similarity graphs ad resultat clusters. The similarity measure fuctios are- Euclidea, Cosie, Jaccard ad Correlatio fuctio. SNN clusterig approach has may good characteristics. First, the SNN clusterig approach does ot cluster all the poits. I geeral, this is good, because much of the data is oise ad eeds to be removed. If the complete clusterig is desired, the uclustered data ca be iserted to the core clusters discovered by SNN clusterig approach by assigig them to the cluster cotaiig the closest represetative poit. Secod, the approach is especially partitioal, although we have experimeted some by creatig a hierarchy of clusters. Fially, the time complexity is O( 2 ) where is the umber of poits, because the similarity matrix has to be computed [1] [4]. 5. EXPERIMENTAL SETUP We have used some of differet types of datasets icludig test data sets of Sythetic databases, KDD cup 99 ad Mushroom dataset ad some radomly geerated datasets by which we ca described the effects of four differet similarity measure fuctios upo Shared Nearest Neighbor (SNN) clusterig approach. All these experimets were performed with the help of MATLAB 2010a (MATLAB 7.10). Here, for experimetatio, we used a 2D dataset cotaiig 107 data poits as show i Figure- 1. We compute each result show here by takig the followig iput parametersk=7, Eps=4 ad mipt=5. Fig 1: 2D Data Set 6. RESULT AND ANALYSIS From data set show i figure- 1, we first compute the similarity matrix by usig the similarity measure fuctios- Euclidea, Cosie, Jaccard ad Correlatio fuctios ad costruct the sparsified similarity graph based o the k earest eighbor criteria. Similarity graph geerated by differet similarity measure fuctios are show i figure- 2. 2(a) Similarity Graph geerated by Euclidea fuctio 2(b) Similarity Graph geerated by Cosie fuctio 3

2(c) Similarity Graph geerated by Jaccard fuctio 3(b) Clusters geerated by Cosie fuctio 2(d) Similarity Graph geerated by Correlatio fuctio Fig 2: Similarity Graph geerated by differet similarity fuctios 3(c) Clusters geerated by Jaccard fuctio Similarity matrix calculatio is most importat part of SNN clusterig approach. The compariso betwee similarity graphs is clear by their figures. 3(a) Clusters geerated by Euclidea fuctio 3(d) Clusters geerated by Correlatio fuctio Fig 3: Clusters costructed by differet similarity fuctios 4

After costructio of similarity graph, we geerate SNN graph ad by applyig user specified criteria- Eps ad mipt o this SNN graph, we compute core, ocore, ad oise poits. The clusters of core, ocore, ad oise poits by usig differet similarity fuctios are show i figure- 3. I figure- 3, X depicts the core poit, dot (.) shows the ocore poit, ad star (*) coveys the oise poits. We compared the Clusters costructed usig differet similarity fuctio by their accuracy of geeratig clusters of core poits. We observed the followig facts- 1. Clusters costructed by Jaccard ad Cosie fuctios had o or very less oise poits, Euclidea fuctio had some oise poits while clusters costructed usig correlatio fuctio had lot of oise poits, as show i figure- 3. 2. I SNN clusterig approach, Euclidea distace fuctio performed better because ot all the poits are clustered i SNN clusterig approach. Most of the data poits are oises ad hece removed. 3. If the complete clusterig is desired, the it ca be doe by followig two ways- a. Usig Euclidea distace fuctio, uclustered data ca be iserted to the core clusters, discovered by SNN clusterig approach ad assigig them to the clusters cotaiig the closest represetative poit. b. Usig Jaccard or Cosie distace fuctio, clusters ca be costructed usig SNN clusterig approach. 4. We observed that geeratio of core, ocore, ad oise poits is depedet upo data poits icluded i dataset ad the user specified criteria k, Eps ad mipt. 5. If some poits are clustered ad others are removed as oise accordig to give specified criteria, the the clusterig process performed faster. 7. CONCLUSION AND FUTURE WORK I this paper, we have aalyzed the impact upo SNN clusterig approach (SNN) of differet similarity computatio fuctios ad compared the resultat similarity graphs ad clusters. From the above results, we ca ifer that the SNN clusterig approach with Euclidea similarity measure fuctio provides better ad faster results as compared to the other distace fuctios described here. I future, we hope to aalyze impacts of other differet similarity measure fuctios upo various popular clusterig techiques. 8. REFERENCES [1] Levet Ertoz, Michael Steiback, Vipi Kumar, Fidig Clusters of Differet Sizes, Shapes, ad Desity i Noisy, High Dimesioal Data, Secod SIAM Iteratioal Coferece o Data Miig, Sa Fracisco, CA, USA, 2003. [2] Aa Huag, Similarity Measures for Text Documet Clusterig, NZCSRSC 2008, April 2008, Christchurch, New Zealad. [3] Kazem Taghva ad Rushikesh Vei, Effects of Similarity Metrics o Documet Clusterig, 2010 Seveth Iteratioal Coferece o Iformatio Techology. [4] R. A. Jarvis ad E. A. Patrick, Clusterig Usig a Similarity Measure Based o Shared Nearest Neighbors, IEEE Trasactios o Computers, Vol. C- 22, [5] M. R. Aderherg, Cluster Aalysis for Applicatio, Academic Press, New York, 1973. [6] Jiawei Ha, Michelie Kamber, Data Miig: Cocepts ad Techiques, Morga Kaufma Publishers, Sa Fracisco, USA, 2001, ISBN 1558604898. [7] Lori Bowe Ayre, Data Miig for Iformatio Professioals, 2006. [8] Aru K Pujari, Data Miig Techiques- Secod Editio, Uiversities Press. No. 11, November 1973. [9] Marti Ester, Has-Peter Kriegel, Jorg Sader, Xiaowei Xu, A Desity-Based Algorithm for Discoverig Clusters i Large Spatial Databases with Noise, KDD 96, Portlad, OR, pp. 226-231, 1996. [10] Sudipto Guha, Rajeev Rastogi, Kyuseok Shim, CURE: A Efficiet Clusterig Algorithm for Large Databases, ACM, 1998. [11] Sudipto Guha, Rajeev Rastogi, ad Kyuseok Shim, ROCK: A Robust Clusterig Algorithm for Categorical Attributes, I Proceedigs of the 15th Iteratioal Coferece o Data Egieerig, 1998. [12] George Karypis, Eui-Hog Ha, ad Vipi Kumar, CHAMELEON: A Hierarchical Clusterig Algorithm Usig Dyamic Modelig, IEEE Computer, Vol. 32, No. 8,. pp. 68-75, August 1999. 5