Analysis of Different Similarity Measure Functions and their Impacts on Shared Nearest Neighbor Clustering Approach

Aalysis of Differet Similarity Measure Fuctios ad their Impacts o Shared Nearest Neighbor Clusterig Approach Ail Kumar Patidar School of IT, Rajiv Gadhi Techical Uiversity, Bhopal (M.P.), Idia Jitedra Agrawal School of IT, Rajiv Gadhi Techical Uiversity, Bhopal (M.P.), Idia Nishchol Mishra School of IT, Rajiv Gadhi Techical Uiversity, Bhopal (M.P.), Idia ABSTRACT Clusterig is a techique of groupig data with aalogous data cotet. I recet years, Desity based clusterig algorithms especially SNN clusterig approach has gaied high popularity i the field of data miig. It fids clusters of differet size, desity, ad shape, i the presece of large amout of oise ad outliers. SNN is widely used where large multidimesioal ad dyamic databases are maitaied. A typical clusterig techique utilizes similarity fuctio for comparig various data items. Previously, may similarity fuctios such as Euclidea or Jaccard similarity measures have bee worked upo for the compariso purpose. I this paper, we have evaluated the impact of four differet similarity measure fuctios upo Shared Nearest Neighbor (SNN) clusterig approach ad the results were compared subsequetly. Based o our aalysis, we arrived o a coclusio that Euclidea fuctio works best with SNN clusterig approach i cotrast to cosie, Jaccard ad correlatio distace measures fuctio. Keywords Data miig, Clusterig, SNN (Shared Nearest Neighbor), Desity, Noise, Outlier, Similarity Measure. 1. INTRODUCTION 1.1 Data Miig Data miig is ew techology/process of fidig ovel, hidde, iterestig, ad useful iformatio, or kowledge from the large volumes of raw data [6]. This useful iformatio or kowledge ca be used to predict or to tell us somethig ew. Data is a essetial etity or fact of our corporatio, but oly if we kow how to retrieve or extract useful data from the large volumes of raw data. Data miig techique helps us i accomplishig this [7]. 1.2 Clusterig Clusterig is the most importat techique of data miig. Clusterig is a techique of groupig of similar data objects together, so that the objects i each group (called cluster) share the same patter of iformatio. Clusterig techique is widely used i fiacial data classificatio, spatial data processig, satellite photo aalysis, egieerig ad medical figure auto-detectio, Social etwork aalysis etc. [5]. There are two types of clusterig techiques [8] - partitioig ad hierarchical clusterig techique. database miig. From the previous results, has bee iferred that the Desity based clusterig is very effective for aalyzig large amouts of heterogeeous, complex data for example clusterig of complex objects [5]. 1.3 Similarity Measures Similarity measure is defied as the distace betwee various data poits. The performace of may algorithms depeds upo selectig a good distace fuctio over iput data set. While, similarity is a amout that reflects the stregth of relatioship betwee two data items, dissimilarity deals with the measuremet of divergece betwee two data items [2] [3]. Here, we preset a brief overview of similarity measure fuctios used i this paper: 1. Euclidea distace: Euclidea distace determies the root of square differeces betwee the coordiates of a pair of objects [2]. For vectors x ad y distace d (x, y) is give by: Sim(x, y) = d = i=1 x i y i 2 Where x ad y are -dimesioal vectors. 2. Cosie distace: Cosie distace measure for text clusterig determies the cosie of the agle betwee two vectors give by the followig formula [2]: (xi xj) Sim(x i, x j )= cosθ = ( xi xj ) Where, θ refers to the agle betwee two vectors ad x i, x j are -dimesioal vectors. 3. Jaccard distace: The Jaccard distace, ivolves the measuremet of similarity as the itersectio divided by the uio of the data items [3]. The formulae could be stated as: (xi xj) Sim(x i, x j ) = ( xi 2 + xj 2 xi xj) 4. Pearso Correlatio distace: Pearso s correlatio distace is aother measure of the extet to which two vectors are related [3]. The distace measure could be mathematically stated as: I this paper, we have used desity based SNN clusterig approach. It is a efficiet clusterig approach for dyamic 1

Sim(x, y) = x 2 xy x 2 x y y 2 y 2 2. OUTLINE OF THE PAPER This paper is composed of 6 sectios i additio to the itroductio. Sectio-3 describes the related work (literature survey) doe based o the otio of desity ad similarity measure. The SNN clusterig approach is discussed i Sectio-4. While Sectio-5 dealt the experimetal setup, sectio-6 cofied the results ad aalysis. A short coclusio ad directios for future work is preseted i Sectio-7 ad sectio-8 dealt with refereces. 3. LITERATURE SURVEY There are umber of clusterig algorithms based to the otio of desity. However, i this paper our focus cofied o the widely used SNN clusterig approach. I this sectio, we represet a brief overview of the work doe i the area of Desity based clusterig ad similarity measure. Discoverig clusters of differet sizes ad shapes is difficult i the presece of oise ad outliers. May recet clusterig algorithms like DBSCAN [9], CURE [10], ROCK [11] ad Chameleo [12], ad other variatios of DBSCAN clusterig approach have tried to address this problem, but these algorithms did ot work well with the objects of varyig desity. Fidig clusters of differet shape, size, ad desity, especially i the presece of oise ad outlier is a problem dealt most recetly with a recet clusterig algorithm kow as SNN clusterig approach. Jarvis ad Patrick [4], first itroduced this idea of shared earest eighbor. I the Jarvis Patrick approach, a s (shared earest eighbor) graph is created from the proximity matrix. A lik is costructed from pair of poits a ad b if ad oly if a ad b has their closest k- earest eighbor lists to each other. This approach is k-earest eighbor sparsificatio. The umber of ear eighbors that two poits share derives the weights of the liks betwee two poits i the s graph. Marti Ester, Has-Peter Kriegel, Joerg Sader, ad Xiaowei Xu [9], demostrated that the DBSCAN clusterig approach fid clusters of arbitrary shapes ad sizes but it caot work with data clusters of differig desities, because its desitybased defiitio of core poits ca t address the core poits of varyig desity clusters. I DBSCAN clusterig approach, if user defies the eighborhood of a poit by givig a particular radius ad the looks up for core poits (core objects) the oe of the poit that satisfy the coditios for core poit is selected as core poit while rest of the poits will be marked as oise. Else every poit coected to that core poit will belog to oe cluster. Sudipto Guha, Rajeev Rastogi ad Kyuseok Shim [10], represeted that CURE (Clusterig Usig REpresetatives), utilizes represetative poits to fid o-globular clusters. Oe of the problems of usig CURE clusterig approach is that it caot hadle may types of globular shapes. This problem is due to the approach of CURE algorithm to fids represetative poits, i.e., CURE algorithm fid poits alog the boudary, ad the shriks those poits towards the ceter of the cluster. George Karypis, Eui-Hog Ha, ad Vipi Kumar [12] verified that while DBSCAN uses the otio of core poits, CURE utilizes represetative poits as criterio, but either of the core poits or represetative poits was explicitly used by Chameleo. All three approaches (DBSCAN, CURE, ad Chameleo) share the commo idea (that the challege) of fidig clusters of differet shapes ad sizes. Mai motto of these three clusterig approaches is to fid poits or subsets of poits ad the costructig clusters aroud them. Chameleo approach is importat for spatial data, as we caot represet o-globular clusters by their cetroid, thus, cetroid based scheme caot hadle them [12]. While usig DBSCAN, CURE, ad Chameleo approaches, we must also give cosiderable attetio to hadlig of oise ad outliers. Aa Huag [2], evaluated the effects of may similarity fuctios o k-mea clusterig algorithm. Kazem Taghva ad Rushikesh Vei [3], compared ad aalyzed the effectiveess of these measures i partitioal clusterig for text documet datasets. I this paper, we described SNN clusterig approach with four differet similarity measure fuctios ad compared the effects of these similarity measures o SNN clusterig approach. 4. SNN CLUSTERING APPROACH Shared Nearest Neighbor (SNN) [1] is oe of the most importat ad most commo clusterig approach i egieerig ad scietific literature, which has the ability to produce clusters of differet size, shape, ad desity. The SNN approach, like DBSCAN approach [9], is based o desity-based clusterig approach. The mai differece betwee SNN approach ad DBSCAN approach is that while SNN deals with varyig desities clusters, DBSCAN do ot deal with clusters of varyig desities. SNN defies the similarity betwee poits by examiig the umber of earest eighbors that are shared by two poits. Utilizig the similarity measure i the SNN clusterig approach, we defied the desity as the sum of all the similarities of the earest eighbors of a poit. High-desity poits become core poits, ad low-desity poits become oise poits. All other poits, greatly similar to particular core poits were drew as ew clusters. SNN clusterig approach [1] ca be explaied as uder. 1. Compute the similarity matrix: This correspods to a similarity graph with data poits for odes ad edges whose weights are the similarities betwee data poits. 2. Sparsify the similarity matrix: This ivolves keepig oly the k most similar eighbors of each data poit. This correspods to oly keepig the k strogest liks of the similarity graph. 3. Costruct the shared earest eighbor graph: SNN graph obtaied from the sparsified similarity matrix. Here, we could apply a similarity threshold ad fid the coected compoets to obtai the clusters (Jarvis Patrick algorithm) 4. Fid the SNN desity of each Poit: Data poits havig a SNN similarity greater or equal to Eps were obtaied. 5. Fid the core poits: All poits that have a SNN desity greater tha MiPt were desigated as Core poits. 2

6. Form clusters from the core poits: If two core poits are withi a radius, Eps, of each other, they are placed i the same cluster. 7. Discard all oise poits: All o-core poits that were ot withi a radius of Eps of a cluster are discarded. 8. Assig all o-oise, o-core poits to clusters: All these poits are assiged to the earest cluster. Followig are the iputs ad their correspodig outputs as geerated by the SNN clusterig approach. Iput: Output: D- Data set k- Maximum umber of earest eighbors to each poit Eps- Desity threshold (radius of cluster) mipt- Core poit threshold K: a set of clusters I this paper, we used four differet similarity measure fuctios for calculatig similarity matrix ad compared the similarity graphs ad resultat clusters. The similarity measure fuctios are- Euclidea, Cosie, Jaccard ad Correlatio fuctio. SNN clusterig approach has may good characteristics. First, the SNN clusterig approach does ot cluster all the poits. I geeral, this is good, because much of the data is oise ad eeds to be removed. If the complete clusterig is desired, the uclustered data ca be iserted to the core clusters discovered by SNN clusterig approach by assigig them to the cluster cotaiig the closest represetative poit. Secod, the approach is especially partitioal, although we have experimeted some by creatig a hierarchy of clusters. Fially, the time complexity is O( 2 ) where is the umber of poits, because the similarity matrix has to be computed [1] [4]. 5. EXPERIMENTAL SETUP We have used some of differet types of datasets icludig test data sets of Sythetic databases, KDD cup 99 ad Mushroom dataset ad some radomly geerated datasets by which we ca described the effects of four differet similarity measure fuctios upo Shared Nearest Neighbor (SNN) clusterig approach. All these experimets were performed with the help of MATLAB 2010a (MATLAB 7.10). Here, for experimetatio, we used a 2D dataset cotaiig 107 data poits as show i Figure- 1. We compute each result show here by takig the followig iput parametersk=7, Eps=4 ad mipt=5. Fig 1: 2D Data Set 6. RESULT AND ANALYSIS From data set show i figure- 1, we first compute the similarity matrix by usig the similarity measure fuctios- Euclidea, Cosie, Jaccard ad Correlatio fuctios ad costruct the sparsified similarity graph based o the k earest eighbor criteria. Similarity graph geerated by differet similarity measure fuctios are show i figure- 2. 2(a) Similarity Graph geerated by Euclidea fuctio 2(b) Similarity Graph geerated by Cosie fuctio 3

2(c) Similarity Graph geerated by Jaccard fuctio 3(b) Clusters geerated by Cosie fuctio 2(d) Similarity Graph geerated by Correlatio fuctio Fig 2: Similarity Graph geerated by differet similarity fuctios 3(c) Clusters geerated by Jaccard fuctio Similarity matrix calculatio is most importat part of SNN clusterig approach. The compariso betwee similarity graphs is clear by their figures. 3(a) Clusters geerated by Euclidea fuctio 3(d) Clusters geerated by Correlatio fuctio Fig 3: Clusters costructed by differet similarity fuctios 4

After costructio of similarity graph, we geerate SNN graph ad by applyig user specified criteria- Eps ad mipt o this SNN graph, we compute core, ocore, ad oise poits. The clusters of core, ocore, ad oise poits by usig differet similarity fuctios are show i figure- 3. I figure- 3, X depicts the core poit, dot (.) shows the ocore poit, ad star (*) coveys the oise poits. We compared the Clusters costructed usig differet similarity fuctio by their accuracy of geeratig clusters of core poits. We observed the followig facts- 1. Clusters costructed by Jaccard ad Cosie fuctios had o or very less oise poits, Euclidea fuctio had some oise poits while clusters costructed usig correlatio fuctio had lot of oise poits, as show i figure- 3. 2. I SNN clusterig approach, Euclidea distace fuctio performed better because ot all the poits are clustered i SNN clusterig approach. Most of the data poits are oises ad hece removed. 3. If the complete clusterig is desired, the it ca be doe by followig two ways- a. Usig Euclidea distace fuctio, uclustered data ca be iserted to the core clusters, discovered by SNN clusterig approach ad assigig them to the clusters cotaiig the closest represetative poit. b. Usig Jaccard or Cosie distace fuctio, clusters ca be costructed usig SNN clusterig approach. 4. We observed that geeratio of core, ocore, ad oise poits is depedet upo data poits icluded i dataset ad the user specified criteria k, Eps ad mipt. 5. If some poits are clustered ad others are removed as oise accordig to give specified criteria, the the clusterig process performed faster. 7. CONCLUSION AND FUTURE WORK I this paper, we have aalyzed the impact upo SNN clusterig approach (SNN) of differet similarity computatio fuctios ad compared the resultat similarity graphs ad clusters. From the above results, we ca ifer that the SNN clusterig approach with Euclidea similarity measure fuctio provides better ad faster results as compared to the other distace fuctios described here. I future, we hope to aalyze impacts of other differet similarity measure fuctios upo various popular clusterig techiques. 8. REFERENCES [1] Levet Ertoz, Michael Steiback, Vipi Kumar, Fidig Clusters of Differet Sizes, Shapes, ad Desity i Noisy, High Dimesioal Data, Secod SIAM Iteratioal Coferece o Data Miig, Sa Fracisco, CA, USA, 2003. [2] Aa Huag, Similarity Measures for Text Documet Clusterig, NZCSRSC 2008, April 2008, Christchurch, New Zealad. [3] Kazem Taghva ad Rushikesh Vei, Effects of Similarity Metrics o Documet Clusterig, 2010 Seveth Iteratioal Coferece o Iformatio Techology. [4] R. A. Jarvis ad E. A. Patrick, Clusterig Usig a Similarity Measure Based o Shared Nearest Neighbors, IEEE Trasactios o Computers, Vol. C- 22, [5] M. R. Aderherg, Cluster Aalysis for Applicatio, Academic Press, New York, 1973. [6] Jiawei Ha, Michelie Kamber, Data Miig: Cocepts ad Techiques, Morga Kaufma Publishers, Sa Fracisco, USA, 2001, ISBN 1558604898. [7] Lori Bowe Ayre, Data Miig for Iformatio Professioals, 2006. [8] Aru K Pujari, Data Miig Techiques- Secod Editio, Uiversities Press. No. 11, November 1973. [9] Marti Ester, Has-Peter Kriegel, Jorg Sader, Xiaowei Xu, A Desity-Based Algorithm for Discoverig Clusters i Large Spatial Databases with Noise, KDD 96, Portlad, OR, pp. 226-231, 1996. [10] Sudipto Guha, Rajeev Rastogi, Kyuseok Shim, CURE: A Efficiet Clusterig Algorithm for Large Databases, ACM, 1998. [11] Sudipto Guha, Rajeev Rastogi, ad Kyuseok Shim, ROCK: A Robust Clusterig Algorithm for Categorical Attributes, I Proceedigs of the 15th Iteratioal Coferece o Data Egieerig, 1998. [12] George Karypis, Eui-Hog Ha, ad Vipi Kumar, CHAMELEON: A Hierarchical Clusterig Algorithm Usig Dyamic Modelig, IEEE Computer, Vol. 32, No. 8,. pp. 68-75, August 1999. 5