Survey of Cluster Analysis and its Various Aspects

Harmnder Kaur et al, Internatonal Journal of Computer Scence and Moble Computng, Vol.4 Issue.0, October- 05, pg. 353-363 Avalable Onlne at www.csmc.com Internatonal Journal of Computer Scence and Moble Computng A Monthly Journal of Computer Scence and Informaton Technology IJCSMC, Vol. 4, Issue. 0, October 05, pg.353 363 SURVEY ARTICLE ISSN 30 088X Survey of Cluster Analyss and ts Varous Aspects Harmnder Kaur, Jaspreet Sngh CSE & RIMT Mand Gobndgarh, Inda CSE & LCET Katan Kalan, Inda harmnder.mann@ymal.com; sprtsekhon@gmal.com Abstract: Cluster analyss or clusterng s a technque storng logcally smlar obects together physcally. Ths physcal storage s referred as classes n clusterng. The data avalable as nput for clusterng can be of varous types e.g. mage, tet etc. Ths process s carred out by dfferent algorthms such as k-mean, fuzzy-c etc. In ths paper lght s thrown out on varous aspects related to cluster analyss. Topcs covered are types of cluster, types of data and clusterng methods avalable. Am s to help the researchers to understand the bascs of clusterng n a sngle paper who are nterested to work on the same. Keywords- Clusterng; types of cluster; k-mean; DBSCAN; Grd based I. Introducton Cluster analyss or clusterng s the task of groupng a set of obects n such a way that obects n the same group (called a cluster) are more smlar (n some sense or another) to each other than to those n other groups (clusters)[].it s a man task of eploratory data mnng, and a common technque for statstcal data analyss, used n many felds, ncludng learnng, pattern, mage analyss, nformaton retreval, and bonformatcs. Cluster analyss tself s not one specfc algorthm, but the general task to be solved. It can be acheved by varous algorthms that dffer sgnfcantly n ther noton of what consttutes a cluster and how to effcently fnd them. Popular notons of clusters nclude groups wth small dstances among the cluster members, dense areas of the data space, ntervals or partcular statstcal dstrbutons. Clusterng 05, IJCSMC All Rghts Reserved 353

Harmnder Kaur et al, Internatonal Journal of Computer Scence and Moble Computng, Vol.4 Issue.0, October- 05, pg. 353-363 can therefore be formulated as a mult-obectve optmzaton problem[3].the approprate clusterng algorthm and parameter settngs (ncludng values such as the dstance functon to use, a densty threshold or the number of epected clusters) depend on the ndvdual data set and ntended use of the results. Cluster analyss as such s not an automatc task, but an teratve process of knowledge dscovery or nteractve mult-obectve optmzaton that nvolves tral and falure[],[4]. It wll often be necessary to modfy data preprocessng and model parameters untl the result acheves the desred propertes. II. Types of clusters a) Well separated clusters:- A cluster s collecton of smlar set of ponts but separatng dssmlar ponts from t. b) Center based clusters:- Centrod s a term used to measure the center of the cluster. Center based obect s more nearer to cluster contanng t than any other outsde cluster. c) Contguous Clusters:- Clusters contanng obects whch are closer to one or more ts neghbors n the same cluster than any other obect whch s ncluded n any outsde c 05, IJCSMC All Rghts Reserved 354

Harmnder Kaur et al, Internatonal Journal of Computer Scence and Moble Computng, Vol.4 Issue.0, October- 05, pg. 353-363 d) Densty based clusters:- A cluster s a dense regon of ponts whch s separated by low densty regons from other regons of hgh densty. It s used when the clusters are rregular or nterwned and when nose and outlers are present. e) Shared Property or conceptual clusters:- Clusters sharng common propertes for representng a concept. III. Types of Data n Cluster Analyss Interval-scaled varables Bnary varables Nomnal, ordnal, and rato varables Varables of med types a) Interval-scaled varables Standardze data s f ( n m f m m... m f f f f nf f n ( f f... nf ). ) where -Calculate the mean absolute devaton: Calculate the standardzed measurement (z-score) But usng mean absolute devaton s more robust than usng standard devaton 05, IJCSMC All Rghts Reserved 355

Harmnder Kaur et al, Internatonal Journal of Computer Scence and Moble Computng, Vol.4 Issue.0, October- 05, pg. 353-363 05, IJCSMC All Rghts Reserved 356 Smlarty and Dssmlarty Between Obects Dstances are normally used to measure the smlarty or dssmlarty between two data obects. Some popular ones nclude: Mnkowsk dstance: where = (,,, p ) and = (,,, p ) are two p-dmensonal data obects, and q s a postve nteger If q =, d s Manhattan dstance, q =, d s Eucldean dstance: Propertes,) 0,) = 0,) =,),),k) + k,) Also, one can use weghted dstance, parametrc Pearson product moment correlaton, or other dssmlarty measures e.g. Salary, heght etc. q q p p q q d )... ( ), (... ), ( p p d )... ( ), ( p p d

Harmnder Kaur et al, Internatonal Journal of Computer Scence and Moble Computng, Vol.4 Issue.0, October- 05, pg. 353-363 b) Bnary Varables )A contngency table for bnary data Obects Obects 0 sum a c ac 0 b d bd sum ab cd p )Dstance measure for symmetrc bnary varables:, ) b c ab cd )Dstance measure for asymmetrc bnary varables: v)jaccard coeffcent (smlarty measure for asymmetrc bnary varables): 0 sm Jaccard, ) b c a bc (, ) a a b c Eample: v)dssmlarty between Bnary Varables Name Gender Fever Cough Test- Test- Test-3 Test-4 Jack M Y N P N N N Mary F Y N P N P N Jm M Y P N N N N 05, IJCSMC All Rghts Reserved 357

Harmnder Kaur et al, Internatonal Journal of Computer Scence and Moble Computng, Vol.4 Issue.0, October- 05, pg. 353-363 Gender s a symmetrc attrbute; the remanng attrbutes are asymmetrc bnary. Let the values Y and P be set to, and the value N be set to 0. c) Nomnal Varables A generalzaton of the bnary varable n that t can take more than states, e.g., red, yellow, blue, green Method : Smple matchng 0 ack, mary) 0.33 0 ack, m) 0.67 m, mary) 0.75, ) p p m m: # of matches, p: total # of varables Method : use a large number of bnary varables creatng a new bnary varable for each of the M nomnal states d) Ordnal Varables An ordnal varable can be dscrete or contnuous Order s mportant, e.g., rank can be treated lke nterval-scaled r,..., M } replace f by ther rank.map the range of each varable onto [0, ] by replacng -th obect n the f-th varable by e) Rato-Scaled Varables r f z f M f compute the dssmlarty usng methods for nterval-scaled varables. Rato-scaled varable: a postve measurement on a nonlnear scale, appromately at eponental scale, such as Ae Bt or Ae -Bt Methods:treat them lke nterval-scaled varables but t s not a good choce because the scale can be dstorted. Apply logarthmc transformaton y f = log( f ) and treat them as contnuous ordnal data treat ther rank as nterval-scaled. f) Varables of Med Types f { f A database may contan all the s types of varables symmetrc bnary, asymmetrc bnary, nomnal, ordnal, nterval and rato. p ( f ) ( f ) ( d f d, ) p ( f ) f 05, IJCSMC All Rghts Reserved 358

Harmnder Kaur et al, Internatonal Journal of Computer Scence and Moble Computng, Vol.4 Issue.0, October- 05, pg. 353-363 One may use a weghted formula to combne ther effects f s bnary or nomnal: d (f) = 0 f f = f, or d (f) = otherwse f s nterval-based: use the normalzed dstance. r f z f M f f s ordnal or rato-scaled.compute ranks r f and and treat z f as nterval-scaled. IV. Dfferent Technques of Data Clusterng a) Parttonal or Representatve-based Clusterng Gven a dataset wth n ponts n a d-dmensonal space, D = {} n=, and gven the number of desred clusters k, the goal of representatve-based clusterng s to partton the dataset nto k groups or clusters, whch s called a clusterng and s denoted as C ={C,C,...,Ck}. Further, for each cluster C there ests a representatve pont that summarzes the cluster, a common choce beng the mean (also called the centrod) μ of all ponts n the cluster, that s, C where n = C s the number of ponts n cluster C. Clusterng s s not practcally feasble. In ths chapter we descrbe two approaches for Representatve-based clusterng, namely the K-means and epect rato[4]. The K-means algorthm, probably the frst one of the clusterng algorthms proposed, s based on a very smple dea: Gven a set of ntal clusters, assgn each pont to one of them, then each cluster center s replaced by the mean pont on the respectve cluster. [5]These two smple steps are repeated untl convergence. A pont s assgned to the cluster whch s close n Eucldean dstance to the pont. Although K-means has the great advantage of beng easy to mplement, t has two bg drawbacks. Frst, t can be really slow snce n each step the dstance between each pont to each cluster has to be calculated, whch can be really epensve n the presence of a large dataset. Second, ths method s really senstve to the provded ntal clusters, however, n recent years, ths problem has been addressed wth some degree of success.[5] A brute-force or ehaustve algorthm for fndng a good clusterng s smply to generate all possble parttons of n ponts nto k clusters, evaluate some optmzaton score for each of them, and retan the clusterng that yelds the best score. (A) Orgnal Ponts (B) Parttonng Clusterng 05, IJCSMC All Rghts Reserved 359

Harmnder Kaur et al, Internatonal Journal of Computer Scence and Moble Computng, Vol.4 Issue.0, October- 05, pg. 353-363 b) Herarchal clusterng Gven n ponts n a d-dmensonal space, the goal of herarchcal clusterng s to create a sequence of nested parttons, whch can be convenently vsualzed va a tree or herarchy of clusters, also called the cluster dendrogram. The clusters n the herarchy range from the fne-graned to the coarse-graned the lowest level of the tree (the leaves) conssts of each pont n ts own cluster, whereas the hghest level (the root) conssts of all ponts n one cluster. Both of these may be consdered to be trval clusterng s. At some ntermedate level, we may fnd meanngful clusters. If the user supples k, the desred number of clusters, we can choose the level at whch there are k clusters. [6-8]There are two man algorthmc approaches to mne herarchcal clusters: agglomeratve and dvsve. Agglomeratve strateges work n a bottom-up manner That s, startng wth each of the n ponts n a separate cluster, they repeatedly merge the most smlar par of clusters untl all ponts are members of the same cluster. Dvsve strateges do ust the opposte, workng n a top-down manner. Startng wth all the ponts n the same cluster, they recursvely splt the clusters untl all ponts are n separate clusters. ) Number of Herarchcal Clusterng The number of dfferent nested or herarchcal clusterng s corresponds to the number of dfferent bnary rooted trees or dendrograms wth n leaves wth dstnct labels.[5] Any tree wth t nodes has t edges. Also, any rooted bnary tree wth m leaves has m nternal nodes. Thus, a dendrogram wth m leaf nodes has a total of t = m+ m =m nodes, and consequently t =m edges ) Agglomeratve Herarchcal Clusterng In agglomeratve herarchcal clusterng, we begn wth each of the n ponts n a separate cluster. We repeatedly merge the two closest clusters untl all ponts are members of the same cluster.formally, gven a set of clusters C = {C,C,..,Cm}, we fnd the closest par of clusters C and C and merge them nto a new cluster C = C C. Net, we update the set of clusters by removng C and C and addng C, as follows C = C \ {C,C } {C }. We repeat the process untl C contans only one cluster. Because the number of clusters decreases by one n each step, ths process results n a sequence of n nested clusterngs. If specfed, we can stop the mergng process when there are eactly k clusters remanng. ) Dvsve Herarchcal Clusterng Dvsve clusterng starts wth a sngle cluster that contans all data ponts and recursvely splts the most approprate cluster. The process repeats untl a stoppng crteron (frequently, the requested number k of clusters) s acheved. v) Among the most used varatons of the herarchcal clusterng based on dfferent dstance measures are: ) Average lnkage clusterng The dssmlarty between clusters s calculated usng average values. The average dstance s calculated from the dstance between each pont n a cluster and all other ponts n another cluster. The two clusters wth the lowest average dstance are oned together to form the new cluster. ). Centrod lnkage clusterng Ths varaton uses the group centrod as the average. The centrod s defned as the center of a cloud of ponts. ). Complete lnkage clusterng (Mamum or Furthest-Neghbor Method)The dssmlarty between groups s equal to the greatest dssmlarty between a member of cluster and a member of cluster. Ths 05, IJCSMC All Rghts Reserved 360

Harmnder Kaur et al, Internatonal Journal of Computer Scence and Moble Computng, Vol.4 Issue.0, October- 05, pg. 353-363 method tends to produce very tght clusters of smlar cases. v). Sngle lnkage clusterng (Mnmum or Nearest-Neghbor Method): The dssmlarty between clusters s the mnmum dssmlarty between members of the two clusters. Ths method produces long chans whchform loose, straggly clusters. v) Ward's Method: Cluster membershp s assgned by calculatng the total sum of squared devatons from the mean of a cluster. The crteron for fuson s that t should produce the smallest possble ncrease n the error sum of squares[9]. c)densty based algorthm Clusters can be thought of as regons of hgh densty, separated by regons of no or low densty. Densty here s consdered as the number of data obects n the neghborhood. The most popular one s probably DBSCAN (Densty-Based Spatal Clusterng of Applcatons wth Nose,. The algorthm fnds, for each obect, the neghborhood that contans a mnmum number of obects. Fndng all ponts whose neghborhood falls nto the above class, a cluster s defned as the set of all ponts transtvely connected by ther neghborhoods. DBSCAN fnds arbtrary-shaped clusters whle at the same tme not beng senstve to the nput order. Besdes, t s ncremental, snce every newly nserted pont only affects a certan neghborhood. [9],[0]On the other hand, t requres the user to specfy the radus of the neghborhood and the mnmum number of obects t should have; optmal parameters are dffcult to determne. ) Computatonal Complety The man cost n DBSCAN s for computng the ǫ-neghborhood for each pont. If the dmensonalty s not too hgh ths can be done effcently usng a spatal nde structure n O(nlogn) tme. When dmensonalty s hgh, t takes O(n) to compute the neghborhood for each pont. Once Nǫ() has been computed the algorthm needs only a sngle pass over all the ponts to fnd the densty connected clusters. Thus, the overall complety of DBSCAN s O(n) n the worst-case. The maor feature of ths algorthm s, Dscover clusters of arbtrary shape and has a capablty to handle nose data n a sngle scan. The several nterestng studes on densty based algorthm are DBSCAN, GDBSCAN, OPTICS, DENCLUE and CLIQUE. The two global parameters n densty are Eps: Mamum radus of the neghbourhood and MnPts: Mnmum number of ponts n an Eps neghbourhood of that pont. ) Densty Reachablty - A pont "p" s sad to be densty reachable from a pont "q" f pont "p" s wthn ε dstance from pont "q" and "q" has suffcent number of ponts n ts neghbours whch are wthn dstance ε. Densty Connectvty - A pont "p" and "q" are sad to be densty connected f there est a pont "r" whch has suffcent number of ponts n ts neghbours and both the ponts "p" and "q" are wthn the ε dstance. Ths s channg process. So, f "q" s neghbour of "r", "r" s neghbour of "s", "s" s neghbour of "t" whch n turn s neghbour of "p" mples that "q" s neghbour of "p". ) Densty Based Spatal Clusterng of Applcatons wth Nose (DBSCAN) Densty Based 0Spatal Clusterng of Applcaton wth Nose(DBSCAN) reles on a densty-based noton of cluster: A cluster s defned as a mamal set of densty-connected ponts. In spatal database t dscovers clusters of arbtrary shape wth nose v) Steps Involved n DBSCAN Arbtrary select a pont p Reclam all the ponts densty-reachable from p wth respect to Eps and MnPts. If pont p s a core obect, a cluster s formed. If pont p s a border obect, no ponts are densty-reachable 05, IJCSMC All Rghts Reserved 36

Harmnder Kaur et al, Internatonal Journal of Computer Scence and Moble Computng, Vol.4 Issue.0, October- 05, pg. 353-363 from p and DBSCAN vsts the net pont of the database. Contnue the process untl all of the ponts processed. Core Obect: The obect wth at least MnPts obects wthn a radus Eps-neghborhood Border Obect: obect that on the border of a cluster. v) Pros and Cons of Densty-Based Algorthm The man advantage densty-based clusterng Algorthm does not requre a-pror specfcaton and able to dentfy nosy data whle clusterng. It fals n case of neck type of dataset and t does not work well n case of hgh dmensonalty. d)grd based algorthm The Grd-Based type of clusterng approach uses a mult resoluton grd data structure. It quantzes the obect space nto a fnte number of cells that form a grd structure on whch all of the operatons for clusterng are performed. The grd-based clusterng approach dffers from the conventonal clusterng algorthms n that t s concerned not wth the data ponts but wth the value space that surrounds the data ponts. In general, a typcal grd-based clusterng algorthm conssts of the followng fve basc steps: Creatng the grd structure,.e., parttonng the data space nto a fnte number of cells. Calculatng the cell densty for each cell. Sortng of the cells accordng to ther denstes. Identfyng cluster centers. Traversal of neghbor cells. STING (A Statstcal Informaton Grd Approach) In STING the spatal area s dvded nto rectangular cells. There are many levels of cells correspondng to dfferent There are many levels of cells correspondng to dfferent levels of resoluton.[] Each cell s parttoned at hgh level nto a number of smaller cells n the net lower level. The statstcal nfo of each cell s calculated and stored beforehand and s used to answer queres. By usng the parameters of lower level the parameters of hgher level cell scan be easly calculated. STING uses a top-down approach to answer the spatal data queres.[0] The Merts and Demerts are as follows Merts Fast processng tme. Good cluster qualty. No dstance computatons Clusterng s performed on summares and not ndvdual obects; complety s usually O(#- populatedgrd-cells) and not O(#obects) Easy to determne whch clusters are neghborng. Shapes are lmted to unon of grd-cells Demerts All the cluster boundares are ether horzontal or vertcal, and no dagonal boundary s detected. V. Concluson The overall goal of the data mnng process s to separate the nformaton from a large data set and transform t nto an understandable form for further use. Clusterng s an mportant task n data analyss and data mnng applcatons. Clusterng s the task of groupng a set of obects so that obects n the same group are more smlar to each other than to those n other groups (clusters).clusterng can be done by the dfferent algorthms such as herarchcal-based, parttonng-based, grd-based and densty-based algorthms. Herarchcal-based clusterng s the connectvty based clusterng. Parttonng-based algorthm s the centrod based clusterng. Densty based clusters are defned as area of hgher densty then the 05, IJCSMC All Rghts Reserved 36

Harmnder Kaur et al, Internatonal Journal of Computer Scence and Moble Computng, Vol.4 Issue.0, October- 05, pg. 353-363 remanng of the data set. Grd based clusterng, partton the space nto a fnte number of cells that form a grd structure on whch all of the operatons for clusterng are performed. REFERENCES [] Pavel Berkhn, A Survey of Clusterng Data Mnng Technques, pp.5-7, 00. [] Amandeep Kaur Mann,.Revew paper on Clusterng Technques., Global Journal of Computer Scence and Technology Software & Data Engneerng. [3] K.Kameshwaran et al, Survey on Clusterng Technques n Data Mnng / (IJCSIT) Internatonal Journal of Computer Scence and Informaton Technologes, Vol. 5 (), 04, 7-76 [4] Osama Abbu Abaas Comparson between data clusterng algorthm,the nternatonal Arab ournal of Informaton Technology,Vol. 5, No.3,uly 008 [5] Ramandeep Kaur & Gurth Sngh Bhathal,.A Survey of Clusterng Technques., Internatonal Journal on Computer Scence and Engneerng Vol. 0, No. 09, 00, 976-980. [6] Megha Gupta, Vshal Shrvastava Revew of varous Technques n Clusterng Internatonal Journal of Advanced Computer Research (ISSN (prnt):49-777 ISSN (onlne):77-7970) Volume-3 Number- Issue-0 June-03 [7] Wael M.S. Yafooz,St Z.Z. Abdn,Nasroh Omar, Rosenah A. Halm, Dynamc Semantc Tetual DocumentClusterng Usng Frequent Terms and Named Entty,0 3 IEEE 3rd Internatonal Conference on System Engneerng and Technology, 9-0 Aug. 03, Shah Alam, Malaysa [8] Aastha Josh, Raneet Kaur, A Revew: Comparatve Study of Varous Clusterng Technques n Data Mnng, Internatonal Journal of Advanced Research n Computer Scence and Software Engneerng Volume 3, Issue 3, March 03 [9] Dr. Manu Kaushk, Mrs. Bhawana Mathur, Comparatve Study of K-Means and Herarchcal Clusterng Technques,Internatonal ournal of hardware and software research n Engneerng Volume,Issue 6,June 04 [0] T. Son Madhulatha, An Overvew On Clusterng Methods IOSR Journal of Engneerng Apr. 0, Vol. (4) pp: 79-75 ISSN: [] Pradeep Ra, Shubha Sngh Survey of Clusterng Technques Internatonal Journal of Computer Applcatons (0975 8887)Volume 7 No., October 00 05, IJCSMC All Rghts Reserved 363