Intenational Jounal of Compute Applications (0975 8887) Volume 6 No.9, Septembe 010 An Optimised Density Based Clusteing Algoithm J. Hencil Pete Depatment of Compute Science St. Xavie s College, Palayamkottai, India A. Antonysamy Depatment of Mathematics St. Xavie s College, Kathmandu, Nepal ABSTRACT The DBSCAN [1] algoithm is a popula algoithm in Data Mining field as it has the ability to mine the noiseless abitay shape Clustes in an elegant way. As the oiginal DBSCAN algoithm uses the distance measues to compute the distance between objects, it consumes so much pocessing time and its computation complexity comes as O (N ). In this pape we have poposed a new algoithm to impove the pefomance of DBSCAN algoithm. The existing algoithms A Fast DBSCAN Algoithm[6] and Memoy effect in DBSCAN algoithm[7] has been combined in the new solution to speed up the pefomance as well as impove the quality of the output. As the RegionQuey opeation takes long time to pocess the objects, only few objects ae consideed fo the expansion and the emaining missed bode objects ae handled diffeently duing the cluste expansion. Eventually the pefomance analysis and the cluste output show that the poposed solution is bette to the existing algoithms. Keywods Optimised DBSCAN, Density Cluste, Optimised RegionQuey, RegionQuey. 1. INTRODUCTION Data mining is a fast gowing field in which clusteing plays a vey impotant ole. Clusteing is the pocess of gouping a set of physical o abstact objects into classes of simila objects []. Among the many algoithms poposed in the clusteing field, DBSCAN is one of the most popula algoithms due to its high quality of noiseless output clustes. As the oiginal DBSCAN algoithm RegionQuey function is vey expensive facto in tems of time, we have poposed a solution to minimize the RegionQuey function call to cove the maximum neighbous in an elegant way. The Fast DBSCAN Algoithm s[6] seleted seed objects RegionQuey has been impoved to give the bette output, at the same time within less time using Memoy effect in DBSCAN algoithm[7]. The emaining objects pesent in the bode aea have been examined sepaately duing the cluste expansion which is not done in the Fast DBSCAN Algoithm. So the new algoithm is capable to give the bette pefomance than the existing DBSCAN algoithms. Rest of the pape is oganised as follows. Section gives the bief histoy about the elated woks in the same aea. Section 3 gives the intoduction of oiginal DBSCAN and section 4 explains the poposed algoithm. Afte the new algoithm s explanation, section 5 shows the Expeimental Results and final section 6 pesents the conclusion and futue wok associated with this algoithm.. RELATED WORK The DBSCAN (Density Based Spatial Clusteing of Application with Noise) [1] is the basic clusteing algoithm to mine the clustes based on objects density. In this algoithm, fist the numbe of objects pesent within the neighbou egion (Eps) is computed. If the neighbou objects count is below the given theshold value, the object will be maked as NOISE. Othewise the new cluste will be fomed fom the coe object by finding the goup of density connected objects that ae maximal w..t densityeachability. The CHAMELEON [3] is a two phase algoithm. It geneates a k-neaest gaph in the fist phase and hieachical cluste algoithm has been used in the second phase to find the cluste by combining the sub clustes. The OPTICS [4] algoithm adopts the oiginal DBSCAN algoithm to deal with vaiance density clustes. This algoithm computes an odeing of the objects based on the eachability distance fo epesenting the intinsic hieachical clusteing stuctue. The Valleys in the plot indicate the clustes. But the input paametes ξ is citical fo identifying the valleys as ξ clustes. The DENCLUE [5] algoithm uses kenel density estimation. The esult of density function gives the local density maxima value and this local density value is used to fom the clustes. If the local density value is vey small, the objects of clustes will be discaded as NOISE. A Fast DBSCAN (FDBSCAN) Algoithm[6] has been invented to impove the speed of the oiginal DBSCAN algoithm and the pefomance impovement has been achieved though consideing only few selected epesentative objects belongs inside a coe object s neighbou egion as seed objects fo the futhe expansion. Hence this algoithm is faste than the basic vesion of DBSCAN algoithm and suffes with the loss of esult accuacy. The MEDBSCAN [7] algoithm has been poposed ecently to impove the pefomance of DBSCAN algoithm, at the same time without loosing the esult accuacy. In this algoithm totally thee queues have been used, the fist queue will stoe the neighbous of the coe object which belong inside Eps distance, the second queue is used to stoe the neighbous of the coe object which belong inside * Eps distance and the thid queue is the seeds queue which stoe the unhandled objects fo futhe expansion. This algoithm guaantees some notable pefomance impovement if Eps value is not vey sensitive. Though the DBSCAN algoithm s complexity can be educed to O (N * log N) using some spatial tees, it is an exta effot to constuct, oganize the tee and the tee equies an additional memoy to hold the objects. In this new algoithm we have achieved good pefomance with oiginal computation complexity O (N ). 3. INTRODUCTION TO DBSCAN ALGORITHM In the following definitions, a database D with set of points of k- dimensional space S has been used. As we need to find out the object neighbous which ae exist/suounded with in the given adius (Eps), Euclidean function dist (p, q) has been used, whee p and q ae the two objects. This function takes two objects and gives the distance between them. 0
Intenational Jounal of Compute Applications (0975 8887) Volume 6 No.9, Septembe 010 Definition 1: Eps Neighbouhood of an object p The Eps Neighbouhood of an object p is efeed as NEps(p), defined as NEps(p) = {q D dist(p,q) <=Eps}. Definition : Coe Object Condition An Object p is efeed as coe object, if the neighbou objects count >= given theshold value (MinObjs). i.e. NEps(p) >=MinObjs Whee MinObjs efes the minimum numbe of neighbou objects to satisfy the coe object condition. In the above case, if p has neighbous which ae exist within the Eps adius count is >= MinObjs, p can be efeed as coe object. Definition 3: Diectly Density Reachable Object An Object p is efeed as diectly density eachable fom anothe object q w..t Eps and MinObjs if function calls, FDBSCAN Algoithm s [3] selected epesentative objects as seed objects appoach duing the cluste expansion has been used in this solution and this appoach has been poved theoetically using the following Lemmas 1 and. As the RegionQuey etieve the neighbou objects which belong inside the Eps adius, Cicle lemmas ae given and which can be diectly used in the RegionQuey optimization. Lemma 1: Minimum numbe of identical cicles equied to cove the cicumfeence of a cicle with same adius which passes though the centes of othe cicles is thee. Poof: Let C and C 1 be the identical cicles of adius with cente at O and O 1 espectively. Assume the cicle C passes though the cente O 1 of the cicle C 1 and the cicle C 1 passes though the cente O of the cicle C. Let the cicles intesect at P and Q. P p NEps(q) and NEps(q) >= MinObjs (Coe Object condition) o 10 C 1 O 1 O C Definition 4: Density Reachable Object An object p is efeed as density eachable fom anothe object q w..t Eps and MinObjs if thee is a chain of objects p1,,pn, p1=q, pn=p such that pi+1 is diectly density eachable fom pi. Q Definition 5: Density connected object An Object p is density connected to anothe object q if thee is an object o such that both, p and q ae density eachable fom o w..t Eps and MinObjs. Definition 6: Cluste A Cluste C is a non-empty subset of a Database D w..t Eps and MinObjs which satisfying the following conditions. Fo evey p and q, if p cluste C and q is density eachable fom p w..t Eps and MinObjs then q C. Fo evey p and q, q C; p is density connected to q w..t Eps and MinObjs. Definition 7: Noise An object which doesn t belong to any cluste is called noise. The DBSCAN algoithm finds the Eps Neighbouhood of each object in a Database duing the clusteing pocess. Befoe the cluste expansion, if the algoithm finds any non coe object, it will be maked as NOISE. With a coe object, algoithm initiate a cluste and suounding objects will be added into the queue fo the futhe expansion. Each queue objects will be popped out and find the Eps neighbou objects fo the popped out object. When the new object is a coe object, all its neighbou objects will be assigned with the cuent cluste id and its unpocessed neighbou objects will be pushed into queue fo futhe pocessing. This pocess will be epeated until thee is no object in the queue fo the futhe pocessing. 4. PROPOSED SOLUTION A new algoithm has been poposed in this pape to ovecome the poblem of the pefomance issue which exists in the density based clusteing algoithms. In this algoithm, numbe of RegionQuey call has been educed as well as some RegionQuey calls speed has been impoved. Fo educing the RegionQuey Figue 1 Two Identical Cicles Intesection with espect to fist cicle s Cente Point. Clealy, OP = OQ = ; O 1 P = O 1 Q = and OO 1 =. O 1 OP and POO 1 = QOO 1 = O 1 QO ae equilateal. 60 POQ = POO 1 + QOO 1 = Now length of ac PO 1 Q = 10 360 10 = 3 Thus acual length 3 of the cicumfeence of the given cicle C is coveed by C 1. In ode to cove the emaining pat of the cicumfeence of cicle C, daw a cicle C of same adius with cente O, passes though O and P. Let C intesect C at anothe point R (say). O 1 P Q O O O 3 R Figue Fou Identical Cicles intesection w..t fist Cicle s cente point. 1
Intenational Jounal of Compute Applications (0975 8887) Volume 6 No.9, Septembe 010 Length of ac PO R = 3 i.e., acual length 3 is coveed by C [poceeding as above] of the cicumfeence of the given cicle C Thus the cicles C 1 and C can able to cove only 3 pat of the cicumfeence of the cicle C. i.e., in ode to cove the complete cicumfeence of the cicle C we ae equied to daw one moe cicle C 3 passes though O, Q and R with cente at O 3 and adius. Length of ac RO 3 Q = 3 Now, Length of ac PO 1 Q + Length of ac PO R + Length =, which is the peimete of the of ac RO 3 Q = 3 3 cicumfeence of the cicle C. Hence minimum thee identical cicles equied to cove the cicumfeence of a cicle with same adius which passes though the centes of othe cicles. Lemma 1 poves that the minimum equiement to cove the cicumfeence of the cente cicle and these minimum cicles selection is equivalent to the RegionQuey call in the DBSCAN algoithm. In the eal scenaio, thee RegionQuey call is not sufficient to cove most the neighbous which exist in the cente object s neighbous when the objects in the dataset is distibuted unifomly(assume the objects ae distibuted unifomly and the distance between an object and its neighbou is 1). Moeove these thee RegionQuey function calls ae not sufficient to cove immediate neighbous of the cente object s neighbous and this poblem is explained below: 4 Figue 4 Minimum Cicles to cove the immediate neighbous. Poof: Clealy, O 1 OO P is a squae of side. OP = Diagonal of the squae of side = Distance AP = - = = 0.414 Thus fou cicles ae able to cove the objects which ae at most 0.414 distance apat fom the cicumfeence of the oiginal cicle C. So we need minimum fou RegionQuey call to cove all the immediate neighbous of the cente Object s neighbous and this will cove > 80% of the neighbou objects of cente object s neighbous. This can be poved as follows: Lemma 3: Fou Identical Cicles ae sufficient to cove moe than 80 % of the neighbou objects of cente cicle when objects ae distibuted unifomly. 1 Poof: Figue 3 Missing immediate neighbou. Above pictue shows that a cicle ( Oiginal Cicle ) is been intesected by thee othe identical cicles. Even though the thee cicles ae coveing the full cicumfeence of the oiginal cicle, these thee cicles ae not able to cove cente cicle s immediate neighbous which ae maked in ed colou (p1, p and p3). i.e. even if the distance between the intesection point and the immediate neighbou point is 1, above scenaio can t cove the all its immediate neighbous. So the Lemma has been intoduced to pove the minimum cicles equiement to cove all the immediate neighbous. Lemma : Fou identical cicles ae sufficient to cove all the immediate neighbou objects of the oiginal cicle when the objects ae distibuted unifomly. Figue 5 Neighbous uneachable aea. Aea of oute cicle (with adius ) = Aea of the squae PQRS (with side ) = = 4 = 4
Intenational Jounal of Compute Applications (0975 8887) Volume 6 No.9, Septembe 010 Total aea of fou semi cicles (each with adius ) = 4 = Hence aea of unmaked egion = 4 + 4.1 Poposed Queue Stuctue So InneRegion and OuteRegion queues will maintain the coesponding egion objects intenally in fou queues. Following diagam shows each queue s object stoage aeas. Aea of maked egion = 4 = ( ) ) = - (4 + - 4 Pecentage of aea in oute cicle coveed by the maked aea = ( - ) 4 100% 50% = = 18.169 % Hence the aea occupied by the maked egion is < 0 pecentage. So in the eal time scenaio we can conclude that if we select fou seed objects fo the cluste expansion fom the cente object s neighbous we have the chances to ignoe ~0 % of the objects which pesent in the bode egion and the pevious FDBSCAN algoithm ignoe these objects. In this solution, this poblem has been ectified and all the bode objects have been consideed fo the clusteing opeation. To impove the pefomance of the algoithm, MEDBSCAN Algoithm [6] appoach has been applied. So thee ae two types of Regionquey functions have been intoduced in this algoithm namely, LongRegionQuey and ShotRegionQuey. Fist LongRegionQuey function will be called to get the egion objects pesent in Eps neighbous as well as *Eps neighbous suounded by the given object, the Eps distance neighbous fom the cente object will be stoed in the InneRegion queue and the objects which ae geate than Eps and less than o equal to *Eps distance fom the cente objects will be stoed in OuteRegion queue espectively. Late the selected seed objects pesent in the Eps neighbou egion will be pocessed using the ShotRegionQuey function call. So the ShotRegionQuey function call will be always faste than the LongRegionQuey function as it needs to pocess only few objects which ae pesent in the InneRegion as well as OuteRegion and no need to pocess the entie objects pesent in the data set. Anothe change in the poposed solution to impove the speed is modification of queue stuctue. i.e InneRegion and OuteRegion queues ae the combination of fou sub queues. RegionQueue { TopRightQueue; RightBottomQueue; BottomLeftQueue; LeftTopQueue; } Figue 6 RegionQueue s stoage aea classification. This type of sepaation helps to minimize the unwanted distance computation while pocessing the bode objects. i.e. while pocessing OuteRegion queue s unpocessed objects, we can conside only the adjacent potion of the InneRegion queue s objects and othe non adjacent potions objects can be ignoed. This concept has been explained as follows. 4. Neighbou computation Ignoe Case Let O is an Oute Cicle with adius and I is an inne cicle with adius. Both of these cicles ae shaing the same Cente point C and these two cicles ae equally divided into fou pats as shown in the below pictue (to pefom the RegionQuey opeation). Figue 7 Inne and Oute Region. Hee the inne cicle objects neighbou objects ae pesent in the oute cicle s maked aea (with bown colou) and the inne cicle itself. Now we can confim that any object pesent in the inne cicle s any one of the quate aea (I 1 OR I OR I 3 OR I 4 ), will have its neighbou(s) in the 3 of the adjacent quate pat of the oute cicle and the inne cicle itself (fou quate pats). Thus we can ignoe the oute cicle s non adjacent quate pat fom the unnecessay computation. (e.g) 3
Intenational Jounal of Compute Applications (0975 8887) Volume 6 No.9, Septembe 010 Figue 8 Availability of Neighbou in the Cicle s Potion. In the above diagam Inne Cicle s I 3 quate potion has been consideed fo the neighbou computation. The object pesent in the I 3 quate potion will have its neighbous in O 4, O 3, O and the Inne cicle itself (I 1, I, I 3 and I 4 ), but not in O 1 potion. i.e. Maximum distance is the valid distance fo neighbou computation and I 3 s object equie minimum +1 distance to each anothe object which is pesent in the O 1 potion and this condition is not possible (Invalid condition has been shown in the diagam in ed colou). Similaly while pocessing the bode objects pesent in the OuteRegion, only the adjacent quate potion of inne egion objects ae enough fo the computation to know whethe it is density eachable to any of the objects pesent in the InneRegion. This is anothe optimization done in the new algoithm to speed up the computation as well as impove the accuacy of output. In the FDBSCAN algoithm, chances of missing the coe objects as well as bode objects ae applicable and in this new appoach all the bode objects have been coveed. Also it is poved that the coe objects loss is vey ae case and the new solution is bette in most of the cases in the eal time scenaio. 4.3 Algoithm 1. Read D, Eps and MinObjs.. Initialize all objects Cluste ID field as UNCLASSIFIED. 3. Fo each UNCLASSIFIED object o D 4. Call LongRegionQuey function with D, Eps and o paametes to be obtain InneRegion and OuteRegion. 5. IF o is a coe object Then 6. Get the ClusteID fo the new Cluste. 7. Select fou UNCLASSIFIED objects fom the InnteRegion TopRight, RightBottom, BottomLeft and LeftTop Queues each fo the futhe cluste expansion and push the selected objects to FouQueue. The selected objects should have the max distance fom the cente object o. 8. Assign ClusteID to all the UNCLASSIFIED and NOISE type pesent in the InnteRegion. 9. Fo each object T FouQueue 10. Call ShotRegionQuey function with InneRegion, OuteRegion, Eps and Object T to obtain the ShotRegion. 11. Select fou UNCLASSIFIED objects fom the ShotRegion TopRight, RightBottom, BottomLeft and LeftTop Queues fo the futhe cluste expansion. The selected should have the max distance fom the cente object T. Push the selected objects to SeedQueue fo the futhe pocessing. 1. Assign ClusteID to all the UNCLASSIFIED and NOISE type pesent in the ShotRegion. 13. End Fo 14. Remove the clusteed objects fom the OuteRegion and pocess the emaining (UNCLASSIFIED and NOISE type) to know if any one of the InneRegion neighbou pesent in the UNCLASSIFIED and NOISE type OuteRegion. i.e if any emaining objects pesent in the OuteRegion is density eachable fom the cente object o s neighbou, assign ClusteID to the Object. 15. Pop the objects s fom SeedQueue, Repeat the steps fom 4-14 and until the SeedQueue is Empty. Fo all the above steps eplace the object o with SeedQueue Object s wheeve it is applicable. 16. Else 17. Mak o as NOISE 18. End If 19. End Fo This algoithm ead the same input as like oiginal DBSCAN and all the objects ae initialized as UNCLASSIFIED in the beginning. Aftewads all the UNCLASSIFIED objects ae pocessed one by one. So the algoithm stats with LongRegionQuey function call to obtain the Neighbou objects (InneRegionobjects and OuteRegion) and the cluste expansion will happen only if the cuent object is a coe object, othewise the cuent object will be maket as NOISE. Duing the cluste expansion, the new Cluste ID will get ceated and fou UNCLASSIFIED objects ae selected fom the InneRegionobjects fou queues each and these objects should have the maximum distance fom the cente object. Afte assigning the Cluste ID to all the pesent in the InneRegion queue, the selected fou objects will be pocessed. Hee the fou objects ae the maximum count and if thee is no UNCLASSIFIED object pesent in one o moe specific queues, the selected objects count will be less than 4. Fo pocessing these objects, ShotRegionQuey has been used and each ShoRegionQuey opeation, maximum fou seed objects will be selected which meets the above condition and pushed into seed queue fo the futhe cluste expansion. The ShotRegionQuey takes the etun aay objects of LongRegionQuey function and will not pocess the whole Data set in the subsequent iteation. Thus the pefomance impovement has been guaanteed when the Eps value is easonably insensitive. The Cluste ID will be assigned to the ShotRegionQuey s output objects if the object is eithe UNCLASSIFIED o NOISE. Now the emaining UNCLASSIFIED o NOISE type objects pesent in the OuteRegion queue is pocessed and which uses the Neighbou computation Ignoe Case computation appoach to minimize the computation and speed up the pefomance. Afte epeating these steps as mentioned in the algoithm and when the SeedQueue become empty, the cuent cluste expansion will stop and the contol moves to pocess the next object UNCLASSIFIED type object using the paent fo loop. The whole clusteing pocess will be ove once the main loop visits the entie N objects pesent in the data set. 4
Numbe of Running time loss Running time loss Running time loss Intenational Jounal of Compute Applications (0975 8887) Volume 6 No.9, Septembe 010 5. PERFORMANCE ANALYSIS The basic DBSCAN, Fast DBSCAN and poposed Optimized DBSCAM algoithms ae implemented in Visual C++ (008) on Windows Vista OS and tested using two dimensional Dataset. To know the eal pefomance diffeence achieved in the new algoithm, we haven t used any additional data stuctues (like spatial tee) to impove the pefomance. These algoithms ae tested using two dimensional synthetic dataset and the pefomance esults ae shown below. Table 1 Running time of Algoithms in Seconds DBSCAN FDBSCAN ODBSCAN 300 0.096 0 0.078 3 0.064 0 500 0.74 0 0.185 11 0.18 1 700 0.483 0 0.56 6 0.177 3 100 1.04 0 0.581 34 0.345 7 500 4.850 0 1.01 77 0.66 13 Above table shows that the new algoithm s pefomance is bette to the existing algoithms in tems of computation time and the new algoithm has small numbe of object loss than the Fast DBSCAN algoithm. 6. CONCLUSION AND FUTURE WORK In this pape we have poposed ODBSCAN algoithm to impove the pefomance with less amount of object loss. In this new algoithm FDBSCAN and MEDBSCAN algoithms appoach has been used to impove the pefomance. Also some new techniques have been intoduced to minimize the distance computation duing the RegionQuey function call. Eventually the pefomance analysis and the output shows that the newly poposed ODBSCAN algoithm gives bette output, at the same time with good pefomance. In this algoithm, all the bode objects have been consideed fo the clusteing pocess. But thee ae few possibilities to miss the coe objects and which causes some loss of objects. Though the new algoithm gives bette esult than the pevious FDBSCAN algoithm, this poblem needs to be esolved in the futhe wok to give the accuate esult with same pefomance. 7. REFERENCES [1] Este M., Kiegel H.-P., Sande J., and Xu X. (1996) A Density-Based Algoithm fo Discoveing Clustes in Lage Spatial Databases with Noise In Poceedings of the nd Intenational Confeence on Knowledge Discovey and Data Mining (KDD 96), Potland: Oegon, pp. 6-31 [] J. Han and M. Kambe, Data Mining Concepts and Techniques. Mogan Kaufman, 006. [3] G. Kaypis, E. H. Han, and V. Kuma, CHAMELEON: A hieachical clusteing algoithm using dynamic modeling, Compute, vol. 3, no. 8, pp. 68 75, 1999. [4] M. Ankest, M. Beunig, H. P. Kiegel, and J. Sande, OPTICS: Odeing to Identify the Clusteing Stuctue, Poc. ACM SIGMOD, in Intenational Confeence on Management of Data, 1999, pp. 49 60. [5] A. Hinnebug and D. Keim, An efficient appoach to clusteing in lage multimedia data sets with noise, in 4th Intenational Confeence on Knowledge Discovey and Data Mining, 1998, pp. 58 65. [6] SHOU Shui-geng, ZHOU Ao-ying JIN Wen, FAN Ye and QIAN Wei-ning.(000) "A Fast DBSCAN Algoithm" Jounal of Softwae: 735-744. [7] Li Jian; Yu Wei; Yan Bao-Ping;, "Memoy effect in DBSCAN algoithm," Compute Science & Education, 009. ICCSE '09. 4th Intenational Confeence on, vol., no., pp.31-36, 5-8 July 009. AUTHOR PROFILES J. Hencil Pete is Reseach Schola, St. Xavie s College (Autonomous), Palayamkottai, Tiunelveli, India. He eaned his MCA (Maste of Compute Applications) degee fom Manonmaniam Sundaana Univesity, Tiunelveli. Now he is doing Ph.D in Compute Applications and Mathematics (Intedisciplinay) at Manonmaniam Sundana Univesity, Tiunelveli. His inteested eseach aea is algoithms inventions in data mining. D. A. Antonysamy is Pincipal of St. Xavie s College, Kathmandu, Nepal. He completed his Ph.D in Mathematics fo the eseach on An algoithmic study of some classes of intesection gaphs. He has guided and guiding many eseach students in Compute Science and Mathematics. He has published many eseach papes in national and intenational jounals. He has oganized Seminas and Confeences in state and national level. 5