, pp.412-418 http://dx.do.org/10.14257/astl.2014.53.86 Study of Data Stream lusterng Based on Bo-nspred Model Yngme L, Mn L, Jngbo Shao, Gaoyang Wang ollege of omputer Scence and Informaton Engneerng, Harbn Normal Unversty,150025 Harbn, hna {Yngme L, yngme_l2013}@163.com} Abstract. Nowadays wth the rapd development of wreless sensor networks, and network traffc montorng, stream data gradually becomes one of the most popular data models. Stream data s dfferent from the tradtonal statc data. lusterng analyss s an mportant technology of data mnng, so that many researchers pay ther attenton to the clusterng of stream data. In ths paper, MSFS(Multple Speces Flockng on Stream) algorthm s proposed. By means of the expermental verfcaton analyss, MSFS algorthm, whch s based on bologcally nspred computatonal model, exsts hgher clusterng purty on both the real dataset and the smulaton datasets. In other words, the cluster result of MSFS algorthm s better. Key Words: stream data; clusterng analyss; the model of MSF; cluster purty 1 Introducton Recently, wth advances n communcaton and data collecton technques, people receve a large number of real tme data at very hgh rates. In data mnng area, there are many technques but they should be tuned and changed to work n data stream mnng. The data stream mnng s dfferent from the regular statc data mnng. These dstngushng features brng new challenge to stream data processng. lusterng analyss s an mportant technology of data mnng, so that many researchers pay ther attenton to the clusterng of stream data[1]. In ths paper, MSFS algorthm s proposed. It combnes MSF model and the DenStream clusterng algorthm that s based on densty. MSF model s a knd of swarm ntellgence model for text clusterng, and we take advantage of the feature smlarty rule to make MSFS be sutable for data stream clusterng. Ths artcle s organzed as follows. The second secton descrbes the related word wth the proposed algorthm: the DenStream algorthm and the MSF(Multple Speces Flockng) model. Secton 3 descrbes our algorthm. In 4th secton, the results of the method on synthetc and real lfe data sets are presented. At last secton, we dscuss the advantages of the approach and concludes ths artcle. ISSN: 2287-1233 ASTL opyrght 2014 SERS
2 Related work In recently years, many specal attentons has been pad towards searchng effcent and effcacous methods for clusterng data streams [2]. In 2000, Guha et al proposed a data stream clusterng algorthm based on k-means [3]. allaghan et al proposed an algorthm for real-tme data streams called STREAM [4]. In 2003, lustream was proposed n [4]. It treats the data stream clusterng as a dynamc process changng by the tme seres. And n next year, HPStream was proposed[5]. ao et al rased a densty-based clusterng algorthm called DenStream for evolvng data streams that captures synopss nformaton about the nature of the data stream by usng summary statstcs [6]. The clusterng process s dvded nto onlne clusterng and offlne clusterng such lke lustream. In onlne clusterng part, f the densty of a cluster s greater than a certan threshold, the algorthm wll thnk of the cluster as potental mcro-clusters (p-mcro-cluster).on the contrary, the cluster wll be treated as an outler mcro-cluster (o-mcro-cluster). In the offlne part, when the query request arrves, t wll deal wth the p-mcro-cluster and the o-mcro-cluster. Then the result wll be output. The process of offlne part essentally follows the methods of DBSAN [7]. In ths paper, MSF model s based on a Flockng clusterng algorthm, and Flockng model s a bonc model. Flockng model was developed by Reynolds and others through the study of brds, group behavors; t can also be seen as the prototype of PSO proposed n 1995. u studed the Flockng model and propose a MSF model that has been appled to text clusterng [8]. But n FlockStream algorthm, the authors Agostno Forestero et al have also proposed a rule that does not refer to the rule modfed by a fourth prncple[9]. 3 MSFS algorthm In ths algorthm, n addton to the use of the rules of MSF model, takng the dfference of agent models nto account, we expanded four dfferent agent models: data agent (on behalf of data ponts), p mcro-cluster Agent (on behalf of the potental core of mcro-clusters, that s, potental c-mcro-cluster), o mcro-cluster agent (on behalf of outler mcro-cluster), c clusterng agent (representatve of the fnal cluster). opyrght 2014 SERS 413
Durng the executon of the algorthm, accordng to the relevant constrants, change agent type, respond clusterng request, generate clusterng results. Durng the ntalzaton of the algorthm, each multdmensonal data pont s assocated wth one data gent; then, randomly deploy the agents whch meet the data collecton to two-dmensonal vrtual grd. The locaton of each agent A=(P, v ) n the grd s randomly generated, and ts velocty vector s defned as v (m, θ), nt m as 1 and 0, 2. After the parameters of data agent are predefned, data agent wll move accordng to MSF rules. The specfc process of ths algorthm can be represented by the followng algorthm. MSFS ( DS, ε, β,μ,λ){ For =1,2,3 Max(teraton){ Int(); AgentsMergng(); T p 1 log 1 ; If (t mod Tp==0) { For each p-agents If ( p ) hange p-agents to o-agents; 2 ( t t o t p ) 2 Tp 1 1 For each o-agents{ If ( o ) hange o-agents to p-agents; o Else f ( ) Delete the cluster o that o-agent represents;}}} If a request of a clusterng arrves 414 opyrght 2014 SERS
Return the cluster that c-agent represents ; } The related nterpretatons of AgentsMergng () algorthm are as follows: (1) When a data agent A on behalf of data PA comes across another data agent B on behalf of PB, f t satsfes dst ( P A, P ), that s, the Eucldean dstance B between them s less than or equal ε, then A and B are combned nto one o-agent. (2) When a data agent A comes across a p agent B on behalf of mcro-cluster B (or an o agent on behalf of mcro-cluster P o B ), f the radus of the new mcro cluster generated by A and B s less than or equal to ε, then A combnes wth B. (3) If A s not a data agent, but a p or o agent or agency, when t encounters another P or O agent, f the dstance between the correspondng mcro-clusters s less than ε, then we can merge them nto clusterng agent whch has certan smlarty. (4) If a P or O agent comes across a data agent B, the same to (2), analyze f agent B can be combned wth A. (5)Fnally, once havng do a merge operaton, velocty vector of the agent wll be calculated accordng to MSF rules, then the agent wll be adjusted accordng to four prncples. 4 Expermental results We employ Java to acheve MSFS algorthm s expermental result. And the computer confguraton parameters are lke ths: the processor s Intel (R) core 3-2120, operatng system s Wndows 7,and the system memory s 4.00GB. The experment s dvded nto two parts: on real data sets and on synthetc data sets whch have some nose data. Real data set s called as KDD UP99 whch s used n KDD (Knowledge Dscovery) contest n 1999. It s always employed to analyze the real-tme detecton of computer attacks n the stream of data clusterng mnng areas. In the experment, we take the advantage of the average purty (purty) to compare clusterng qualty of clusterng algorthm clusters. The clusterng purty s defned as follows: opyrght 2014 SERS 415
K d purty 1 K 100 % where K denotes the number of clusters, d ndcates the number of ponts wth the domnant class label n cluster, and ndcates the number of ponts n cluster. Expermental data shows that MSFS clusterng purty s always better than DenStream on the network ntruson dataset-kdd up99. The results are shown as Fg. 1. luster Purty % 93 92 91 90 89 KDD UP 1999 Dataset, v=1000,h=1 MSFS DenStream 88 10 30 50 70 90 Tme Unt Fg.1. The cluster purty of MSFS and DenStream wth H=1 In ths paper, three artfcal datasets DS1, DS2, DS3 are selected for more equtable comparng. A new evolutonary data sets EDS s produced by the method of random selecton. In real applcatons, some unavodable nose data s generated due to some unexpected reasons. Therefore, we added 5% nose data n the EDS and observed expermental results. The Fg.3 ndcates the expermental results. 416 opyrght 2014 SERS
luster Purty % Evolvng Data Stream,v=1000,H=2,vose=5% 93.6 MSFS DenStream 93.3 93 92.7 10 20 30 40 50 60 70 80 90 100 Tme Unt Fg.3. The experment result on EDS wth nose=5% 5 Dscusson and onclusons MSFS can produce better clusterng effect than DenStream algorthm n expermental comparson. When experments s performed based on real data sets, MSFS algorthm acheves hgher clusterng purty. What s more, MSFS algorthm s more outstandng when t deals wth the data whch exsts some nose. However, because the parameters are pre-defned, proposed algorthm has hgh parameter senstvty. In the future, ths ssue wll be concerned and ts soluton s gong to be proposed. Ths work s supported by the Helongjang Provncal Department of Educaton Scence Research Project(No. 12541239). References 1 Shfe Dng, Fuln Wu, Jun Qan, Hongje Ja, Fengxang Jn. Research on data stream clusterng algorthms. Sprnger Scence, (2013) 2 Guha S,Meyerson A,Mshra N, lusterng data streams, lusterng data streams, Proceedngs of the 41st Annual Symposum on Foundatons of omputer Scence. Washngton D: IEEE omputer Socety, pp. 359-366 (2000) 3 O'allaghan L, Streamng data algorthms for hgh qualty clusterng, Proc of the 18th Internatonal onference on Data Engneerng. Massachusetts: IEEE omputer Socety, pp.685-694 (2002) 4 Aggarwal, Han J, Wang J et al (2003) A framewrok for clusterng evolvng data streams. In: Proceedngs of VLDB. pp 81 92 (2003) 5 Aggarwal, Han J, Wang J, Yu PS. A framework for projected clusterng of hgh dmensonal datastreams. In: Proceedngs of the 30th nternatonal conference on very large data bases. pp. 852 863 (2004) 6 ao F, Ester M, Qan W, Zhou A, Densty-based clusterng over evolvng data stream wth opyrght 2014 SERS 417
nose, Proceedngs of the sxth SIAM nternatonal conference on data mnng (SIAM 06), Bethesda, pp. 326 337 (2006) 7 Ester M, Kregel H-P, Jrg S, Xu X. A densty-based algorthm for dscoverng clusters n large spatal databases wth nose. Proceedngs of the second AM SIGKDD nternatonal conference on knowledge dscovery and data mnng (KDD 96). pp 373 382 (1996) 8 u X, Potok TE. A dstrbuted agent mplementaton of multple speces flockng model for document parttonng clusterng. ooperatve nformaton agents. Ednburgh,(2006) 9 Agostno Forestero, lara Pzzut and Gandomenco Spezzano. A sngle pass algorthm for clusterng evolvng data streams based on swarm ntellgence. 26,1-26 (2013) 418 opyrght 2014 SERS