Study of Data Stream Clustering Based on Bio-inspired Model

Similar documents
FlockStream: a Bio-inspired Algorithm for Clustering Evolving Data Streams

A Deflected Grid-based Algorithm for Clustering Analysis

Parallelism for Nested Loops with Non-uniform and Flow Dependences

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Cluster Analysis of Electrical Behavior

Clustering Algorithm of Similarity Segmentation based on Point Sorting

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Load Balancing for Hex-Cell Interconnection Network

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Available online at Available online at Advanced in Control Engineering and Information Science

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Meta-heuristics for Multidimensional Knapsack Problems

SCALABLE AND VISUALIZATION-ORIENTED CLUSTERING FOR EXPLORATORY SPATIAL ANALYSIS

Positive Semi-definite Programming Localization in Wireless Sensor Networks

Simulation Based Analysis of FAST TCP using OMNET++

THE PATH PLANNING ALGORITHM AND SIMULATION FOR MOBILE ROBOT

A Similarity Measure Method for Symbolization Time Series

Network Intrusion Detection Based on PSO-SVM

BRDPHHC: A Balance RDF Data Partitioning Algorithm based on Hybrid Hierarchical Clustering

Outlier Detection Methodologies Overview

Fuzzy C-Means Initialized by Fixed Threshold Clustering for Improving Image Retrieval

The Research of Support Vector Machine in Agricultural Data Classification

Clustering Algorithm Combining CPSO with K-Means Chunqin Gu 1, a, Qian Tao 2, b

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

A Binarization Algorithm specialized on Document Images and Photos

An Optimal Algorithm for Prufer Codes *

Security Enhanced Dynamic ID based Remote User Authentication Scheme for Multi-Server Environments

SURFACE PROFILE EVALUATION BY FRACTAL DIMENSION AND STATISTIC TOOLS USING MATLAB

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

A Comparative Study for Outlier Detection Techniques in Data Mining

A Clustering Algorithm for Key Frame Extraction Based on Density Peak

Machine Learning: Algorithms and Applications

Scheduling Remote Access to Scientific Instruments in Cyberinfrastructure for Education and Research

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

A fast algorithm for color image segmentation

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Virtual Machine Migration based on Trust Measurement of Computer Node

Analyzing Popular Clustering Algorithms from Different Viewpoints

An Improved Image Segmentation Algorithm Based on the Otsu Method

BIN XIA et al: AN IMPROVED K-MEANS ALGORITHM BASED ON CLOUD PLATFORM FOR DATA MINING

Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset

An Internal Clustering Validation Index for Boolean Data

An Influence of the Noise on the Imaging Algorithm in the Electrical Impedance Tomography *

The Research of Ellipse Parameter Fitting Algorithm of Ultrasonic Imaging Logging in the Casing Hole

Evaluation of an Enhanced Scheme for High-level Nested Network Mobility

An Image Fusion Approach Based on Segmentation Region

Face Recognition Method Based on Within-class Clustering SVM

Application of VCG in Replica Placement Strategy of Cloud Storage

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Application of Clustering Algorithm in Big Data Sample Set Optimization

Classifier Selection Based on Data Complexity Measures *

An Image Compression Algorithm based on Wavelet Transform and LZW

Robust Subspace Outlier Detection in High Dimensional Space

K-means Optimization Clustering Algorithm Based on Hybrid PSO/GA Optimization and CS validity index

Overview. Basic Setup [9] Motivation and Tasks. Modularization 2008/2/20 IMPROVED COVERAGE CONTROL USING ONLY LOCAL INFORMATION

Security Vulnerabilities of an Enhanced Remote User Authentication Scheme

AN INDEXING METHOD FOR SUPPORTING SPATIAL QUERIES IN STRUCTURED PEER-TO-PEER SYSTEMS

Constructing Minimum Connected Dominating Set: Algorithmic approach

An inverse problem solution for post-processing of PIV data

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Using internal evaluation measures to validate the quality of diverse stream clustering algorithms

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Mining User Similarity Using Spatial-temporal Intersection

Using Particle Swarm Optimization for Enhancing the Hierarchical Cell Relay Routing Protocol

Research of Neural Network Classifier Based on FCM and PSO for Breast Cancer Classification

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

Querying by sketch geographical databases. Yu Han 1, a *

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Sensor Selection with Grey Correlation Analysis for Remaining Useful Life Evaluation

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

An Improved Particle Swarm Optimization for Feature Selection

Load-Balanced Anycast Routing

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Machine Learning. Topic 6: Clustering

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Application of Improved Fish Swarm Algorithm in Cloud Computing Resource Scheduling

HCMX: AN EFFICIENT HYBRID CLUSTERING APPROACH FOR MULTI-VERSION XML DOCUMENTS

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Design of Structure Optimization with APDL

An Evolvable Clustering Based Algorithm to Learn Distance Function for Supervised Environment

A Load-balancing and Energy-aware Clustering Algorithm in Wireless Ad-hoc Networks

Classic Term Weighting Technique for Mining Web Content Outliers

Associative Based Classification Algorithm For Diabetes Disease Prediction

A Low Energy Algorithm of Wireless Sensor Networks Based on Fractal Dimension

S1 Note. Basis functions.

Pruning Training Corpus to Speedup Text Classification 1

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Maximum Variance Combined with Adaptive Genetic Algorithm for Infrared Image Segmentation

EVALUATION OF THE PERFORMANCES OF ARTIFICIAL BEE COLONY AND INVASIVE WEED OPTIMIZATION ALGORITHMS ON THE MODIFIED BENCHMARK FUNCTIONS

IMPACT OF RADIO MAP SIMULATION ON POSITIONING IN INDOOR ENVIRONTMENT USING FINGER PRINTING ALGORITHMS

Adaptive Energy and Location Aware Routing in Wireless Sensor Network

A Clustering Algorithm Solution to the Collaborative Filtering

Video Proxy System for a Large-scale VOD System (DINA)

Research Article. A Novel Spectral Clustering and its Application in Image Processing. Gu Ruijun*, Chen Shenglei and Wang Jiacai

Fingerprint matching based on weighting method and SVM

Classifier Swarms for Human Detection in Infrared Imagery

Using Fuzzy Logic to Enhance the Large Size Remote Sensing Images

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Transcription:

, pp.412-418 http://dx.do.org/10.14257/astl.2014.53.86 Study of Data Stream lusterng Based on Bo-nspred Model Yngme L, Mn L, Jngbo Shao, Gaoyang Wang ollege of omputer Scence and Informaton Engneerng, Harbn Normal Unversty,150025 Harbn, hna {Yngme L, yngme_l2013}@163.com} Abstract. Nowadays wth the rapd development of wreless sensor networks, and network traffc montorng, stream data gradually becomes one of the most popular data models. Stream data s dfferent from the tradtonal statc data. lusterng analyss s an mportant technology of data mnng, so that many researchers pay ther attenton to the clusterng of stream data. In ths paper, MSFS(Multple Speces Flockng on Stream) algorthm s proposed. By means of the expermental verfcaton analyss, MSFS algorthm, whch s based on bologcally nspred computatonal model, exsts hgher clusterng purty on both the real dataset and the smulaton datasets. In other words, the cluster result of MSFS algorthm s better. Key Words: stream data; clusterng analyss; the model of MSF; cluster purty 1 Introducton Recently, wth advances n communcaton and data collecton technques, people receve a large number of real tme data at very hgh rates. In data mnng area, there are many technques but they should be tuned and changed to work n data stream mnng. The data stream mnng s dfferent from the regular statc data mnng. These dstngushng features brng new challenge to stream data processng. lusterng analyss s an mportant technology of data mnng, so that many researchers pay ther attenton to the clusterng of stream data[1]. In ths paper, MSFS algorthm s proposed. It combnes MSF model and the DenStream clusterng algorthm that s based on densty. MSF model s a knd of swarm ntellgence model for text clusterng, and we take advantage of the feature smlarty rule to make MSFS be sutable for data stream clusterng. Ths artcle s organzed as follows. The second secton descrbes the related word wth the proposed algorthm: the DenStream algorthm and the MSF(Multple Speces Flockng) model. Secton 3 descrbes our algorthm. In 4th secton, the results of the method on synthetc and real lfe data sets are presented. At last secton, we dscuss the advantages of the approach and concludes ths artcle. ISSN: 2287-1233 ASTL opyrght 2014 SERS

2 Related work In recently years, many specal attentons has been pad towards searchng effcent and effcacous methods for clusterng data streams [2]. In 2000, Guha et al proposed a data stream clusterng algorthm based on k-means [3]. allaghan et al proposed an algorthm for real-tme data streams called STREAM [4]. In 2003, lustream was proposed n [4]. It treats the data stream clusterng as a dynamc process changng by the tme seres. And n next year, HPStream was proposed[5]. ao et al rased a densty-based clusterng algorthm called DenStream for evolvng data streams that captures synopss nformaton about the nature of the data stream by usng summary statstcs [6]. The clusterng process s dvded nto onlne clusterng and offlne clusterng such lke lustream. In onlne clusterng part, f the densty of a cluster s greater than a certan threshold, the algorthm wll thnk of the cluster as potental mcro-clusters (p-mcro-cluster).on the contrary, the cluster wll be treated as an outler mcro-cluster (o-mcro-cluster). In the offlne part, when the query request arrves, t wll deal wth the p-mcro-cluster and the o-mcro-cluster. Then the result wll be output. The process of offlne part essentally follows the methods of DBSAN [7]. In ths paper, MSF model s based on a Flockng clusterng algorthm, and Flockng model s a bonc model. Flockng model was developed by Reynolds and others through the study of brds, group behavors; t can also be seen as the prototype of PSO proposed n 1995. u studed the Flockng model and propose a MSF model that has been appled to text clusterng [8]. But n FlockStream algorthm, the authors Agostno Forestero et al have also proposed a rule that does not refer to the rule modfed by a fourth prncple[9]. 3 MSFS algorthm In ths algorthm, n addton to the use of the rules of MSF model, takng the dfference of agent models nto account, we expanded four dfferent agent models: data agent (on behalf of data ponts), p mcro-cluster Agent (on behalf of the potental core of mcro-clusters, that s, potental c-mcro-cluster), o mcro-cluster agent (on behalf of outler mcro-cluster), c clusterng agent (representatve of the fnal cluster). opyrght 2014 SERS 413

Durng the executon of the algorthm, accordng to the relevant constrants, change agent type, respond clusterng request, generate clusterng results. Durng the ntalzaton of the algorthm, each multdmensonal data pont s assocated wth one data gent; then, randomly deploy the agents whch meet the data collecton to two-dmensonal vrtual grd. The locaton of each agent A=(P, v ) n the grd s randomly generated, and ts velocty vector s defned as v (m, θ), nt m as 1 and 0, 2. After the parameters of data agent are predefned, data agent wll move accordng to MSF rules. The specfc process of ths algorthm can be represented by the followng algorthm. MSFS ( DS, ε, β,μ,λ){ For =1,2,3 Max(teraton){ Int(); AgentsMergng(); T p 1 log 1 ; If (t mod Tp==0) { For each p-agents If ( p ) hange p-agents to o-agents; 2 ( t t o t p ) 2 Tp 1 1 For each o-agents{ If ( o ) hange o-agents to p-agents; o Else f ( ) Delete the cluster o that o-agent represents;}}} If a request of a clusterng arrves 414 opyrght 2014 SERS

Return the cluster that c-agent represents ; } The related nterpretatons of AgentsMergng () algorthm are as follows: (1) When a data agent A on behalf of data PA comes across another data agent B on behalf of PB, f t satsfes dst ( P A, P ), that s, the Eucldean dstance B between them s less than or equal ε, then A and B are combned nto one o-agent. (2) When a data agent A comes across a p agent B on behalf of mcro-cluster B (or an o agent on behalf of mcro-cluster P o B ), f the radus of the new mcro cluster generated by A and B s less than or equal to ε, then A combnes wth B. (3) If A s not a data agent, but a p or o agent or agency, when t encounters another P or O agent, f the dstance between the correspondng mcro-clusters s less than ε, then we can merge them nto clusterng agent whch has certan smlarty. (4) If a P or O agent comes across a data agent B, the same to (2), analyze f agent B can be combned wth A. (5)Fnally, once havng do a merge operaton, velocty vector of the agent wll be calculated accordng to MSF rules, then the agent wll be adjusted accordng to four prncples. 4 Expermental results We employ Java to acheve MSFS algorthm s expermental result. And the computer confguraton parameters are lke ths: the processor s Intel (R) core 3-2120, operatng system s Wndows 7,and the system memory s 4.00GB. The experment s dvded nto two parts: on real data sets and on synthetc data sets whch have some nose data. Real data set s called as KDD UP99 whch s used n KDD (Knowledge Dscovery) contest n 1999. It s always employed to analyze the real-tme detecton of computer attacks n the stream of data clusterng mnng areas. In the experment, we take the advantage of the average purty (purty) to compare clusterng qualty of clusterng algorthm clusters. The clusterng purty s defned as follows: opyrght 2014 SERS 415

K d purty 1 K 100 % where K denotes the number of clusters, d ndcates the number of ponts wth the domnant class label n cluster, and ndcates the number of ponts n cluster. Expermental data shows that MSFS clusterng purty s always better than DenStream on the network ntruson dataset-kdd up99. The results are shown as Fg. 1. luster Purty % 93 92 91 90 89 KDD UP 1999 Dataset, v=1000,h=1 MSFS DenStream 88 10 30 50 70 90 Tme Unt Fg.1. The cluster purty of MSFS and DenStream wth H=1 In ths paper, three artfcal datasets DS1, DS2, DS3 are selected for more equtable comparng. A new evolutonary data sets EDS s produced by the method of random selecton. In real applcatons, some unavodable nose data s generated due to some unexpected reasons. Therefore, we added 5% nose data n the EDS and observed expermental results. The Fg.3 ndcates the expermental results. 416 opyrght 2014 SERS

luster Purty % Evolvng Data Stream,v=1000,H=2,vose=5% 93.6 MSFS DenStream 93.3 93 92.7 10 20 30 40 50 60 70 80 90 100 Tme Unt Fg.3. The experment result on EDS wth nose=5% 5 Dscusson and onclusons MSFS can produce better clusterng effect than DenStream algorthm n expermental comparson. When experments s performed based on real data sets, MSFS algorthm acheves hgher clusterng purty. What s more, MSFS algorthm s more outstandng when t deals wth the data whch exsts some nose. However, because the parameters are pre-defned, proposed algorthm has hgh parameter senstvty. In the future, ths ssue wll be concerned and ts soluton s gong to be proposed. Ths work s supported by the Helongjang Provncal Department of Educaton Scence Research Project(No. 12541239). References 1 Shfe Dng, Fuln Wu, Jun Qan, Hongje Ja, Fengxang Jn. Research on data stream clusterng algorthms. Sprnger Scence, (2013) 2 Guha S,Meyerson A,Mshra N, lusterng data streams, lusterng data streams, Proceedngs of the 41st Annual Symposum on Foundatons of omputer Scence. Washngton D: IEEE omputer Socety, pp. 359-366 (2000) 3 O'allaghan L, Streamng data algorthms for hgh qualty clusterng, Proc of the 18th Internatonal onference on Data Engneerng. Massachusetts: IEEE omputer Socety, pp.685-694 (2002) 4 Aggarwal, Han J, Wang J et al (2003) A framewrok for clusterng evolvng data streams. In: Proceedngs of VLDB. pp 81 92 (2003) 5 Aggarwal, Han J, Wang J, Yu PS. A framework for projected clusterng of hgh dmensonal datastreams. In: Proceedngs of the 30th nternatonal conference on very large data bases. pp. 852 863 (2004) 6 ao F, Ester M, Qan W, Zhou A, Densty-based clusterng over evolvng data stream wth opyrght 2014 SERS 417

nose, Proceedngs of the sxth SIAM nternatonal conference on data mnng (SIAM 06), Bethesda, pp. 326 337 (2006) 7 Ester M, Kregel H-P, Jrg S, Xu X. A densty-based algorthm for dscoverng clusters n large spatal databases wth nose. Proceedngs of the second AM SIGKDD nternatonal conference on knowledge dscovery and data mnng (KDD 96). pp 373 382 (1996) 8 u X, Potok TE. A dstrbuted agent mplementaton of multple speces flockng model for document parttonng clusterng. ooperatve nformaton agents. Ednburgh,(2006) 9 Agostno Forestero, lara Pzzut and Gandomenco Spezzano. A sngle pass algorthm for clusterng evolvng data streams based on swarm ntellgence. 26,1-26 (2013) 418 opyrght 2014 SERS