A Deflected Grid-based Algorithm for Clustering Analysis

Similar documents
Parallelism for Nested Loops with Non-uniform and Flow Dependences

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

SCALABLE AND VISUALIZATION-ORIENTED CLUSTERING FOR EXPLORATORY SPATIAL ANALYSIS

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Study of Data Stream Clustering Based on Bio-inspired Model

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Cluster Analysis of Electrical Behavior

Face Recognition Method Based on Within-class Clustering SVM

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

The Research of Support Vector Machine in Agricultural Data Classification

Unsupervised Learning

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Load Balancing for Hex-Cell Interconnection Network

Analyzing Popular Clustering Algorithms from Different Viewpoints

Clustering Algorithm of Similarity Segmentation based on Point Sorting

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Classifier Selection Based on Data Complexity Measures *

Machine Learning: Algorithms and Applications

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Design of Structure Optimization with APDL

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

Hierarchical clustering for gene expression data analysis

Fuzzy C-Means Initialized by Fixed Threshold Clustering for Improving Image Retrieval

A Binarization Algorithm specialized on Document Images and Photos

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

STING : A Statistical Information Grid Approach to Spatial Data Mining

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

K-means and Hierarchical Clustering

Support Vector Machines

A Simple and Efficient Goal Programming Model for Computing of Fuzzy Linear Regression Parameters with Considering Outliers

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION

Available online at Available online at Advanced in Control Engineering and Information Science

Mining User Similarity Using Spatial-temporal Intersection

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Network Intrusion Detection Based on PSO-SVM

Boundary-Based Time Series Sorting

1. Introduction. Abstract

A Two-Stage Algorithm for Data Clustering

A Topology-aware Random Walk

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

An Improved Image Segmentation Algorithm Based on the Otsu Method

Detection of an Object by using Principal Component Analysis

Outlier Detection Methodologies Overview

X- Chart Using ANOM Approach

Learning-Based Top-N Selection Query Evaluation over Relational Databases

A CALCULATION METHOD OF DEEP WEB ENTITIES RECOGNITION

Deep learning is a good steganalysis tool when embedding key is reused for different images, even if there is a cover source-mismatch

A Similarity Measure Method for Symbolization Time Series

Determining Fuzzy Sets for Quantitative Attributes in Data Mining Problems

Corner-Based Image Alignment using Pyramid Structure with Gradient Vector Similarity

Optimal connection strategies in one- and two-dimensional associative memory models

Face Recognition University at Buffalo CSE666 Lecture Slides Resources:

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

Feature Reduction and Selection

A fast algorithm for color image segmentation

Unsupervised Learning and Clustering

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Clustering algorithms and validity measures

An Image Fusion Approach Based on Segmentation Region

Sorting. Sorted Original. index. index

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Support Vector Machines

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Brushlet Features for Texture Image Retrieval

Programming in Fortran 90 : 2017/2018

Hybridization of Expectation-Maximization and K-Means Algorithms for Better Clustering Performance

Mathematics 256 a course in differential equations for engineering students

A Simple Methodology for Database Clustering. Hao Tang 12 Guangdong University of Technology, Guangdong, , China

CS 534: Computer Vision Model Fitting

BIN XIA et al: AN IMPROVED K-MEANS ALGORITHM BASED ON CLOUD PLATFORM FOR DATA MINING

Object-Based Techniques for Image Retrieval

ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE

Optimization of integrated circuits by means of simulated annealing. Jernej Olenšek, Janez Puhan, Árpád Bűrmen, Sašo Tomažič, Tadej Tuma

Robust Subspace Outlier Detection in High Dimensional Space

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Clustering. A. Bellaachia Page: 1

A Webpage Similarity Measure for Web Sessions Clustering Using Sequence Alignment

A Clustering Algorithm for Chinese Adjectives and Nouns 1

High-Boost Mesh Filtering for 3-D Shape Enhancement

On Some Entertaining Applications of the Concept of Set in Computer Science Course

B.N.Jagadesh* et al. /International Journal of Pharmacy & Technology

SRBIR: Semantic Region Based Image Retrieval by Extracting the Dominant Region and Semantic Learning

BRDPHHC: A Balance RDF Data Partitioning Algorithm based on Hybrid Hierarchical Clustering

Maximum Variance Combined with Adaptive Genetic Algorithm for Infrared Image Segmentation

Human Face Recognition Using Generalized. Kernel Fisher Discriminant

Local Quaternary Patterns and Feature Local Quaternary Patterns

USING GRAPHING SKILLS

Feature Selection as an Improving Step for Decision Tree Construction

Efficient Distributed File System (EDFS)

A Comparative Study for Outlier Detection Techniques in Data Mining

Clustering Algorithm Combining CPSO with K-Means Chunqin Gu 1, a, Qian Tao 2, b

Comparison of Heuristics for Scheduling Independent Tasks on Heterogeneous Distributed Environments

CSE 326: Data Structures Quicksort Comparison Sorting Bound

An Optimal Algorithm for Prufer Codes *

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Transcription:

A Deflected Grd-based Algorthm for Clusterng Analyss NANCY P. LIN, CHUNG-I CHANG, HAO-EN CHUEH, HUNG-JEN CHEN, WEI-HUA HAO Department of Computer Scence and Informaton Engneerng Tamkang Unversty 5 Yng-chuan Road Tamsu, Tape County TAIWAN, R.O.C nancyln@mal.tku.edu.tw, taftdc@mal.tku.edu.tw, 8909034@s90.tku.edu.tw chenh@mal.su.edu.tw, 88990@s89.tku.edu.tw Abstract: - The grd-based clusterng algorthm, whch parttons the data space nto a fnte number of cells to form a grd structure and then performs all clusterng operatons on ths obtaned grd structure, s an effcent clusterng algorthm, but ts effect s serously nfluenced by the sze of the cells. To cluster effcently and smultaneously, to reduce the nfluences of the sze of the cells, a new grd-based clusterng algorthm, called DGD, s proposed n ths paper. The man dea of DGD algorthm s to deflect the orgnal grd structure n each dmenson of the data space after the clusters generated from ths orgnal structure have been obtaned. The deflected grd structure can be consdered a dynamc adustment of the sze of the orgnal cells, and thus, the clusters generated from ths deflected grd structure can be used to revse the orgnally obtaned clusters. The expermental results verfy that, ndeed, the effect of DGD algorthm s less nfluenced by the sze of the cells than other grd-based ones. Key-Words: - Data Mnng, Clusterng Algorthm, Grd-based Clusterng, Sgnfcant Cell, Grd Structure Introducton Clusterng analyss whch s to group the data ponts nto clusters s an mportant task of data mnng recently. Unlke classfcaton whch analyzes the labeled data, clusterng analyss deals wth data ponts wthout consultng a known label prevously. In general, data ponts are grouped only based on the prncple of maxmzng the ntra-class smlarty and mnmzng the nter-class smlarty, and thus, clusters of data ponts are formed so that data ponts wthn a cluster are hghly smlar to each other, but are very dssmlar to the data ponts n other clusters. Up to now, many clusterng algorthms have been proposed [, 2, 3, 4, 5, 6, 7, 8, 9, 0,, 2, 3], and generally, the called grd-based algorthms are the most computatonally effcent ones. The man procedure of the grd-based clusterng algorthm s to partton the data space nto a fnte number of cells to form a grd structure, and next, fnd out the sgnfcant cells whose denstes exceed a predefned threshold, and group nearby sgnfcant cells nto clusters fnally. Clearly, the grd-based algorthm performs all clusterng operatons on the generated grd structure; therefore, ts tme complexty s only dependant on the number of cells n each dmenson of the data space. That s, f the number of the cells n each dmenson can be controlled as a small value, then the tme complexty of the grd-based algorthm wll be low. Some famous algorthms of the grdbased clusterng are STING [], WaveCluster [2], and CLIQUE [3]. As the above mentoned, the grd-based clusterng algorthm s an effcent algorthm, but ts effect s serously nfluenced by the sze of the grds (or the value of the predefned threshold). If the cell s small, then t needs many cells to be connected nto one cluster. And there wll also be more connecton of cells. In the connecton of cells, the number of data ponts n cell s the maor factor to connect or dsconnect the cells. So, the more cells, the more effects. And n the same data space, there are more cells, there wll be smaller sze. To cluster data ponts effcently and to reduce the nfluences of the sze of the cells at the same tme, a new grd-based clusterng algorthm, called DGD, s proposed here. The man dea of DGD algorthm s to deflect the orgnal grd structure n each dmenson of the data space after the clusters generated from the orgnal grd structure have been obtaned. The deflected grd structure s then used to fnd out the new sgnfcant cells. Next, the nearby sgnfcant cells are grouped as well to form some new clusters. Fnally, these new generated clusters are used to ISSN: 09-2750 25 Issue 3, Volume 7, March 2008

revse the orgnally generated clusters. The rest of the paper s organzed as follows: In secton 2, some famous grd-based clusterng algorthms wll be ntroduced. In secton 3, the proposed clusterng algorthm, DGD algorthm, wll be presented. In secton 4, some experments and dscussons wll be dsplayed. The conclusons wll be gven n secton 5. 2 Grd-based Clusterng Algorthm In ths secton, two popular grd-based clusterng algorthms, STING [] and CLIQUE [3], wll be ntroduced. STING (Statstcal Informaton Grd-based algorthm) (Wang et al., 997) explots the clusterng propertes of ndex structures. It employs a herarchcal grd structure and uses longtude and lattude to dvde the data space nto rectangular cells. STING selects a layer to begn wth at the begnnng. For each cell of ths layer, to label the cell as relevant f ts confdence nterval of probablty s hgher than the threshold. We go down the herarchy structure by one level and go back to check those cells s relevant or not untl the bottom level. Return those regons that meet the requrement of the query. And fnally, to retreve those data fall nto the relevant cells. CLIQUE (Clusterng In QUEst) (Agrawal et al., 998) s a densty and grd-based approach for hgh dmensonal data sets that provdes automatc sub-space clusterng of hgh dmensonal data. It conssts of the followng steps: Frst, to uses a bottom-up algorthm that explots the monotoncty of the clusterng crteron wth respect to dmensonalty to fnd dense unts n dfferent subspaces. Second, t use a depth-frst search algorthm to fnd all clusters that dense unts n the same connected component of the graph are n the same cluster. Fnally, t wll generate a mnmal descrpton of each cluster. In fact, the effects of these two algorthms are serously nfluenced by the sze of the predefned grds and the threshold of the sgnfcant cells. To reduce the nfluences of the sze of the predefned grds and the threshold of the sgnfcant cells, we propose a new grd-based clusterng algorthm whch s called A Deflected Grd-based (DGD) algorthm n ths paper. 3 A Deflected Grd-based Algorthm After the grd structure s bult, the deflected grd-based algorthm (DGD) deflects the cell margns by half a cell wdth n each dmenson and have the new grd structure and then combne the two sets of clusters nto the fnal result. The procedure of DGD s shown n the followng steps. Step : Generate a grd structure. By dvdng nto k equal parts n each dmenson, the n dmensonal data space s parttoned nto k n non-overlappng cells to be the grd structure. Step 2: Identfy sgnfcant cells. Next, the densty of each cell s calculated to fnd out the sgnfcant cells whose denstes exceed a predefned threshold. Step 3: Generate the set of clusters. Then the nearby sgnfcant cells whch are connected to each other are grouped nto clusters. The set of the clusters s denoted as S. Step 4: Deflect the grd structure. The orgnal grd structure s next deflected by dstance d n each dmenson of the data space. Step 5: Generate the set of new clusters. The step 2 and step 3 are used agan to generate the set of new clusters by usng the deflected grd structure. The set of new clusters generated here s denoted as S 2. Step 6: Revse orgnal clusters. The clusters generated from the deflected grd structure are used to revse the orgnally obtaned clusters as the followng steps. Step 6a: Fnd each overlapped cluster C 2 for C S, and generate the rule C C2, where C I C2 φ, C2 S2. The rulec C 2 means that clusterc overlaps cluster C 2. Smlarty, fnd each overlapped cluster C for C 2 S 2, and also generate the rule C2 C, where C2 I C φ. Step 6b: The set of all the rules generated n step 6a s denoted as R o. Next, each clusterc S s revsed by usng the cluster revsed functon CR (). The cluster modfed functon CR() s shown n fg.. ISSN: 09-2750 26 Issue 3, Volume 7, March 2008

Step 7: Generate the clusterng result. After all clusters of S have been revsed, S s the set of fnal clusters. for each C S Let X := X; Repeat oldx := X ; For each Y Z n R 0 Do If Y X then X := X Z; If Z S then S := S {Z}; Endf Untl (oldx = X ); C := X ; End Fg. the CR algorthm 3. Example In ths place, the two dmensonal example, shown n fgure 2, wth 600 ponts s easy to be dvded nto two clusters. The example goes through the algorthm. Step 2: Identfy sgnfcant cells. Next, the densty of each cell s calculated, shown n fg. 4, to fnd out the sgnfcant cells whose denstes exceed a predefned threshold, here the threshold s 4. Fg.3 the grd structure of 20 2 cells Fg.4 the densty of each cell step3: Generate the set of clusters. Then the nearby sgnfcant cells whch are connected to each other are grouped nto 5 clusters. The set of the clusters s denoted as S ={C,C 2,,C 5 }, shown n fg. 5. Fg.2 two dmensonal example Step : Generate a grd structure. By dvdng nto 20 equal parts n each dmenson, the two dmensonal data space n ths example s parttoned nto 20 2 non-overlappng cells to be the grd structure, shown n fg.3. step4: Deflect the grd structure. The orgnal grd structure s deflected by dstance d n each dmenson of the data space. In ths example, d s equal to the half sde length of the cell. By deflectng the grd structure, the new one s parttoned nto 2 2 cells, shown n fg. 6. ISSN: 09-2750 27 Issue 3, Volume 7, March 2008

Fg.5 result of frst clusterng Fg.7 the cell densty of new grd structure Fg.8 Result of the second clusterng Fg.6 the new grd structure wth 2 2 cells Step 5: Generate the set of new clusters. Here, the cell densty of new grd structure s shown n fg. 7. It s easy to fnd out the sgnfcant cells whose denstes exceed a predefned threshold, 4. And the nearby sgnfcant cells whch are connected to each other are grouped nto 4 clusters. The set of the clusters s denoted as S ={C 2,C 22,,C 4 }, shown n fg. 8. 2 R0 s composed of rulesc C2, shown n table, andc2 C, shown n table 2. Step 7: Generate the clusterng result. After all clusters of S have been revsed by usng cluster modfed functon CR (), revsed S s shown n table 3. And the fnal clusterng result s shown n fg. 9. Step 6: Revse orgnal clusters. The clusters generated from the deflected grd structure are used to revse the orgnally obtaned clusters as steps 6.a and 6.b. ISSN: 09-2750 28 Issue 3, Volume 7, March 2008

Table rules C C2 of R 0 Fg. 9 the fnal clusterng result 4. Experment and Dscussons Here, we experment wth seven dfferent data. The features are shown n Table 4. Table 2 rules C C 2 of R 0 Table 4 expermental data features Fg.0 experment Fg. experment 2 Table 3 the set of fnal clusters Fg.2 experment 3 Fg.3 experment 4 ISSN: 09-2750 29 Issue 3, Volume 7, March 2008

result of SDG s part of the clusterng result of DGD n experment. And n experment, t s mpossble to fnd the wrong expermental result that usng n DGD but s correct when usng n SDG. Fg.4 experment 5 Fg.5 experment 6 Table 5 the correct rate comparson sheet of experment Fg.6 experment 7 4. Experment Fgure.7 shows the correct rates of DGD and SDG, where the correct clusterng result of SDG s by usng one of orgnal or new grd structures n the experment. The correct rates of DGD are all hgher than SDG. In the experments, the correct rates comparson s by usng random 00 sets of parameters (densty threshold, number of dvdng parts n each dmenson) from (6, ) to (55, 3). In table.6, 7, 8, and 9, t s possble to fnd the correct expermental result that usng n DGD but s wrong when usng n SDG. Though the values are low, the expermental results are not the same as experment n table 5. So, the results of SDG are not always parts of the clusterng results of DGD. Because the correct rate of DGD s always hgher than SDG, the experment by usng DGD s able to advance the correct rate than usng other grd-based algorthms. In other words, the expermental results verfy that the effect of DGD algorthm s less nfluenced by the sze of the cells than other grd-based ones. Table 6: the correct rate comparson sheet of experment 2 Fg.7 correct rates of DGD and SDG In table.5, t s the correct rate comparson sheet of experment by usng random 00 sets of parameters. The correct rate of DGD s 47% whch s hgher than SDG whose correct rate s only 2%. Here, the correct rate of both usng the same set of parameters s only 2%. So, the Table 7: the correct rate comparson sheet of experment 3 ISSN: 09-2750 30 Issue 3, Volume 7, March 2008

Table 8: the correct rate comparson sheet of experment 4 Table 9: the correct rate comparson sheet of experment 5 connected sgnfcant cells to generate the two orgnal clusterng results s k*p*[m d + (m+) d ] at most. And the tme of the cluster revsed functon CR () s k2*r, where r s the number of C C2 and C2 C n R o, r << m d << n. In the end, the tme of checkng the cluster s number of all data s k3*n. So the total tme complexty s O(m d )+O(n). 5. Concluson and Future Work In ths paper, a new grd-based clusterng algorthm s called the Deflected Grd-based (DGD) algorthm, whch has the obvous wder ranges of sze of the cell and threshold of densty. And the expermental results verfy that the effect of DGD algorthm s less nfluenced by the sze of the cells than other grd-based ones. At the same tme, the DGD algorthm stll nherts the advantage wth the low tme complexty. There are many nterestng research problems related to DGD algorthm. One s to fnd the non-parametrc algorthm wth the same effcency of the DGD algorthm at least. And the other s to use algorthm of parallelsm to reduce the computatonal cost. Table 0: the correct rate comparson sheet of experment 6 Table : the correct rate comparson sheet of experment 7 4.2 Dscusson In the DGD algorthm, for each data pont α, only those ponts that are n the same cell of α are consdered. The densty of each cell s calculated at frst. When the total number of data ponts s n and each dmenson, total d dmensons, s dvded nto m ntervals, there wll be m d cells. The tme of checkng the densty of all cells s k0*[m d + (m+) d ]. If p(=3 d -) s the number of nearby cells of one cell, the tme of comparng the References: [] J. MacQeen. Some methods for classfcaton and analyss of multvarate observaton. Proc. 5th Berkeley Symp. Math. Statst, Prob., :28-297,967 [2] L. Kaufman and P.J. Rousseeuw. Fndng Groups n Data: An Introducton to Cluster Analyss. New York: John Wley & Sons, 990. [3]Charu C. Aggarwal, Phlp S. Yu, An effectve and effcent algorthm for hgh-dmensonal outler detecton The VLDB ournal, 4:2-22,2005 [4] M. Ester, H. Kregel, J. Sander, and X. Xu. A Densty-Based Algorthm for Dscoverng Clusters n Large Spatal Databases wth Nose, In Proc. of 2nd Int. Conf. on KDD, 996, pages 226-23. [5] A. Hnneburg and D. A. Kem,. An Effcent Approach to Clusterng n Large Multmeda Databases wth Nose, In Knowledge Dscovery and Data Mnng, 998, pages 58-65. [6] ANKERST M. etc. OPTICS: Orderng Ponts to Identfy the Clusterng Structure. In Proc. ACM SIGMOD Int. Conf. on MOD, 999, pages ISSN: 09-2750 3 Issue 3, Volume 7, March 2008

49-60. [7] A. H. Plevar, M. Sukumar, GCHL: A grd-clusterng algorthm for hgh-dmensonal very large spatal data bases, Pattern Recognton Letters 26(2005),999-00 [8] ZHAO Y.C., SONG J., GDILC: A Grd-based Densty-Isolne Clusterng Algorthm., In Proc. Internat. Conf. on Info-net, Vol 3,pp.40-45,200, [9]Ma, W.M., Eden, Chow, Tommy, W.S., A new shftng grd clusterng algorthm, Pattern Recognton 37 (3),2004,503-54 [0]Alevzos, P., Boutsnas, B., Tasouls, D., Vrahats, M.N., Improvng the K-wndows clusterng algorthm, In Proc. 4th IEEE Internat. Conf. on Tools wth Artfcal Intell, pp.239-245, 2002. [] Wang, Yang, R. Muntz, We Wang and Jong Yang and Rchard R. Muntz STING: A Statstcal Informaton Grd Approach to Spatal Data Mnng, In Proc. of 23rd Int. Conf. on VLDB, 997, pages 86-95. [2] G. Shekholeslam, S. Chatteree, and A. Zhang. WaveCluster: a wavelet-based clusterng approach for spatal data n very large databases, In VLDB Journal: Very Large Data Bases, 2000, pages 289-304. [3] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatc sub-space clusterng of hgh dmensonal data for data mnng applcatons, In Proc. of ACM SIGMOD Int. Conf. MOD, 998, pages 94-05. ISSN: 09-2750 32 Issue 3, Volume 7, March 2008