On the Two-level Hybrid Clustering Algorithm

Similar documents
Hierarchical clustering for gene expression data analysis

Machine Learning: Algorithms and Applications

Hierarchical agglomerative. Cluster Analysis. Christine Siedle Clustering 1

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

THE CONDENSED FUZZY K-NEAREST NEIGHBOR RULE BASED ON SAMPLE FUZZY ENTROPY

Cluster Analysis of Electrical Behavior

Region Segmentation Readings: Chapter 10: 10.1 Additional Materials Provided

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Optimized Query Planning of Continuous Aggregation Queries in Dynamic Data Dissemination Networks

Machine Learning 9. week

Recognition of Identifiers from Shipping Container Images Using Fuzzy Binarization and Enhanced Fuzzy Neural Network

Image Segmentation. Image Segmentation

Machine Learning. Topic 6: Clustering

Classifying Acoustic Transient Signals Using Artificial Intelligence

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Survey of Cluster Analysis and its Various Aspects

Unsupervised Learning

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering

Keyword-based Document Clustering

IMRT workflow. Optimization and Inverse planning. Intensity distribution IMRT IMRT. Dose optimization for IMRT. Bram van Asselen

Clustering. A. Bellaachia Page: 1

A Clustering Algorithm for Chinese Adjectives and Nouns 1

Decision Strategies for Rating Objects in Knowledge-Shared Research Networks

K-means and Hierarchical Clustering

A Binarization Algorithm specialized on Document Images and Photos

A Simple Methodology for Database Clustering. Hao Tang 12 Guangdong University of Technology, Guangdong, , China

Some Advanced SPC Tools 1. Cumulative Sum Control (Cusum) Chart For the data shown in Table 9-1, the x chart can be generated.

Semi-Supervised Biased Maximum Margin Analysis for Interactive Image Retrieval

Feature Reduction and Selection

Support Vector Machines

Traffic Classification Method Based On Data Stream Fingerprint

Index Terms-Software effort estimation, principle component analysis, datasets, neural networks, and radial basis functions.

Lecture Note 08 EECS 4101/5101 Instructor: Andy Mirzaian. All Nearest Neighbors: The Lifting Method

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

TIME-EFFICIENT NURBS CURVE EVALUATION ALGORITHMS

Wishing you all a Total Quality New Year!

Support Vector Machines

A Deflected Grid-based Algorithm for Clustering Analysis

Performance Evaluation of Information Retrieval Systems

Multilayer Neural Networks and Nearest Neighbor Classifier Performances for Image Annotation

A CALCULATION METHOD OF DEEP WEB ENTITIES RECOGNITION

Available online at ScienceDirect. Procedia Computer Science 94 (2016 )

Parallelism for Nested Loops with Non-uniform and Flow Dependences

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

BRDPHHC: A Balance RDF Data Partitioning Algorithm based on Hybrid Hierarchical Clustering

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

KOHONEN'S SELF ORGANIZING NETWORKS WITH "CONSCIENCE"

X- Chart Using ANOM Approach

Introduction. 1. Mathematical formulation. 1.1 Standard formulation

Parallel matrix-vector multiplication

Load Balancing for Hex-Cell Interconnection Network

1. Introduction. Abstract

A new Algorithm for Lossless Compression applied to two-dimensional Static Images

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

Analysis of Continuous Beams in General

A Scheduling Algorithm of Periodic Messages for Hard Real-time Communications on a Switched Ethernet

Graph-based Clustering

Lecture 4: Principal components

CHAPTER 2 DECOMPOSITION OF GRAPHS

A new Unsupervised Clustering-based Feature Extraction Method

Application of Genetic Algorithms in Graph Theory and Optimization. Qiaoyan Yang, Qinghong Zeng

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

APPLICATION OF IMPROVED K-MEANS ALGORITHM IN THE DELIVERY LOCATION

Clustering is a discovery process in data mining.

Broadcast Time Synchronization Algorithm for Wireless Sensor Networks Chaonong Xu 1)2)3), Lei Zhao 1)2), Yongjun Xu 1)2) and Xiaowei Li 1)2)

Ontology based data warehouses federation management system

A Classifier Ensemble of Binary Classifier Ensembles

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

A Topology-aware Random Walk

S1 Note. Basis functions.

Smoothing Spline ANOVA for variable screening

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION

Document Representation and Clustering with WordNet Based Similarity Rough Set Model

Supervised Learning in Parallel Universes Using Neighborgrams

An Internal Clustering Validation Index for Boolean Data

Detection of an Object by using Principal Component Analysis

On Some Entertaining Applications of the Concept of Set in Computer Science Course

ACCURATE BIT ALLOCATION AND RATE CONTROL FOR DCT DOMAIN VIDEO TRANSCODING

Classification. Outline. 8.1 Statistical Learning Theory Formulation. 8.3 Methods for Classification. 8.2 Classical Formulation.

A Vision-based Facial Expression Recognition and Adaptation System from Video Stream

Module Management Tool in Software Development Organizations

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

On the Efficiency of Swap-Based Clustering

Web Mining: Clustering Web Documents A Preliminary Review

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE

DATA CLUSTERING: APPLICATIONS IN ENGINEERING

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Description of NTU Approach to NTCIR3 Multilingual Information Retrieval

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Making Name-Based Content Routing More Efficient than Link-State Routing

Transcription:

On the Two-level Clusterng Algorthm ng Yeow Cheu, Chee Keong Kwoh, Zongln Zhou Bonformatcs Research Centre, School of Comuter ngneerng, Nanyang Technologcal Unversty, Sngaore 639798 ezlzhou@ntu.edu.sg ABSTRACT In ths aer, we desgn the hybrd clusterng algorthms, whch nvolve two level clusterng. At each of the levels, users can select the -means, herarchcal or SOM clusterng technues. Unle the exstng cluster analyss technues, the hybrd clusterng aroach develoed here reresents the orgnal data set usng a smaller set of rototye vectors (cluster means), whch allows effcent use of a clusterng algorthm to dvde the rototye nto grous at the frst level. Snce the clusterng at the frst level rovdes data abstracton frst, t reduces the number of samles for the second level clusterng. The reducton of the number of samles, hence, the reducton of comutatonal cost s esecally mortant when herarchcal clusterng s used n the second stage. The rototyes clustered at the frst level are local averages of the data and therefore less senstve to random varatons than the orgnal data. The emrcal evaluaton of the two-level hybrd clusterng algorthms s made at four data sets 1. INTRODUCTION Over the years, extensve research has been carred out n determnng the otmal cluster analyss. Technues for clusterng have been develoed very radly, surred mostly by the avalablty of comuters to carry out awesome calculatons nvolved. These research efforts have resulted n a number of well-nown algorthms, and varants are contnuously beng develoed, each addressng secfc shortcomngs of ther ancestors. In ths aer, three general methods are selected, namely (1) -means, an teratve arttonng method, (2) agglomeratve herarchcal clusterng, a method that bulds a herarchcal clusterng tree from bottom-u, (3) Self-Organzng Ma (SOM), a romnent unsuervsed neural networ model mang hgh-dmensonal data onto a two-dmensonal lane. Our hybrd clusterng technues are desgned based on them. Analyss of dfferences n erformance of the three general methods and our hybrd clusterng algorthms s also gven. 2. CLUSTRING ALGORITHMS There are many dfferent algorthms that are avalable today, and the two of the algorthms that we nvestgate, fall nto two general categores: herarchcal and nonherarchcal. The thrd s an unsuervsed clusterng method - SOM, used to fnd clusters n the nut data, and dentfy an unnown data vector wth one of the clusters [1]. 2.1. HIRARCHICAL CLUSTRING PROCDUR There are bascally two tyes of herarchcal clusterng rocedures agglomeratve and dvsve. In agglomeratve herarchcal methods, each observaton starts out as ts own cluster. In subseuent stes, the two closest clusters are combned nto a new aggregate cluster, thus reducng the number of clusters by one n each ste. Two grous of ndvduals formed at an earler stage may on together n a new cluster. ventually, all ndvduals are fused nto one large cluster. In dvsve methods, an ntal sngle grou of obects s dvded nto two subgrous such that the obects n one subgrou are far from the obects n the other. These subgrous are then further dvded nto dssmlar subgrous; the rocess contnues untl there are as many subgrous as obects (each obect forms a cluster). In both herarchcal methods, a herarchy of a tree-le structure s constructed and usually reresented as a dendrogram or tree grah. The dendrogram llustrates the mergers or dvsons that have been made at successve levels. In artcular, Wshart [6] contends that the to down decson tree aroach has nherently greater rs of msclassfcaton by neffcently slttng on a sngle varable than the bottom u aroach. ach classfcaton generated n a decson tree s unvarate by defnton, and ths lmts the range of ossble segments avalable for consderaton. By comarson, the agglomeratve aroach s multvarate and exloratory, and allows for more feasble segments to be nvestgated n terms of the actual dstrbuton of the scatter. Hence, ths roect concentrates on agglomeratve herarchcal algorthms manly (dvsve methods act almost as agglomeratve methods n reverse). The followng are the stes n the agglomeratve herarchcal clusterng algorthm for groung N obects:

1. Start wth N clusters, each contanng a sngle entty and an N x N symmetrc matrx of dstances (or smlartes) D = d }. { 2. Search the dstance matrx for the nearest (most smlar) ar of clusters. Let the dstance between most smlar clusters U and V be D. 1. Merge clusters U and V. Label the newly formed cluster (UV ). Udate the entres n the dstance matrx by a. deletng the rows and columns corresondng to clusters U and V and b. addng a row and column gvng the dstances between cluster (UV ) and the remanng clusters. Reeat Stes 2 and 3 a total of N 1 tmes. (All obects wll be n a sngle cluster after the algorthm termnates.) Record the dentty of clusters that are merged and the levels (dstances or smlartes) at whch the mergers tae lace. 2.2. VARIATIONS OF HIRARCHICAL ALGORITHM Ths secton descrbes the varous varants of agglomeratve herarchcal clusterng algorthms - sngle, comlete, average and Ward s method (SS). 2.2.1. LINKAG MTHODS The nuts to a algorthm can be dstances or smlartes between ars of obects. Sngle, comlete and average are the three -based herarchcal clusterng algorthms mlemented. Table 1: Between-clusters dstances Betweenclusters dstance uv d ( Q, Q l ) Sngle d s = mn, { x x } Comlete { } d c = max, x x Average d a, = x N N l x Between-clusters dstance ( Q, Q ) x Q,x Q, l. l of samles n cluster d l ; N s the number Q. Table 1 shows the between-clusters dstance defnton for each of the methods. In ths case, dssmlarty coeffcent s emloyed. The selecton of the dstance crteron or smlarty coeffcent deends on alcaton. Sngle Lnage: Grous are formed from the ndvdual enttes by mergng nearest neghbours, where the term nearest neghbour connotes the smallest dstance or largest smlarty. Comlete Lnage: The dstance (smlarty) between clusters s determned by the dstance (smlarty) between the two elements, one from each cluster, whch are most dstant (or least smlar). Average Lnage: Average treats the dstance between two clusters as the average dstance between all ars of tems where one member of a ar belongs to each cluster. 2.2.2 WARD S MTHOD (UCLIDAN SUM OF SQUARS) In Ward s method, the dstance between two clusters s the sum of suares between the two clusters summed over all varables. At each stage n the clusterng rocedure, the wthn-cluster sum of suares s mnmzed over all arttons obtanable by combnng two clusters from the revous stage. The ucldean Sum of Suares (SS), s gven by: = c w ( x µ ) w, for a cluster where x s the value of varable n case wthn cluster, c s an otonal dfferental weght for case, w s an otonal dfferental weght for varable, and µ s the mean of varable for cluster. The total SS for all clusters s = and the ncrease n the ucldean Sum of Suares the unon of two clusters and s: I = 2 I at Ward consders herarchcal clusterng rocedures based on mnmzng the loss of nformaton from onng two grous. Ths method s usually mlemented wth loss of nformaton taen to be an ncrease n an error sum of suares crteron. At each ste, unon of every ossble ar of clusters s consdered, and the two clusters whose combnaton results n the smallest ncrease n SS are oned.

2.2. NONHIRARCHICAL CLUSTRING PROCDUR Nonherarchcal rocedures do not nvolve the tree-le constructon rocess. Instead, these methods assgn obects nto clusters once the number of clusters to be formed s secfed. The number of clusters may be ether be secfed n advance or determned as art of the clusterng rocedure. Nonherarchcal methods start from ether from (1) an ntal artton of tems nto grous or (2) an ntal set of seed onts, whch wll form the nucle of clusters. Nonherarchcal clusterng rocedures are freuently referred to as K-means clusterng. MacQueen [5] suggests the term K-means for descrbng an algorthm of hs that assgns each tem to the cluster havng the nearest centrod (mean). In ts smlest form, the rocess s comosed of three stes: 1. Partton the tems nto ntal clusters. (or secfy ntal centrods (seed onts)) but non-homogeneous clusters. The hybrd aroach on data set 1 s erformed usng Ward s herarchcal clusterng and sngle herarchcal clusterng. Durng the frst stage of the hybrd aroach, Ward s method s used to fnd ten smaller clusters on the standardzed data set 1. As can be seen from Fgure 1, ten small clusters are found. No smaller cluster s formed wth elements n both elongated clusters of data set 1. Durng the 2 nd stage sngle herarchcal clusterng, cluster analyss s erformed on the ten cluster means. The cluster means are treated as new nut vectors to the 2 nd stage. Ths hybrd aroach utlzes the roerty of Ward s method and sngle herarchcal clusterng. Ward s method tends to fnd relatvely eual szes and hyer-shercal clusters whereas sngle clusterng tends to form long elongated cluster. In ths test by combnng the features of both clusterng methods, the two elongated clusters of data set 1 are found n Fgure 2. 2. Proceed through the lst of tems, assgnng an tem to the cluster who centrod (mean) s nearest. (Dstance s usually comuted usng ucldean dstance wth ether standardzed or unstandardzed observatons.) Recalculate the centrod for the cluster recevng the new tem and for the cluster losng the tem. 3. Reeat Ste 2 untl no more reassgnments tae lace. Because a matrx of dstances (smlartes) does not have to be determned, and the basc data do not have to be stored durng the comuter run, nonherarchcal methods can be aled to larger data sets than can herarchcal technues. 2.3. SLF-ORGANIZING MAP (SOM) The Self-Organzng Ma (SOM) s an unsuervsed neural networ mang hgh dmensonal nut data onto a usually two-dmensonal outut sace whle reservng relatons between the data tems. The cluster structure wthn the data as well as the nter-cluster smlarty s vsble from the resultng toology reservng mang [3, 4]. The SOM conssts of unts (neurons), whch are arranged as a two-dmensonal rectangular or hexagonal grd. Durng the tranng rocess vectors from the data set are resented to the ma n random order. The unt most smlar to a chosen vector s selected as the wnner and adoted to match the vector even better. Then unts n the neghborhood of the wnner are slghtly adoted as well. The traned SOM rovdes a mang of the data sace onto a two-dmensonal lan n such a way that smlar data onts are located close to each other. Fgure1: Result after 1 st clusterng on data set 1 stage Ward s herarchcal 3. TH MPIRICAL STUDY In the emrcal secton, the software for all the clusterng algorthms evaluated n ths aer s avalable at [2]. Data set 1 s artfcally generated to see how the algorthms erform when there are two well-searated Fgure2: Result after 2 nd stage sngle herarchcal clusterng on data set 1 Data set 2 contans three classes of 50 nstances each, where each class refers to a tye of rs lant. ach

nstance has four contnuous attrbutes. One class s lnearly searable from the other two; the latter are not lnearly searable from each other. Table 2 summarzes the results acheved by each of the clusterng technues carred out n ths exermental setu, ncludng two two-level hybrd clusterng algorthms. Table 2: Results of the clusterng technues on raw data set 2 Clusterng Method Percentage of samles K-means 89.3% Sngle 68% Comlete 96% Average 74% Ward s method 89.3% 2 nd stage Comlete 92.6% 82% Data set 3 contans two classes of 690 samles. In ths dataset, there s a good mx of attrbutes: contnuous, nomnal wth small numbers of values, and nomnal wth larger numbers of values. In Table 3, the results acheved by drect clusterng on ths data set usng comlete, and average herarchcal clusterng technue are not as good as the result acheved usng the hybrd aroach. In ths exerment setu, hybrd aroach clusterng utlsng SOM and comlete herarchcal clusterng acheves a better result than comlete clusterng on data set 3. A better result s also acheved usng hybrd aroach clusterng utlsng SOM and average herarchcal clusterng than drect average herarchcal clusterng on data set 3. Table 3: Results of the clusterng technues on data set 3 Clusterng Method Percentage of samles K-means 84% Sngle 55% Comlete 55% Average 55% Ward s method 79% 2 nd stage Sngle 55% 2 nd stage Comlete 2 nd stage Comlete 2 nd stage Average 55% 80% 76% 84% Data set 4 contans two classes of samles where one class s the grou of atents dagnosed ostvely for dabetes. ach samle has eght contnuous attrbutes. In ths exerment setu, the results n Table 4 acheved by all the clusterng technues are about the same. There s a slght mrovement usng hybrd aroach utlzng K-means clusterng and comlete herarchcal clusterng when t s comared to the result acheved usng comlete herarchcal clusterng on data set 4. Table 4: Results of each of the clusterng technues on data set 4 Percentage of samles Clusterng Method K-means 70% Sngle 65% Comlete 67% Average 65% Ward s method 66% 2 nd 65% stage Sngle 2 nd 70% stage Comlete 2 nd 63% stage Comlete 2 nd 65% stage Average 65%

4. CONCLUSIONS We comared on the four data sets the erformance of the two-level hybrd clusterng algorthms aganst the other clusterng algorthms: -mean, SOM, sngle, comlete, average, and Ward s herarchcal clusterng. The two-level hybrd clusterng algorthms ht the hghest ercentage of samles on all the data sets as comared to each of the other clusterng algorthms alone. In artcular, n data set 1, the hybrd aroach usng ward s method n the frst stage and sngle herarchcal clusterng n the second stage s able to fnd the two well-searated non-homogeneous clusters of the data set, whereas other clusterng methods, other than sngle clusterng, are not able to fnd the clusters for ths tye of data set. RFRNCS [1] M. S. Aldenderfer and R. K. Blashfeld, Cluster analyss. Beverly Hlls: Sage Publcatons, 1984. [2] Clustan Clusterng Software, Avalable: www.clustan.com, 2003. [3] T. Kohonen, Self-organzng mas: Otmzaton aroaches, In roceedngs of the nternatonal conference on artfcal neural networs, Fnland,. 981-990, 1991. [4] T. Kohonen, J. Hynnnen, J. Kangas, and J. Laasonen, The Self-Organzng Ma Program Pacage, Laboratory of Comuter and Informaton Scence, Helsn Unversty of Technology, 1995 [5] Macueen, Some methods for classfcaton and analyss of multvarate observaton, Proc. 5 th Bereley Sym., I,. 281-297, 1967. [6] D. Wshart, ffcent herarchcal cluster analyss for data mnng and nowledge dscovery, Presented at the Interface 1998. Mnneaols, USA, 1998.