Bdrectonal Herarchcal Clusterng for Web Mnng ZHONGMEI YAO & BEN CHOI Computer Scence, College of Engneerng and Scence Lousana Tech Unversty, Ruston, LA 71272, USA zya001@latech.edu, pro@bencho.org Abstract In ths paper we propose a new bdrectonal herarchcal clusterng system for addressng challenges of web mnng. The key feature of our approach s that t ams to maxmze the ntra-cluster smlarty n the bottom-up cluster-mergng phase and t ensures to mnmze the nter-cluster smlarty n the top-down refnement phase. Ths two-pass approach acheves better clusterng than exstng one-pass approaches. We also propose a new cluster-mergng crteron for allowng more than two clusters to be merged n each step and a new measure of smlarty for takng nto consderaton not only the nter-connectvty between clusters but also the nternal connectvty wthn the clusters. These result n reducng the average complexty for creatng the fnal herarchcal structure of clusters from O(n 2 ) to O(n). The herarchcal structure represents a semantc structure between concepts of clusters and s drectly applcable to the future of semantc net. 1. Introducton The World Wde Web, wth ts explosve growth and ever-broadenng reach, has become the default knowledge resource for many areas of endeavor. It s becomng ncreasngly mportant to devse sophstcated schemes to fnd nterestng concepts and relatons between concepts from ths resource. Clusterng s one of the technques that can solve ths problem. Clusterng s an unsupervsed dscovery process for parttonng a set of data such that the ntra-cluster smlarty s maxmzed and the ntercluster smlarty s mnmzed [1,2]. The applcaton of clusterng technques to web mnng has been facng a number of challenges [3,4], such as huge amount of resources, retreval tme, hgh dmensonalty, qualty, and meanngful nterpretaton. In ths paper we propose a new Bdrectonal Herarchcal Clusterng system n a hgh dmensonal space based n part on the graph parttonng model [,6]. Our system frst uses the all-k-nearest neghbors [7] to sparsfy the graph and to elmnate outlers. In our bottom-up cluster-mergng phase, we defne a new edge matchng [,6] method that takes nto consderaton not only the nter-connectvty between vertces but also the nternal connectvty wthn the vertces. Ths edge matchng method also dscovers the herarchcal structure of clusters much faster than the usual herarchcal clusterng. Our top-down refnement processng then elmnates errors that occurred n the greedy clustermergng phase. The fnal step s to extract concepts from the clusters organzed n the herarchcal structure. The rest of ths paper s organzed as follows. Secton 2 revews related work. Our proposed Bdrectonal Herarchcal Clusterng system s presented n Secton 3. Secton 4 dscusses the computatonal complexty of our algorthm. Secton contans conclusons and future work. 2. Related Work Numerous clusterng algorthms appear n lterature [1-4,8-19]. Clusterng technques can be broadly categorzed nto parttonal clusterng and herarchcal clusterng [1,2] whch dffer n whether they produce flat parttons or herarchy of clusters. The k-means s a parttonal clusterng algorthm whch has O(n) tme complexty n terms of the number of data ponts [8,19]. Whle the k- means s senstve to outlers, the medod-based method elmnates ths problem typfed by PAM and CLARANS [9]. But the k-medods have O(n 2 ) tme complexty. The lmtatons of these two parttonal schemes are that they are senstve to ntal seeds and they fal when clusters have arbtrary shapes or large dfferent szes. Ths research was supported n part by Center for Entrepreneurshp and Informaton Technology (CEnIT), Lousana Tech Unversty, Grant CSe 200123. Proceedngs of the IEEE/WIC Internatonal Conference on Web Intellgence (WI 03) 0-769-1932-6/03 $17.00 2003 IEEE
Herarchcal clusterng creates a nested sequence of clusters. There are varatons of herarchcal agglomeratve clusterng (HAC) algorthms whch dffer prmarly n how they compute the dstance between clusters [1,2]. For nstance, the sngle lnk method can fnd clusters of arbtrary shape or dfferent szes, but t s susceptble to nose and outlers. The complete lnk method s less used because of ts O(n 3 ) tme complexty. An effcent method s the group average method whch defnes the average par-wse dstance as the cluster dstance. Other densty-based or grd-based clusterng methods were presented, e.g. GDBSCAN [13] and OptGrd [14]. Nevertheless they do not work effectvely n a very hgh dmensonal space [4,1]. Another clusterng approach s the probablstc approach [16]. Ths approach tends to mpose structure on the data and the selected dstrbuton famly may not be approprate [4]. More recently, clusterng algorthms for mnng large databases have been proposed [10-12]. Most of these are varants of herarchcal clusterng, e.g. BIRCH [11], CURE [12], and CHAMELEON [10]. In summary, only k-means methods, HAC methods and graph parttonng algorthms [10] have been appled n very hgh dmensonal datasets. The performances of HAC algorthms have hgher qualty and are more versatle than the k-means algorthm. The maor lmtatons of HAC methods [8,10,17] are ther O(n 2 ) tme complexty and the errors that may occur durng the greedy cluster-mergng procedure (Fgure 1). In the followng sectons we present our new algorthm that overcomes the lmtatons of common HAC methods. A F 8 7 8 9 B 8 7 G D E 6 7 C H (Step1) (Step2) A,B 1 F,G D,E 2 2. 3 4 4 A,B,C D,E F,G,H C H (Step3,4) (Step,6,7) Fgure 1. An example of HAC. Note that the greedy decson can lead to an ncorrect soluton. The correct soluton n ths case s (ABCD) and (EFGH). 3. Our New Bdrectonal Herarchcal Clusterng (BHC) System In ths secton we propose our new BHC approach. Ths approach conssts of the followng fve maor steps: (1) representng web pages by vector-space model; (2) generatng the matrx of k-nearest neghbors of web B A 8 6 C 3 D,E 4 4 F 7 G H pages; (3) bottom-up cluster mergng phase; (4) top-down refnement phase; and () extractng concepts of clusters. 3.1. Representng Web Pages We convert a web page nto a vector of features and only text n the web page s represented: (w 1,, w k,, w m ) where w k s the weght of the term t k n the th web page, and m s the number of dstnct terms (dmensonalty) n the dataset. Hereafter m denotes the dmensonalty and n denotes the number of web pages. A maor dffculty of text clusterng s the hgh dmensonalty of the feature space [20,21]. After removng stoppng-terms and stemmng terms n web pages, we remove those terms whose document frequences [21] are less than a threshold n order to reduce dmensonalty. The document frequency thresholdng can be relably used, because t elmnated 90% or more unque terms wth ether an mprovement or no loss n accuracy of performance and t also has lowest cost [21]. We then use the term frequency nverse document frequency (tf-df) [19] to determne w k (1 k m, and n): w k tf log( n / df ) s, k where tf k s the frequency of the term t k n the th web page, df k s the document frequency of term t k, and s s the normalzaton component. Fnally, the length of each web page vector s normalzed to have unt L 2 norm [19], that s, s m 2 ( ( tf k log( N / df 1/ k ) 2 ). k 1 The normalzaton ensures that web pages dealng wth the same subect matter, but dfferng n length lead to smlar web page vectors [19]. The cosne measure s then appled to compute smlarty between vectors, d and d : cos( d, d d d ) d d where denotes the dot-product of vectors and d s the length of the vector. Snce the length of the web page vector s normalzed to have unt length, the above formula s smplfed to cos(d,d )=d d. Cosne measure has been tested to be one of the best smlarty measures compared to Dce Coeffcent, extended Jaccard, Eucldean, Pearson correlaton measures n web page doman [4,20,22]. 3.2. Generatng All-k-nearest-neghbor Matrx Fndng k-nearest-neghbor for each web page can be solved by brute force usng O(n 2 ) smlarty computatons. Fortunately, there are fast algorthms to solve the all-k- k Proceedngs of the IEEE/WIC Internatonal Conference on Web Intellgence (WI 03) 0-769-1932-6/03 $17.00 2003 IEEE
nearest-neghbor (Aknn) problem. We apply the fast algorthm presented n [7]. The fast Aknn algorthm starts wth a rough guess of the set of k-nearest neghbors and refnes t when more nformaton s avalable throughout the process. A pvot-based ndex s used to ndex the set of nearest neghbors. The pvot-based ndexng algorthm [7] works on the trangle nequalty. However, general smlarty functons do not obey the trangle nequalty. Thus we have to transform the smlarty s nto dstance t= log(s ) [4] ust for the Aknn problem. The algorthm n has O( ) ( 2) tme complexty. The value of depends on how good the ndex s to search n the vector space. The n k Aknn matrx s used to construct a sparse graph, n whch a vertex represents a web page and each vertex s connected wth ts k-nearest neghbors. Edges n the graph are weghted by the par-wse smlartes among vertces. We denote the maxmum edge weght as Max, whch wll be used to determne thresholds n the followng phase. Ths k-nearest-neghbor graph approach reduces redundancy, outlers and overall executon tme. 3.3. The Bottom-up Cluster Mergng Phase Our bottom-up cluster mergng approach operates on the dea of matchng [,6] n graph parttonng. If the edge between two vertces n the graph G =(V,E ) (V s the set of vertces and E s the set of edges) has been matched, t s collapsed and a mult-node consstng of these two vertces s created. A coarser graph G +1 s obtaned by collapsng the matched adacent vertces n G (Fgure 2 [,6]). Each vertex n the orgnal graph G 0 s regarded as a sngle cluster. A herarchcal structure of clusters s created n the graph coarsenng procedure. Fgure 2. Matchng vertces to coarsen a graph. We defne a new matchng method called Heavy Connectvty Matchng (HCM) for allowng more than two vertces to be merged n each stage. We also defne a new smlarty measure called edge connectvty for takng nto consderaton the nter-connectvty between vertces and the nternal connectvty wthn the vertces. Edge connectvty between vertces u and v s defned: n _ edge ( u ) n _ edge ( v ) cr _ edge ( u, v ) u v where n_edge(u) s the sum of the weghts of edges connectng sub-vertces n vertex u f u contanng more than one vertces; otherwse t s 0; cr_edge(u,v) s the weght of the edge crossng between vertces u and v; and u v s the number of edges n the unon of u and v. Our Heavy Connectvty Matchng method proceeds by vstng vertces n an arbtrary order. If a vertex u has not been matched yet, we select ts unmatched adacent vertces such that the edge connectvty between u and ts unmatched adacent vertces s larger than a threshold. Vertex u and ts matched adacent vertces are then combned to form a mult-node for the next coarser graph. In order to preserve the connectvty nformaton n the coarser graph, we update edge weghts after each stage of coarsenng the graph. Let V v be the set of vertces of G combned to form vertex v of the next coarser G +1. We compute n_edge(v) to be the sum of weghts of edges connectng the vertces wthn V v. In the case where more than one vertex of V v contan edges to another vertex u, the weght of the cr_edge(v,u) s updated as the sum of the weghts of edges connectng v and u. HCM s appled successvely to coarsen the graph. In each stage the threshold s dvded by a decay factor, ( >1) [18]. equals Max/ for the frst stage. Durng the th stage, the edges whch have weghts n the range of (Max/ ~ Max/ +1 ) are matched and collapsed. controls the speed of coarsenng and guarantees a certan number of edges are matched and collapsed durng each stage. The herarchcal structure of clusters s thus created durng ths mergng phase. The mergng procedure stops when the hghest edge connectvty n the coarsest graph s below a stoppng factor that s a functon of Max/. 3.4. The Top-down Refnement Phase After groupng vertces n the greedy herarchcal way, we successvely refne the clusters as we proect the coarser graph G +1 down to the larger fner graph G. Obtanng the larger G from the coarser G +1 s done smply by transformng mult-node v of G +1 back to ndvdual vertces, V v, of G. Snce G s fner, t provdes more degrees of freedom that can be used to refne clusterng. The refnement algorthm s used to reduce the nterconnectvty between clusters (or mult-nodes). The nterconnectvty between clusters A and B s defned as: weght (, ) A, B gan, A B where vertex belongs to cluster A (or mult-node A), vertex belongs to cluster B and A s the sze of cluster A. If a vertex n A s swapped to cluster B and decreases the value of gan, then the vertex should be moved to Proceedngs of the IEEE/WIC Internatonal Conference on Web Intellgence (WI 03) 0-769-1932-6/03 $17.00 2003 IEEE
cluster B. The gan s smlar to the rato-cut heurstc n [17]. Gven the defnton of gan, for each vertex u, we compute mprovement f u s moved from the cluster t belongs to, to one of the other clusters that u s connected to. The mprovement s ndcated by the heurstc value of (gan-before-swap gan-after-swap). The Kernghan- Ln algorthm (KL) [,6] then proceeds by repeatedly selectng a vertex u wth the hghest heurstc value and movng t to the desred cluster. After movng u, u won t be moved agan and the heurstc values of the vertces adacent to u are updated to reflect the change. In each fner graph, the KL algorthm s termnated when no more vertex movng wll decrease the nter-connectvty between clusters. Ths refnement algorthm s appled at each successve fner graph. For the example n Fgure 1, we wll computer the mprovement f any vertex s moved to the other cluster whch t s connected to. As we can see, D wll be moved to the other cluster (ABC) snce ts value of (gan-beforeswap gan-after-swap) s ( 14 9 3 4 4 =0.37). Further movng won t have any mprovement. Ths llustrates our refnement method can mprove clusterng and obtan the correct soluton. We can see that the top-down (coarsest-fnest) refnement approach operates at dfferent representaton scales and can easly dentfy groups of vertces to be moved together. Thus ths mult-level refnement approach can clmb out of local mnma very effectvely [,6]. 3.. Concept Extracton We extract cluster concept by selectng the most mportant terms from each cluster. We apply the most frequent and predctve term method to extract the concepts of clusters, snce t receved the best performance over 2 method, most frequent term method, and most predctve term method [24]. The most frequent and predctve word method selects terms based on the product of local frequency and predctveness: p ( term cluster ) p ( term cluster ) p ( term ) where p(term cluster) s frequency of the term n the cluster and p(term) s the term s frequency n the whole collecton. The k-hghest-rankng terms are thus extracted from the cluster to represent the concept. 4. Analyss of Computatonal Complexty The overall computatonal complexty of our new algorthm depends on the tme complexty of buldng the all-k-nearest-neghbor matrx and the amount of tme t requres to perform the bottom-up and top-down phases of the clusterng algorthm. The tme complexty of fndng the Aknn has been dscussed n the prevous secton, whch s O( n ) ( n). The amount of tme requred by the mergng phrase depends on rate n whch the sze of successvely coarser graphs s decreasng. If the sze of successvely coarse graphs decreases by a constant factor, then the complexty of the algorthm s lnear on the number of vertces and the number of edges n the graph [,6]. In our new edge matchng approach, snce the Max and the decay factor control the speed of coarsenng the graph, an approprate value of may guarantee a number of edges are matched and collapsed durng each stage. In ths case, the bottom-up cluster mergng phase has O(n) tme complexty, because n an Aknn sparse graph the number of edges s lnear on the number of vertces. In the worst case, when the sze of successvely coarser graphs decreases by only a few vertces at a tme, the complexty of the mergng algorthm wll be quadratc on the number of vertces n the graph. The complexty of refnement phase s same as the mergng phase snce both of them are multlevel algorthms. (The KL mproved by Fducca and Mattheyses [23] reduces complexty to O( E ) by usng approprate data structures.) Therefore the average complexty of overall procedure s determned by constructng the Aknn graph, whch takes O( n ) ( n) tme.. Conclusons and Future Work In ths paper we presented the comprehensve process for clusterng web pages and extractng cluster concepts. More mportantly, we proposed a new BHC algorthm based n part on multlevel graph parttonng. We defned a new edge matchng method that preferred mergng the sub-clusters whose edge connectvty was hgh n the bottom-up cluster-mergng phase. We also used an obectve functon for the top-down refnement procedure that decreased the nter-connectvty between dfferent clusters. Thus the new algorthm tred to maxmze the ntra-cluster smlarty n the bottom-up cluster-mergng phase and t ensured to mnmze the nter-cluster smlarty n the top-down refnement phase. The advantages of our algorthm are that t elmnated the errors occurrng n greedy clusterng algorthms and ts multlevel refnement procedure was very effectve n clmbng out of local mnma. The average tme complexty of our new algorthm s O( n ) ( n), whch s also faster than the common HAC algorthm (O(n 2 )). Proceedngs of the IEEE/WIC Internatonal Conference on Web Intellgence (WI 03) 0-769-1932-6/03 $17.00 2003 IEEE
We beleve that the new algorthm wll have good performance n near future, snce the Aknn algorthm, the multlevel graph parttonng and KL algorthm were well studed and mplemented. However, as we can see, the choce of proper obectve functons s essental for overall success of our algorthm. Another problem s the method we used to extract concepts of clusters. Usng mportant terms to represent concepts of clusters s the smplest way but more sophstcated methods reman to be developed. Our future work ncludes nvestgatng more sophstcated methods for clusterng based on contextual meanng of web pages and ncorporatng them wth our proposed classfcaton system [2,26] nto our web-page Classfcaton and Search Engne. References [1] B. S. Evertt, S. Landua, and M. Leese, Cluster Analyss, Arnold, London Great Brtan, 2001. [2] A. K. Jan, M. N. Murty, and P. J. Flynn, Data Clusterng: A Revew, ACM computng Surveys, Vol. 31, No. 3, September 1999, pp.2-323. [3] O. Zamr and O. Etzon, Web Document Clusterng: A Feasblty Demonstraton, n Proc. 21st Annu. Int. ACM SIGIR Conf., 1998, pp.46 4. [4] A. Strehl, Relatonshp-based Clusterng and Cluster Ensembles for Hgh-dmensonal Data Mnng, Dssertaton, The Unversty of Texas as Austn, May 2002. [] G. Karyps, and V. Kumar, Multlevel k-way Parttonng Scheme for Irregular Graph, Journal of Parallel and Dstrbuted computng, 48(1), 1998, pp96-129. [6] G. Karyps and V. Kumar, A Fast and Hgh Qualty Multlevel Scheme for Parttonng Irregular Graphs, SIAM Journal of Scentfc Computng, 20(1), 1999, pp39-392. [7] E. Chavez, K. Fgueroa, and G. Navarro, A Fast Algorthm for the All K Nearest Neghbors Problem n General Metrc Spaces, http://garota.fsmat. umch.mx/ ~elchavez/publca/. [8] M. Stenbach, G. Karyps, V. Kumar, A Comparson of Document Clusterng Technques, KDD 2000, Techncal report of Unversty of Mnnesota. [9] R. Ng and J. Han, "Effcent and Effectve Clusterng Methods for Spatal Data Mnng", VLDB-94. [10] G. Karyps, E.-H. Han, V. Kumar, CHAMELEON: A Herarchcal Clusterng Algorthm Usng Dynamc Modelng, IEEE Computer, 32(8), August 1999, pp.68-7. [11] T. Zhang, R. Ramakrshnan and M. Lnvy, BIRCH: an Effcent Data Clusterng Method for Very Large Databases, Proceedngs of the ACM SIGMOD Conference on Management of Data, Montreal, Canada, 1996, pp. 103-114. [12] S. Guha, R. Rastog, and K. Shm, CURE: A Clusterng Algorthm for Large Databases, Proceedngs of the ACM SIGMOD Conference on Management of Data, 1998, pp73-84. [13] J. Sander, M. Ester, H. P. Kregel and X. Xu, Denstybased Clusterng n Spatal Databases: The Algorthm GDBSCAN and ts Applcatons, An Internatonal Journal 2(2), Kluwer Academc Publshers, Norwell, MA., June 1998, pp.169-194. [14] A. Hnneburg and D. A. Kem. An Optmal Grdclusterng: Towards Breakng the Curse of Dmensonalty n Hgh-dmensonal Clusterng, VLDB- 99, 1999. [1] B. Lu, Y. Xa, P. S. Yu, Clusterng Through Decson Tree Constructon, SIGMOD 2000. [16] M. Goldszmdt and M. Saham, A Probablstc Approach to Full-Text Document Clusterng, Techncal Report ITAD-433-MS-98-044, SRI Internatonal, http://cteseer.n.nec.com/goldszmdt98probablstc.html. [17] G. Karyps, E.-H. Han, and V. Kumar, Multlevel Refnement for Herarchcal Clusterng, http://garota. fsmat.umch.mx/ ~elchavez/publca/. [18] K. Raaraman and H. Pan, Document Clusterng usng 3- tuples, PRICAI'2000 Internatonal Workshop on Text and Web Mnng, Melbourne, Australa, Sep. 2000, p88-9. [19] I. S. Dhllon, J. Fan and Y. Guan, Effcent Clusterng of Very Large Document Collectons, Data Mnng for Scentfc and Engneerng Applcatons, Kluwer Academc Publsher, 2001. [20] C. J. Rsbergen, Informaton Retreval, Butterworths, 1979. [21] Y. Yang and J. O. Pedersen, A Comparatve Study on Feature Selecton n Text Categorzaton, www.cs.cmu.edu/ ~ymng/ papers.yy/ml97.ps. [22] A. Strehl, J. Ghosh, and R. Mooney, Impact of Smlarty Measures on Web-page Clusterng, Proceedngs of the AAAI2002 Workshop on Artfcal Intellgence for Web Search, AAAI/MIT Press, Austn, Texas, July 2002, pp8-64. [23] C. M. Fducca and R. M. Mattheyses, A Lnear Tme Heurstc for Improvng Network Parttons, Proceedngs 19th IEEE Desgn Automaton Conference, 1982, pages 17 181. [24] A. Popescul and L. H. Ungar, Automatc Labelng of Document Clusters, http: //www.cs.upenn.edu/ ~popescul/publcatons.html. [2] X. Peng & B. Cho, Automatc Web Page Classfcaton n a Dynamc and Herarchcal Way, IEEE Internatonal Conference on Data Mnng, 2002, pp. 386-393. [26] B. Cho, Makng Sense of Search Results by Automatc Web-page Classfcatons, WebNet 2001, 2001, pp.184-186. Proceedngs of the IEEE/WIC Internatonal Conference on Web Intellgence (WI 03) 0-769-1932-6/03 $17.00 2003 IEEE