/02/$ IEEE

Size: px

Start display at page:

Download "/02/$ IEEE"

Adela Welch
6 years ago
Views:

1 A Modfed Fuzzy ART for Soft Document Clusterng Ravkumar Kondadad and Robert Kozma Dvson of Computer Scence Department of Mathematcal Scences Unversty of Memphs, Memphs, TN ABSTRACT Document clusterng s a very useful applcaton n recent days especally wth the advent of the World Wde Web. Most of the exstng document clusterng algorthms ether produce clusters of poor qualty or are hghly computatonally expensve. In ths paper we propose a document-clusterng algorthm, KMART, that uses an unsupervsed Fuzzy Adaptve Resonance Theory (Fuzzy-ART) neural network. A modfed verson of the Fuzzy ART s used to enable a document to be n multple clusters. The number of clusters s determned dynamcally. Some experments are reported to compare the effcency and executon tme of our algorthm wth other document-clusterng algorthm lke Fuzzy c Means. The results show that KMART s both effectve and effcent. 1. INTRODUCTION Clusterng s an mportant tool n data mnng and knowledge dscovery. The ablty to automatcally group smlar tems together enables one to dscover hdden smlarty and key concepts. Also clusterng enables one to summarze a large amount of data nto a small number of groups. Ths serves as an nvaluable tool for users to comprehend a large amount of data. The World Wde Web search engnes serve as a good example for ths. Clusterng s used n many dfferent felds, lke data mnng [5], mage compresson [15] and nformaton retreval [16]. Reference [10] provdes an extensve survey of varous clusterng technques. The World Wde Web s a large repostory of many knds of nformaton. The sheer sze of t makes t hard for any user to fnd nformaton relevant to hm/her. Nowadays many search engnes exst to allow users to query the Web, usually va keyword search. However, snce each keyword s assocated wth many dfferent subjects, and the typcal amount of nformaton (web documents) returned s very large, the user s not able to have a good grasp of the output. Usually the search results are lsted by some sort of relevance measure. However, even documents of vastly dfferent subjects can share the same hgh relevance scores. Thus, one needs a way to cluster the results from the web search engne to facltate users. Some search engnes have pre-defned subjects that are used to categorze the output of search engnes (for nstance, yahoo.com). However, few search engnes (lke Teoma.com, wsenut.com) provde a dynamc clusterng mechansm.e. clusterng algorthms are appled only to the resultng documents of the query. We beleve that ths s an mportant servce for any search engne over the Web and s hghly benefcal to users. Whle there are many tradtonal clusterng algorthms avalable, document clusterng brngs along many dstnctve ssues to deal wth. One such ssue s representaton. A document s typcally represented as a vector (document vector), where each dmenson corresponds to a term (word), and the value denotes whether a term s present or not. In addton, smlarty between documents s typcally measured by some non-eucldean measure between the vectors. Ths means that a document vector cannot be manpulated lke normal vectors. For nstance, we cannot average document vectors. Ths mples that algorthms that requre a cluster center lke K-means [9,19] need to be modfed sgnfcantly. There are multple ways of lookng at the clusterng problem. Accordng to [11], there are four dfferent knds of clusterng algorthms: agglomeratve herarchcal algorthms, partton algorthms, model fttng and densty based. Agglomeratve herarchcal clusterng algorthms [7] use a bottom-up methodology to merge smaller clusters nto larger ones, usng technques such as mnmal spannng tree. Partton algorthms such as K-means try to dvde data nto subgroups such that the partton optmzes certan crtera, lke nter-cluster dstance or ntra-cluster dstances. They typcally take an teratve approach. Model fttng algorthms attempt to ft the data as a mxture of easly parameterzed dstrbutons (e.g. multvarate normal) and estmate ther parameters. Densty-based algorthms, such as DBSCAN [8], vew clusterng as locatng hgh-densty regons. The goal of document clusterng s to categorze the documents so that all the documents n a cluster are smlar. Most of the early work [9,19] appled tradtonal clusterng algorthms lke K-means to the sets of documents to be clustered. Wllett [24] provded a survey on applyng herarchcal clusterng algorthms nto clusterng documents. Cuttng et al. [6] proposed speedng up the parttonbased clusterng by usng technques that provde good ntal clusters. Two technques, Buckshot and Fractonaton are mentoned. Buckshot selects a small sample of documents to pre-cluster them usng a standard clusterng algorthm and assgns the rest of the documents to the clusters formed. Fractonaton splts the N documents nto m buckets where each bucket contans N/m documents. Fractonaton takes an nput parameter ρ, whch ndcates the reducton factor for each bucket. The standard clusterng algorthm s appled so that f there are n documents n each bucket, they are clustered nto n/ρ clusters. Now each of these clusters are treated as f they were ndvdual documents and the whole process s repeated untl there are only K clusters. Most of the algorthms above use a word-based approach to fnd the smlarty between two documents. In [26] a phrase-based approach called STC (suffx-tree clusterng) was proposed. STC s a lnear-tme clusterng algorthm. Ths allows STC to form clusters dependng not only on ndvdual words but also on orderng of the words.

2 In [18], a new method was proposed for clusterng related documents usng assocaton rules and hyper-graph parttonng. Ths method frst fnds set of terms that occur frequently together n documents usng the Apror algorthm [1]. These frequent tem sets are then used to group tems nto hyper-graph edges, and a hyper-graph parttonng algorthm s used to fnd the tem clusters. The smlarty among tems s captured mplctly by the frequent tem sets. The man advantage of ths method s that t does not requre any dstance measure to fnd the smlarty between documents. The clusterng technques above can be categorzed as hard clusterng, as every tem s clustered nto a sngle cluster. Soft clusterng allows each tem to assocate wth multple clusters, by ntroducng a membershp functon W j between each cluster-tem par to measure the degree of assocaton. In ths paper, we propose a soft document-clusterng algorthm usng a modfed Fuzzy Adaptve resonance theory network [4]. A bref descrpton about soft clusterng and some of the soft document clusterng algorthms s gven n the next secton. In the rest of ths paper, we dscuss about ART networks brefly and then we dscuss our proposed algorthm, together wth our expermental results. We show that our clusterng technque overcomes the problems of standard hard clusterng algorthms mentoned above, wthout payng any prce n effcency. 2. SOFT DOCUMENT CLUSTERING A sngle document very often contans multple themes. For example, ths paper can be classfed nto the felds fuzzy clusterng as well as Neural networks. Many clusterng algorthms mentoned above assgn each document to a sngle cluster, thus makng t hard for a user to dscover such nformaton. To remedy the above stuaton, we can employ soft clusterng. That s, each document can belong to multple clusters, and there s a measure to determne the assocaton between each cluster and each document. Ths has the followng advantages: A document can belong to multple clusters, thus we can dscover the multple themes for a document. Clusters that contan combnaton of themes. For nstance, n our experments, when the document set has documents related to baseball, moves and baseball-moves respectvely, KMART formed three clusters for documents about baseball, moves and baseball moves where as hard clusterng algorthms lke k-means faled to produce a cluster for baseballmoves. The measure assocated between clusters and documents can be used as a relevance measure to order the document approprately. Many soft clusterng algorthms employ the dea of fuzzness n ther methods. One of the most common fuzzy clusterng algorthms s Fuzzy C-means (FCM). It was frst reported by Dunn n 1972 and subsequently generalzed by Bezdek [3]. FCM s based on the Partton clusterng algorthm, teratng over the data sets untl the values of the membershp functon stablzes. FCM has been used n many applcatons lke medcal dagnoss, mage analyss, rrgaton desgn and automatc target recognton. Other fuzzy algorthm technques such as Self-Organzng Maps [14], also abounds. Barald and Blonda [2] provdes a good survey of such algorthms. However, one drawback of fuzzy algorthms s that they are slow compared to non-fuzzy algorthms. Fuzzy clusterng algorthms tend to be teratve, and typcal fuzzy clusterng algorthms requre repeatedly calculatng the assocatons between every cluster/document par. SISC and WBSC [12,13] are two soft documentclusterng algorthms developed by one of the authors of ths paper. SISC uses a modfed Fuzzy C Means algorthm to cluster documents. It uses a randomzaton approach that enables t to avod lot of computatons needed n a tradtonal fuzzy clusterng algorthm. At each teraton, t computes a smlarty measure between a cluster and a document wth a probablty proportonal to the proxmty of the smlarty measure to the threshold measure. It also has a robust outlerhandlng mechansm. WBSC [13] uses a word-based approach. It starts wth each term as a cluster and clusters the terms dependng on the documents they appear n. It s a herarchcal clusterng algorthm. There has also been work done on applyng Selforganzng maps to cluster documents. For nstance, [20] dscusses an approach called Adaptve approach whch uses self-organzng maps to cluster documents and also takes feedback from the user and re-clusters the documents. Approaches based on neural networks nclude one based on an adaptve blnear retreval model [25], and a herarchcal model based on fuzzy adaptve resonance theory [17]. In ths paper, we propose a modfcaton to the tradtonal Fuzzy ART algorthm, whch s a hard clusterng algorthm, to make t a soft clusterng algorthm. Ths also cuts down some teratve search process n Fuzzy ART makng t much faster than some of the exstng document-clusterng algorthms. We dscuss brefly about ART networks n the next secton. 3. ART NETWORKS ART (Adaptve Resonance theory) neural networks are developed by Grossberg [4] to address the problem of stablty-plastcty dlemma. A network s plastc, f t can adapt to the nputs ndefntely. A network s not stable f t can wth stand to nose. A tradtonal neural network uses the tranng data to adapt to the nput, but does not do t for test data. So t s not plastc. Also f the tranng data contans some erroneous nformaton t adapts accordng to that erroneous data. So t s not stable. The stablty-plastcty dlemma can be proposed as follows: How can a learnng system be desgned to reman plastc or adaptve and at the same tme reman stable to rrelevant events? The ART networks proposed by Grossberg solve ths problem. It s an ncremental algorthm. So t adapts to new

3 nputs ndefntely. At the same tme, t wont let new nputs to change any stored patterns untl the nput pattern matches the stored pattern wth n a certan tolerance. Ths means that an ART network has both plastcty and stablty; new categores can be formed when the envronment does not match any of the stored patterns, but the envronment cannot change stored patterns unless they are suffcently smlar. The general structure of an ART network s shown n the fgure 1. Fgure 1: Archtecture of an ART network A typcal ART network conssts of two layers: an nput layer (F1) and an output layer (F2). There are no hdden layers. The nput layer contans N nodes, where N s the number of nput patterns. The number of nodes n the output layer s decded dynamcally. Every node n the output layer has a correspondng prototype vector. The networks dynamcs are governed by two sub-systems: an attenton subsystem and an orentng subsystem. The attenton subsystem proposes a wnnng neuron (or category) and the orentng subsystem decdes whether to accept t or not. The network s sad to be n a resonant state when the orentng system accepts a wnnng category (.e. when the wnnng prototype vector matches the current nput pattern close enough.) There are many versons of ART algorthms: ART1, ART2, ARTMAP, Fuzzy ART, Fuzzy ART MAP etc. ART1 s the basc ART network that s used for bnary data. Fuzzy ART s an extenson of ART1 for analog data. It uses Fuzzy AND operator nstead of the crsp operator. The basc Fuzzy ART algorthm was descrbed below: The Fuzzy ART takes three nput parameters: choce parameter (β > 0), vglance parameter (0 ρ 1) and learnng rate (0 λ 1). Step1: Intalzaton: Intalze all the parameters. Step 2: Apply nput pattern Let I:=[next nput vector] Let P:= be the set of canddate prototype vectors Step 3: Category choce Fnd the closest prototype vector (P P) that maxmzes I β + P P β acts as a te breaker when multple prototype vectors are subsets of the nput pattern and favors larger magntude prototypes. Step 4: Vglance Test The prototype selected n the prevous step undergoes a vglance test that compares the smlarty between the wnnng prototype and the current nput pattern aganst a user-defned vglance parameter as follows I P ρ (2) I If the prototype passes the vglance test, t s adapted to the gven nput pattern (Step 5). Otherwse, the current prototype s deactvated for the current nput pattern and other prototypes n the F2 layer are also undergone the vglance test untl one of the prototypes passes the test. If none of them passes the test, a new prototype s created for the current nput pattern. Go to step 2 to contnue for the next nput. Step 5: Matched prototype update: The matched prototype s updated to move closer to the current nput pattern accordng to the followng equaton P = λ( I P ) + (1 λ) P (3) λ s the learnng rate. If λ s 1, t s called fast learnng. After the update, all the prototypes are reactvated and the algorthm contnues wth the next nput (step 2). The Fuzzy ART algorthm mentoned above s a hard clusterng algorthm. We modfed the Fuzzy Art to make t a soft clusterng algorthm. The algorthm s called KMART (Kondadad & Kozma Modfed ART) algorthm. In the next secton we present KMART. 4. KMART Although Fuzzy ART has the name fuzzy n t, t s used to work wth Fuzzy data. But t categorzes a gven set of data tems nto dfferent parttons. (.e. t s a hard clusterng algorthm). So t cannot be used for document clusterng effectvely. The algorthm can be broadly dvded nto three stages; Pre-processng, cluster buldng and keyword selecton. 4.1 Pre-processng: In ths stage, stop words are removed from all the documents. The algorthm mantans a common lst of stop words lke artcles, propostons, verb auxlares etc. Then all the words n all documents are combned and redundant terms are removed to form a lst of unque words n all the documents together. Document vectors are formed for each document. The length of the vector s the total number of unque words n all documents and the value of the vector s (1)

4 the frequency of the word f the word appears n the documents and zero otherwse. 4.2 Cluster Buldng: A modfed verson of Fuzzy ART was used for cluster buldng. We propose a change to the exstng Fuzzy ART algorthm to make t a soft clusterng algorthm. Instead of choosng a maxmum smlarty category and applyng the vglance test to check f t s close enough to the nput pattern, we can check every category n the F2 layer and apply the vglance test and f the category passes the vglance test, the nput document s put nto that partcular category. The smlarty measure computed n the vglance test defnes a degree of membershp of the gven nput pattern to the current cluster. Ths enables the document to be n multple clusters wth varyng degrees of membershps. All the prototypes that pass the vglance test are updated accordng to (3). Ths modfcaton also has other advantages apart from allowng soft clusterng. Fuzzy ART s generally tme consumng because t nvolves some teratve search whle searchng for a wnnng category that satsfes the vglance test. In our modfcaton, there s no search because every F2 node s checked. Ths makes t computatonally less expensve. Another advantage s that by elmnatng the category choce step, we are avodng the use of choce parameter, there by reducng the number of user-defned parameters n the system. Ths modfcaton also does not volate the underlyng prncple of ART networks.e. to avod stablty- plastcty dlemma. KMART stll s an ncremental clusterng algorthm, thus plastc and also before learnng a new nput t checks the nput and the nput pattern s learned only f t matches any of the stored patterns wth n a certan tolerance. 4.3 Keyword selecton: The fnal step n KMART s to dsplay representatve keywords for each cluster formed n the prevous stage. Ths allows users to dstngush among dfferent clusters. For each cluster, we rank the words n that cluster accordng to the number of documents n the cluster the word appears and the smlarty of the documents (defned by vglance test) n whch the word appears. We generally dsplay the frst 7-10 words as keywords. 5. EXPERIMENTS In ths secton, we descrbe the results of the varous experments conducted and analyze the results. We compared our experments wth both soft clusterng algorthms lke SISC [12] and also hard clusterng algorthms lke k-means [19] and Fractonaton [6]. 5.1 Data & Expermental Envronment: We downloaded 2000 documents from the World Wde Web manually that belong to dfferent categores lke food, agents, vrus, crcket, football, genetc algorthms etc. we also downloaded another 2000 documents from the UCI KDD archve [22] whch has varous documents from dfferent newsgroups. All the experments are carred out on a 733 MHz, 256 MB RAM PC. We ran the algorthm to get the clusters and compared the qualty of clusters formed. We also compared the executon tmes of all the algorthms for document sets of dfferent szes. To be more accurate, we actually ran all the algorthms on dfferent document sets. Snce except ours all other clusterng algorthms take number of clusters as nput, we made all of them to produce same number of clusters. All the results shown are averages taken over 20 dfferent runs. 5.2 Qualty of the Clusters: We compared the clusters formed by the documents aganst the documents n the orgnal categores and matched the clusters wth the categores one-to-one. The number of matches can be used to measure the qualty of the clusters formed. The matchng was computed usng a b-partte matchng algorthm [21]. Fgure 2 compares the qualty of the clusters formed by KMART to Fuzzy ART, SISC, K-means and Fractonaton. Number of matches per Qualty Number of Documents KMART FuzzyART SISC K-Means Fractonaton Fgure 2: Comparson of qualty of the clusters As we can clearly see from the fgure, KMART formed clusters of better qualty compared to all other algorthms and almost comparable to the tradtonal Fuzzy ART. 5.3 Executon tme: We also compared the executon tmes of our approach wth Fuzzy ART, SISC, K-Means and Fractonaton. Fgure 3 compares the executon tme of KMART wth other algorthms. The executon tme of KMART s lnear wth the number of documents. It can be clearly seen from the fgure that our algorthm runs much faster than all the hard clusterng algorthms and ts executon tme s almost comparable to that of SISC. KMART also runs much faster

5 than Fuzzy ART. Ths s because KMART avods the expensve tme consumng search n the category choce step by elmnatng that step from the Fuzzy ART algorthm. Executon tme (In mnutes) Executon tme Number of documents SISC KMART FuzzyART Fractonaton K-Means Fgure 3: Comparson of executon tmes Ths shows that KMART s very effectve and effcent both n terms of qualty of the clusters and also the executon tme. 6. CONCLUSIONS AND FUTURE WORK We proposed a modfcaton to the tradtonal Fuzzy ART to adapt t to the document-clusterng doman that makes t a soft clusterng algorthm and also reduces the executon tme. The expermental results show that our approach forms clusters of better qualty and also faster compared to other algorthms. The man advantage of KMART over most of other fuzzy clusterng algorthms s that the number of clusters s decded dynamcally. Currently t s practcal to work wth around 1500 documents from web search perspectve. Our future work nvolves makng t more effcent and reducng the response tme by adaptng better data structures. We are also consderng ways of automatcally tunng the values of the vglance and learnng rate parameters dependng on the nput document set dervng a parameter-free Fuzzy ART network. References: [1] Rakesh Agrawal and Ramakrshnan Srkant, Fast Algorthms for Mnng Assocaton Rules n Large Databases, In Proceedngs of the 1994 Internatonal Conference on Very Large Databases, pp , [2] A. Barald, P. Blonda, A survey of fuzzy clusterng algorthms for pattern recognton, Techncal Report TR , Internatonal Computer Scence Insttute, Berkeley, CA, Oct [3] J.L. Bezdek, Pattern Recognton Wth Fuzzy Objectve Functon Algorthms, Plenum Press, Nyew York, NY [4] Carpenter,G.A., Grossberg,S., Rosen,D. "Fuzzy ART: Fast Stable Learnng of Analog Patterns by an Adaptve Resonance System.", Neural Networks, 4, [5] M.S. Chen, J. Han, and P.S. Yu, Data Mnng: An Overvew from a Database Perspectve, IEEE Transactons on Knowledge and Data Engneerng, 8(6): , [6] Douglass R. Cuttng, Davd R. Karger, Jan O. Pedersen, John W. Tukey, Scatter/Gather: A Cluster-based Approach to Browsng Large Document Collectons, In Proceedngs of the Ffteenth Annual Internatonal ACM SIGIR Conference, pp , June [7] F. Murtagh. A survey of recent advances n herarchcal clusterng algorthms. The Computer Journal, 26(4): , [8]Martn Ester, Hans-Peter Kregel, Jorg Sander, and Xaowe Xu. A densty-based algorthm for dscoverng clusters n large spatal databases wth nose. In Proceedngs of the Second Internatonal Conference on Knowledge Dscovery and Data Mnng ({KDD}-96)}, pages AAAI Press, [9] D. R. Hll, A vector clusterng technque, n: Samuelson (Ed.), Mechanzed Informaton Storage, Retreval and Dssemnaton, North- Holland, Amsterdam, [10] A.K. Jan, M.N. Murty and P.J. Flynn, Data Clusterng: A Revew, ACM Computng Surveys. 31(3): , Sept [11] W.J. Krzanowsk and F.H. Marrott, Multvarate Analyss: Classfcaton, Covarance Structures and Repeated Measurements. Arnold, London, [12] Kng-Ip Ln, Ravkumar Kondadad, A Smlarty based Soft clusterng algorthm for documents, In proceedngs of 7th nternatonal conference on Database systems for advanced applcatons (DASFAA-2001), pp 40-47, Aprl [13] Kng-Ip Ln, Ravkumar Kondadad, "A Word based soft clusterng algorthm for documents", In proceedngs of 16th Internatonal conference on computers and ther applcatons (CATA-2001), pp , March [14] T. Kohonen, The self-organzng map, Proceedngs of the IEEE, 78(9): , [15] Y. Lnde, A. Buzo and R.M. Gray, An Algorthm for Vector Quantzaton Desgn, IEEE Transactons on Communcatons, 28(1), [16] M.N. Murty and A. K. Jan, Knowledge-based clusterng scheme for collecton management and retreval of lbrary books, Pattern recognton 28, , [17] Alberto Munoz, Compound key word generaton from document databases usng a Herarchcal clusterng ART Model, Intellgent Data Analyss, 1(1), Jan [18] Jerome Moore, Eu-Hong (Sam) Han, Danel Boley, Mara Gn, Robert Gross, Kyle Hastngs, George Karyps, Vpn Kumar, and Bamshad Mobasher, Web Page Categorzaton and Feature Selecton Usng Assocaton Rule and Prncpal Component Clusterng, In Proceedngs of seventh Workshop on Informaton Technologes and Systems (WITS'97), December [19] J. J. Roccho, Document retreval systems optmzaton and evaluaton, Ph.D. Thess, Harvard Unversty, [20] Dmtr Roussnov, Krstne Tolle, Marshall Ramsey and Hsnchun Chen, Interactve Internet search through Automatc clusterng: an emprcal study, In Proceedngs of the Internatonal ACM SIGIR Conference, pages , [21] Robert E. Tarjan, Data Structures and Network Algorthms, Socety for Industral and Appled Mathematcs, [22]UCI, [23] P.Wllett, V. Wnterman and D. Bawden, "Implementaton of Nearest Neghbour Searchng n an Onlne Chemcal Structure Search System, Journal of Chemcal Informaton and Computer Scences, 26, 36-41,1986. [24] P.Wllett, Recent trends n herarchcal document clusterng: a crtcal revew, Informaton processng and management, 24: , [25] Wong, S.K.M., Ca, Y.J., and Yao, Y.Y, Computaton of Term Assocaton by neural Network. In Proceedngs of the Sxteenth Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval, pp , [26] O.Zamr, O.Etzon, Web document clusterng: a feasblty demonstraton, n Proceedngs of 19 th nternatonal ACM SIGIR conference on research and development n nformaton retreval (SIGIR 98), 1998, pp

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features