Keyword-based Document Clustering

Keyword-based ocument lusterng Seung-Shk Kang School of omputer Scence Kookmn Unversty & AIrc hungnung-dong Songbuk-gu Seoul 36-72 Korea sskang@kookmn.ac.kr Abstract ocument clusterng s an aggregaton of related documents to a cluster based on the smlarty evaluaton task between documents and the representatves of clusters. erms and ther dscrmnatng features of terms are the clue to the clusterng and the dscrmnatng features are based on the term and document frequences. Feature selecton method on the bass of frequency statstcs has a lmtaton to the enhancement of the clusterng algorthm because t does not consder the contents of the cluster obects. In ths paper we adopt a content-based analytc approach to refne the smlarty computaton and propose a keyword-based clusterng algorthm. Expermental results show that content-based keyword weghtng outperforms frequency-based weghtng method. Keywords: ocument lusterng Weghtng Scheme Feature Selecton Introducton ocument clusterng s an aggregaton of documents by dscrmnatng the relevant documents from the rrelevant documents. he relevance determnaton crtera of any two documents s a smlarty measure and the representatves of the documents [234]. here are some smlarty measures such as ce coeffcent Jaccard s coeffcent and cosne measure. hese smlarty measures requre that the documents are represented n document vectors and the smlarty of two documents s calculated from the operaton of document vectors. In general the representatves of a document or a cluster are document vectors that consst of <term weght> pars and the document smlartes are determned by the terms and ther weghtng values that are extracted from the document [79]. In the prevous studes on the document clusterng we focused on the clusterng algorthm but the document hs work was supported by the Korea Scence and Engneerng Foundaton(KOSEF) through the Advanced Informaton echnology Research enter(airc). representaton methodology was not the mportant ssue. ocument vectors are smply constructed from the term frequency (F) and the nverted document frequency (IF). hs representaton of term weghtng method starts from the precondton that terms or keywords representng the document are calculated by F-IF. erm weghtng method by F-IF s generally used to construct a document vector but we cannot say that t s the best way of representng a document. So we suppose that there s a lmtaton to mprove the accuracy of the clusterng system only by mprovng the clusterng algorthm wthout changng the document/cluster representaton method. Also document clusterng requres a large amount of memory spaces to keep the representatves of documents/clusters and the smlarty measures [6 8 ]. Gven N documents to be clustered N N smlarty matrx s needed to store document smlarty measures. Also the recursve teraton of smlarty calculaton and reconstructng the representatve of the clusters need a huge number of computatons. In ths paper we propose a new clusterng method that s based on the keyword weghtng approach. he clusterng algorthm starts from the seed documents and the cluster s expanded by the keyword relatonshp. he evoluton of the cluster stops when no more documents are added to the cluster and rrelevant documents are removed from the cluster canddates. 2 Keyword-based Weghtng Scheme In general the constructon of a document vector depends on the term frequency and document frequency. If keywords are determned by frequency nformaton of the document we are apt to generate an error that nouns are often used regardless of substance of the document and the words of a hgh frequency are extracted. he clusterng method whch s focused on smlarty calculaton consders the whole words except stopwords as the representatve of the document and consttutes a document vector that s calculated by the weght value from the term frequency and document frequency. It s common that terms and ther weght values represent a document and <term weght> pars are the unque elements of the document vector. When we construct a document vector term frequency and document frequency are the most mportant features to calculate the weght of a term. As for the terms and

ther weght values the weght value of a term means a rankng score ust as an mportance factor to the document. So the term weghtng can be seen as an evaluaton of the term as a keyword or a stopword to the document. he weghtng functon w(t) from a term to ts weght s descrbed n expresson (). w: term weght () w(t) = f t s a stopword f t s a keyword a otherwse a For the weghtng scheme of terms there are two ponts of vews as the representaton of a document: () a dscrmnatve value that dstngushes or characterzes the document from others; (2) an mportance measure as a keyword or a stopword. Frequency-based term weghtng (FBW) s a statstcal measure of terms n an nter-document relatonshp. hs weghtng scheme s a very effcent method for dstngushng and characterzng a document from others and t performs well for the applcatons of document classfcaton or clusterng n the nformaton retreval system. he only evaluaton measure to characterze a document n frequency-based weghtng scheme s a frequency statstcs but term frequences are not the best measures to characterze the document by terms. Another weghtng scheme s a keyword-based term weghtng (KBW) method that s based on the keyword mportance factors n a document. It s an analytc approach that analyses the contents of a document to get a keyword lst from the document. he weght value of a word s calculated by the mportance factors as a keyword n a document. he weght value of a word s a combnaton value of keyword-weghtng factors and the terms are ordered by the keyword rankng score. he rankng scores n ths weghtng scheme are calculated from the analyss results of the document. Keyword-based term weghtng wll be a good soluton to overcome the lmtaton of the frequency-based weghtng scheme. Keywords n a text are the terms that represent a document and the canddate keywords are extracted from the analyss results of the document. Keyword rankng method depends on several factors of a term such as the type of a document the locaton and the role of words n a sentence or a paragraph [5]. hematc words of a document are representatve terms for the document. hematc words are extracted from a text by analysng the contents of the text but keyword extracton depends on the type of text. Keywords are easly found n the ttle or an abstract n a research paper that conssts of a ttle abstract body experment and concluson. Also newspaper artcle contans a keyword n the ttle or the frst part of the text. here are some clues of determnng a keyword and we may classfy them as word level sentence level paragraph level and text level features. Word-level features are the type of part-of-speech and case-role nformaton. he part-of-speech of Korean noun s dvded nto common noun compound noun proper noun and numeral. Syntactc or sentence-level features are the type of a phrase or a clause sentence locaton and sentence type. From the rhetorc word n a sentence the mportance of the sentence s computed and the terms n a sentence are affected by the type of a sentence. Also the weghtng scheme of a term n the subectve clause s not the equal to the same term that appeared n an auxlary clause or n a modfyng clause. Basc term weght s assgned by the type of a term and recomputed by the features that t accompanes n the text. hat s the weght value of a term s also determned by the characterstcs of word sentence phrase and clause where the term s extracted. 3 Keyword-based ocument lusterng Keyword-based document clusterng creates a cluster by the keywords of each document. Suppose that s a set of clusters that s fnally created by the clusterng algorthm. If n s the number of clusters n then s a set of clusters. 2 = { } 2 Each cluster s ntalsed by document d that s not assgned to the exstng clusters and d s a seed document of. When a new cluster s created expanson and reducton steps are repeated untl t reaches a stable state from the start state. In each evoluton steps for cluster s the -th state of. : the -th state of a cluster he characterstc vector of a cluster s a set of <keyword weght value> pars that represents the cluster. If K s a keyword set of a document and K s a keyword set of cluster then K s the -th state of cluster. Fgure shows a keyword-based clusterng algorthm for the cluster. Gven the keyword sets for each document cluster s created by the self-expandng algorthm. 3. luster Intalsaton he frst step of the clusterng algorthm s a creaton and ntalsaton of a new cluster. A document s selected that does not belong to any other cluster and t s assgned to a new cluster that s an ntal state n n

of cluster. = { } At ths tme a document that s the frst document n the new cluster s called a seed document (or an ntalsaton document). he seed document s randomly selected among the documents that do not belong to the clusters ~. Keyword set K of a document s a set of keywords k k 2 k n that are extracted from document. he ntal state of keyword set K s ntalsed by K. K = K K = { k k s a keyword that s extracted from } = { } K = K = { x document x where k K x for k such that k K } = do { K = K where x x + = for all x begn s = sm( x K ) f ( s < threshold) + + = { x} end for = + } whle ( seleteocument() ) = Fgure. Keyword-based clusterng algorthm 3.2 Expandng the luster In the ntalsaton step of the cluster a new cluster an ntal state of cluster s establshed as the seed document and the keyword set K s ntalsed by the key word set of the seed document. In the expandng step of the cluster the cluster s expanded by addng more related documents to the cluster that nclude the keywords of the seed document as the related documents of the seed document. hat s addng the total documents that K appear each keyword of (the keyword extracted from the seed document) to the cluster that s the next state of cluster expands the cluster. = { x k x K = K where K k K he cluster expanson s performed by the teraton of keyword expanson and cluster expanson. More documents are added to a cluster by the smlarty evaluaton between the keyword set and the document. If a new document s added to a cluster then the keywords n the added document are also added to the keyword set of the cluster. he frst expanson s performed by the keyword set extracted from the seed document. he second expanson s performed by new keywords that are added to a cluster as a result of the frst expanson. And the -th expanson s performed by the (-)-th state of the keyword set. he number of teratons s decded through the experment. When a cluster s expanded from to the keyword set K s also expanded to a new keyword set K that appears n the total documents of the cluster. he keyword set K of s a unon of the total keyword sets of. x } he keyword set of the cluster s used to calculate the characterstc vector of each cluster. he characterstc vector s consttuted the weght value calculated by term frequency (F) and nverted document frequency (IF) of the keywords and ths s used to calculate the smlarty measure between a document and the cluster. 3.3 luster Reducton and ompleton hs step s to produce a complete cluster by removng the documents that are not related to the cluster. For the cluster documents of a low smlarty to the cluster are removed that are not related to a cluster through the smlarty computaton wth the cluster. he result of cluster reducton s a flterng of documents that are not related + to the cluster and the cluster s generated as a next step of the cluster. Ultmately the cluster s completed that conssts of the related documents after flterng the non-related documents. If a cluster s completed the next cluster + s created through the same process. lusterng s termnated f all the documents are clustered or no more clusters are created. x

Input ocument Keyword Extracton create nverted-fle reate Inverted-Fle reate a luster Int. luster create cluster Keyword set 2 n Expand luster expand cluster Reduce/omplete luster lusters a b 2a 2b na nb a 2 L a n a b 2 L b n b z 2 L z n z Fgure 2. Overall archtecture of keywordbased clusterng 4 esgn and Implementaton he structure of a keyword-based clusterng system s shown n Fgure 2. At frst keywords are extracted from each nput document and the weght values of them are computed. Keywords and ther scores are stored n an nverted-fle structure. Inverted-fle structure s a good for the expanson of the cluster and addng the documents that ncludes a keyword to the ntal cluster. Fgure 3 shows an example of the operaton of the document clusterng system: ntalzaton expanson reducton and completon of clusters. A new cluster s created and t ncludes a seed document. An ntal set of keywords for the ntal state of a cluster s a keyword set K of document. K = { 2 n } For the terms n K documents that contan the same term are added as a canddate document n the cluster. Let the canddate documents be a b 2a 2b na nb. then xy s a document that s expanded by term x. Keyword set of the cluster s reconstructed by new set of documents. In each step of the cluster expanson the number of keywords that are used for the expanson and the threshold of the weght value are decded through experments consderng the maxmum number of document canddates n a cluster. Also <keyword weght> pars as an ntermedate representatve of the cluster are much mportant factor of the cluster expanson. result A B 2A 2B na nb complete cluster Fgure 3. Example of keyword-based clusterng Now a new keyword set that s lmted to the cluster canddates s constructed to get cluster documents. hrough the smlarty calculaton between the document and the canddate centrod of the cluster relevant documents are selected to be a member of the cluster. hrough the teratons on keyword selecton and the reconstructon of the related documents a new cluster s completed that reaches n a stable status wth a strong relatonshp between keyword set and document set. 5 he Experments We mplemented our clusterng algorthm and appled t to the clusterng of smlar documents. he test documents for the experment are collected from the three days of newspaper artcles. he total number of artcles s 383 and average 32 terms are extracted from the artcles. We performed a document clusterng by applyng the dfference crtera for term selecton: ) frequency-based term selecton; 2) percentage-based keyword selecton; and 3) keyword selecton by absolute number of keywords. Fgure 4 shows the result of smlarty clusterng by frequency-based term selecton. In ths experment three types of term selecton are performed.

- all terms are used to the clusterng - terms wth more than frequency 2 - terms wth more than frequency 3 In each experment we vared the smlarty decson rato by the percentage of term matches. Fgure 4 shows that term selecton by frequency 2 or 3 s not good for the representaton of a document. smlarty decson and auxlary keywords are also needed for the accuracy. Another pont n ths experment s that 3%~6% keyword selecton resulted better than the selecton of all terms. We compared the F -measure for the selecton of maxmum keywords. All the experments n Fgure 6 resulted better than the experment of usng all the terms n the document. Also 3~7 keywords wth 6%~7% match rato resulted a good performance for the comparson of document smlarty. term m atch rato term match rato Fgure 4. Frequency-based keyword selecton Fgure 6. Keyword selecton by maxmum term match rato Fgure 5. Percentage-based keyword selecton In the experment of percentage-based keyword selecton terms of hgh weght values are selected for the smlarty calculaton of the document. All the curves n Fgure 5 are a smlar shape except for % selecton. In case of % selecton we guess that less than % of keywords are not suffcent for the 6 oncluson It s common that clusterng algorthm s based on the smlarty computaton by frequency-based statstcs to aggregate the related documents. hs metrc s an mportant factor for term weghtng. We proposed a term weghtng method that s based on the keyword features and we tred to complement the drawback of frequency-based metrc. Based on the keyword weghtng scheme documents of the same keywords are grouped nto a cluster canddate and a new cluster s created by removng rrelevant documents. We performed an experment for the clusterng of smlar documents and the results showed that keyword-based weghtng scheme s better than the frequency-based method. Our keyword-based algorthm s usng 3%~6% of terms for a clusterng and the smlarty matrx s not a necessty that t wll be good for the clusterng of a huge number of documents. We also expect that ths algorthm wll be good for the topc trackng of specal events. In the experment we randomly selected a seed document and t s a bt senstve for the seed document. So our next research wll be focused on mnmzng the effect of the seed document by gettng representatve keywords before startng the clusterng.

References [] Anderberg M. R. luster Analyss for Applcatons New York: Academc 973. [2] an F. and E. A. Ozkarahan ynamc luster Mantenance Informaton Processng & Management Vol. 25 pp.275-29 989. [3] ubes R. and A. K. Jan lusterng Methodologes n Exploratory ata Analyss Advances n omputers Vol. 9 pp.3-227 98. [4] Frakes W. B. and R. Baeza-Yates Informaton Retreval Prentce Hall 992. [5] Kang S. S. H. G. Lee S. H. Son G.. Hong and B. J. Moon erm Weghtng Method by Postposton and ompound Noun Recognton Proceedngs of 3 th onference on Korean Language omputng pp.96-98 2. [6] Murtagh F. omplextes of Herarchc lusterng Algorthms: State of the Art omputatonal Statstcs Quarterly Vol. pp.-3 984. [7] Perry S. A. and P. Wllett A Revew of the Use of Inverted Fles for Best Match Searchng n Informaton Retreval Systems Journal of Informaton Scence Vol. 6 pp.59-66 983. [8] Sbson R. SLINK: an Optmally Effcent Algorthm for the Sngle-Lnk luster Method omputer Journal Vol. 6 pp.328-342 973. [9] Wllett P. ocument lusterng Usng an Inverted Fle Approach Journal of Informaton Scence Vol. 2 pp.223-23 98. [] Wllett P. Recent rends n Herarchc ocument lusterng: A rtcal Revew Informaton Processng and Management Vol. 24 No.5 pp.577-597 988.