Online Text Mining System based on M2VSM

Size: px

Start display at page:

Download "Online Text Mining System based on M2VSM"

Antonia Lynch
5 years ago
Views:

1 FR-E2-1 SCIS & ISIS 2008 Onlne Text Mnng System based on M2VSM Yasufum Takama 1, Takash Okada 1, Toru Ishbash 2 1. Tokyo Metropoltan Unversty, 2. Tokyo Metropoltan Insttute of Technology 6-6 Asahgaoka, Hno, Tokyo , Japan emal: ytakama@sd.tmu.ac.p Abstract Ths paper proposes an onlne text mnng system that s developed based on M2VSM (Meta keyword-based Modfed VSM. When conventonal vector space model (VSM s appled to document clusterng, t s dffcult to adust the granularty of cluster n terms of topc. In order to solve the problem, M2VSM s proposed as an extended VSM so that t can consder meta keywords such as adectves and adverbs, as addtonal value of ndexng terms. The smlarty between documents s calculated by consderng the matchng of meta keywords for each ndex term, whch makes t possble to cluster documents wth varous granulartes n terms of topc. The onlne text mnng system s developed wth MUSASHI, whch s one of the most popular open source data mnng tools. By usng the system, users can perform a seres of text mnng process onlne, ncludng preprocessng, feature selecton, clusterng, and vsualzaton of results. Expermental results show that clusterng results by M2VSM match the results by test subects n both rough and detaled clusterng. It s also shown that the system can process database contanng 5,000 documents wthn 7 mnutes. I. INTRODUCTION We can fnd huge databases easly on the Web n recent years, because of breakthroughs n technque for nformaton acquston and dramatcally low-prcng of the mass storage devces. The volume of such databases has been already beyond human s ablty of nformaton processng, and ntellgent support by nformaton technologes ncludng nformaton retreval and data mnng are requred. Varous knds of data mnng and nformaton retreval technques have been developed based on vector space model (VSM because of ts several advantages. One of them s the ablty to rank the documents n order of the expectaton that documents are approprate to a user s query. However, conventonal VSM s dffcult to adust the granularty of cluster n term of a topc. For example, when VSM s appled to document database of a specfc feld such as the feld of medcne, the documents tend to form dense clusters n the vector space because of hgh smlarty between them; therefore ther performance decreases [3]. Employng addtonal words as ndex terms s one of the usual solutons, because ncreasng the number of dmenson by ncreasng the number of ndex terms can make the vector space sparse. However, ths sometmes leads to problems such as curse of dmensonalty, whch prevents the expresson of the accurate relatonshp between documents. Furthermore, clusters found n such sparse space do not tend to have correspondng topc, whch makes t hard to nterpret for humans. In order to solve above-mentoned problems, M2VSM (Meta keyword-based Modfed VSM has been proposed by extendng conventonal VSM [3, 4]. The M2VSM makes use of such meta keywords as addtonal value of ndexng terms, and the smlarty between documents s calculated by consderng the matchng of meta keywords for each ndex term. Ths paper proposes a text mnng system that s developed based on M2VSM. It s desgned for analyzng large volume of documents, from preprocessng such as ndex terms / meta keywords selecton, to document clusterng. It s developed wth MUSASHI, whch s one of the most popular open source data mnng tools. By usng the system, users can perform a seres of text mnng process onlne, ncludng preprocessng, feature selecton, clusterng, and vsualzaton of results. Expermental results show that M2VSM can generate clusters that match those generated by test subects, n both of rough and detaled clusterng. It s also shown that the developed system can analyze 5,000 documents wthn 400 seconds, whch means t s sutable for practcal use n terms of processng speed. II. M2VSM A. Vector Space Model The VSM has been wdely used n the tradtonal nformaton retreval feld. The VSM model creates a mult-dmensonal space, n whch both documents and queres are represented by vectors. For a fxed collecton of documents, a N w -dmensonal vector s generated for each document and query from sets of terms assocated weghts, where N w s the number of ndexng terms n the document collecton. Then the smlarty between documents ncludng query s calculated by cosne measure. In VSM, weght w assocated wth the term t n document D s often calculated by TFIDF (Term Frequency Inverse Document Frequency measure [11], whch s calculated by Eq.. m N D TFIDF( t, D = log, M DF( t where m represents the number of occurrences (frequency of term t n document D, M represents the total frequency of ndexng terms n D, N D s the total number of documents and DF(t s the number of documents contanng t. The smlarty sm(d, D between documents D and D s defned as cosne value of document vectors (Eq.. 739

2 Nw = n = 1 wnwn sm( D, D. D D B. Outlne of M2VSM As mentoned n Secton I, when the conventonal VSM s appled to cluster documents, t s dffcult to adust the granularty of cluster n terms of a topc. In partcular, VSM s not good at dvdng documents n terms of detaled topc. Therefore, when t s appled to a database n a specfc feld such as the feld of medcne, t can crowd the documents n the vector space [3, 4]. Furthermore, even though herarchcal clusterng such as AHC [1] s employed, obtaned herarchy does not always correspond to topcal herarchy lke Web drectory servces. One of the reasons causng ths problem s the exstence of the ndexng terms appearng n many documents, because they have the general meanngs n the feld. Therefore ncreasng the number of the ndexng terms dose not only resolve the problem but also causes the curse of dmensonalty at worst. The M2VSM assumes f same ndexng terms have the dfferent meta keywords (adectves, adverbs, etc. as ts modfer n a dfferent document, each document refers to the dfferent topcs. In other words, ndex terms ndependently represent a topc n general sense, whereas ndex terms combned wth meta keywords represent a topc n more detal. That s, meta keywords gve addtonal value of ndexng terms. Gven the collecton of meta keywords S M, we defne the smlarty as Eq. -(5, Nw α n n wnwn sm( D D = = 1,, α α D D N W α D = α w n n= 1 k n 2 n, (4 α = α, k = ME ME, (5 n where ME n (a subset of S M represents the set of meta keywords of ndexng term t n n a document D, and α n reflects the degree of co-occurrng meta keywords of t n between D and D nto document smlarty calculaton. The α (>1 s parameter for adustng the effect of meta keywords, whch s set to 3 n the experments of ths paper. In prevous study [3, 4], the range of α n s defned as [0,1], whch means the exstence of meta keyword appearng ether document (.e., ether D or D decreases the smlarty. Compared wth the prevous study, the α n ( 1 reflects the nfluence of meta keywords more postvely nto smlarty calculaton. n C. Selecton of Meta Keyword In Sec. 2-B, adectves and adverbs are referred to as meta keywords. In partcular, we select meta keywords from adectves, adverbs, adnomnal nouns, and adectve verbs. These parts of speech are used for descrbng the characterstcs or state of target obect, sentment, emoton, etc., whch are sutable as meta keywords. Ths paper focuses on processng documents wrtten n Japanese. Bascally, a meta keyword of an ndex term t s defned as ether adectve, adverb, adnomnal noun, or adectve verb, whch has modfcaton relaton wth t. In ths paper, nouns are used as ndex terms unless those are used as adnomnal nouns. It s noted that the text mnng system n Sec. III can nteractvely specfy the part of speech for ndex terms and meta keywords. In order to dentfy ndex terms and meta keywords from Japanese documents, ths paper employs Japanese dependency parser Cabocha [6]. III. TEXT MINING SYSTEM BASED ON M2VSM Fg. 1 shows the system archtecture of the developed text mnng system that s based on M2VSM. Current verson of the system can only handle Japanese documents. It conssts of 3 processng components: preprocessng, ndexng / meta keyword selecton, clusterng, and vsualzaton. Gven a set of documents that are to be analyzed, preprocessng component performs morphologcal analyss, syntactc and dependency parsng, removal of words belongng to the part of speech that s not used as ndexng terms or meta keywords, and reunon of words that are excessvely segmented. The result s stored n a database n order to speedng up the subsequent processng. Document Set Selecton Method Selecton Preprocessng Indexng / Meta-keyword selecton Clusterng Data Results Vsualzaton Cluster Selecton Fg. 1. System archtecture of text mnng system based on M2VSM In the next step, a set of ndex terms as well as that of meta keywords, whch are used for smlarty calculaton by M2VSM, are selected. Ths step s performed nteractvely wth the help of a user. In the thrd step, document clusters are generated from the target document set based on the document-document smlarty calculated by M2VSM. The system employs sngle pass clusterng [8, 9] n order to process large number of documents wthn a reasonable tme. When performng clusterng, three smlarty measure can be appled; sngle-lnkage, complete-lnkage, and average-lnkage method. It s also possble to calculate the smlarty based on ordnary VSM. The result of clusterng s presented to a user wth ether table format or usng nformaton vsualzaton [5, 10]. 740

3 The system s mplemented wth usng MUSASHI [2, 7, 12], whch s a famous open source data mnng tool. MUSASHI provdes a set of commands for processng vast amount of data as shown n Table 1. It s expected that usng MUSASHI makes t possble to develop stable and effcent system n relatvely short development tme. Table 1. Part of command set of MUSASHI Command xtagg xtbar xtcat xtcomb xtcount xtcut xtcorrelaton Bref descrpton Aggregaton of records Generaton of bar graph (SVG format Merge of multple XML tables Calculaton of combnaton Countng the number of rows Selecton of tems Calculaton of correlaton coeffcent A. Selecton of ndex terms / meta keywords In the developed system, a user can nteractvely select ndex terms and meta keywords that are to be used for the analyss. Frst, a user selects ndex terms, and then selects meta keywords from the remanng words. When selectng ndex terms, words are fltered based on the part of speeches specfed by a user. The result s further fltered by specfyng mnmum and maxmum df values (DF(t n Eq.. In order to specfy approprate df values, the hstogram of df values s presented by a user, as shown n Fg. 2. Fnally, a user can examne each of the words obtaned by those flterng processes, and remove unwanted words. summary of a clusterng result, and that showng the detal of a cluster. As one of the advantages of M2VSM s that t can perform both rough and detaled clusterng as dscussed n Sec. II-B, the developed system can perform several clusterng processes wth dfferent thresholds n the same tral. The table shows the summary of clusterng results. For each clusterng result, used threshold for clusterng, the number of obtaned clusters, and the number of documents n each cluster s presented. By selectng one or more nterestng clusterng results, summary of the results s shown as the table. The table contans the number of documents, the numbers of ndex terms and meta keywords, frequences of ndex terms and meta keywords for each cluster. By selectng one or more nterestng clusters, detaled nformaton about the clusters s shown as the table. The table contans typcal ndex terms together wth correspondng meta keywords, and typcal documents for the selected clusters. Typcal ndex terms and meta keywords are selected based on ther frequences, and up to 5 documents close to the centrod of a cluster are selected as typcal documents of the cluster. Other detaled nformaton about a cluster, such as the relatonshp between ndex terms, that between ndex terms and meta keywords, that between meta keywords, and that between cluster centrods and ndex terms, are also dsplayed by Keyword Map as shown n Fg. 3. Keyword Map treats an ndex term, meta keyword, and cluster centrod as a node, whch s arranged accordng to the relatonshps wth other nodes so that related nodes can form a cluster on the map. Keyword Map employs sprng model for drawng a map. Flterng by DF values # of Index terms Fg. 2. Hstogram of DF values DF The selecton of meta keywords s performed n smlar way, except that typcal ndex terms are presented for each canddate meta keyword n the last step. A user can select meta keywords by examnng ther relatons wth the correspondng ndex terms. B. Output of clusterng results The result of clusterng s presented to a user wth two types of formats: a table format and vsualzaton by keyword map [5]. A table format s generated as HTML fles, whch a user can vew wth ordnary HTML browser. There are three types of tables; a table comparng the clusterng results wth dfferent thresholds, that showng the Fg. 3. Analyzed result presented by Keyword Map IV. EXPERIMENTS A. Performance of M2VSM Experments are performed wth three document sets wrtten n Japanese. The purpose of the experments s to show the effectveness of the proposed M2VSM aganst conventonal VSM n document clusterng n two levels: clusterng by general topc (rough clusterng and by detaled topc (detaled 741

4 clusterng. For that purpose, comparson between clusterng results by M2VSM, VSM, PVSM (Phrase-based VSM, and test subects are performed. The PVSM s a smple extenson of VSM, n whch phrase (combnaton of a noun and ts meta keywords s used as an ndex term nstead of ndependent word. It s expected that PVSM could generate clusters correspondng to more detaled topcs than normal VSM. Documents used for the experments are edtoral artcles of 7 Japanese newspaper companes: Asah Shmbun 1, Yomur Shmbun 2, Nkke Shmbun 3, Kobe Shmbun 4, Chugoku Shmbun 5, Hokkado 6 Shmbun, and Kahoku Shmpo 7. Total number of artcles, whch were collected from June 1, 2005 to December 29, 2005, s 2,298. In order to reduce the burden of test subects, we frst appled M2VSM, VSM, and PVSM to the collected document sets, and found small subset of documents that belong to the same cluster by any of 3 methods. By addng some nose artcles to those subsets, we obtaned the 3 document set A, B, and C, each of whch contans 20 documents. That s, each document set forms a sngle cluster under general topc, but are dvded nto several clusters under specfc topcs. In ths process, sngle lnkage method s appled and the same threshold s used for the 3 methods. The topcs of the document set are as follows. Here, RC (rough cluster means the topc of the entre document set, and DC (detaled clusters means the topcs of clusters when a document set s dvded n detal. - Set A: (RC nternatonal ssues, (DC sx-party talks, postwar perod, Iran - Set B: (RC North Korea, (DC Japan-North Korea talks, sx-party talks - Set C: (RC IT, (DC Rakuten-TBS problem, meda and the Internet Ten test subects are asked to cluster each of the 3 document sets n two levels. Frst, they are asked to roughly dvde the documents n terms of topc. Then, the obtaned clusters are further dvded nto clusters n terms of more detaled topc. There s no constrant on the number of clusters n each level. The clusterng results by M2VSM, VSM, and PVSM and those by test subects are compared wth the followng measure. dp( method, subect Match( method, subect, D =, (6 C D 2 where method ndcates ether M2VSM, VSM, or PVSM, and subect ndcates a subect (=1,,10. The D s a document set (A, B, or C, d p (method,subect s the number of document pars, whch are clustered n the same way by both of method and subect. For example, let us consder the case where a document set contans 3 documents {1, 2, 3} and one method dvdes t as {1, 2} and {3}, and a subect dvdes t as {1, 2, 3}. In ths case, total number of document pars (.e. denomnator n Eq. (6 s 3 ( 1-2, 1-3, and 2-3, and only the par 1-2 s clustered n the same way (.e. belongng to the same cluster by both of them, the matchng score s 1/3=0.33. When the clusterng results of both a method and a subect are completely the same, Eq. (6 s equal to. Each method s compared wth 10 test subects wth Eq. (6. Table 2, 3, and 4 summarze the comparson result for document set A, B, and C, respectvely. In these tables, the left column (RC for each method shows the result for rough clusterng, and the rght column (DC s the result for detaled clusterng. That s, the sngle-lnkage method s appled to the document set and rough clusterng s performed by cuttng the obtaned dendrogram wth low smlarty threshold, whereas detaled clusterng s performed wth hgh smlarty threshold. Thresholds are determned for each method so that the average score (Eq. (6 over test subects can be as hgh as possble. In the tables, NUM shows the number of clusters generated by each method, TH s used threshold, Avg., Max, and Mn are average, max, mnmum matchng score over 10 test subects, respectvely (number n parentheses s the rank among three methods. Table 2. Expermental results for set A NUM TH Avg Max Mn Table 3. Expermental results for set B NUM TH Avg Max Mn It can be seen from the tables that M2VSM and VSM obtan the same results for all data sets n the case of rough clusterng. However, when detaled clusterng s performed, the performance of VSM s lower than M2VSM. PVSM tends to outperform VSM when detaled clusterng s performed, but ts performance tends to be worst than other 2 methods n rough clusterng. The most mportant thng s that the proposed M2VSM can obtan the best results for all 3 data sets, n both rough and detaled clusterng. These results show M2VSM s 742

5 capable of adustng the granularty of clusters n terms of a topc. Table 4. Expermental results for set C NUM TH Avg Max Mn B. Evaluaton of M2VSM-based Text Mng System The performance of the developed text mnng system s evaluated n terms of processng tme. Documents used for the experments are edtoral artcles of 7 Japanese newspaper companes 1-7. Total number of artcles s 5,672, whch were collected from May 1, 2005 to Jan 31, Table 5 shows the used parameter values and the specfcaton on whch the system was run. The processng tme throughout the analyss process,.e., from preprocessng to document clusterng s measured whle varyng the number of documents from 200 to 5,000. It s noted that the tme of user s nteractng wth the system s omtted from the processng tme. Fg. 3 shows the relatonshp between processng tme and the number of documents. It can be seen that the developed system can process 5,000 documents wthn 400 seconds. Table 5. Parameters used for experment # of ndex terms # of meta keywords Clusterng Method 2, Average lnkage 0.7 CPU Cache sze Memory Swap Pentum GHz Tme (s Threshold for clusterng 512KB 755MB 1.5GB # of documents Fg. 3. Relatonshp between the number of documents and processng tme V. CONCLUSIONS The M2VSM, a modfed VSM based on meta keywords s proposed for clusterng documents wth varous granulartes n terms of topc. The M2VSM makes use of adectves, adverbs, adnomnal nouns, and adectve verbs as meta keywords for ndex terms, and calculate document smlarty whle consderng the effect of meta keywords. Experments are performed wth the sets of edtoral artcles of Japanese newspapers, and the results show obtaned clusterng results can correspond to the results by test subects n the case of both rough and detaled clusterng. A text mnng system s developed based on M2VSM, and the expermental result shows t has enough processng speed for practcal use. In the future study, we are gong to provde the system wth experts for a specfc doman, such as management engneer. Although only documents wrtten n Japanese s consdered n ths paper, M2VSM tself can be appled to documents wrtten wth other languages such as Englsh documents. As preprocessng,.e., meta keyword extracton should be dfferent from language to language, t should be studed for each languages. Our future study ncludes the applcaton of M2VSM to Englsh documents. In the experments reported n the paper, clusterng of only two levels, rough and detaled clusterng, s performed. It s also challengng to apply M2VSM to generaton of mult-level (more than 2 level topcal structure, such as Web drectory servces. REFERENCES [1] S. Chakrabart, Chapter 4: Smlarty and Clusterng, mnng the web, Morgan Kaufmann, pp , [2] Y. Hamuro, N. Katoh, K. Yada, MUSASHI: Flexble and Effcent Data Preprocessng Tool for KDD based on XML, Proceedngs of the Frst Internatonal Workshop on Data Cleanng and Preprocessng, pp.38-49, [3] T. Ishbash, and Y. Takama, Proposal of M2VSM and Its Comparson wth Conventonal VSM, AM2004, Vol ICS-128, pp. 1-6, [4] T. Ishbash, and Y. Takama, Proposal of M2VSM for Informaton Retreval n the Specfc Feld, SCIS&ISIS2004, THP-3-3, [5] T. Kanam, and Y. Takama, Interactve Keyword Map Equpped wth Keywords Arrangement Support Functons for Emphaszng User s Intenton, Trans. Informaton Processng Socety of Japan, Vol. 48, No. 3, pp , 2007 (wrtten n Japanese. [6] T. Kudo, and Y. Matsumoto, Japanese dependency analyss usng cascaded chunkng, Proc. Of 6th Conference on Natural Language Learnng, Vol. 20, pp. 1-7, [7] MUSASHI Mnng Utltes and System Archtecture for Scalable processng of HIstorcal data, [8] R. Papa, and J. Allan, On-lne New Event Detecton Usng Sngle-pass Clusterng, UMASS Computer Scence Techncal Report, UM-CS , [9] M. Sptters, and W. Kraa, TNO at TDT2001: Language Model-Based Topc Detecton, Topc Detecton and Trackng (TDT Workshop 2001, [10] Y. Takama, T. Kanam, and A. Matsumura, Applcaton of Keyword Map-based Relevance Feedback to Interactve Blog Search, AMT 2005, pp , [11] J. Thorsten, A probablstc Analyss of the Roccho Algorthm wth TFIDF for Text Categorzaton, n proceedng of the 14th Internatonal Conference on Machne Learnng, pp , [12] K. Yada, Y. Hamuro N. Katoh, T. Washo, I. Fusamoto, Data Mnng Orented CRM System Based on MUSASHI: C-MUSASHI, Proceedngs of Second Internatonal Workshop on Actve Mnng, pp.52-61,

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto