Single Document Keyphrase Extraction Using Neighborhood Knowledge

Size: px

Start display at page:

Download "Single Document Keyphrase Extraction Using Neighborhood Knowledge"

Cora Payne
6 years ago
Views:

1 Proceedngs of the Twenty-Thrd AAAI Conference on Artfcal Intellgence (2008) Sngle Document Keyphrase Extracton Usng Neghborhood Knowledge Xaoun Wan and Janguo Xao Insttute of Computer Scence and Technology Pekng Unversty, Beng , Chna {wanxaoun, Abstract Exstng methods for sngle document keyphrase extracton usually make use of only the nformaton contaned n the specfed document. Ths paper proposes to use a small number of nearest neghbor documents to provde more knowledge to mprove sngle document keyphrase extracton. A specfed document s expanded to a small document set by addng a few neghbor documents close to the document, and the graph-based rankng algorthm s then appled on the expanded document set to make use of both the local nformaton n the specfed document and the global nformaton n the neghbor documents. Expermental results demonstrate the good effectveness and robustness of our proposed approach. Introducton A keyphrase s defned as a meanngful and sgnfcant expresson consstng of one or more words n a document. Approprate keyphrases can serve as a hghly condensed summary for a document, and they can be used as a label for the document to supplement or replace the ttle or summary, or they can be hghlghted wthn the body of the document to facltate users fast browsng and readng. Moreover, document keyphrases have been successfully used n the followng IR and NLP tasks: document ndexng (Gutwn et al., 1999), document classfcaton (Krulwch and Burkey, 1996), document clusterng (Hammouda et al., 2005) and document summarzaton (Berger and Mttal, 2000). Keyphrases are usually manually assgned by authors, especally for ournal or conference artcles. However, the vast maorty of documents (e.g. news artcles, magazne artcles) do not have keyphrases, therefore t s benefcal to automatcally extract a few keyphrases from a gven document to delver the man content of the document. Here, keyphrases are selected from wthn the body of the nput document, wthout a predefned lst (.e. controlled vocabulary). Though keyphrase extracton s an mportant research topc n the NLP and IR feld, t has receved less attenton than t deserves. Most prevous works focus on keyphrase extracton for ournal or conference artcles, whle ths paper focus on keyphrase extracton for news artcles because news artcle s one of the most popular Copyrght 2008, Amercan Assocaton for Artfcal Intellgence ( All rghts reserved. document genres on the web and most news artcles have no author-assgned keyphrases. Exstng methods conduct the keyphrase extracton task usng only the nformaton contaned n the specfed document, ncludng the phrase s TFIDF, poston and other syntactc nformaton n the document. One common assumpton of exstng methods s that the documents are ndependent of each other. And the keyphrase extracton task s conducted separately wthout nteractons for each document. However, some topc-related documents actually have mutual nfluences and contan useful clues whch can help to extract keyphrases from each other. For example, two documents about the same topc earthquake would share a few common phrases, e.g. earthquake, vctm, and they can provde addtonal knowledge for each other to better evaluate and extract salent keyphrases from each other. Therefore, gven a specfed document, we can retreve a few documents topcally close to the document from a large corpus through search engnes, and these neghbor documents are deemed benefcal to evaluate and extract keyphrases from the document because they can provde more knowledge and clues for keyphrase extracton from the specfed document. Ths study proposes to construct an approprate knowledge context for a specfed document by leveragng a few neghbor documents close to the specfed document. The neghborhood knowledge can be used n the keyphrase extracton process and help to extract salent keyphrases from the document. In partcular, the graph-based rankng algorthm s employed for sngle document keyphrase extracton by makng use of both the word relatonshps n the specfed document and the word relatonshps n the neghbor documents, where the former relatonshps reflect the local nformaton exstng n the specfed document and the latter relatonshps reflect the global nformaton exstng n the neghborhood. Experments have been performed on a dataset consstng of 308 news artcles and human-annotated keyphrases, and the results demonstrate the good effectveness of the proposed approach. The use of the neghborhood knowledge can sgnfcantly mprove the performance of sngle document keyphrase extracton. We also nvestgate how the sze of the neghborhood nfluences the keyphrase extracton performance and t s encouragng that a small number of neghbor documents can mprove the performance. 855

2 Related Work The methods for keyphrase (or keyword) extracton can be roughly categorzed nto ether unsupervsed or supervsed. In ths study, we focus on unsupervsed methods. Unsupervsed methods usually nvolve assgnng a salency score to each canddate phrases by consderng varous features. Krulwch and Burkey (1996) use heurstcs to extract keyphrases from a document. The heurstcs are based on syntactc clues, such as the use of talcs, the presence of phrases n secton headers, and the use of acronyms. Barker and Cornaccha (2000) propose a smple system for choosng noun phrases from a document as keyphrases. Muñoz (1996) uses an unsupervsed learnng algorthm to dscover two-word keyphrases. The algorthm s based on Adaptve Resonance Theory (ART) neural networks. Steer and Belew (1993) use the mutual nformaton statstcs to dscover two-word keyphrases. Tomokyo and Hurst (2003) use pontwse KL-dvergence between multple language models for scorng both phraseness and nformatveness of phrases. More recently, Mhalcea and Tarau (2004) propose the TextRank model to rank keywords based on the co-occurrence lnks between words. Such algorthms make use of votng or recommendatons between words to extract keyphrases. Supervsed machne learnng algorthms have been proposed to classfy a canddate phrase nto ether keyphrase or not. GenEx (Turney, 2000) and Kea (Frank et al., 1999; Wtten et al., 1999) are two typcal systems, and the most mportant features for classfyng a canddate phrase are the frequency and locaton of the phrase n the document. More lngustc knowledge has been explored by Hulth (2003). Statstcal assocatons between keyphrases have been used to enhance the coherence of the extracted keyphrases (Turney, 2003). Song et al. (2003) present an nformaton gan-based keyphrase extracton system called KPSpotter. Medelyan and Wtten (2006) propose KEA++ that enhances automatc keyphrase extracton by usng semantc nformaton on terms and phrases gleaned from a doman-specfc thesaurus. Nguyen and Kan (2007) focus on keyphrase extracton n scentfc publcatons by usng new features that capture salent morphologcal phenomena found n scentfc keyphrases. All the above methods make use of only the nformaton contaned n the specfed document. The use of neghbor documents to mprove sngle document keyphrase extracton has not been nvestgated yet. Other related works nclude web page keyword extracton (Kelleher and Luz, 2005), advertsng keywords fndng (Yh et al., 2006). It s noteworthy that collaboratve technques have been successfully used n the tasks of nformaton flterng (Xue et al., 2005), document summarzaton (Wan et al., 2007) and web mnng (Wong et al., 2006). Proposed Approach Overvew Gven a specfed document d 0 for keyphrase extracton, the proposed approach frst fnds a few neghbor documents for document d 0. The neghbor documents are topcally close to the specfed document and they construct the neghborhood knowledge context for the specfed document. In other words, document d 0 s expanded to a small document set D whch provdes more knowledge and clues for keyphrase extracton from d 0. Gven the expanded document set, the proposed approach adopts the graph-based rankng algorthm to ncorporate both the word relatonshps n d 0 (local nformaton) and the word relatonshps n neghbor documents (global nformaton) for keyphrase extracton from d 0. Fgure 1 gves the framework of the proposed approach. 1. Neghborhood Constructon: Expand the specfed document d 0 to a small document set D={d 0, d 1, d 2, d k } by addng k neghbor documents. The neghbor documents d 1, d 2,, d k can be obtaned by usng document smlarty search technques; 2. Keyphrase Extracton: Gven document d 0 and the expanded document set D, perform the followng steps to extract keyphrases for d 0 : a) Neghborhood-level Word Evaluaton: Buld a global affnty graph G based on all canddate words restrcted by syntactc flters n all the documents of the expanded document set D, and employ the graph-based rankng algorthm to compute the global salency score for each word. b) Document-level Keyphrase Extracton: For the specfed document d 0, evaluate the canddate phrases n the document based on the scores of the words contaned n the phrases, and fnally choose a few phrases wth hghest scores as the keyphrases of the document. Fgure 1: The framework of the proposed approach For the frst step n the above framework, dfferent smlarty search technques can be adopted to obtan neghbor documents close to the specfed document. The number k of the neghbor documents nfluences the keyphrase extracton performance and wll be nvestgated n the experments. For the second step n the above framework, substep a) ams to evaluate all canddate words n the expanded document set based on the graph-based rankng algorthm. The global affnty graph ams to reflect the neghborhoodlevel co-occurrence relatonshps between all canddate words n the expanded document set. The salency scores of the words are computed based on the global affnty graph to ndcate how much nformaton about the man topc the words reflect. Substep b) ams to evaluate the canddate phrases n the specfed document based on the neghborhood-level word scores, and then choose a few salent phrases as the keyphrases of the document. 856

3 Neghborhood Constructon Gven a specfed document d 0, neghborhood constructon ams to fnd a few nearest neghbors for the document from a text corpus or on the Web. The k neghbor documents d 1, d 2,, d k and the specfed document d 0 buld the expanded document set D={d 0, d 1, d 2,, d k } for d 0, whch can be consdered as the expanded knowledge context for document d 0. The neghbor documents can be obtaned by usng the technque of document smlarty search. Document smlarty search s to fnd documents smlar to a query document n a text corpus and return a ranked lst of smlar documents to users. The effectveness of document smlarty search reles on the functon for evaluatng the smlarty between two documents. In ths study, we use the wdely-used cosne measure to evaluate document smlarty and the term weght s computed by TFIDF. The smlarty sm doc (d,d ), between documents d and d, can be defned as the normalzed nner product of the two term vectors d r and d r : r r d d smdoc (d,d ) = r r (1) d d In the experments, we smply use the cosne measure to compute the parwse smlarty value between the specfed document d 0 and the documents n the corpus, and then choose k documents (dfferent from d 0 ) wth the largest smlarty values as the nearest neghbors for d 0. Fnally, there are totally k+1 documents n the expanded document set. For the document set D={d 0, d 1, d 2,, d k }, the parwse cosne smlarty values between documents are calculated and recorded for later use. The effcency of document smlarty search can be sgnfcantly mproved by adoptng some ndex structure n the mplemented system, such as K-D-B tree, R-tree, SS-tree, SR-tree and X-tree (Böhm & Berchtold, 2001). The use of neghborhood nformaton s worth more dscusson. Because neghbor documents mght not be sampled from the same generatve model as the specfed document, we probably do not want to trust them so much as the specfed document. Thus a confdence value s assocated wth every document n the expanded document set, whch reflects out belef that the document s sampled from the same underlyng model as the specfed document. When a document s close to the specfed one, the confdence value s hgh, but when t s farther apart, the confdence value wll be reduced. Heurstcally, we use the cosne smlarty between a document and the specfed document as the confdence value. The confdence values of the neghbor documents wll be ncorporated n the keyphrase extracton algorthm. Keyphrase Extracton a) Neghborhood-Level Word Evaluaton Lke the PageRank algorthm (Page et al., 1998), the graph-based rankng algorthm employed n ths study s essentally a way of decdng the mportance of a vertex wthn a graph based on global nformaton recursvely drawn from the entre graph. The basc dea s that of votng or recommendaton between the vertces. A lnk between two vertces s consdered as a vote cast from one vertex to the other vertex. The score assocated wth a vertex s determned by the votes that are cast for t, and the score of the vertces castng these votes. Formally, gven the expanded document set D, let G=(V, E) be an undrected graph to reflect the relatonshps between words n the document set. V s the set of vertces and each vertex s a canddate word 1 n the document set. Because not all words n the documents are good ndcators of keyphrases, the words added to the graph are restrcted wth syntactc flters,.e., only the words wth a certan part of speech are added. As n Mhalcea and Tarau (2004), the documents are tagged by a POS tagger, and only the nouns and adectves are added nto the vertex set 2. E s the set of edges, whch s a subset of V V. Each edge e n E s assocated wth an affnty weght aff(v,v ) between words v and v. The weght s computed based on the cooccurrence relaton between the two words, controlled by the dstance between word occurrences. The co-occurrence relaton can express coheson relatonshps between words. Two vertces are connected f the correspondng words cooccur at least once wthn a wndow of maxmum w words, where w can be set anywhere from 2 to 20 words. The affnty weght aff(v,v ) s smply set to be the count of the controlled co-occurrences between the words v and v n the whole document set as follows: aff ( v,v ) = smdoc( d0, d p ) countd ( v,v ) (2) d p D ( v where countd p,v ) s the count of the controlled cooccurrences between words v and v n document d p, and sm doc (d 0,d p ) s the smlarty factor to reflect the confdence value for usng document d p (0 p k) n the expanded document set. The graph s bult based on the whole document set and t can reflect the global nformaton n the neghborhood, whch s called Global Affnty Graph. We use an affnty matrx M to descrbe G wth each entry correspondng to the weght of an edge n the graph. M = (M, ) V V s defned as follows: M aff( v,v ), f v lnks wth v and ; = 0, otherwse, (3) Then M s normalzed to M ~ as follows to make the sum of each row equal to 1: 1 The orgnal words are used wthout stemmng. 2 The correspondng POS tags of the canddate words nclude JJ, NN, NNS, NNP, NNPS. We used the Stanford log-lnear POS tagger (Toutanova and Mannng, 2000) n ths study. p 857

4 V V ~ M, M,, f M, 0 M = (4), = 1 = 1 0, otherwse Based on the global affnty graph G, the salency score WordScore(v ) for word v can be deduced from those of all other words lnked wth t and t can be formulated n a recursve form as n the PageRank algorthm: ~ (1 µ ) WordScore( v ) = µ WordScore( v ) M, + all V (5) And the matrx form s: r ~ r T (1 µ ) r λ = µ M λ + e V (6) where λ r = [ WordScore( v )] V s the vector of word 1 salency scores. e r s a vector wth all elements equalng to 1. µ s the dampng factor usually set to 0.85, as n the PageRank algorthm. The above process can be consdered as a Markov chan by takng the words as the states and the correspondng transton matrx s gven by ~ (1 µ ) r µ M T + e. The V statonary probablty dstrbuton of each state s obtaned by the prncpal egenvector of the transton matrx. For mplementaton, the ntal scores of all words are set to 1 and the teraton algorthm n Equaton (5) s adopted to compute the new scores of the words. Usually the convergence of the teraton algorthm s acheved when the dfference between the scores computed at two successve teratons for any words falls below a gven threshold ( n ths study). b) Document-Level Keyphrase Extracton After the scores of all canddate words n the document set have been computed, canddate phrases (ether sngle-word or mult-word) are selected and evaluated for the specfed document d 0. The canddate words (.e. nouns and adectves) of d 0, whch s a subset of V, are marked n the text of document d 0, and sequences of adacent canddate words are collapsed nto a mult-word phrase. The phrases endng wth an adectve s not allowed, and only the phrases endng wth a noun are collected as canddate phrases for the document. For nstance, n the followng sentence: Mad/JJ cow/nn dsease/nn has/vbz klled/vbn 10,000/CD cattle/nns, the canddate phrases are Mad cow dsease and cattle. The score of a canddate phrase p s computed by summng the neghborhood-level salency scores of the words contaned n the phrase. PhraseScor e( p ) = WordScore( ) (7) v v p All the canddate phrases n document d 0 are ranked n decreasng order of the phrase scores and the top m phrases are selected as the keyphrases of d 0. m ranges from 1 to 20 n ths study. Emprcal Evaluaton Evaluaton Setup To our knowledge, there was no gold standard news dataset wth assgned keyphrases for evaluaton. So we manually annotated the DUC2001 dataset (Over, 2001) and used the annotated dataset for evaluaton n ths study. The dataset was orgnally used for document summarzaton. It conssted of 309 news artcles collected from TREC-9, n whch two artcles were duplcate (.e. d05a\fbis and d05a\fbis-41815~), so the actual document number was 308. The artcles could be categorzed nto 30 news topcs and the average length of the documents was 740 words. Two graduate students were employed to manually label the keyphrases for each document. At most 10 keyphrases could be assgned to each document. The annotaton process lasted two weeks. The Kappa statstc for measurng nter-agreement among annotators was And then the annotaton conflcts between the two subects were solved by dscusson. Fnally, 2488 keyphrases were labeled for the dataset. The average keyphrase number per document was 8.08 and the average word number per keyphrase was In the experments, the DUC2001 dataset was consdered as the corpus for document expanson n ths study, whch could be easly expanded by addng more documents. Each specfed document was expanded by addng k documents (dfferent from the specfed document) most smlar to the document. For evaluaton of keyphrase extracton results, the automatc extracted keyphrases were compared wth the manually labeled keyphrases. The words n a keyphrase were converted to ther correspondng basc forms usng word stemmng before comparson. The precson p=count correct /count system, recall r=count correct /count human, F- measure (F=2pr/(p+r)) were used as evaluaton metrcs, where count correct was the total number of correct keyphrases extracted by the system, and count system was the total number of automatc extracted keyphrases, and count human was the total number of human-labeled keyphrases. Evaluaton Results The proposed approach (.e. ExpandRank) s compared wth the baselne methods relyng only on the specfed document (.e. SngleRank and TFIDF). The SngleRank baselne uses the graph-based rankng algorthm to compute the word scores for each sngle document based on the local graph for the specfed document. The TFIDF baselne computes the word scores for each sngle document based on the word s TFIDF value n the specfed document. The two baselnes do not make use of the neghborhood knowledge. Table 1 gves the comparson results of the baselne methods and the proposed ExpandRank methods wth dfferent neghbor numbers (k=1, 5, 10). In the experments, the keyphrase number m s typcally set to 10 because at 858

5 most 10 keyphrases can be manually labeled for each document, and the co-occurrence wndow sze w s also smply set to 10. Table 1. Keyphrase Extracton Results System TFIDF SngleRank ExpandRank (k=1) ExpandRank (k=5) ExpandRank (k=10) Seen from Table 1, the ExpandRank methods wth dfferent neghbor numbers can always outperform the baselne methods of SngleRank and TFIDF over all three metrcs. The results demonstrate the good effectveness of the proposed method. In order to nvestgate how the sze of the neghborhood nfluences the keyphrase extracton performance, we conduct experments wth dfferent values of the neghbor number k. Fgure 2 shows the performance curves for the ExpandRank method. In the fgure, k ranges from 0 to 15. Note that when k=0, the ExpandRank method degenerates nto the baselne SngleRank method. We can see from the fgure that the performance of ExpandRank (.e. k>0) can always outperform the baselne SngleRank method (.e. k=0), no matter how many neghbor documents are used. We can also see that the performance of ExpandRank frst ncreases and then decreases wth the ncrease of k. The trend demonstrates that very few or very many neghbors wll deterorate the results, because very few neghbors cannot provde suffcent knowledge and very many neghbors may ntroduce nosy knowledge. Seen from the fgure, t s not necessary to use many neghbors for ExpandRank, and the neghbor number can be set to a comparable small number (.e. 5), whch wll mprove the computatonal effcency and make the propose approach more applcable Neghbor number k Fgure 2: ExpandRank (m=10, w=10) performance vs. neghbor number k In order to nvestgate how the co-occurrence wndow sze nfluences the keyphrase extracton performance, we conduct experments wth dfferent wndow sze w. Fgures 3 and 4 show the performance curves for ExpandRank when w ranges from 2 to 20. In Fgure 3 the neghbor number s set to 5 and n Fgure 4 the neghbor number s set to 10. We can see from the fgures that the performances are almost not affected by the wndow sze, except when w s set to Wndow sze w Fgure 3: ExpandRank (k=5, m=10) performance vs. wndow sze w Wndow sze w Fgure 4: ExpandRank (k=10, m=10) performance vs. wndow sze w In the above experments, the keyphrase number s set to 10. We further conduct experments wth dfferent keyphrase number m to nvestgate how the keyphrase number nfluences the keyphrase extracton performance. Fgures 5 and 6 show the performance curves for ExpandRank when m ranges from 1 to 20. In Fgure 5 the neghbor number s set to 5 and n Fgure 6 the neghbor number s set to 10. We can see from the fgures that the precson values decrease wth the ncrease of m, and the recall values ncreases wth the ncrease of m, whle the F- measure values frst ncrease and then tend to decrease wth the ncrease of m Keyphrase number m Fgure 5: ExpandRank (k=5, w=10) performance vs. keyphrase number m 859

6 Keyphrase number m Fgure 6: ExpandRank (k=10, w=10) performance vs. keyphrase number m It s noteworthy that the proposed approach has hgher computatonal complexty than the baselne approach because t nvolves more documents, and we can mprove ts effcency by collaboratvely conductng sngle document keyphrase extractons n a batch mode. Suppose there are multple documents to be extracted separately, we can group the documents nto clusters, and for each cluster, we can use all other documents as the neghbors for a specfed document. Thus the mutual nfluences between all documents can be ncorporated nto the keyphrase extracton algorthm and all the words and phrases n the documents of a cluster are evaluated collaboratvely, resultng n keyphrase extracton for all the sngle documents n a batch mode. Concluson and Future Work Ths paper proposes a novel approach to sngle document keyphrase extracton by leveragng the neghborhood knowledge of the specfed document. In future work, other keyphrase extracton algorthms wll be ntegrated nto the proposed framework, and we wll use more test data for evaluaton to valdate the robustness of the proposed approach. Acknowledgements Ths work was supported by the Natonal Scence Foundaton of Chna (No ) and the Research Fund for the Doctoral Program of Hgher Educaton of Chna (No ). References Berger, A., and Mttal, V OCELOT: A system for summarzng Web Pages. In Proceedngs of SIGIR2000. Barker, K., and Cornaccha, N Usng nounphrase heads to extract document keyphrases. In Canadan Conference on AI. Böhm, C., and Berchtold, S Searchng n hgh-dmensonal spacesndex structures for mprovng the performance of multmeda databases. ACM Computng Surveys, 33(3): Frank, E.; Paynter, G. W.; Wtten, I. H.; Gutwn, C.; and Nevll-Mannng, C. G Doman-specfc keyphrase extracton. Proceedngs of IJCAI- 99, pp Gutwn, C.; Paynter, G. W.; Wtten, I. H.; Nevll-Mannng, C. G.; and Frank, E Improvng browsng n dgtal lbrares wth keyphrase ndexes. Journal of Decson Support Systems, 27, Hammouda, K. M.; Matute, D. N.; and Kamel, M. S CorePhrase: keyphrase extracton for document clusterng. In Proceedngs of MLDM2005. Hulth, A Improved automatc keyword extracton gven more lngustc knowledge. In Proceedngs of EMNLP2003. Kelleher, D., and Luz, S Automatc hypertext keyphrase detecton. In Proceedngs of IJCAI2005. Krulwch, B., and Burkey, C Learnng user nformaton nterests through the extracton of semantcally sgnfcant phrases. In AAAI 1996 Sprng Symposum on Machne Learnng n Informaton Access. Medelyan, O., and Wtten, I. H Thesaurus based automatc keyphrase ndexng. In Proceedngs of JCDL2006. Mhalcea, R., and Tarau, P TextRank: Brngng order nto texts. In Proceedngs of EMNLP2004. Muñoz, A Compound key word generaton from document databases usng a herarchcal clusterng ART model. Intellgent Data Analyss, 1(1). Nguyen, T. D., and Kan, M.-Y Keyphrase extracton n scentfc publcatons. In Proceedngs of ICADL2007. Over, P Introducton to DUC-2001: an ntrnsc evaluaton of generc news text summarzaton systems. In Proceedngs of DUC2001. Page, L.; Brn, S.; Motwan, R.; and Wnograd, T The pagerank ctaton rankng: Brngng order to the web. Techncal report, Stanford Dgtal Lbrares. Song, M.; Song, I.-Y.; and Hu, X KPSpotter: a flexble nformaton gan-based keyphrase extracton system. In Proceedngs of WIDM2003. Steer, A. M., and Belew, R. K Exportng phrases: A statstcal analyss of topcal language. In Proceedngs of Second Symposum on Document Analyss and Informaton Retreval, pp Tomokyo, T., and Hurst, M A language model approach to keyphrase extracton. In Proceedngs of ACL Workshop on Multword Expressons. Toutanova, K., and Mannng, C. D Enrchng the knowledge sources used n a maxmum entropy Part-of-Speech tagger. In Proceedngs of EMNLP/VLC Turney, P. D Learnng algorthms for keyphrase extracton. Informaton Retreval, 2: Turney, P. D Coherent keyphrase extracton va web mnng. In Proc. of IJCAI-03, pages Wan, X.; Yang, J.; and Xao, J Sngle document summarzaton wth document expanson. In Proceedngs of AAAI2007 Wtten, I. H.; Paynter, G. W.; Frank, E.; Gutwn, C.; and Nevll-Mannng, C. G KEA: Practcal automatc keyphrase extracton. Proceedngs of Dgtal Lbrares 99 (DL'99), pp Wong, T.-L.; Lam, W.; and Chan, S.-K Collaboratve nformaton extracton and mnng from multple web documents. In Proceedngs of SDM2006. Xue, G.-R.; Ln, C.; Yang, Q.; X, W.; Zeng, H.-J.; Yu, Y.; and Chen, Z Scalable collaboratve flterng usng cluster-based smoothng. In Proceedngs of SIGIR2005. Yh, W.-T.; Goodman, J.; and Carvalho, V. R Fndng advertsng keywords on web pages. In Proceedngs of WWW

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto