Using Wikipedia Anchor Text and Weighted Clustering Coefficient to Enhance the Traditional Multi-Document Summarization

Size: px

Start display at page:

Download "Using Wikipedia Anchor Text and Weighted Clustering Coefficient to Enhance the Traditional Multi-Document Summarization"

Evangeline Rich
5 years ago
Views:

1 Usng Wkpeda Anchor Text and Weghted Clusterng Coeffcent to Enhance the Tradtonal Mult-Document Summarzaton by Nraj Kumar, Kannan Srnathan, Vasudeva Varma n 13th Internatonal Conference on Intellgent Text Processng and Computatonal Lngustcs Indan Insttute of Technology Delh, New Delh, Inda Report No: IIIT/TR/2012/-1 Centre for Search and Informaton Extracton Lab Internatonal Insttute of Informaton Technology Hyderabad , INDIA March 2012

2 Usng Wkpeda Anchor Text and Weghted Clusterng Coeffcent to Enhance the Tradtonal Mult-Document Summarzaton Nraj Kumar, Kannan Srnathan, Vasudeva Varma, IIIT-Hyderabad, Hyderabad , INDIA Abstract. Smlar to the tradtonal approach, we consder the task of summarzaton as selecton of top ranked sentences from ranked sentenceclusters. To acheve ths goal, we rank the sentence clusters by usng the mportance of words calculated by usng page rank algorthm on reverse drected word graph of sentences. Next, to rank the sentences n every cluster we ntroduce the use of weghted clusterng coeffcent. We use page rank score of words for calculaton of weghted clusterng coeffcent. Fnally the most mportant ssue s the presence of a lot of nosy entres n the text, whch downgrades the performance of most of the text mnng algorthms. To solve ths problem, we ntroduce the use of Wkpeda anchor text based phrase mappng scheme. Our expermental results on DUC-2002 and DUC-2004 dataset show that our system performs better than unsupervsed systems and better than/comparable wth novel supervsed systems of ths area. Keywords: Mult-document summarzaton, sentence clusters, weghted clusterng coeffcent, page rank, and Wkpeda anchor text. 1 Introducton The generc summares reflect the man topcs of the document wthout any addtonal clues and pror knowledge. Accordng to [5], generc summares outperform over (1) query-based and (2) hybrd summares n the browsng tasks, so the document context of generc summares help users n browsng. These days dgtal lbrares and nternet etc. contan huge amount of text resources, lke: Text artcles, web pages, News documents, Educatonal materals etc. These all agan contan huge amount of nformaton and we have less tme to go through. It s remarkable to note that all such documents do not always contan human suppled summares. We beleve that an unsupervsed approach to generate extract summary by usng lmted lngustc resources s essental. It mproves the quck access of large quanttes of such nformaton. Fnally, the uses of learnng /tranng based systems make us dependent on corpus or dataset. That s why we focus our attenton towards the development of an unsupervsed generc Mult-document summarzaton system, whch can generate hgh qualty extract summary wthout usng heavy lngustc resources and learnng/tranng.

3 1.1 Related Work A lot of methods have been proposed for mult-document summarzaton. The most frequently used technques among all proposed methods are the use of sentence vector representaton (where each row represents a sentence and each column represents a term) and graphs based methods (where each node s a sentence and each edge represents the par wse relatonshp among correspondng sentences). Fnally all these methods rank the sentences accordng to ther scores calculated by a set of predefned features, such as term frequency nverse sentence frequency (TF-ISF) [16]; [14], sentence or term poston [20], and number of keywords [20]. Some state of the art methods wth key features are: centrod-based methods (e.g., MEAD [16]), graph-rankng based methods (e.g., LexPageRank [10]), non-negatve matrx factorzaton (NMF) based methods (e.g., [11]), Condtonal random feld (CRF) based summarzaton [18], and LSA based methods [11]. 1.2 Problem Setup and Motvaton In ths secton we present some basc ssues and problems related to tradtonal multdocument summarzaton and basc motvaton behnd the technques used to solve t. Usng Wkpeda anchor texts and documents ttles to handle nosy terms: Presence of nosy words n documents generally reduces the performance of most of the summarzaton algorthms. Because several tmes nosy words get good score wth lngustc, statstcal or graph theoretcal scorng system. However, the use of Tf-Idf (term frequency and nverse document frequency) and word net etc., shows some mprovements, but stll t requres some more mprovements. To solve ths ssue, we use the Wkpeda anchor text and ttles of documents. Wth the help of Wkpeda anchor text and ttles of documents, we dentfy the nformatve terms from gven documents. The anchor texts n Wkpeda have great semantc value,.e. they provde alternatve names, morphologcal varatons and related phrases for target artcle. Ths step has two benefts: (1) It reduces the chances of gettng hgh mportance by nosy words and (2) mproves the performance of overall system. Usng page rank score on reverse drected word graph of sentences to rank the sentence clusters: Use of sentence clusters n mult-document summarzaton s not new. We use GAAC (group average agglomeratve clusterng algorthm) to cluster the sentences. To rank the dentfed sentence clusters, we use page rank score of words, calculated on reverse drected word graph of sentences. Ths scheme helps n effectve rankng of words through votng. In general wrtng behavour, we descrbe the term after wrtng t. The page rank score on reverse drected word graph of sentences effectvely captures t. Use of Weghted Clusterng Coeffcent: use of weghted clusterng coeffcent helps us n dentfyng the strength of tes wth strong nodes. Before gong nto detal, we frst descrbe the clusterng coeffcent and then descrbe the requrement of weghted clusterng coeffcent.

4 The clusterng coeffcent s a measure of degree to whch nodes n a graph tends to cluster. There are two types of clusterng coeffcents: a) Global Clusterng coeffcent: It s desgned to gve an overall ndcaton of the clusterng n the network. b) Local Clusterng Coeffcent: It gves the ndcaton of embeddedness of sngle node. We use the noton of local clusterng coeffcent. It can be defned as: a) In undrected network the local clusterng coeffcent CV of a node V can be defned as: C V K V 2eV K V 1 Where, K V =number of neghbors / degree of V and e V =number of connected pars between all neghbours of V b) In drected network the local clusterng coeffcent CV be defned as: C V K V ev K V 1 (1) of a node V can Man am behnd the use of weghted clusterng coeffcents: We beleve that each word n document may have dfferent levels of mportance (beyond what s captured by degree of node n graph) and therefore we cannot gnore ths fact. The unweghted clusterng coeffcent obtaned by usng word graph of sentences, helps us n dentfyng the embeddedness strength of words wth other words n the graph; however, the use of mportance of words n clusterng coeffcents (.e. weghted clusterng coeffcent) helps us n dentfyng the embeddedness strength of words wth other mportant words n the graph. Ths s a general socal networkng behavour, where strength or status of any node or person depends upon (1) strength of that person / node and (2) strength of te ups wth strong frends. By usng of page rank of words n calculaton of weghted clusterng coeffcent we tred to acheve both levels of strength. Our system uses the weghted clusterng coeffcent score of words to calculate the mportance of sentences n sentence cluster. The effectve mprovements n qualty of results also support our vew (see sub-secton 4.2 for results). (2) 2 Framework and Algorthm 2.1 Input Cleanng Our nput cleanng task ncludes: (1) removal of nosy entres from entre document collecton and (2) sentence fltraton. Fnally we stem the entre text by usng porter stemmng algorthm.

5 2.2 Calculaton of Importance of Words The calculaton of mportance of words s very mportant, as, we use t to calculate the mportance of dentfed sentence clusters n next step. To calculate the mportance of all dstnct words of gven collecton, we concatenate all the documents of gven collecton and prepare a sngle fle. Next, we calculate the page rank score of every word on reverse drected word graph of sentences. The way to prepare the reverse drected word graph of sentences and calculaton of page rank s gven below: Preparng reverse drected word graph of sentences: Let, we have a set of sentences.e. S = {S1, S2,...Sn} from gven collecton. Now, to prepare the reverse drected word graph of sentences, we add reverse drected lnk for every adjacent G V, E as a word par of every sentence n the set. See Fgure-1. We denote drected graph, Where, V V V,..., V V E 1, 2 j, f there s a lnk from j V n V. V to denotes the vertex set and lnk set Fgure1: reverse drected word graph of sentences, Here S1, S2 and S3 represents the sentences of document and a, b, c, d, e, f, g, h and represents the dstnct words. Calculatng Page Rank Score: For any gven vertex V, let IN V be the set of vertces that pont to t (predecessors), and let OUT V be the set of vertces that vertex V ponts to (successors). Then the page rank score of vertex V can be defned as [3]: 1 SV j S V (3) N jin V OUTV j Where: SV Rank / score of word / vertex V. S V j =rank/score of word/vertex V j, from whch ncomng lnk comes to word / vertex V. N Count of number of words/vertex n word graph of sentences. Dampng factor (we use a fxed score for dampng factor.e., 0.85 as used n [3]).

6 2.3 Preparng Sentence Clusters and Rankng To dentfy the topcs covered n document we use Group average agglomeratve clusterng scheme (GAAC). In our case the topc s consdered as set of sentences related to same concept. Among three major agglomeratve clusterng algorthms,.e. sngle-lnk, complete-lnk, and average-lnk clusterng. Sngle-lnk clusterng can lead to elongated clusters. Complete-lnk clusterng s strongly affected by outlers. Average-lnk clusterng s a compromse between the two extremes, whch generally avods both problems. Ths s the man reason of use of group average agglomeratve clusterng algorthm for clusterng the sentences. GACC, uses average smlarty across all pars wthn the merged cluster to measure the smlarty of two clusters. In ths scheme average smlarty between two clusters (say, c and c j ) can be computed as: sm 1 ( c, c j ) sm( x, y) c c j ( c c 1) x j ( c c j ) y( c c j ): yx (4) Where, sm( x, y ) = count of co-occurrng words n x and y To apply the GACC on sentences we use a sentence vector representaton of documents of entre collecton. Here, each row represents a sentence and each column represents a term. In the entre evaluaton, we use the threshold 0.4. Calculatng mportance of sentence clusters or topcs: To calculate the weghted mportance of any sentence cluster or topc, we calculate the sum of weghted mportance of all words n the gven sentence cluster. The calculaton of weghted mportance of any sentence cluster can be gven as: C W Wwd (5) Where W C = weght of gven sentence cluster C W wd =weght of all words n gven sentence cluster. (see sub-secton 2.2, eq-3 to calculate the weght of words). Next, we calculate the percentage of weghted nformaton of every dentfed sentence cluster. The percentage weghted mportance of any dentfed sentence cluster can be calculated as: W C % W C 100 (6) W C Where: % WC =percentage weght of gven sentence cluster C. W C =sum of weght of all dentfed sentence cluster. W C = weght of gven sentence cluster C.

7 2.4 Mappng Phrases by usng Wkpeda Anchor Text We use Wkpeda anchor text to dentfy the nformatve terms n every dentfed sentence cluster. For ths, frst of all we fx the phrase boundary. Accordng to scheme defned n [2], we consder stopwords and punctuaton marks as phrase boundary. Next, we stem the entre anchor text collecton and fnd the longest matchng Wkpeda anchor text sequence n every words sequence wthn phrase boundary. We repeat ths process wth every word sequence nsde the predefned phrase boundary. We also fnd the matchng words related to ttles of entre collecton. We remove the rest of the words from every sentence. Thus every sentence n collecton contans sequence of Wkpeda anchor texts or words from ttles of entre collecton. We use ths mappng of phrases n calculaton of weghted clusterng coeffcents. 2.5 Calculatng Weghted Clusterng Coeffcent After step 2.4 we have sequence of Wkpeda anchor text words or words from ttles of documents, n sentences of every dentfed sentence cluster. Now, we calculate the weghted clusterng coeffcent of all such words n every sentence cluster. For ths we create undrected word graph of sentences. The sparse nature of word graph of sentences s the man reason behnd the selecton undrected graph for calculaton of weghted clusterng coeffcent. The process to calculate the weghted clusterng coeffcent of every dstnct word of gven sentence cluster s gven below: Preparng word graph of sentences: we treat every dstnct word as node of graph and prepare undrected word graph of sentences by addng undrected edge for every adjacent words par. Fgure-2: Undrected word graph of sentences, Here, S1, S2 and S3 denotes the sentences and A, B, C, D and N denotes the words whch are common to Wkpeda anchor text or Ttle of documents. as an undrected word graph of 1, 2 V n denotes the vertex set and lnk set V j, V E f there s a lnk between V j and V Calculatng Lnk weght: We use the page rank score of words (See sub-secton-2.2, for calculaton of weght of every word) n calculaton of lnk weght. The lnk weght of any edge E V, V j can be calculated as: Graph Theoretcal Notaton: We denote G V, E sentences. Where, V V V,..., W ScoreV V V j ScoreV, j Lc j DegreeV DegreeV j Where, W V, V j = Lnk weght of lnk between nodes V and V, V 2 (7) V j

8 Score V =page rank score of node (word) V Score V j =page rank score of node (word) V j Degree V =degree of node (word) V Degree V j =degree of node (word) V j L c V, Vj = count of number of lnks between nodes V and By usng ths scheme, we calculate the lnk weght of every edge of the graph. Calculatng weghted clusterng coeffcent: We use the lnk weght calculated by usng page rank score n calculaton of weghted clusterng coeffcent. In ths ven, we mantan the propertes of unweghted clusterng coeffcents on undrected graph (as descrbed n [4]). The value of weghted clusterng coeffcent of any node.e. 0,1 C. In the unweghted case, the number of trangles at ts node determnes ts clusterng property. In the weghted case, clusterng should be determned by some weghted characterstc of trangles. For each trangle all three edges should be taken nto account. For each trangle, the weghted characterstc should be nvarant to permutaton of weght. When any of the trangle approaches zero, the weghted characterstc of that trangle should lkewse approaches zero. When vertex V partcpates n the 1 maxmum number KV K V 1 of trangles, where each edge weght s 2 maxmal, the weghted clusterng coeffcent should also be maxmal.e. ~ C V 1. To acheve the weghted clusterng coeffcent [4], replaces e V (See Eq-1) by sum of trangle ntenstes. Now weghted clusterng coeffcent of any node V can be defned as: Where, ~ CV K V 2 K V ~ 1 W V V j, V k ~ W V, V j, V j W W V V j ~ ~, V W V, V W V, V j j k k 1 3 ~ V (8) (9) W V, V j Lnk weght of lnk between nodes V and V j (see equaton-7). In these equatons W s the maxmum of all edge s weght n gven graph. The normalzaton used n above equaton and use of sum of trangle ntenstes fulfl the condtons gven n [4]. 2.6 Rankng Sentences nsde Every Sentence Cluster To rank the sentences n every sentence cluster, we use the weghted clusterng coeffcent of words n sentences. We add the weghted clusterng coeffcent score of words to calculate the weght of sentence. We fnally rank the sentences n

9 descendng order of ther weght. The scheme to calculate the weght of sentences can be gven as: Where, Wt S r S WCC W Wt r (10) =weght of sentence Sr n gven sentence cluster. W = sum of weght of all words (node / vertex) whch exst n gven WCC sentence S r and obtaned by usng weghted clusterng coeffcent (see sub-secton 2.5, equaton-8). Next, we rank the sentences of gven sentence cluster n descendng order of ther weght. 2.7 Generatng Extract Summary To generate the extract summary, we select sngle top ranked sentence(s) from every dentfed sentence cluster and arrange them accordng to the rank of ther parent sentence cluster (see sub-secton 2.3 for rankng of dentfed sentence clusters). If, number of sentence clusters s few, then we use the percentage weght of every sentence cluster to fx the number of requred top sentences, whch are to be extracted from every sentence cluster. To calculate the percentage weght / mportance of any gven sentence clusters C we use the followng scheme: W C % W C 100 (11) W C Where, % WC = percentage weght of gven sentence cluster C. W C = sum of weghted mportance of all dentfed sentence clusters. W C = weght of gven sentence cluster C. (see sub-secton 2.3, to calculate the weght of any gven sentence clusters). Now, the count of sentences, that s to be extracted from sentence cluster C can be the nearest hgher nteger value of % WC Total number of requred sentences. NOTE: f the length of sentence s more than 40 words than we dscard t and pck the next hghest ranked sentence from same sentence cluster. 3 Pseudo Code INPUT: ASCII text document. OUTPUT: Requred number of extracted sentences as summary. We truncate the fnal output to meet the requred number of words. ALGORITHM:

10 Step 1. Apply nput cleanng (see Subsec-2.1). Step 2. Calculate the mportance/weght of every dstnct word of entre text collecton (See Subsecton-2.2). Step 3. Identfy all sentence clusters from the gven collecton and rank every dentfed sentence cluster n descendng order of ther mportance / score (see sub-secton 2.3). Step 4. Use Wkpeda anchor text and words from ttles of document collecton to dentfy the nformatve words n every dentfed sentence cluster (see sub-secton-2.4). Step 5. Calculate the weghted clusterng coeffcents of nformatve words of every dentfed sentence cluster (See sub-secton 2.5). Step 6. Use weghted clusterng coeffcent of nformatve words to rank the sentences n descendng order of ther weght, n every dentfed sentence cluster (See sub-secton 2.6). Step 7. Apply sentence extracton scheme, to produce the requred number of sentences (see sub-secton-2.7). 4 Evaluaton We have done two dfferent experments. In frst experment we compare our devsed system wth state-of-the-art supervsed and unsupervsed systems. In the second experment, we test the effect of weghted clusterng coeffcent. The detals of dataset, evaluaton metrcs and results are gven below. Detals of dataset: We use DUC2002 and DUC2004 data sets to evaluate our devsed system. DUC dataset s an open benchmark data sets from Document Understandng Conference (DUC) for generc automatc summarzaton. Table 1 gves a bref descrpton of the dataset. Table-1: Detals of DUC 2002, DUC-2004 dataset DUC2002 DUC2004 number of document collectons number of documents n each collecton data source TREC TDT summary length 200 words 665bytes Evaluaton metrc: We use ROUGE toolkt (verson 1.5.5) to measure the summarzaton performance. To properly evaluate the summary we use ROUGE-1, ROUGE-2, ROUGE-SU and ROUGE-L based measures. The rest of the detals and package s avalable at [13]. 4.1 Experment-1 In ths experment we emprcally compare our devsed system s result wth publshed results of [6]. The detals of system descrpton used n expermental evaluaton of [6], s descrbed below: Systems used n evaluaton. We use the publshed results of the followng most wdely used document summarzaton methods as the baselne systems to compare wth our devsed system. (1) Random: The method selects sentences randomly for

11 each document collecton (2) Centrod: The method apples MEAD algorthm [16] to extract sentences accordng to the followng three parameters: centrod value, postonal value, and frst-sentence overlap. (3) LexPageRank: The method frst constructs a sentence connectvty graph based on cosne smlarty and then selects mportant sentences based on the concept of egenvector centralty [10]. (4) LSA: The method performs latent semantc analyss on terms by sentences matrx to select sentences havng the greatest combned weghts across all mportant topcs [11]. (5) NMF: The method performs non-negatve matrx factorzaton (NMF) on terms by sentences matrx and then ranks the sentences by ther weghted scores [12]. (6) KM: The method performs K-means algorthm on terms by sentences matrx to cluster the sentences and then chooses the centrods for each sentence cluster. (7) FGB: The FGB method s proposed n [19]. (8) The publshed results of BSTM method [6]. Results: Results are gven n Table-2 and Table-3. Table-2 contans evaluaton results on DUC-2002 dataset. Table-3 contans evaluaton results on DUC-2004 dataset. The hghest evaluaton score related to every ROUGE evaluaton metrc s presented by usng bold font. From expermental results (as, gven n Table-2 and Table-3), t s clear that our devsed system performs better than all unsupervsed systems and better/comparable wth supervsed system lke BSTM [6]. Table-2: Evaluaton results on DUC-2002 dataset Systems ROUGE-1 ROUGE-2 ROUGE-L ROUGE-SU DUC Best Random Centrod LexPageRank LSA NMF KM FGB BSTM Our System Table-3: Evaluaton results on DUC-2004 dataset Systems ROUGE-1 ROUGE-2 ROUGE-L ROUGE-SU DUC Best Random Centrod LexPageRank LSA NMF KM FGB BSTM Our System

12 4.2 Experment-2 We use ths experment to justfy the use of weghted clusterng coeffcent for rankng the sentences n every dentfed sentence cluster. For ths we make smple change and use unweghted clusterng coeffcent as gven n equaton-1 n place of equaton-8 (see sub-secton 2.5) and run the entre system. The comparatve results (.e. wth weghted clusterng coeffcent and wth unweghted clusterng coeffcent) wth DUC-2002 and DUC-2004 dataset are gven n Fgure-3 and n Fgure-4 respectvely. The results gven n Fgure-3 and 4, clearly ndcates the benefts of usng weghted clusterng coeffcent. Fgure-3: Experments usng DUC-2002 dataset Fgure-Y: Experments usng DUC-2004 dataset 5 Concluson and Future Work In ths paper we ntroduce the use of Wkpeda anchor text and weghted clusterng coeffcent for mult-document summarzaton. Addtonally, we lmt the use of lngustc resources to nclude only stopwords, stemmers and punctuaton marks. The expermental results show that our devsed system performs better than unsupervsed systems and better/comparable wth supervsed systems of ths area. As, a future work we are plannng to use the relaton between Wkpeda anchor texts for mprovements n summary qualty. We beleve that such relaton can

13 mprove the weghted clusterng coeffcent score of nformatve terms and hence, t may mprove the summary qualty. References 1. D. M. Ble, A. Y. Ng, and M. I. Jordan. Latent drchlet allocaton. In Advances n Neural Informaton Processng Systems Kumar, Nraj and Srnathan, Kannan. Automatc keyphrase extracton from scentfc documents usng N-gram fltraton technque. Proceedng of the eghth ACM symposum on Document engneerng. DocEng '08. Sao Paulo, Brazl L. Page, s. Brn, r. Motwan and t. Wnograd., The pagerank ctaton rankng: brngng order to the web. Techncal report, Stanford dgtal lbrary technologes project, Jar Saramak, Jukka-Pekka Onnela, Janos Kertesz and Kmmo Kask; Characterzng Motfs n Weghted Complex Networks. 5. Danel m. Mcdonald and hsnchun chen., Summary n context: searchng versus browsng;acm transactons on nformaton systems, vol. 24, no. 1, january 2006, pages Dngdng Wang, Shenghuo Zhu, Tao L, Yhong Gong;Mult-Document Summarzaton usng Sentence-based Topc Models;Proceedngs of the ACL-IJCNLP 2009 Conference Short Papers, pages ,Suntec, Sngapore, 4 August ACL and AFNLP 7. C. Dng and X. He. K-means clusterng and prncpal component analyss. In Prodeedngs of ICML Chrs Dng, Xaofeng He, and Horst Smon On the equvalence of nonnegatve matrx factorzaton and spectral clusterng. In Proceedngs of Sam Data Mnng. 9. Chrs Dng, Tao L, We Peng, and Haesun Park Orthogonal nonnegatve matrx trfactorzatons for clusterng. In Proceedngs of SIGKDD G. Erkan and D. Radev Lexpagerank: Prestge n mult-document text summarzaton. In Proceedngs of EMNLP Y. Gong and X. Lu Generc text summarzaton usng relevance measure and latent semantc analyss. In Proceedngs of SIGIR. 12. Danel D. Lee and H. Sebastan Seung. Algorthms for non-negatve matrx factorzaton. In Advances n Neural Informaton Processng Systems C-Y. Ln and E.Hovy. Automatc evaluaton of summares usng n-gram cooccurrence statstcs. In Proceedngs of NLT-NAACL C-Y. Ln and E. Hovy From sngle to mult-document summarzaton: A prototype system and ts evaluaton. In Proceedngs of ACL I. Man Automatc summarzaton. John Benjamns Publshng Company. 16. D. Radev, H. Jng, M. Stys, and D. Tam Centrod-based summarzaton of multple documents. Informaton Processng and Management, pages B. Rcardo and R. Berther Modern nformaton retreval. ACM Press. 18. D. Shen, J-T. Sun, H. L, Q. Yang, and Z. Chen Document summarzaton usng condtonal random felds. In Proceedngs of IJCAI Dngdng Wang, Shenghuo Zhu, Tao L, Yun Ch, and Yhong Gong Integratng clusterng and mult-document summarzaton to mprove document understandng. In Proceedngs of CIKM W-T. Yh, J. Goodman, L. Vanderwende, and H. Suzuk Multdocument summarzaton by maxmzng nformatve content-words. In Proceedngs of IJCAI 2007.

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto