Using Wikipedia Anchor Text and Weighted Clustering Coefficient to Enhance the Traditional Multi-Document Summarization

Similar documents
Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

UB at GeoCLEF Department of Geography Abstract

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

A Binarization Algorithm specialized on Document Images and Photos

Machine Learning: Algorithms and Applications

Single Document Keyphrase Extraction Using Neighborhood Knowledge

Available online at Available online at Advanced in Control Engineering and Information Science

Weighted Feature Subset Non-Negative Matrix Factorization and its Applications to Document Understanding

Hierarchical clustering for gene expression data analysis

Module Management Tool in Software Development Organizations

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Query Clustering Using a Hybrid Query Similarity Measure

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Keyword-based Document Clustering

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines

Information Retrieval

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

An Image Fusion Approach Based on Segmentation Region

Description of NTU Approach to NTCIR3 Multilingual Information Retrieval

Combining Multiple Resources, Evidence and Criteria for Genomic Information Retrieval

Ranking Techniques for Cluster Based Search Results in a Textual Knowledge-base

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

A Knowledge Management System for Organizing MEDLINE Database

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Performance Evaluation of Information Retrieval Systems

Cluster Analysis of Electrical Behavior

Experiments in Text Categorization Using Term Selection by Distance to Transition Point

Domain Thesaurus Construction from Wikipedia *

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Selecting Query Term Alterations for Web Search by Exploiting Query Contexts

The Research of Support Vector Machine in Agricultural Data Classification

Load-Balanced Anycast Routing

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

TN348: Openlab Module - Colocalization

Document Representation and Clustering with WordNet Based Similarity Rough Set Model

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Machine Learning 9. week

Recommended Items Rating Prediction based on RBF Neural Network Optimized by PSO Algorithm

Optimizing Document Scoring for Query Retrieval

Federated Search of Text-Based Digital Libraries in Hierarchical Peer-to-Peer Networks

GA-Based Learning Algorithms to Identify Fuzzy Rules for Fuzzy Neural Networks

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

A Clustering Algorithm for Key Frame Extraction Based on Density Peak

Classifier Selection Based on Data Complexity Measures *

Classifying Acoustic Transient Signals Using Artificial Intelligence

Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study

Resolving Surface Forms to Wikipedia Topics

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Classification of Face Images Based on Gender using Dimensionality Reduction Techniques and SVM

Detection of an Object by using Principal Component Analysis

Discriminative Dictionary Learning with Pairwise Constraints

Pruning Training Corpus to Speedup Text Classification 1

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION

A Method of Hot Topic Detection in Blogs Using N-gram Model

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

HCMX: AN EFFICIENT HYBRID CLUSTERING APPROACH FOR MULTI-VERSION XML DOCUMENTS

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Cordial and 3-Equitable Labeling for Some Star Related Graphs

Semantic Illustration Retrieval for Very Large Data Set

High-Boost Mesh Filtering for 3-D Shape Enhancement

Semantic Image Retrieval Using Region Based Inverted File

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Web Document Classification Based on Fuzzy Association

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

A CALCULATION METHOD OF DEEP WEB ENTITIES RECOGNITION

Professional competences training path for an e-commerce major, based on the ISM method

Hermite Splines in Lie Groups as Products of Geodesics

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Personalized Concept-Based Clustering of Search Engine Queries

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Simulation Based Analysis of FAST TCP using OMNET++

Impact of a New Attribute Extraction Algorithm on Web Page Classification

An Entropy-Based Approach to Integrated Information Needs Assessment

The Codesign Challenge

User Tweets based Genre Prediction and Movie Recommendation using LSI and SVD

LRD: Latent Relation Discovery for Vector Space Expansion and Information Retrieval

An Optimal Algorithm for Prufer Codes *

Cross-Language Information Retrieval

Cross-lingual Pseudo Relevance Feedback Based on Weak Relevant Topic Alignment

Support Vector Machines

ICDAR2007 Page Segmentation Competition

A Novel Term_Class Relevance Measure for Text Categorization

Alignment Results of SOBOM for OAEI 2010

Object-Based Techniques for Image Retrieval

Learning Semantics-Preserving Distance Metrics for Clustering Graphical Data

A Novel Video Retrieval Method Based on Web Community Extraction Using Features of Video Materials

Ranking Search Results by Web Quality Dimensions

Enhancement of Infrequent Purchased Product Recommendation Using Data Mining Techniques

Load Balancing for Hex-Cell Interconnection Network

Hybrid Non-Blind Color Image Watermarking

Improving Web Search Results Using Affinity Graph

Transcription:

Usng Wkpeda Anchor Text and Weghted Clusterng Coeffcent to Enhance the Tradtonal Mult-Document Summarzaton by Nraj Kumar, Kannan Srnathan, Vasudeva Varma n 13th Internatonal Conference on Intellgent Text Processng and Computatonal Lngustcs Indan Insttute of Technology Delh, New Delh, Inda Report No: IIIT/TR/2012/-1 Centre for Search and Informaton Extracton Lab Internatonal Insttute of Informaton Technology Hyderabad - 500 032, INDIA March 2012

Usng Wkpeda Anchor Text and Weghted Clusterng Coeffcent to Enhance the Tradtonal Mult-Document Summarzaton Nraj Kumar, Kannan Srnathan, Vasudeva Varma, nraj_kumar@research.t.ac.n, srnathan@t.ac.n, vv@t.ac.n, IIIT-Hyderabad, Hyderabad-500032, INDIA Abstract. Smlar to the tradtonal approach, we consder the task of summarzaton as selecton of top ranked sentences from ranked sentenceclusters. To acheve ths goal, we rank the sentence clusters by usng the mportance of words calculated by usng page rank algorthm on reverse drected word graph of sentences. Next, to rank the sentences n every cluster we ntroduce the use of weghted clusterng coeffcent. We use page rank score of words for calculaton of weghted clusterng coeffcent. Fnally the most mportant ssue s the presence of a lot of nosy entres n the text, whch downgrades the performance of most of the text mnng algorthms. To solve ths problem, we ntroduce the use of Wkpeda anchor text based phrase mappng scheme. Our expermental results on DUC-2002 and DUC-2004 dataset show that our system performs better than unsupervsed systems and better than/comparable wth novel supervsed systems of ths area. Keywords: Mult-document summarzaton, sentence clusters, weghted clusterng coeffcent, page rank, and Wkpeda anchor text. 1 Introducton The generc summares reflect the man topcs of the document wthout any addtonal clues and pror knowledge. Accordng to [5], generc summares outperform over (1) query-based and (2) hybrd summares n the browsng tasks, so the document context of generc summares help users n browsng. These days dgtal lbrares and nternet etc. contan huge amount of text resources, lke: Text artcles, web pages, News documents, Educatonal materals etc. These all agan contan huge amount of nformaton and we have less tme to go through. It s remarkable to note that all such documents do not always contan human suppled summares. We beleve that an unsupervsed approach to generate extract summary by usng lmted lngustc resources s essental. It mproves the quck access of large quanttes of such nformaton. Fnally, the uses of learnng /tranng based systems make us dependent on corpus or dataset. That s why we focus our attenton towards the development of an unsupervsed generc Mult-document summarzaton system, whch can generate hgh qualty extract summary wthout usng heavy lngustc resources and learnng/tranng.

1.1 Related Work A lot of methods have been proposed for mult-document summarzaton. The most frequently used technques among all proposed methods are the use of sentence vector representaton (where each row represents a sentence and each column represents a term) and graphs based methods (where each node s a sentence and each edge represents the par wse relatonshp among correspondng sentences). Fnally all these methods rank the sentences accordng to ther scores calculated by a set of predefned features, such as term frequency nverse sentence frequency (TF-ISF) [16]; [14], sentence or term poston [20], and number of keywords [20]. Some state of the art methods wth key features are: centrod-based methods (e.g., MEAD [16]), graph-rankng based methods (e.g., LexPageRank [10]), non-negatve matrx factorzaton (NMF) based methods (e.g., [11]), Condtonal random feld (CRF) based summarzaton [18], and LSA based methods [11]. 1.2 Problem Setup and Motvaton In ths secton we present some basc ssues and problems related to tradtonal multdocument summarzaton and basc motvaton behnd the technques used to solve t. Usng Wkpeda anchor texts and documents ttles to handle nosy terms: Presence of nosy words n documents generally reduces the performance of most of the summarzaton algorthms. Because several tmes nosy words get good score wth lngustc, statstcal or graph theoretcal scorng system. However, the use of Tf-Idf (term frequency and nverse document frequency) and word net etc., shows some mprovements, but stll t requres some more mprovements. To solve ths ssue, we use the Wkpeda anchor text and ttles of documents. Wth the help of Wkpeda anchor text and ttles of documents, we dentfy the nformatve terms from gven documents. The anchor texts n Wkpeda have great semantc value,.e. they provde alternatve names, morphologcal varatons and related phrases for target artcle. Ths step has two benefts: (1) It reduces the chances of gettng hgh mportance by nosy words and (2) mproves the performance of overall system. Usng page rank score on reverse drected word graph of sentences to rank the sentence clusters: Use of sentence clusters n mult-document summarzaton s not new. We use GAAC (group average agglomeratve clusterng algorthm) to cluster the sentences. To rank the dentfed sentence clusters, we use page rank score of words, calculated on reverse drected word graph of sentences. Ths scheme helps n effectve rankng of words through votng. In general wrtng behavour, we descrbe the term after wrtng t. The page rank score on reverse drected word graph of sentences effectvely captures t. Use of Weghted Clusterng Coeffcent: use of weghted clusterng coeffcent helps us n dentfyng the strength of tes wth strong nodes. Before gong nto detal, we frst descrbe the clusterng coeffcent and then descrbe the requrement of weghted clusterng coeffcent.

The clusterng coeffcent s a measure of degree to whch nodes n a graph tends to cluster. There are two types of clusterng coeffcents: a) Global Clusterng coeffcent: It s desgned to gve an overall ndcaton of the clusterng n the network. b) Local Clusterng Coeffcent: It gves the ndcaton of embeddedness of sngle node. We use the noton of local clusterng coeffcent. It can be defned as: a) In undrected network the local clusterng coeffcent CV of a node V can be defned as: C V K V 2eV K V 1 Where, K V =number of neghbors / degree of V and e V =number of connected pars between all neghbours of V b) In drected network the local clusterng coeffcent CV be defned as: C V K V ev K V 1 (1) of a node V can Man am behnd the use of weghted clusterng coeffcents: We beleve that each word n document may have dfferent levels of mportance (beyond what s captured by degree of node n graph) and therefore we cannot gnore ths fact. The unweghted clusterng coeffcent obtaned by usng word graph of sentences, helps us n dentfyng the embeddedness strength of words wth other words n the graph; however, the use of mportance of words n clusterng coeffcents (.e. weghted clusterng coeffcent) helps us n dentfyng the embeddedness strength of words wth other mportant words n the graph. Ths s a general socal networkng behavour, where strength or status of any node or person depends upon (1) strength of that person / node and (2) strength of te ups wth strong frends. By usng of page rank of words n calculaton of weghted clusterng coeffcent we tred to acheve both levels of strength. Our system uses the weghted clusterng coeffcent score of words to calculate the mportance of sentences n sentence cluster. The effectve mprovements n qualty of results also support our vew (see sub-secton 4.2 for results). (2) 2 Framework and Algorthm 2.1 Input Cleanng Our nput cleanng task ncludes: (1) removal of nosy entres from entre document collecton and (2) sentence fltraton. Fnally we stem the entre text by usng porter stemmng algorthm.

2.2 Calculaton of Importance of Words The calculaton of mportance of words s very mportant, as, we use t to calculate the mportance of dentfed sentence clusters n next step. To calculate the mportance of all dstnct words of gven collecton, we concatenate all the documents of gven collecton and prepare a sngle fle. Next, we calculate the page rank score of every word on reverse drected word graph of sentences. The way to prepare the reverse drected word graph of sentences and calculaton of page rank s gven below: Preparng reverse drected word graph of sentences: Let, we have a set of sentences.e. S = {S1, S2,...Sn} from gven collecton. Now, to prepare the reverse drected word graph of sentences, we add reverse drected lnk for every adjacent G V, E as a word par of every sentence n the set. See Fgure-1. We denote drected graph, Where, V V V,..., V V E 1, 2 j, f there s a lnk from j V n V. V to denotes the vertex set and lnk set Fgure1: reverse drected word graph of sentences, Here S1, S2 and S3 represents the sentences of document and a, b, c, d, e, f, g, h and represents the dstnct words. Calculatng Page Rank Score: For any gven vertex V, let IN V be the set of vertces that pont to t (predecessors), and let OUT V be the set of vertces that vertex V ponts to (successors). Then the page rank score of vertex V can be defned as [3]: 1 SV j S V (3) N jin V OUTV j Where: SV Rank / score of word / vertex V. S V j =rank/score of word/vertex V j, from whch ncomng lnk comes to word / vertex V. N Count of number of words/vertex n word graph of sentences. Dampng factor (we use a fxed score for dampng factor.e., 0.85 as used n [3]).

2.3 Preparng Sentence Clusters and Rankng To dentfy the topcs covered n document we use Group average agglomeratve clusterng scheme (GAAC). In our case the topc s consdered as set of sentences related to same concept. Among three major agglomeratve clusterng algorthms,.e. sngle-lnk, complete-lnk, and average-lnk clusterng. Sngle-lnk clusterng can lead to elongated clusters. Complete-lnk clusterng s strongly affected by outlers. Average-lnk clusterng s a compromse between the two extremes, whch generally avods both problems. Ths s the man reason of use of group average agglomeratve clusterng algorthm for clusterng the sentences. GACC, uses average smlarty across all pars wthn the merged cluster to measure the smlarty of two clusters. In ths scheme average smlarty between two clusters (say, c and c j ) can be computed as: sm 1 ( c, c j ) sm( x, y) c c j ( c c 1) x j ( c c j ) y( c c j ): yx (4) Where, sm( x, y ) = count of co-occurrng words n x and y To apply the GACC on sentences we use a sentence vector representaton of documents of entre collecton. Here, each row represents a sentence and each column represents a term. In the entre evaluaton, we use the threshold 0.4. Calculatng mportance of sentence clusters or topcs: To calculate the weghted mportance of any sentence cluster or topc, we calculate the sum of weghted mportance of all words n the gven sentence cluster. The calculaton of weghted mportance of any sentence cluster can be gven as: C W Wwd (5) Where W C = weght of gven sentence cluster C W wd =weght of all words n gven sentence cluster. (see sub-secton 2.2, eq-3 to calculate the weght of words). Next, we calculate the percentage of weghted nformaton of every dentfed sentence cluster. The percentage weghted mportance of any dentfed sentence cluster can be calculated as: W C % W C 100 (6) W C Where: % WC =percentage weght of gven sentence cluster C. W C =sum of weght of all dentfed sentence cluster. W C = weght of gven sentence cluster C.

2.4 Mappng Phrases by usng Wkpeda Anchor Text We use Wkpeda anchor text to dentfy the nformatve terms n every dentfed sentence cluster. For ths, frst of all we fx the phrase boundary. Accordng to scheme defned n [2], we consder stopwords and punctuaton marks as phrase boundary. Next, we stem the entre anchor text collecton and fnd the longest matchng Wkpeda anchor text sequence n every words sequence wthn phrase boundary. We repeat ths process wth every word sequence nsde the predefned phrase boundary. We also fnd the matchng words related to ttles of entre collecton. We remove the rest of the words from every sentence. Thus every sentence n collecton contans sequence of Wkpeda anchor texts or words from ttles of entre collecton. We use ths mappng of phrases n calculaton of weghted clusterng coeffcents. 2.5 Calculatng Weghted Clusterng Coeffcent After step 2.4 we have sequence of Wkpeda anchor text words or words from ttles of documents, n sentences of every dentfed sentence cluster. Now, we calculate the weghted clusterng coeffcent of all such words n every sentence cluster. For ths we create undrected word graph of sentences. The sparse nature of word graph of sentences s the man reason behnd the selecton undrected graph for calculaton of weghted clusterng coeffcent. The process to calculate the weghted clusterng coeffcent of every dstnct word of gven sentence cluster s gven below: Preparng word graph of sentences: we treat every dstnct word as node of graph and prepare undrected word graph of sentences by addng undrected edge for every adjacent words par. Fgure-2: Undrected word graph of sentences, Here, S1, S2 and S3 denotes the sentences and A, B, C, D and N denotes the words whch are common to Wkpeda anchor text or Ttle of documents. as an undrected word graph of 1, 2 V n denotes the vertex set and lnk set V j, V E f there s a lnk between V j and V Calculatng Lnk weght: We use the page rank score of words (See sub-secton-2.2, for calculaton of weght of every word) n calculaton of lnk weght. The lnk weght of any edge E V, V j can be calculated as: Graph Theoretcal Notaton: We denote G V, E sentences. Where, V V V,..., W ScoreV V V j ScoreV, j Lc j DegreeV DegreeV j Where, W V, V j = Lnk weght of lnk between nodes V and V, V 2 (7) V j

Score V =page rank score of node (word) V Score V j =page rank score of node (word) V j Degree V =degree of node (word) V Degree V j =degree of node (word) V j L c V, Vj = count of number of lnks between nodes V and By usng ths scheme, we calculate the lnk weght of every edge of the graph. Calculatng weghted clusterng coeffcent: We use the lnk weght calculated by usng page rank score n calculaton of weghted clusterng coeffcent. In ths ven, we mantan the propertes of unweghted clusterng coeffcents on undrected graph (as descrbed n [4]). The value of weghted clusterng coeffcent of any node.e. 0,1 C. In the unweghted case, the number of trangles at ts node determnes ts clusterng property. In the weghted case, clusterng should be determned by some weghted characterstc of trangles. For each trangle all three edges should be taken nto account. For each trangle, the weghted characterstc should be nvarant to permutaton of weght. When any of the trangle approaches zero, the weghted characterstc of that trangle should lkewse approaches zero. When vertex V partcpates n the 1 maxmum number KV K V 1 of trangles, where each edge weght s 2 maxmal, the weghted clusterng coeffcent should also be maxmal.e. ~ C V 1. To acheve the weghted clusterng coeffcent [4], replaces e V (See Eq-1) by sum of trangle ntenstes. Now weghted clusterng coeffcent of any node V can be defned as: Where, ~ CV K V 2 K V ~ 1 W V V j, V k ~ W V, V j, V j W W V V j ~ ~, V W V, V W V, V j j k k 1 3 ~ V (8) (9) W V, V j Lnk weght of lnk between nodes V and V j (see equaton-7). In these equatons W s the maxmum of all edge s weght n gven graph. The normalzaton used n above equaton and use of sum of trangle ntenstes fulfl the condtons gven n [4]. 2.6 Rankng Sentences nsde Every Sentence Cluster To rank the sentences n every sentence cluster, we use the weghted clusterng coeffcent of words n sentences. We add the weghted clusterng coeffcent score of words to calculate the weght of sentence. We fnally rank the sentences n

descendng order of ther weght. The scheme to calculate the weght of sentences can be gven as: Where, Wt S r S WCC W Wt r (10) =weght of sentence Sr n gven sentence cluster. W = sum of weght of all words (node / vertex) whch exst n gven WCC sentence S r and obtaned by usng weghted clusterng coeffcent (see sub-secton 2.5, equaton-8). Next, we rank the sentences of gven sentence cluster n descendng order of ther weght. 2.7 Generatng Extract Summary To generate the extract summary, we select sngle top ranked sentence(s) from every dentfed sentence cluster and arrange them accordng to the rank of ther parent sentence cluster (see sub-secton 2.3 for rankng of dentfed sentence clusters). If, number of sentence clusters s few, then we use the percentage weght of every sentence cluster to fx the number of requred top sentences, whch are to be extracted from every sentence cluster. To calculate the percentage weght / mportance of any gven sentence clusters C we use the followng scheme: W C % W C 100 (11) W C Where, % WC = percentage weght of gven sentence cluster C. W C = sum of weghted mportance of all dentfed sentence clusters. W C = weght of gven sentence cluster C. (see sub-secton 2.3, to calculate the weght of any gven sentence clusters). Now, the count of sentences, that s to be extracted from sentence cluster C can be the nearest hgher nteger value of % WC Total number of requred sentences. NOTE: f the length of sentence s more than 40 words than we dscard t and pck the next hghest ranked sentence from same sentence cluster. 3 Pseudo Code INPUT: ASCII text document. OUTPUT: Requred number of extracted sentences as summary. We truncate the fnal output to meet the requred number of words. ALGORITHM:

Step 1. Apply nput cleanng (see Subsec-2.1). Step 2. Calculate the mportance/weght of every dstnct word of entre text collecton (See Subsecton-2.2). Step 3. Identfy all sentence clusters from the gven collecton and rank every dentfed sentence cluster n descendng order of ther mportance / score (see sub-secton 2.3). Step 4. Use Wkpeda anchor text and words from ttles of document collecton to dentfy the nformatve words n every dentfed sentence cluster (see sub-secton-2.4). Step 5. Calculate the weghted clusterng coeffcents of nformatve words of every dentfed sentence cluster (See sub-secton 2.5). Step 6. Use weghted clusterng coeffcent of nformatve words to rank the sentences n descendng order of ther weght, n every dentfed sentence cluster (See sub-secton 2.6). Step 7. Apply sentence extracton scheme, to produce the requred number of sentences (see sub-secton-2.7). 4 Evaluaton We have done two dfferent experments. In frst experment we compare our devsed system wth state-of-the-art supervsed and unsupervsed systems. In the second experment, we test the effect of weghted clusterng coeffcent. The detals of dataset, evaluaton metrcs and results are gven below. Detals of dataset: We use DUC2002 and DUC2004 data sets to evaluate our devsed system. DUC dataset s an open benchmark data sets from Document Understandng Conference (DUC) for generc automatc summarzaton. Table 1 gves a bref descrpton of the dataset. Table-1: Detals of DUC 2002, DUC-2004 dataset DUC2002 DUC2004 number of document collectons 59 50 number of documents n each collecton 10 10 data source TREC TDT summary length 200 words 665bytes Evaluaton metrc: We use ROUGE toolkt (verson 1.5.5) to measure the summarzaton performance. To properly evaluate the summary we use ROUGE-1, ROUGE-2, ROUGE-SU and ROUGE-L based measures. The rest of the detals and package s avalable at [13]. 4.1 Experment-1 In ths experment we emprcally compare our devsed system s result wth publshed results of [6]. The detals of system descrpton used n expermental evaluaton of [6], s descrbed below: Systems used n evaluaton. We use the publshed results of the followng most wdely used document summarzaton methods as the baselne systems to compare wth our devsed system. (1) Random: The method selects sentences randomly for

each document collecton (2) Centrod: The method apples MEAD algorthm [16] to extract sentences accordng to the followng three parameters: centrod value, postonal value, and frst-sentence overlap. (3) LexPageRank: The method frst constructs a sentence connectvty graph based on cosne smlarty and then selects mportant sentences based on the concept of egenvector centralty [10]. (4) LSA: The method performs latent semantc analyss on terms by sentences matrx to select sentences havng the greatest combned weghts across all mportant topcs [11]. (5) NMF: The method performs non-negatve matrx factorzaton (NMF) on terms by sentences matrx and then ranks the sentences by ther weghted scores [12]. (6) KM: The method performs K-means algorthm on terms by sentences matrx to cluster the sentences and then chooses the centrods for each sentence cluster. (7) FGB: The FGB method s proposed n [19]. (8) The publshed results of BSTM method [6]. Results: Results are gven n Table-2 and Table-3. Table-2 contans evaluaton results on DUC-2002 dataset. Table-3 contans evaluaton results on DUC-2004 dataset. The hghest evaluaton score related to every ROUGE evaluaton metrc s presented by usng bold font. From expermental results (as, gven n Table-2 and Table-3), t s clear that our devsed system performs better than all unsupervsed systems and better/comparable wth supervsed system lke BSTM [6]. Table-2: Evaluaton results on DUC-2002 dataset Systems ROUGE-1 ROUGE-2 ROUGE-L ROUGE-SU DUC Best 0.49869 0.25229 0.46803 0.28406 Random 0.38475 0.11692 0.37218 0.18057 Centrod 0.45379 0.19181 0.43237 0.23629 LexPageRank 0.47963 0.22949 0.44332 0.26198 LSA 0.43078 0.15022 0.40507 0.20226 NMF 0.44587 0.16280 0.41513 0.21687 KM 0.43156 0.15135 0.40376 0.20144 FGB 0.48507 0.24103 0.45080 0.26860 BSTM 0.48812 0.24571 0.45516 0.27018 Our System 0.51746 0.24245 0.47252 0.28642 Table-3: Evaluaton results on DUC-2004 dataset Systems ROUGE-1 ROUGE-2 ROUGE-L ROUGE-SU DUC Best 0.38224 0.09216 0.38687 0.13233 Random 0.31865 0.06377 0.34521 0.11779 Centrod 0.36728 0.07379 0.36182 0.12511 LexPageRank 0.37842 0.08572 0.37531 0.13097 LSA 0.34145 0.06538 0.34973 0.11946 NMF 0.36747 0.07261 0.36749 0.12918 KM 0.34872 0.06937 0.35882 0.12115 FGB 0.38724 0.08115 0.38423 0.12957 BSTM 0.39065 0.09010 0.38799 0.13218 Our System 0.41413 0.093017 0.39032 0.13846

4.2 Experment-2 We use ths experment to justfy the use of weghted clusterng coeffcent for rankng the sentences n every dentfed sentence cluster. For ths we make smple change and use unweghted clusterng coeffcent as gven n equaton-1 n place of equaton-8 (see sub-secton 2.5) and run the entre system. The comparatve results (.e. wth weghted clusterng coeffcent and wth unweghted clusterng coeffcent) wth DUC-2002 and DUC-2004 dataset are gven n Fgure-3 and n Fgure-4 respectvely. The results gven n Fgure-3 and 4, clearly ndcates the benefts of usng weghted clusterng coeffcent. Fgure-3: Experments usng DUC-2002 dataset Fgure-Y: Experments usng DUC-2004 dataset 5 Concluson and Future Work In ths paper we ntroduce the use of Wkpeda anchor text and weghted clusterng coeffcent for mult-document summarzaton. Addtonally, we lmt the use of lngustc resources to nclude only stopwords, stemmers and punctuaton marks. The expermental results show that our devsed system performs better than unsupervsed systems and better/comparable wth supervsed systems of ths area. As, a future work we are plannng to use the relaton between Wkpeda anchor texts for mprovements n summary qualty. We beleve that such relaton can

mprove the weghted clusterng coeffcent score of nformatve terms and hence, t may mprove the summary qualty. References 1. D. M. Ble, A. Y. Ng, and M. I. Jordan. Latent drchlet allocaton. In Advances n Neural Informaton Processng Systems 14. 2. Kumar, Nraj and Srnathan, Kannan. Automatc keyphrase extracton from scentfc documents usng N-gram fltraton technque. Proceedng of the eghth ACM symposum on Document engneerng. DocEng '08. Sao Paulo, Brazl. 199 208. 3. L. Page, s. Brn, r. Motwan and t. Wnograd., The pagerank ctaton rankng: brngng order to the web. Techncal report, Stanford dgtal lbrary technologes project, 1998. 4. Jar Saramak, Jukka-Pekka Onnela, Janos Kertesz and Kmmo Kask; Characterzng Motfs n Weghted Complex Networks. 5. Danel m. Mcdonald and hsnchun chen., Summary n context: searchng versus browsng;acm transactons on nformaton systems, vol. 24, no. 1, january 2006, pages 111 141. 6. Dngdng Wang, Shenghuo Zhu, Tao L, Yhong Gong;Mult-Document Summarzaton usng Sentence-based Topc Models;Proceedngs of the ACL-IJCNLP 2009 Conference Short Papers, pages 297 300,Suntec, Sngapore, 4 August 2009. ACL and AFNLP 7. C. Dng and X. He. K-means clusterng and prncpal component analyss. In Prodeedngs of ICML 2004. 8. Chrs Dng, Xaofeng He, and Horst Smon. 2005. On the equvalence of nonnegatve matrx factorzaton and spectral clusterng. In Proceedngs of Sam Data Mnng. 9. Chrs Dng, Tao L, We Peng, and Haesun Park. 2006. Orthogonal nonnegatve matrx trfactorzatons for clusterng. In Proceedngs of SIGKDD 2006. 10. G. Erkan and D. Radev. 2004. Lexpagerank: Prestge n mult-document text summarzaton. In Proceedngs of EMNLP 2004. 11. Y. Gong and X. Lu. 2001. Generc text summarzaton usng relevance measure and latent semantc analyss. In Proceedngs of SIGIR. 12. Danel D. Lee and H. Sebastan Seung. Algorthms for non-negatve matrx factorzaton. In Advances n Neural Informaton Processng Systems 13. 13. C-Y. Ln and E.Hovy. Automatc evaluaton of summares usng n-gram cooccurrence statstcs. In Proceedngs of NLT-NAACL 2003. 14. C-Y. Ln and E. Hovy. 2002. From sngle to mult-document summarzaton: A prototype system and ts evaluaton. In Proceedngs of ACL 2002. 15. I. Man. 2001. Automatc summarzaton. John Benjamns Publshng Company. 16. D. Radev, H. Jng, M. Stys, and D. Tam. 2004. Centrod-based summarzaton of multple documents. Informaton Processng and Management, pages 919 938. 17. B. Rcardo and R. Berther. 1999. Modern nformaton retreval. ACM Press. 18. D. Shen, J-T. Sun, H. L, Q. Yang, and Z. Chen. 2007. Document summarzaton usng condtonal random felds. In Proceedngs of IJCAI 2007. 19. Dngdng Wang, Shenghuo Zhu, Tao L, Yun Ch, and Yhong Gong. 2008. Integratng clusterng and mult-document summarzaton to mprove document understandng. In Proceedngs of CIKM 2008. 20. W-T. Yh, J. Goodman, L. Vanderwende, and H. Suzuk. 2007. Multdocument summarzaton by maxmzng nformatve content-words. In Proceedngs of IJCAI 2007.