Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Similar documents
A Binarization Algorithm specialized on Document Images and Photos

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Available online at Available online at Advanced in Control Engineering and Information Science

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Classifier Selection Based on Data Complexity Measures *

The Research of Support Vector Machine in Agricultural Data Classification

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Support Vector Machines

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

3D vector computer graphics

Performance Evaluation of Information Retrieval Systems

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Corner-Based Image Alignment using Pyramid Structure with Gradient Vector Similarity

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

An Optimal Algorithm for Prufer Codes *

Unsupervised Learning

Related-Mode Attacks on CTR Encryption Mode

Ontology Generator from Relational Database Based on Jena

An Image Fusion Approach Based on Segmentation Region

A CALCULATION METHOD OF DEEP WEB ENTITIES RECOGNITION

A Fuzzy Image Matching Algorithm with Linguistic Spatial Queries

Cluster Analysis of Electrical Behavior

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

A NOTE ON FUZZY CLOSURE OF A FUZZY SET

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

Shape Representation Robust to the Sketching Order Using Distance Map and Direction Histogram

Smoothing Spline ANOVA for variable screening

EXTENDED BIC CRITERION FOR MODEL SELECTION

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Load Balancing for Hex-Cell Interconnection Network

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Mathematics 256 a course in differential equations for engineering students

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Clustering Algorithm of Similarity Segmentation based on Point Sorting

Solving two-person zero-sum game by Matlab

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

A New Feature of Uniformity of Image Texture Directions Coinciding with the Human Eyes Perception 1

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Design of Structure Optimization with APDL

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Load-Balanced Anycast Routing

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Virtual Machine Migration based on Trust Measurement of Computer Node

Document Representation and Clustering with WordNet Based Similarity Rough Set Model

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

A Clustering Algorithm for Chinese Adjectives and Nouns 1

Face Recognition by Fusing Binary Edge Feature and Second-order Mutual Information

Meta-heuristics for Multidimensional Knapsack Problems

Positive Semi-definite Programming Localization in Wireless Sensor Networks

Conditional Speculative Decimal Addition*

A fast algorithm for color image segmentation

Alignment Results of SOBOM for OAEI 2010

The Shortest Path of Touring Lines given in the Plane

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Enhancement of Infrequent Purchased Product Recommendation Using Data Mining Techniques

Using Wikipedia Anchor Text and Weighted Clustering Coefficient to Enhance the Traditional Multi-Document Summarization

Enhanced Watermarking Technique for Color Images using Visual Cryptography

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

Intra-Parametric Analysis of a Fuzzy MOLP

X- Chart Using ANOM Approach

UB at GeoCLEF Department of Geography Abstract

Resolving Surface Forms to Wikipedia Topics

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Discriminative Dictionary Learning with Pairwise Constraints

A Novel Adaptive Descriptor Algorithm for Ternary Pattern Textures

Can We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search

Performance Assessment and Fault Diagnosis for Hydraulic Pump Based on WPT and SOM

A new selection strategy for selective cluster ensemble based on Diversity and Independency

REFRACTION. a. To study the refraction of light from plane surfaces. b. To determine the index of refraction for Acrylic and Water.

Simulation Based Analysis of FAST TCP using OMNET++

An Entropy-Based Approach to Integrated Information Needs Assessment

Palmprint Feature Extraction Using 2-D Gabor Filters

Scheduling Remote Access to Scientific Instruments in Cyberinfrastructure for Education and Research

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc.

A Five-Point Subdivision Scheme with Two Parameters and a Four-Point Shape-Preserving Scheme

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss.

Bridges and cut-vertices of Intuitionistic Fuzzy Graph Structure

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

EYE CENTER LOCALIZATION ON A FACIAL IMAGE BASED ON MULTI-BLOCK LOCAL BINARY PATTERNS

THE CONDENSED FUZZY K-NEAREST NEIGHBOR RULE BASED ON SAMPLE FUZZY ENTROPY

Machine Learning: Algorithms and Applications

An Intelligent Context Interpreter based on XML Schema Mapping

Combining Multiple Resources, Evidence and Criteria for Genomic Information Retrieval

TN348: Openlab Module - Colocalization

Boundary-Based Time Series Sorting

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

HCMX: AN EFFICIENT HYBRID CLUSTERING APPROACH FOR MULTI-VERSION XML DOCUMENTS

Music/Voice Separation using the Similarity Matrix. Zafar Rafii & Bryan Pardo

The Research of Ellipse Parameter Fitting Algorithm of Ultrasonic Imaging Logging in the Casing Hole

A Resources Virtualization Approach Supporting Uniform Access to Heterogeneous Grid Resources 1

Machine Learning. Topic 6: Clustering

Transcription:

Tsnghua Unversty at TAC 2009: Summarzng Mult-documents by Informaton Dstance Chong Long, Mnle Huang, Xaoyan Zhu State Key Laboratory of Intellgent Technology and Systems, Tsnghua Natonal Laboratory for Informaton Scence and Technology, Department of Computer Scence and Technology, Tsnghua Unversty, Chna Abstract Ths paper presents our extractve summarzaton systems at the update summarzaton track of TAC 2009. Ths system s based on our newly developed document summarzaton framework under the theory of condtonal nformaton dstance among many objects. The best summary s defned n ths paper to be the one whch has the mnmum nformaton dstance to the entre document set. The best update summary has the mnmum condtonal nformaton dstance to a document cluster gven that a pror document cluster has already been read. Experments on the TAC dataset have proved that our method has got a good performance n many categores. 1 Introducton We partcpated n the update summarzaton track of TAC 2009. The update summarzaton task s to wrte a short (not more than 100 words) summary of a set of newswre artcles, under the assumpton that the user has already read a gven set of earler artcles. The summares wll be evaluated for readablty and content (based on Columba Unversty s Pyramd Method) [1]. We frstly proposed nformaton dstance based approach n TAC 2008. Ths year we have developed a framework n whch mult-document summarzaton can be modeled by the nformaton dstance theory. The best summary s defned as havng the mnmal nformaton dstance (or condtonal nformaton dstance) to the entre document set (f a pror document set s gven). The paper s organzed as follows. Secton 2 ntroduces our method n TAC 2008.

Our newly developed theory s descrbed n Secton 3.1. Secton 3 presents the summarzaton method under the new theory and experments n Secton 4 emphasze the advantages of our work. Conclusons and future work are outlned n Secton 5. 2 Overvew of Our Method n TAC 2008 In TAC 2008, we frstly proposed to use nformaton dstance to solve the summarzaton problem [2]. Fx a unversal Turng machne U. The Kolmogorov complexty [3] of a bnary strng x condtoned to another bnary strng y, K U (x y), s the length of the shortest (prefx-free) program for U that outputs x wth nput y. It can be shown that for a dfferent unversal Turng machne U, for all x, y K U (x y) = K U (x y) + C, where the constant C depends only on U. Thus K U (x y) can be smply wrtten as K(x y). We wrte K(x ɛ), where ɛ s the empty strng, as K(x). It has also been defned n [4] that the energy to convert between x and y to be the smallest number of bts needed to convert x to y and vce versa. That s, wth respect to a unversal Turng machne U, the cost of converson between x and y s: E(x, y) = mn{ p : U(x, p) = y, U(y, p) = x} (1) The followng theorem has been proved n [4]: Theorem 1 E(x, y) = max{k(x y), K(y x)}. Thus, the max dstance was defned n [4]: D max (x, y) = max{k(x y), K(y x)}. (2) TAC update summarzaton task s to wrte a short summary S of n newswre artcles B 1, B 2,..., B n, under the assumpton that the user has already read a gven set of earler m artcles A 1, A 2,..., A m. In TAC 2008, we use the followng crtera to select the best summary S: mn D max (S, B 1 B 2... B m A 1 A 2... A m ), S θ (3) S s selected from sentences of artcles A 1, A 2,..., A m. However, t s more or less ntutve method. Ths year we have set up a relatvely complete nformaton dstance summarzaton framework. Our new summarzaton model n TAC 2009 s based on our newly developed theory nstead of an emprcal formula(equaton 3) n TAC 2008. Next we wll ntroduce ths new framework. 3 New Summarzaton Framework Our new framework s based on our newly developed theory of condtonal nformaton dstance among many objects. In ths secton we wll frstly ntroduce our newly developed theory and then our summarzaton model based on the new theory.

3.1 New Theory In [5], the authors generalze the theory of nformaton dstance to more than two objects. Smlar to Equaton 1, gven strngs x 1,..., x n, they defne the mnmal amount of thermodynamc energy needed to convert any x to any x j as: E m (x 1,..., x n ) = mn{ p : U(x, p, j) = x j for all, j} (4) Then t s proved n [5] that: Theorem 2 Modulo to an O(log n) addtve factor, mn K(x 1... x n x ) E m (x 1,..., x n ) mn D max (x, x k ) (5) k In update summarzaton, the summary should contan new nformaton whch former documents have not mentoned, so we extended Equaton 5 n paper [6] to be: Theorem 3 Modulo to an O(log n) addtve factor, mn K(x 1... x n x, c) E m (x 1,..., x n c) mn D max (x, x k c) k (6) where c s the condtonal sequence that s gven for free to compute from sequence x to y and from y to x. Gven n objects and a condtonal sequence c, the left-hand sde of Equaton 6 may be nterpreted as the most comprehensve object that contans the most nformaton about all of the others. The rght-hand sde of the equaton may be nterpreted as the most typcal object that s smlar to all of the others. 3.2 Modelng We have developed the theory of condtonal nformaton dstance among many objects. In ths subsecton, a new summarzaton model be bult based on our new theory. 3.2.1 Modelng Tradtonal Summarzaton The task of tradtonal mult-document summarzaton can be descrbed as follows: gven n documents B = {B 1,B 2,...,B n }, the task requres the system to generate a summary S of B. Accordng to our theory, the condtonal nformaton dstance among B 1,B 2,...,B n s E m (B). However, t s very dffcult to compute E m. Moreover, E m tself does not tell us how to generate a summary. Equaton 5 has provded us a feasble way to approxmate E m : the most comprehensve object and the most typcal one are the left and rght of Equaton 6, respectvely. The most comprehensve object s long enough to cover as much nformaton n B as possble, whle the most typcal object s a

concse one that expresses the most common dea shared by those objects. Snce we am to produce a short summary to represent the general nformaton, the rghthand sde of Equaton 5 should be used. The most typcal document s the B j such that mn D max (B, B j ) j j However, B j s far from enough to be a good summary. A good method should be able to select the nformaton from B 1 to B n to form a best S. We vew ths S as a document n ths set. Snce S s a short summary, t does not contan extra nformaton outsde B. The best tradtonal summary S trad should satsfy the constrant as: S trad = arg mn S D max (B, S) (7) In most applcatons, the length of S s confned by S θ (θ s a constant nteger) or S α B (α s a constant real number between 0 and 1). 3.2.2 Modelng Update Summarzaton Gven a set of earler m artcles A = {A 1,A 2,...,A m }, the update summarzaton task s to summarze new contents presented by a document set B = {B 1,B 2,...,B m }. Ths earler artcle set A can be vewed as a precondton. Thus ths task can be well modeled by the condtonal verson of nformaton dstance. The best summary S best should satsfy the constrant as follows: S best = arg mn D max (B, S A) (8) S If m = 0 (A = φ), t wll be a tradtonal mult-document summarzaton problem. If m > 0 (A φ), t wll be a multdocument update summarzaton problem. Therefore, the tradtonal summarzaton can be vewed as a specal case of formula 8. Accordng to [7], from Equaton 8 we can get: D max (B, S A) = D max (B A, S) where B s mapped to B A under the condton of A. Then for a document B and a document set A, B A s a set of B s sentences (B,k s) whch are dfferent from all the sentences n A 1 to A m : B A = {B,k sen A, D max (B,k, sen) > ϕ} (9) where A s the sentence set of a document A and ϕ s a threshold. We have already developed a framework for summarzaton. However, the problem s that nether K(.) nor D max (.,.) s computable. we can use frequency count, and use Shannon-Fano code [8] to encode a phrase whch occurs n probablty p n approxmately log p bts to obtan a short descrpton. Ths approxmaton method can deal wth a sentence n word and phrase granu-

0.39 Old Method New Method 0.37 Old Method New Method ROUGE-1 Recall 0.38 0.37 0.36 0.35 ROUGE-1 Recall 0.36 0.35 0.34 A B C All Cluster 0.34 A B All Cluster DUC 2007 TAC 2008 Fgure 1. Comparsons Table 1. Evaluaton Results Cluster Tradtonal Update Evaluaton Method Best Ours Rank Best Ours Rank AVG Modfed Score 0.383 0.311 9 0.307 0.296 4 MacroAVG Modfed Score wth 3 Models 0.377 0.316 9 0.303 0.292 4 AVG Lngustc Qualty 5.932 5.682 3 5.886 5.886 1 AVG Overall Responsveness 5.159 4.955 2 5.023 5.023 1 lartes. Therefore, frstly we dvde a sentence nto semantc elements; then nformaton dstance between two sentences s estmated through ther semantc element sets [6]. Semantc element extracton method were smply mplemented n TAC 2008 [2] by usng named entty recognton and countng the overlap of the words and enttes. However, an entty may have dfferent names. For example, George Bush and George W. Bush were vewed as dfferent enttes; May 15th, 2008, May 15, 2008 and 5/15/2008 were recognzed as dfferent dates n our TAC 2008 system. We add coreference resoluton to our system ths year. Frstly named enttes are normalzed usng wkpeda [9], then dfferent wrtng styles of dates such as May 15th, 2008, May 15, 2008 and 5/15/2008 are normalzed nto the same date through regular expressons. Experment results showed n [6] have proved the effectveness of our coreference resoluton method. 4 Expermental Results In ths secton, we wll frstly compare our two dfferent summarzaton method (developed n TAC 2008 and 2009) and then provde the evaluaton results on TAC 2009.

4.1 Comparson wth TAC 2008 s Method Frstly our newly developed method (called new method ) s compared wth the orgnal one n TAC 2008 [2](called old method ). We compare these two methods on the DUC 2007 and the TAC 2008 update datasets under the ROUGE-1 recall crteron. We can see from the Fgure 1 the fgure that our system has a got much better performance after usng the method based on the newly developed theory framework. 4.2 Results of TAC 2009 Fnally our new method s tested on the TAC 2009 dataset. The experment results under pyramd evaluaton methods are shown n Table 1. The results of tradtonal summarzaton (Cluster A) and update summarzaton (Cluster B) are lsted separately. Best means the best result among all 52 submssons. Ours means our system s result. Rank means the rankngs of our result. We can see from ths table that our system performs better on update datasets than on tradtonal datasets. Our system has got the best result under average lngustc qualty and average overall responsveness on update datasets. 5 Concluson and Future Work In ths paper, we have bult up a document summarzaton framework based on the theory of nformaton dstance. Experments show that our approach performs well on the TAC 2009 dataset. In future work, we wll further study our framework and develop a better nformaton dstance approxmaton method. Acknowledgment The work was supported by NSFC under grant No.60803075, the Natonal Basc Research Program ( 973 project n Chna ) under grant No.2007CB311003. The work was also supported by IRCI from the Internatonal Development Research Center, Canada. References [1] A. Nenkova, R. Passonneau, and K. Mckeown, The pyramd method: Incorporatng human content selecton varaton n summarzaton evaluaton, ACM Transactons on Speech and Language Processng, vol. 4, no. 2, 2007. [2] S. Chen, Y. Yu, C. Long, F. Jn, L. Qn, M. Huang, and X. Zhu, Tsnghua unversty at the summarzaton track of tac 2008, n TAC, 2008. [3] M. L and P. M. Vtány, An Introducton to Kolmogorov Complexty and ts Applcatons. Sprnger-Verlag, 1997. [4] C. H. Bennett, P. Gács, M. L, P. M. Vtány, and W. H. Zurek, Informaton dstance, IEEE Transactons on

Informaton Theory, vol. 44, no. 4, pp. 1407 1423, July 1998. [5] C. Long, X. Zhu, M. L, and B. Ma, Informaton shared by many objects, n CIKM, 2008, pp. 1213 1220. [6] C. Long, M. Huang, X. Zhu, and M. L, Mult-document summarzaton by nformaton dstance, n Accepted by ICDM, 2009. [7] X. Zhang, Y. Hao, X. Zhu, and M. L, Informaton dstance from a queston to an answer, n SIGKDD, August 2007. [8] R. L. Clbras and P. M. Vtány, The google smlarty dstance, IEEE Transactons on Knowledge and Data Engneerng, vol. 19, no. 3, pp. 370 383, March 2007. [9] F. L, Z. Zheng, Y. Tang, F. Bu, R. Ge, X. Zhu, X. Zhang, and M. Huang, Thu quanta at tac 2008 qa and rte track, n TAC.