Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Tsnghua Unversty at TAC 2009: Summarzng Mult-documents by Informaton Dstance Chong Long, Mnle Huang, Xaoyan Zhu State Key Laboratory of Intellgent Technology and Systems, Tsnghua Natonal Laboratory for Informaton Scence and Technology, Department of Computer Scence and Technology, Tsnghua Unversty, Chna Abstract Ths paper presents our extractve summarzaton systems at the update summarzaton track of TAC 2009. Ths system s based on our newly developed document summarzaton framework under the theory of condtonal nformaton dstance among many objects. The best summary s defned n ths paper to be the one whch has the mnmum nformaton dstance to the entre document set. The best update summary has the mnmum condtonal nformaton dstance to a document cluster gven that a pror document cluster has already been read. Experments on the TAC dataset have proved that our method has got a good performance n many categores. 1 Introducton We partcpated n the update summarzaton track of TAC 2009. The update summarzaton task s to wrte a short (not more than 100 words) summary of a set of newswre artcles, under the assumpton that the user has already read a gven set of earler artcles. The summares wll be evaluated for readablty and content (based on Columba Unversty s Pyramd Method) [1]. We frstly proposed nformaton dstance based approach n TAC 2008. Ths year we have developed a framework n whch mult-document summarzaton can be modeled by the nformaton dstance theory. The best summary s defned as havng the mnmal nformaton dstance (or condtonal nformaton dstance) to the entre document set (f a pror document set s gven). The paper s organzed as follows. Secton 2 ntroduces our method n TAC 2008.

Our newly developed theory s descrbed n Secton 3.1. Secton 3 presents the summarzaton method under the new theory and experments n Secton 4 emphasze the advantages of our work. Conclusons and future work are outlned n Secton 5. 2 Overvew of Our Method n TAC 2008 In TAC 2008, we frstly proposed to use nformaton dstance to solve the summarzaton problem [2]. Fx a unversal Turng machne U. The Kolmogorov complexty [3] of a bnary strng x condtoned to another bnary strng y, K U (x y), s the length of the shortest (prefx-free) program for U that outputs x wth nput y. It can be shown that for a dfferent unversal Turng machne U, for all x, y K U (x y) = K U (x y) + C, where the constant C depends only on U. Thus K U (x y) can be smply wrtten as K(x y). We wrte K(x ɛ), where ɛ s the empty strng, as K(x). It has also been defned n [4] that the energy to convert between x and y to be the smallest number of bts needed to convert x to y and vce versa. That s, wth respect to a unversal Turng machne U, the cost of converson between x and y s: E(x, y) = mn{ p : U(x, p) = y, U(y, p) = x} (1) The followng theorem has been proved n [4]: Theorem 1 E(x, y) = max{k(x y), K(y x)}. Thus, the max dstance was defned n [4]: D max (x, y) = max{k(x y), K(y x)}. (2) TAC update summarzaton task s to wrte a short summary S of n newswre artcles B 1, B 2,..., B n, under the assumpton that the user has already read a gven set of earler m artcles A 1, A 2,..., A m. In TAC 2008, we use the followng crtera to select the best summary S: mn D max (S, B 1 B 2... B m A 1 A 2... A m ), S θ (3) S s selected from sentences of artcles A 1, A 2,..., A m. However, t s more or less ntutve method. Ths year we have set up a relatvely complete nformaton dstance summarzaton framework. Our new summarzaton model n TAC 2009 s based on our newly developed theory nstead of an emprcal formula(equaton 3) n TAC 2008. Next we wll ntroduce ths new framework. 3 New Summarzaton Framework Our new framework s based on our newly developed theory of condtonal nformaton dstance among many objects. In ths secton we wll frstly ntroduce our newly developed theory and then our summarzaton model based on the new theory.

3.1 New Theory In [5], the authors generalze the theory of nformaton dstance to more than two objects. Smlar to Equaton 1, gven strngs x 1,..., x n, they defne the mnmal amount of thermodynamc energy needed to convert any x to any x j as: E m (x 1,..., x n ) = mn{ p : U(x, p, j) = x j for all, j} (4) Then t s proved n [5] that: Theorem 2 Modulo to an O(log n) addtve factor, mn K(x 1... x n x ) E m (x 1,..., x n ) mn D max (x, x k ) (5) k In update summarzaton, the summary should contan new nformaton whch former documents have not mentoned, so we extended Equaton 5 n paper [6] to be: Theorem 3 Modulo to an O(log n) addtve factor, mn K(x 1... x n x, c) E m (x 1,..., x n c) mn D max (x, x k c) k (6) where c s the condtonal sequence that s gven for free to compute from sequence x to y and from y to x. Gven n objects and a condtonal sequence c, the left-hand sde of Equaton 6 may be nterpreted as the most comprehensve object that contans the most nformaton about all of the others. The rght-hand sde of the equaton may be nterpreted as the most typcal object that s smlar to all of the others. 3.2 Modelng We have developed the theory of condtonal nformaton dstance among many objects. In ths subsecton, a new summarzaton model be bult based on our new theory. 3.2.1 Modelng Tradtonal Summarzaton The task of tradtonal mult-document summarzaton can be descrbed as follows: gven n documents B = {B 1,B 2,...,B n }, the task requres the system to generate a summary S of B. Accordng to our theory, the condtonal nformaton dstance among B 1,B 2,...,B n s E m (B). However, t s very dffcult to compute E m. Moreover, E m tself does not tell us how to generate a summary. Equaton 5 has provded us a feasble way to approxmate E m : the most comprehensve object and the most typcal one are the left and rght of Equaton 6, respectvely. The most comprehensve object s long enough to cover as much nformaton n B as possble, whle the most typcal object s a

concse one that expresses the most common dea shared by those objects. Snce we am to produce a short summary to represent the general nformaton, the rghthand sde of Equaton 5 should be used. The most typcal document s the B j such that mn D max (B, B j ) j j However, B j s far from enough to be a good summary. A good method should be able to select the nformaton from B 1 to B n to form a best S. We vew ths S as a document n ths set. Snce S s a short summary, t does not contan extra nformaton outsde B. The best tradtonal summary S trad should satsfy the constrant as: S trad = arg mn S D max (B, S) (7) In most applcatons, the length of S s confned by S θ (θ s a constant nteger) or S α B (α s a constant real number between 0 and 1). 3.2.2 Modelng Update Summarzaton Gven a set of earler m artcles A = {A 1,A 2,...,A m }, the update summarzaton task s to summarze new contents presented by a document set B = {B 1,B 2,...,B m }. Ths earler artcle set A can be vewed as a precondton. Thus ths task can be well modeled by the condtonal verson of nformaton dstance. The best summary S best should satsfy the constrant as follows: S best = arg mn D max (B, S A) (8) S If m = 0 (A = φ), t wll be a tradtonal mult-document summarzaton problem. If m > 0 (A φ), t wll be a multdocument update summarzaton problem. Therefore, the tradtonal summarzaton can be vewed as a specal case of formula 8. Accordng to [7], from Equaton 8 we can get: D max (B, S A) = D max (B A, S) where B s mapped to B A under the condton of A. Then for a document B and a document set A, B A s a set of B s sentences (B,k s) whch are dfferent from all the sentences n A 1 to A m : B A = {B,k sen A, D max (B,k, sen) > ϕ} (9) where A s the sentence set of a document A and ϕ s a threshold. We have already developed a framework for summarzaton. However, the problem s that nether K(.) nor D max (.,.) s computable. we can use frequency count, and use Shannon-Fano code [8] to encode a phrase whch occurs n probablty p n approxmately log p bts to obtan a short descrpton. Ths approxmaton method can deal wth a sentence n word and phrase granu-

0.39 Old Method New Method 0.37 Old Method New Method ROUGE-1 Recall 0.38 0.37 0.36 0.35 ROUGE-1 Recall 0.36 0.35 0.34 A B C All Cluster 0.34 A B All Cluster DUC 2007 TAC 2008 Fgure 1. Comparsons Table 1. Evaluaton Results Cluster Tradtonal Update Evaluaton Method Best Ours Rank Best Ours Rank AVG Modfed Score 0.383 0.311 9 0.307 0.296 4 MacroAVG Modfed Score wth 3 Models 0.377 0.316 9 0.303 0.292 4 AVG Lngustc Qualty 5.932 5.682 3 5.886 5.886 1 AVG Overall Responsveness 5.159 4.955 2 5.023 5.023 1 lartes. Therefore, frstly we dvde a sentence nto semantc elements; then nformaton dstance between two sentences s estmated through ther semantc element sets [6]. Semantc element extracton method were smply mplemented n TAC 2008 [2] by usng named entty recognton and countng the overlap of the words and enttes. However, an entty may have dfferent names. For example, George Bush and George W. Bush were vewed as dfferent enttes; May 15th, 2008, May 15, 2008 and 5/15/2008 were recognzed as dfferent dates n our TAC 2008 system. We add coreference resoluton to our system ths year. Frstly named enttes are normalzed usng wkpeda [9], then dfferent wrtng styles of dates such as May 15th, 2008, May 15, 2008 and 5/15/2008 are normalzed nto the same date through regular expressons. Experment results showed n [6] have proved the effectveness of our coreference resoluton method. 4 Expermental Results In ths secton, we wll frstly compare our two dfferent summarzaton method (developed n TAC 2008 and 2009) and then provde the evaluaton results on TAC 2009.

4.1 Comparson wth TAC 2008 s Method Frstly our newly developed method (called new method ) s compared wth the orgnal one n TAC 2008 [2](called old method ). We compare these two methods on the DUC 2007 and the TAC 2008 update datasets under the ROUGE-1 recall crteron. We can see from the Fgure 1 the fgure that our system has a got much better performance after usng the method based on the newly developed theory framework. 4.2 Results of TAC 2009 Fnally our new method s tested on the TAC 2009 dataset. The experment results under pyramd evaluaton methods are shown n Table 1. The results of tradtonal summarzaton (Cluster A) and update summarzaton (Cluster B) are lsted separately. Best means the best result among all 52 submssons. Ours means our system s result. Rank means the rankngs of our result. We can see from ths table that our system performs better on update datasets than on tradtonal datasets. Our system has got the best result under average lngustc qualty and average overall responsveness on update datasets. 5 Concluson and Future Work In ths paper, we have bult up a document summarzaton framework based on the theory of nformaton dstance. Experments show that our approach performs well on the TAC 2009 dataset. In future work, we wll further study our framework and develop a better nformaton dstance approxmaton method. Acknowledgment The work was supported by NSFC under grant No.60803075, the Natonal Basc Research Program ( 973 project n Chna ) under grant No.2007CB311003. The work was also supported by IRCI from the Internatonal Development Research Center, Canada. References [1] A. Nenkova, R. Passonneau, and K. Mckeown, The pyramd method: Incorporatng human content selecton varaton n summarzaton evaluaton, ACM Transactons on Speech and Language Processng, vol. 4, no. 2, 2007. [2] S. Chen, Y. Yu, C. Long, F. Jn, L. Qn, M. Huang, and X. Zhu, Tsnghua unversty at the summarzaton track of tac 2008, n TAC, 2008. [3] M. L and P. M. Vtány, An Introducton to Kolmogorov Complexty and ts Applcatons. Sprnger-Verlag, 1997. [4] C. H. Bennett, P. Gács, M. L, P. M. Vtány, and W. H. Zurek, Informaton dstance, IEEE Transactons on

Informaton Theory, vol. 44, no. 4, pp. 1407 1423, July 1998. [5] C. Long, X. Zhu, M. L, and B. Ma, Informaton shared by many objects, n CIKM, 2008, pp. 1213 1220. [6] C. Long, M. Huang, X. Zhu, and M. L, Mult-document summarzaton by nformaton dstance, n Accepted by ICDM, 2009. [7] X. Zhang, Y. Hao, X. Zhu, and M. L, Informaton dstance from a queston to an answer, n SIGKDD, August 2007. [8] R. L. Clbras and P. M. Vtány, The google smlarty dstance, IEEE Transactons on Knowledge and Data Engneerng, vol. 19, no. 3, pp. 370 383, March 2007. [9] F. L, Z. Zheng, Y. Tang, F. Bu, R. Ge, X. Zhu, X. Zhang, and M. Huang, Thu quanta at tac 2008 qa and rte track, n TAC.