From CLIR to CLIE: Some Lessons in NTCIR Evaluation

From CLIR to CLIE: Some Lessons in NTCIR Evaluation Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan +886-2-33664888 ext 311 hhchen@csie.ntu.edu.tw ABSTRACT Cross-language information retrieval (CLIR) facilitates the use of one language to access documents in other languages. Crosslanguage information extraction (CLIE) extracts relevant information in finer granularity from multilingual documents for some specific applications like summarization, question answering, opinion extraction, etc. This paper reviews CLIR, CLQA, and opinion analysis tasks in NTCIR evaluation. The design methodologies and some key technologies are reported. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search Retrieval Search process. General Terms Algorithms, Measurement, Performance. Keywords CLIE, CLIR, Evaluation, Opinion Analysis, Question Answering. 1. INTRODUCTION Cross-language information retrieval (CLIR) facilitates the use of one language to access documents in other languages. Crosslanguage information extraction (CLIE) extracts relevant information in finer granularity from multilingual documents for some specific applications like summarization, question answering, opinion extraction, etc. NTCIR started evaluation of CLIR tasks on Chinese, English, Japanese and Korean languages in 2001 1. In these 5 years (2001-2005), four CLIR test collections say, NTCIR-2, NTCIR-3, NTCIR-4 and NTCIR-5 evaluation sets [2][3][4][5], have been developed. In NTCIR-5 (2004-2005), we extended CLIR task to CLQA (Cross-Lingual Question Answering) task [8], which is an application of CLIE. In NTCIR- 6 (2005-2006), we further reused the past NTCIR CLIR test collections to build a corpus for opinion analysis [6][7], which is another application of CLIE. For setting up an evaluation test set for multilingual information access, several issues including data sources, languages, genres, criteria for topic/question creation, relevance granularity, and so on, have to be considered. This paper reviews CLIR, CLQA and opinion analysis tasks in NTCIR evaluation, and reports the design methodologies and some key technologies in Sections 2-4, respectively. In each section, the definitions of subtasks, the collection of document sets, the formulation of topics, the evaluation metrics, and the technologies explored are discussed. 2. CLIR 2.1 CLIR evaluation In CLIR, the topics are in source languages and the documents are in target languages. The target languages are different from the source languages. Comparing to TREC 2 and CLEF 3, which also provide CLIR evaluation, NTCIR focuses on Asian languages and English. In 2001, Hsin-Hsi Chen and Kuang-hua Chen [2] from Department of Computer Science and Information Engineering, and Department of Library Information Science, National Taiwan University (NTU) organized two subtasks in NTCIR-2, including Chinese-Chinese IR (CHIR) and English-Chinese IR (ECIR). They collected a Chinese document set CIRB010 from 5 news agencies in Taiwan. The statistics is shown in Table 1. Table 1. CIRB010 document set. News Agency #Documents Percentage China Times 38,163 28.8% Commercial Times 25,812 19.5% China Times Express 5,747 4.4% Central Daily News 27,770 21.0% China Daily News 34,728 26.3% Total 132,173 (200MB) Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. IWRIDL-2006 Kolkata, India Copyright 2007 ACM 1-59593-608-4 $5.00. The creation of CIRB010 topic set consists of three stages, including collecting information request through questionnaire on the web, selecting information requests and constructing topics. From 405 information requests, researchers filter out 163 unsuitable requests. Then a full text retrieval system filters out 1 http://research.nii.ac.jp/ntcir/index-en.html 2 http://trec.nist.gov/ 3 http://www.clef-campaign.org/

173 information requests further based on the number of relevant documents reported. Finally, researchers select 50 topics from the remaining 69 information requests. We adopt pooling method to collect candidate documents from participants submissions. To speed up the evaluation procedure, we designed an evaluation platform shown in Figure 1 for assessors. The left upper part shows the name of the assessor assessing the designated document, topic ID, pool file, the j th document in the pool file, and document number. The degree of relevancy say, highly relevant (score 3), relevant (score 2), partially relevant (score 1), and irrelevant (score 0), is assigned by assessors. Assessors can consult previous judgments to make their decision, or correct their judgments. In addition, they can also give comments to their decision. The log is kept for further analysis to improve the evaluation procedure. The right upper part and the bottom part list the topic description and the document being judged, respectively. (3) Multilingual CLIR (MLIR): The target collection consists of documents in two or more languages. In MLIR evaluation, systems are checked if they can retrieve documents in more than one language relevant to the same topic. Therefore, we tried to restrict the document sets used in the evaluation within the same periods. Unfortunately, not all documents in these four languages were available due to the copyright issue. Table 2 summarizes the document sets used in NTCIR-3 CLIR task. CJE news articles were published in 1998-1999, however, Korean news articles were published in 1994. Thus, we divided the collection into CJE part (1998-99) and Korean part (1994). Different topic sets were created for each part. As before, four different granularities of measurement, i.e., highly relevant (S), relevant (A), partially relevant (B) and irrelevant (C), are adopted. Instead of using the above formula, documents of either S or A are regarded as correct in a rigid case. In a relaxed case, documents of S, A, or B are considered as correct. Table 2. NTCIR-3 CLIR document sets. Japan Korea Taiwan Mainichi Newspaper (1998-1999): Japanese Mainichi Daily News (1998-1999): English Korea Economic Daily (1994): Korean CIRB011 (1998-1999): Chinese United Daily News (CIRB020, 1998-1999): Chinese 220,078 12,723 66,146 132,173 249,508 Figure 1. Evaluation platform. Each topic is judged by three assessors. Total 23 assessors spent 799 hours to judge the relevancy of 44,924 documents. We integrated the 3 scores for each document in the pool file by the following way: ( X ) A + X B + X C 3 R = 3 where R is integrated score, X A, X B, and X C are 3 scores assigned by assessors. In a rigid case, a document is considered as correct when its R score is within 0.6667 and 1. Comparatively, in a relaxed case, a document is considered correct when its R score is within 0.3333 and 1. In NTCIR-3, CLIR become an international joint effort. Research groups from Japan, Korea and Taiwan are involved in the design. We began to evaluate the CLIR problems on three Asian languages (Chinese (C), Japanese (J), and Korean (K)), and English (E). Three subtasks shown as follows are designed [3]: (1) Single Language IR (SLIR): The language of search topics is identical to that of documents. (2) Bilingual CLIR (BLIR): The document set to be searched is in a single language different from the language of the topic set. Taiwan News and Chinatimes English News (EIRB010, 1998-1999): English 10,204 In the NTCIR-3 testing data set, the number of English documents is 16.65 times and 9.60 times less than that of Chinese and Japanese documents, respectively. There are 18 topics (36%) without any relevant documents in English data set. The number of relevant documents from English collection is far less than that from Chinese and Japanese collection. The document sets come from the same periods in NTCIR-4 and NTCIR-5 CLIR tasks. In NTCIR-4, 254,438 Korean news articles published within 1988-1989 were added into the collection, and the sizes of Chinese, Japanese, and English document sets were extended to 381,375, 593,636, and 347,376, respectively [4]. In NTCIR-5, publication period was changed into 2000-2001, and the numbers of documents in Chinese, Japanese, Korean, and English were further expanded to 901,446, 858,400, 220,374, and 259,050, respectively [5]. The research issues which have been explored are shown as follows [3][4][5]. (1) Index methods: indexing of CJK text, decompounding problem, identification of named entities, dictionaries for indexing, and so on.

(2) Translation: query/document translation, translation methods and sources, term disambiguation, multiword translation, out-of-vocabulary problem, transliteration method, conversion of Kanji codes, cognate matching, pivot language approach, and so on. (3) Retrieval models: Okapi (BM25 and its variations), vector model, logistic regression model, language model, data fusion, and so on. (4) Query expansion and re-ranking: pseudo relevance feedback, web-based expansion, statistical thesauri, pretranslation expansion, document re-ranking, and so on. 2.2 The Web as a Translation Aid Translation is necessary for CLIR. In addition to bilingual dictionaries and machine translation systems, the web also serves as a translation aid. After bilingual dictionary lookup, those outof-vocabulary (OOV) query terms are translated by using the web as a multilingual corpus. For example, the named entity is an important query term, but not in the bilingual dictionary. Figure 2 demonstrates a snapshot after Google search, where snippets in a sorted sequence are returned. Figure 3 shows one of snippets in which the corresponding English translation appears. Here, a snippet consists of title, type, body and source fields. The following depicts how to extract the translation pairs from snippets. Figure 2. A snapshot after Google search Figure 3. A Snippet containing translation of the named entity The basic algorithm is as follows. Top-k snippets returned by Google are analyzed. For each snippet, we collect those continuous capitalized words, and regard them as candidates. Then we count the total occurrences of each candidate in the k snippets, and sort the candidates by their frequencies. The candidates of the larger occurrences are considered as the translation of the query term. The above algorithm does not consider the distance between the query term and the corresponding candidate in a snippet. Intuitively, the larger the distance is, the less possible a candidate is. We modify the basic algorithm as follows. We drop those candidates whose distances are larger than a predefined threshold. In this way, a snippet may not contribute any candidates. To collect enough candidates say, cnum, we may have to examine more than k snippets. Because there may not always exist cnum candidates, we stop collecting when maximum (max) snippets are examined. We prefer those candidates of higher occurrences with the query term and smaller average distances. 3. CLQA 3.1 CLQA evaluation Question-answering (QA) attracts much attention due to that huge heterogeneous data collection is available on the Internet. In NTCIR-5, we initiated CLQA as a pilot task. Five subtasks, including JE, EJ, CE, CC, and EC where C, E, and J stand for Chinese, English, and Japanese, respectively, were evaluated. For each subtask XY, a question in source language X is submitted to a QA system, and answers will be extracted from documents in target language Y [8]. The document collection consists of materials in 3 languages, including Chinese data set of 901,446 news articles (from UDN.com), Japanese data set of 658,719 news articles (from Yomiuri Newspaper), and English data set of 17,741 news articles (the Daily Yomiuri). Because English data set is comparatively small, we have to consider if there exist answers in English corpus when designing questions. For CE subtask, the Chinese questions are translated from English questions by human translators. For CC and EC, we reference to CLIR topics and the logs kept for an online Chinese QA system. In NTCIR-5 CLQA task, the answer types were restricted to named entities (NEs) such as PERSON, LOCATION, ORGANIZATION, ARTIFACT (product name, book title, law, etc.), DATE, TIME, MONEY, PERCENT, and NUMEX. Total 200 questions were provided for each subtask in the formal run evaluation. How to find the correct answer is the major concern in this pilot task. The participants were asked to submit the answer in target language rather than to translate the answers back to original language. The evaluation criteria for each answer are shown as follows: (1) Right: The answer is correct, and the document containing the answer also supports it. (2) Unsupported: The answer is correct, but the document containing the answer cannot support it. (3) Wrong: The answer is incorrect. The answers were evaluated by different metrics: accuracy in official runs, and MRR and top-5 in unofficial runs. The challenging issues in CLQA task are two folds, including machine translation and question-answering. We have to translate questions from source language to target language, and retrieve the relevant documents containing the answers. That is similar to CLIR. The translated questions are also employed to extract the answers. Translation errors may result in poor IR and IE performance. Techniques such as machine-readable dictionaries,

on-line machine translation systems, collocation on search results from the web, and so on, have been explored. 3.2 Answer translation and fusion In the goal of CLQA, the extracted answers are in terms of source language. In other words, they have to be translated to the language in which questions are. In the pilot study, we focus on the performance of question translation and answer retrieval. We did not ask participants to do answer translation. For multilingual CLQA, we submit a question to extract the plausible answers from a multilingual document collection. The same named entities may be reported in different languages. For example, in the Chinese question 1997? (Who was the Japanese Prime Minister in 1997?), Table 3 lists the first five answers from English and Chinese document sets, respectively. Merging answers from multiple sources is an additional task in multilingual CLQA. Extension of the methodology in Section 2.2 may be adopted. Table 3. Answers in different languages. In this example,,, and denote the same persons as Yoshiro Mori, Keizo Obuchi, and Ryutaro Hashimoto, respectively. We can merge the two sets of answers in the following way. (1) Multiply out the English answers E i (1 i 5) and the Chinese answers C j (1 j 5), and generate 25 combinations. (2) For a combination (E i, C j ), submit E i and C j together to Google, and employ the similar way as the methods specified in Section 2.2 to verify if E i and C j appear in the neighborhood. If the combination has strong collocation, then delete (E i, X) (where X C j ) and (X, C j ) (where X E i ), and try the remaining combinations. Figure 4 shows an example of submitting Keizo Obuchi to Goggle. The collocation is marked in red. Figure 4. Collocation Example of and Keizo Obuchi. 4. OPINION ANALYSIS 4.1 Opinion extraction evaluation Humans like to express their opinions and are eager to know others opinions. An opinion is a word string that someone expresses to declare his stand toward a specific target. The named entity who expresses an opinion is called the opinion holder. There may not always be an opinion holder in a sentence. A target may be a product, a person, an event, and so on. Automatically mining and organizing opinions from heterogeneous information sources are very useful for individuals, organizations and even governments [7][8]. Opinion extraction, opinion summarization and opinion tracking are three important techniques for understanding opinions. Opinion extraction identifies opinion holders, extracts the relevant opinion sentences and decides their polarity. Opinion summarization recognizes the major events embedded in documents and summarizes the supportive and the non-supportive evidences. Opinion tracking captures subjective information from various genres and monitors the developments of opinions from spatial and temporal dimensions. There are many applications such as polls on public issues, product review analysis, opinion collection of famous people, monitoring changes of public opinions, opinions toward candidates in an election, summarizing opinions of different social classes, and so on. In 2005, the open submission session at NTCIR-5 [1] collected researchers comments about a new pilot task. This pilot task aims to build an opinion extraction corpus based on the past NTCIR CLIR test collections, and to promote investigation of opinionated information access. Opinion extraction is one of the kernel technologies related to opinion analysis tasks. In this pilot task, sentences are the basic units for extraction and relevance judgments. Opinions and the opinion holders are the information we focus on. The test collection consists of topics and the relevant documents. For each topic and the corresponding relevant documents, which sentences have subjective information relevant to the designated topic, their polarity, and the explicit opinion holder have to be reported.

To evaluate the technologies embedded, we divide opinion analysis into five subtasks. Extracting opinion holders and opinionated sentences are mandatory. To indicate the relevancy of opinionated sentences to the given topics and/or determine the polarity of relevant opinionated sentences is optional. In addition, there is an optional application-oriented subtask. The name of the opinion task describes the language, type, and granularity of the task. The format is L-T-G where L denotes the language of the material, T the type of task, and G the granularity (of the analyzed unit of materials). For example, C/J/E-OE-S denotes Chinese/Japanese/English Opinion Extraction at the Sentence Level. We selected 32 opinionated topics from NTCIR-3, -4, -5 CLIR tasks. The relevant documents for these topics were also extracted. There are 872 Chinese documents meeting our requirements. Each Chinese document is tagged with the information shown in Table 4 by three annotators. Table 4. Tags for corpus annotation. Tag Level Attribute Value Description <SEN_OP></SEN_OP> Sentence TYPE YES Sentence Opinion: Define if this NO sentence is an opinion sentence. <SEN_ATTITUDE></SEN_ATTITUDE> Sentence Sentence Subsentence TYPE SUP NSP NEU Sentence Attitude: Define the opinion polarity of a sentence <SEN_REL></SEN_REL> YES Sentence Relevance: Define if this TYPE NO sentence is relevant to the topic. <OPINION_SRC></OPINION_SRC> EXP Opinion Source: Define the opinion TYPE IMP holder of a specific opinion For speeding up the annotation, an opinion annotation tool shown in Figure 5 is designed. With this friendly interface, users can click the suitable buttons to annotate different values. Three buttons, i.e., Supportive subsentence, Nonsupportive subsentence, and Neutral subsentence, are provided. These buttons are used to annotate the target text with <SEN_ATTITUDE> </SEN_ATTITUDE> tag pair. (2) Opinion keyword Four buttons such as Positive keyword, Negative keyword, Neutral keyword, and Opinion operator are designed. These buttons are used to annotate the target text with <SENTIMENT_KW> </SENTIMENT_KW> tag pair. (3) Semantics conversion It consists of three buttons, i.e., Converted to positive, Converted to negative, and Converted to neutral. These buttons are used to annotate the target text with <CXT_ATTITUDE> </CXT_ATTITUDE> tag pair. This tag pair is applied if the sentiment polarity of an opinion keyword is converted from one to another when this keyword co-occurs with another word. (4) Opinion holder It contains two buttons say, Explicit holder and Implicit holder. These buttons are used to annotate the target text with <OPINION_SRC> </OPINION_SRC> tag pair. Now, the tool supports Chinese, English and Japanese. It is easy to extend to other languages by just including language files. To evaluate the quality of the human-tagged corpora, the agreements of annotations have to be analyzed. Inter-annotator agreements are conducted at different levels. The metrics of the inter-annotator agreement is adopted as follows. A B Agreement( A, B) = samples Three annotators denoted A, B and C examined the samples. Under lenient metrics, neutral and positive are considered to be in the same category. Strict metrics treats all three categories (positive, neutral, and negative) as distinct. The annotations are called strongly inconsistent if positive polarity and negative polarity are assigned to the same constituent by different annotators. Kappa value gives a quantitative measure of the magnitude of inter-annotator agreement. Figure 5. An opinion annotation tool. The function descriptions of buttons are shown as follows. (1) Opinion subsentence 4.2 Opinion extraction algorithms The extraction of opinion passage and determination of its tendency is not trivial. We should consider the topic specification, the keywords, and the surrounding words. The topic specification consists of two parts: the focus and the contents of an event. The focus has strong relationship with the opinion types, and the contents determine if a document or a passage is related to the topic specification. In the opinion extraction algorithm, the following cues are employed. CW: a set of concept words in a topic SW: a set of supportive keywords NS: a set of not-supportive keywords OW: a set of opinion keywords NW: a set of neutral keywords NG: a set of negation operators F: focus of a topic

CW and F are topic dependent, and SW, NS, OW, NW and NG are topic independent. The word-based opinion extraction algorithm is shown as follows. (a) passage level (1) Determine a passage by full stop, question mark and exclamation mark, segment the passage and perform steps (2)-(8) until all the passages in a document are read. (2) If the passage does not contain any keywords, then it is not an opinion. Move to step (1) to get the next passage. (3) If the passage contains a designated number of concept words, then it is related to the topic and go ahead to determine its type. Otherwise, move to step (1) to get the next passage. That is, although it may be an opinion, it is not related to the topic. (4) If all the keywords in the passage are supportive, then check further if the surrounding words contain any negation operator. If there does not exist such an operator, then the passage is a positive opinion. We increment the corresponding opinion counter, and move to step (1). Otherwise, we change the type of the supportive keyword with negation into not-supportive, and move to step (6). (5) If all the keywords in the passage are not-supportive, then check further if the surrounding words contain any negation operator. If there does not exist such an operator, then the passage is a negative opinion. We increment the corresponding opinion counter, and move to step (1). Otherwise, we change the type of the not-supportive keyword with negation into supportive, and move to step (6). (6) If supportive and not-supportive keywords are mixed in the passage, the majority determines the passage type, i.e., supportive or not-supportive, and move to step (1). (7) If the passage contains only neutral keywords, then set the passage as neutral, increment neutral counter, and move to step (1). (8) If the passage contains only opinion keywords, then increment neutral counter, and move to step (1). (b) document level (1) If the topic focus is anti like topic Anti-Meinung Dam Construction, related to environment protection, then reverse the types of passages and exchange the corresponding counters. (2) If number of the positive opinions is larger than that of the negative opinions, then the document is positive. (3) If number of the negative opinions is larger than that of the positive opinions, then the document is negative. (4) If number of the neutral opinions is the largest among the passages, or the numbers of the positive and negative opinions are the same, then the document is neutral. 5. CONCLUSION In the design of CLIR and CLIE evaluation, the cost to prepare the answer sets is always an issue. We try to reuse the test beds set up in the previous NTCIR evaluations. That not only reduces the cost in the development of test data, but also makes the evaluation of individual modules of a complex system feasible. Take an opinion extraction system as an example. When answering the question Why is Seed in favor of human cloning?, an opinion extraction system has to find the relevant documents to the topic human cloning, and then the positive opinions are reported. The performance of back-end information retrieval systems will have great effects on the front-end opinion extraction systems. Consistent topics on the same document sets enable researchers to test pipelining modules incrementally. 6. ACKNOWLEDGMENTS The author is very thankful to the efforts of the co-organizers in NTCIR CLIR, CLQA and opinion analysis tasks. 7. REFERENCES [1] Chen, H. H. and Koga, T. Open submission session, In Proceedings of 5th NTCIR Workshop Meeting on Evaluation of Information Access Technologies (Tokyo, Japan, December 6-9, 2005). National Institute of Informatics, Tokyo, Japan, 2005, http://research.nii.ac.jp/ ntcir/workshop/onlineproceedings5/index.html. [2] Chen, K. H. and Chen, H. H. Cross-language Chinese text retrieval in NTCIR workshop towards cross-language multilingual text retrieval. ACM SIGIR Forum, 35, 2 (Fall 2001), 12-19. [3] Kishida, K., Chen, K. H., Lee, S. Chen, H. H., Kando, N., Kuriyama, K., Myaeng, S. H. and Eguchi, K. Cross-lingual information retrieval (CLIR) task at the NTCIR workshop 3. ACM SIGIR Forum, 38, 1 (June 2004), 17-20. [4] Kishida, K., Chen, K. H., Lee, S., Kuriyama, K., Kando, N., Chen, H. H., Myaeng, S. H and Eguchi, K. Overview of CLIR task at the fourth NTCIR workshop. In Proceedings of 4th NTCIR Workshop Meeting on Evaluation of Information Access Technologies (Tokyo, Japan, June 2-4, 2004). National Institute of Informatics, Tokyo, Japan, 2004, 1-59. [5] Kishida, K., Chen, K. H., Lee, S., Kuriyama, K., Kando, N., Chen, H. H. and Myaeng, S. H. Overview of CLIR task at the fifth NTCIR workshop. In Proceedings of 5th NTCIR Workshop Meeting on Evaluation of Information Access Technologies (Tokyo, Japan, December 6-9, 2005). National Institute of Informatics, Tokyo, Japan, 2005, 1-38. [6] Ku, L. W., Ho, H. W. and Chen, H. H. Novel relationship discovery using opinions mined from the web. In Proceedings of Twenty-First National Conference on Artificial Intelligence (AAAI-06) (Boston, Massachusetts, July 16-20, 2006). AAAI Press, Menlo Park, California, 2006, 1357-1362. [7] Ku, L. W., Liang, Y. T. and Chen, H. H. Opinion extraction, summarization and tracking in news and blog corpora. In Proceedings of AAAI-2006 Spring Symposium on Computational Approaches to Analyzing Weblogs. (Stanford, USA, March 27-29, 2006) AAAI Technical Report SS-06-03, California, USA, 2006, 100-107. [8] Sasaki, Y., Chen, H. H., Chen, K. H. and Lin, C. J. Overview of the NTCIR-5 cross-lingual question answering task. In Proceedings of 5th NTCIR Workshop Meeting on Evaluation of Information Access Technologies (Tokyo, Japan, December 6-9, 2005). National Institute of Informatics, Tokyo, Japan, 2005, 175-185.