This paper studies methods to enhance cross-language retrieval of domain-specific

Keith A. Gatlin. Enhancing Cross-Language Retrieval of Comparable Corpora Through Thesaurus-Based Translation and Citation Indexing. A master s paper for the M.S. in I.S. degree. April, 2005. 23 pages. Advisor: Robert Losee. This paper studies methods to enhance cross-language retrieval of domain-specific documents. English- and German-language comparable corpora are used as the subject of the study. A multilingual thesaurus is developed to facilitate query translation, and reference citations are indexed to provide a language-neutral method to retrieve documents. These new retrieval methods are tested against actual user queries to measure the improvement of retrieval quality over an existing Boolean system. Experimental results suggest that a manually produced thesaurus can greatly increase the recall of documents, while the use of the citation index leads to high precision retrieval when compared to a standard Boolean system without these enhancements. Both methods provide cross-language retrieval of documents given monolingual search terms, thus automatically expanding the scope of a user s query. Headings: Information retrieval cross-language Information retrieval comparable corpora Thesaurus compilation Citation indexing

ENHANCING CROSS-LANGUAGE RETRIEVAL OF COMPARABLE CORPORA THROUGH THESAURUS-BASED TRANSLATION AND CITATION INDEXING by Keith A. Gatlin A Master s paper submitted to the faculty of the School of Information and Library Science of the University of North Carolina at Chapel Hill in partial fulfillment of the requirements for the degree of Master of Science in Information Science. Chapel Hill, North Carolina April 2005 Approved by Robert Losee

1 TABLE OF CONTENTS INTRODUCTION... 2 BACKGROUND... 4 THESAURUS CREATION AND EVALUATION... 7 THESAURUS IMPLEMENTATION... 10 REFERENCE CITATION INDEXING... 11 METHODOLOGY... 13 RESULTS... 15 DISCUSSION... 17 REFERENCES... 19

2 INTRODUCTION The problem of cross-language retrieval is one that has become increasingly important as the volume and availability of machine-readable source data has increased. We now have vast amounts of computer-readable text in multiple languages. To provide convenient access to this information, a retrieval system must be able to return documents in one or more target languages for any query. In this paper, I will examine methods to enhance the cross-language retrieval of text from domain-specific English and German corpora. The focus of this study is on increasing retrieval effectiveness when English-language queries are used to retrieve documents from both corpora. As such, the target user of this system is a person who is most comfortable phrasing queries in English, but has enough knowledge of German to interpret the German-language results. The corpora examined in this study are composed of 90,000 English and German catalog entries extracted from online sources. The catalog entries are domain specific and represent comparable corpora. Thus, they share the same concepts and ideas, but documents in one corpus are not direct translations of those in the other. Without the help of a cross-language retrieval system, users must take one of two approaches to retrieving documents from the corpora: 1) Phrase the search query using only language-neutral terms so it will match documents in both languages, or 2) Express the query using terms in more than one language. Both methods are unsatisfactory. The first method will almost never achieve perfect recall, as documents are rarely so similar to allow for such a onesize-fits-all query. The second method, on the other hand, can achieve very high recall, but it assumes the user is very familiar with both languages (and the domain-specific terms within each) an unlikely scenario. Existing search log evidence suggests that

3 most users employ the first retrieval method. Therefore, we can assume that users are not retrieving all relevant documents in the corpora. Another barrier to retrieval is the nature of information representation in the corpora. Most catalog entries include citations to major reference works in the body of the text. Because of the domain-specific nature of the corpora, these citations can be crucial to locating relevant documents. Users will often know which reference number they are seeking when they perform a search. However, because the documents are captured from various sources, they do not use a consistent citation format for these references. Therefore, users can never be sure which format to use for their query even if they know exactly which reference they are seeking. The problem this paper attempts to solve is how to create a bridge between the documents to join similar concepts, terms, and reference citations. The methods I propose to achieve this are twofold. First, a multilanguage thesaurus can be used to translate English query terms into their German counterparts. This translation will automatically expand a user s query to include German terms. Second, the language-neutral reference citations can be identified and indexed separately from the textual portion of the catalog entry. This will allow users to input reference queries in a uniform format to be retrieved separately from the document text without the risk of mismatches.

4 BACKGROUND One of the most troublesome areas of cross-language IR and one that has surfaced many times in previous studies is its need for advanced disambiguation of terms (Rogati and Yang, 2004). For a system to be successful, we must be able to specify the correct translation for a term having more than one meaning. This sentiment is further reinforced by Ballesteros and Croft (1997), who link the success of cross-language retrieval with its ability to resolve term ambiguity. This ambiguity can be at least partially avoided if we can make assumptions about the contents of the corpora. If the corpora are focused on the same, narrow domain as they are in this study then term meanings are unlikely to vary with context. In this situation, a thesaurus can serve as a controlled vocabulary, specifying the synonyms, translations, and preferred variants for a standard set of commonly encountered terms. With the thesaurus as a base, terms are constrained to their domain-specific meanings, thus avoiding the problem of ambiguity. Multilingual thesauri have been used extensively in previous cross-language IR research. Most experiments have used thesauri as a method to translate query terms on the fly, thus expanding the query to include one or more target languages (Ballesteros and Croft, 1996). The foundation of this method is automatic query expansion, a technique long used in IR systems to add relevant terms to a query (Qiu, 1993). When terms are added onto an existing query, we would expect that the number of documents returned to increase. This reasoning has been applied to multilingual retrieval, where query expansion is used to append translations of query terms to the user s original input (Han, 1994). This query expansion method operates separately on individual terms and does not provide a context-sensitive translation of the query. Research has shown that such

5 dictionary-based translations tend to work better for short, targeted queries rather than long ones (Oard, 1998). This constraint is not a problem in the context of this study, because the documents themselves are very short (averaging about 100 terms), and users queries are unlikely to contain long, complex ideas. The performance of a thesaurus-based cross-language IR system depends most heavily on the coverage and accuracy of the underlying thesaurus. The UMLS metathesaurus has been successfully applied as an automated method for cross-language retrieval (Eichmann, Ruiz, and Srinivasan, 1998). However, such high quality thesauri are not always readily available for a given application. In this case, the thesaurus must be custom-built. Soergel (1997) provides a framework for building multilingual thesauri. He emphasizes the user-centered approach to indexing, in which actual user queries and interests are used as a basis for thesaurus construction. Sager (1990) advocates a similar approach to term compilation and stresses the importance of high quality translations that are reversible (that is, approachable from either language). To quickly produce a thesaurus based on available corpora, we might consider automated methods. Attempts to automatically construct monolingual thesauri have met with some success (Jing and Croft, 1994). However, the creation of a multilingual thesaurus is much more difficult, especially when we need to ensure that the thesaurus contains a domain-specific, controlled vocabulary in two languages. Although some parts of the thesaurus construction process can be automated (i.e., term compilation), the actual translation and evaluation process requires manual effort by someone with knowledge of the domain (Soergel, 1997). We can therefore expect that the thesaurus construction process will not be a straightforward, data-driven task. Instead, it will require terminological research and some familiarity with both the English and German terms related to the field. Citation analysis and indexing have long been studied as methods to find relationships among documents (Small, 1973). Garfield (1979) suggested bibliographic

6 citations as a basis for retrieval. Naturally, citation indexing can be expanded to include cross-language applications as well assuming that the same citations are used among documents in different languages. The biggest challenge in citation indexing is the parsing of the citations, which may appear differently depending on their context and source. Lawrence et al. (1999) discusses this non-trivial problem in the context of autonomously identifying citations in Web-based articles. One of the techniques suggested is the development of heuristics based on regular expressions to handle the variations in citation styles. This is a relatively standard technique that can be augmented by term frequency analysis and lists of commonly encountered citation components (authors, journal titles, etc.). Once a thesaurus and citation index have been built, they must be integrated into a retrieval system. One method to achieve this is by mapping Boolean operators to SQL queries to retrieve documents from a relational database. Grossman et al. (1997) discuss the advantages of using SQL queries to retrieve both unstructured and structured data. In this study, we can use a similar method to convert user queries for both unstructured text and structured reference citations into SQL queries. This method allows us to support Boolean operators and easily integrate structured information into user searches.

7 THESAURUS CREATION AND EVALUATION The most important quality of a multilingual thesaurus is that it includes all concepts relevant to a domain as they exist in each of the languages (Soergel, 1997). For this experiment, our intent is to improve retrieval within a relatively limited domain. Therefore, we do not have to capture every concept just the most important ones. In addition, we know that concepts in German are not always expressed in lexically similar ways as the same concepts in English. This is a major concern for the creators of general thesauri. However, our application is targeted at comparable corpora, where the concepts expressed in one corpus have reliable and discernable counterparts in the other corpus. Users of the system are expected to know the vocabulary of the domain, (i.e., a controlled vocabulary), so this assumption greatly decreases the possible scope of the thesaurus. Construction of a multilingual thesaurus typically begins with an analysis of search requests, common document terms, and other thesauri (Soergel, 1997). The goal of this process is to create a list of the most important terms related to a particular domain. As a starting point, I concentrated on a frequency analysis of user query terms as extracted from a search log. This ensures that the thesaurus will cover, at the very least, concepts that have been most frequently requested by users. The next step in the process is to group terms by category (i.e., geographical locations, proper names, etc.) and identify synonyms within each language. My term frequency analysis of the search query log uncovered several series of terms relevant to the domain. Many of these terms fit into some framework, or subset, of the domain. For example, series of chronological events, names, and places arise when the terms are grouped by topic. I was able to use both my

8 domain knowledge and some existing thesauri and indices to help fill in missing elements of these series. Once a monolingual term list is available, the next step is to translate the terms into their target language (in this case, German). I carried out this task almost entirely by hand, using cross-language dictionaries to help find term translations. The validity of these translations is solely dependent on whether they actually appear in the target corpus. Therefore, we must ensure that any translations of the English terms are present in the German corpus. In addition, they must be used in the same context. Otherwise, the translations will not facilitate retrieval across the corpora. After compiling the thesaurus, I exported it as a text file in a format the retrieval system can read. The thesaurus is arranged according to base terms in English. These are the preferred terms and are most likely to be encountered in an English corpus. Each row of the thesaurus begins with a base term. After the base term, any applicable English synonyms are listed. Next come the German translations, with the preferred translation first followed by any variations (such as spelling variants). The layout of the thesaurus thus allows the search system to match English query terms with thesaurus entries and expand them to include synonyms and German translations. Ultimately, the success of a multilingual thesaurus will be reflected in the performance of the retrieval system into which it is integrated. Because of its query expansion effect, we would expect that a good thesaurus will greatly increase the number of relevant results to any monolingual query. Another good measure of the thesaurus effectiveness is its coverage i.e., what proportion of users queries has a match in the thesaurus. Finally, we can evaluate the accuracy of the thesaural translations to see if they reflect the true translation of a term as it appears in the corpus of the target language. This evaluation step should occur during thesaurus construction, possibly with the help of experts fluent in the target language.

9 In summary, I propose three methods of thesaurus evaluation: 1. Measure the number of results returned for a monolingual query both before and after thesaurus implementation 2. Calculate the proportion of user query terms that appear in the thesaurus (and thus can be translated by it) 3. Evaluate thesaurus translations to ensure they are reversible and appear in the target corpus The first evaluation method will occur after the thesaurus is implemented and will be discussed in the Methodology section. The second and third evaluation methods should be undertaken as an ongoing part of the thesaurus compilation process. We would assume that a good thesaurus would cover as many user query terms as possible within the limits of time and expense. In addition, high quality translations should be the basis for thesaurus development. Finally, evaluation must be carried out continuously, so the thesaurus can be updated in response to any changes in user queries or corpus content.

10 THESAURUS IMPLEMENTATION The system into which I implemented the thesaurus uses Boolean retrieval. It accepts queries with the AND, NOT, and OR operators. For each term or phrase in the query string, the system does a lookup in the thesaurus. Query expansion occurs only when a term or phrase has an exact match. In this case, the original query term is expanded to include preferred synonyms and translations appearing in the thesaurus. Some research in this area has assigned weights to the terms and phrases used in the query expansion process, but experimentation is required to find optimum values for the weights (Jing, 1994). The system in this experiment is much simpler, using no weighting for the terms. Instead, the query expansion uses the Boolean OR operator to add translated terms or phrases to the user s original query. For example, a user query with the English noun branch would become branch OR zweig after the thesaurus lookup and translation. This Boolean query would then be mapped to an equivalent SQL query to retrieve results from the underlying relational database. For this study, the translation process occurs automatically in the background, expanding user terms to include entries from the thesaurus. The user does not see the expanded query, nor are they allowed to choose which of the suggested query terms to include. As such, this system does not offer a method for user feedback.

11 REFERENCE CITATION INDEXING As a method to bridge between languages, reference citations can be very powerful. In the domain of this study, these references appear in catalog entries to refer the reader to standard, paper-based sources. For example, in the legal domain, Congressional bills are often cited with an abbreviation and a number, such as H. R. 145, which refers to bill number 145 in the House of Representatives. Another example of such a citation is a reference to a specific Bible verse, as in 1 Cor. 1:13. In both examples, the numerical portions of the reference citations are language-neutral and can be extracted from a document in any language. The German and English corpora in this study take their references from similar, well known sources. As a result, reference numbers provide a good example of a language-neutral bridge between documents in different languages. The trouble with the reference citations as they exist in catalog entries is in their formatting: Documents captured from different sources tend to use different citation styles. Returning to the example of Congressional bills, some sources use punctuation in their citations, such as H. R. 326 or S. 120, while other sources leave out the periods and use different spacing, such as HR 326 or S 120. If users do not phrase their search queries exactly as citations appear in the source documents, then the retrieval system will not find them. To address this problem, I chose the top seven most frequently cited reference works, removed their reference citations from the corpora, and indexed them separately from the document text. I carried out this process by generating a list of regular expressions that would match the numerical portion of each reference citation. The citations were then placed in their own database fields so they can be searched separately from the document text. This index of citations is similar to database

12 normalization, in that its goal is to extract atomic, structured data from an unstructured text field. Despite their ability to serve as a bridge between documents in different languages, reference citations do have one drawback: They are not always unique. Therefore, they may not produce the desired documents if used alone in a search query. Instead, they often need to be used in combination with other query terms to identify relevant catalog entries. For example, when referring to bills under consideration in the U.S. Senate, a citation such as S. 648 might be used in the text of a document. However, this citation is incomplete when taken out of context (i.e., when extracted programmatically). To refer to a unique bill, the reference citation must be accompanied by a Senate number, such as 109 th Congress, because bill numbers are reset before each new Congress. For this study, we can assume that users of the system are familiar with these references and their limitations, and that they will phrase their queries accordingly.

13 METHODOLOGY The methodology I used to test and evaluate retrieval performance is focused on the two improvements I made to the system. First, I measured the performance of the multilingual thesaurus. Second, I gauged how well the system performs when using the index of reference citations to retrieve documents. To adequately test the multilingual thesaurus, we must evaluate it based on the metrics discussed in the thesaurus compilation process. We are most interested in how successfully the thesaurus uses query expansion to find new documents. This can be measured by counting the number of documents returned for each search query. We can compare the result of this test to the old search method (which did not use a thesaurus) to see how the use of thesaural translations increases document recall. To carry out this evaluation, I randomly selected 200 search queries from the existing query log. For each query, I ran a search using both the old and new systems and recorded the number of hits. The second evaluation component focuses on the effect of reference citations in cross-language retrieval. To test the effect of these references, I randomly selected 50 user search queries that include a reference citation. I then broke these queries into two components: the numerical citation, and any other search terms that were part of the query. I used the numerical portion as an input to search for matches among the reference entries in the database. The other search terms were input into the thesaurusbased translation system. For a query to match a document, both the numerical reference citation and the query terms must have matches in the database. In this way, we take advantage of the thesaural query expansion as well as the index of citations. For each query, I ensured that enough information was available to guarantee that the reference

14 citation would be unique. This restriction allows us to judge the relevance of the results, because we know which catalog entries a user was trying to find based on their query. Therefore, for the list of results of each query, we can calculate the number of relevant and nonrelevant documents.

15 RESULTS The thesaurus-based query expansion system improved the search results for 37 of the 200 randomly selected queries. No changes were observed for the other 163 queries. Of those queries that saw improved results, the number of hits increased by an average of 115 percent so the thesaurus more than doubled the average number of documents retrieved. Although I did not evaluate the relevance of each retrieved document, initial results indicate that the query expansion of English terms is very effective at returning German documents that would not have been retrieved without query translation. Therefore, we can confidently say that the thesaurus does a good job of expanding search results when it can translate at least one query term. Inevitably, though, not all user queries have matches in the thesaurus. Some queries simply contain concepts not covered by the thesaurus. Others do not match thesaurus entries because of misspellings. Analysis of these non-matching terms is valuable, as they may prove to be candidates for addition to the thesaurus either as new entries or as extensions of an existing entry. Whereas the thesaurus evaluation offers information about the recall of the retrieval system, the reference citations can be employed to evaluate its precision. The results of this portion of the evaluation were very good. Of the 50 citation-specific search queries examined, all but two produced at least some relevant results. Overall, the 50 queries achieved very high precision. On average, 87 percent of the retrieved documents were relevant to the query. As with the evaluation of the thesaurus-based query expansion, the documents retrieved in this test were highly representative of the corpora from which they were drawn; that is, they consisted of both German and English documents with varying styles of citations. From these results, we can conclude that

16 the index of reference citations greatly improves cross-language retrieval because of its ability to reconcile differing reference styles. When implemented alongside the thesaurusbased translation system, this citation index provides a reliable, language-neutral method to retrieve specific catalog entries from the corpora.

17 DISCUSSION This study has demonstrated how cross-language retrieval of German and English documents can be enhanced by the use of a multilingual thesaurus and an index of reference citations. The thesaurus compilation process was completed almost entirely with manual methods. Manual compilation, though very labor intensive, gives a thesaurus several important properties. First, the thesaurus was built with the help of domainspecific knowledge of the corpora. This allows us to ensure the quality of translations and, at least subjectively, judge the completeness of the thesaurus entries. In addition, we can limit the scope of the thesaurus to cover only the most important domain concepts. These properties of a manually constructed thesaurus allow it to fit well into the framework of such a narrowly defined domain, where the use of general thesauri or machine translation would not be able to handle the specialized vocabulary. Like thesaurus-based translations, the index of reference citations is an attempt to bridge between the English and German documents. Reference citations are a particularly powerful retrieval method because they represent structured information within unstructured document text. That is, the references cited in a document appear in a structured format that we can extract and interpret programmatically. This structure does show subtle variations among documents (reflecting the differences in citation styles), but most can be matched with the use of regular expressions. We can therefore parse a document for the numerical portion of a citation and index this separately from the document text, providing a key on which to find documents containing a particular citation. In any corpus of unstructured text, such citations are valuable tools for retrieval. If they can be extracted and indexed apart from the document text in a normalized

18 fashion, then we can rely on simple database queries rather than retrieval algorithms to match documents to a query. Several aspects of the implemented retrieval methods could benefit from improvement. To allow for an expansion of the number of user queries that can be translated by the thesaurus, term compilation and thesaurus evaluation have to be ongoing processes. As the coverage of the thesaurus increases, cross-language retrieval performance will similarly improve. The query expansion process might also benefit from user feedback; instead of being an automatic, behind-the-scenes process, the system could show users suggested synonyms and translations. This function would allow users to customize their search by choosing which terms to add to a query or by editing the suggested terms. Finally, the scope of the reference citations index could be expanded to include more reference works. As implemented, the system indexes the top seven citation sources. We could greatly expand the coverage of this resource by indexing other, less frequently cited reference works. As a result of this study, we recommend that designers of domain-specific, crosslanguage retrieval systems carefully evaluate the potential for developing a custom thesaurus and citation index to enhance retrieval performance. Within a narrow domain, thesaurus-based translations of important query terms can provide an extremely powerful method to present users with documents that otherwise would not have been returned by a monolingual query. Similarly, language-neutral reference citations should be exploited wherever possible by developing parsing techniques that can identify citations and index them. This method allows users who are familiar with standard references to find documents in multiple languages.

19 REFERENCES Ballesteros, L. and W. B. Croft. (1996). Dictionary methods for cross-lingual information retrieval. Proceedings of the 7th International DEXA Conference on Database and Expert Systems Applications. Ballesteros, L. and W. B. Croft. (1998). Resolving ambiguity for cross-language retrieval. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 64-71. Eichmann, D., M. Ruiz, and P. Srinivasan. (1998). Cross-language information retrieval with the UMLS metathesaurus. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Garfield, E. (1979) Citation indexing: Its theory and application in science, technology, and humanities. New York: Wiley-Interscience. Grossman, D. A., O. Frieder, D. O. Holmes, and D. C. Roberts. (1997). Integrating structured data and text: A relational approach. Journal of the American Society for Information Science, 48(2), 122-132. Han, C., H. Fujii, and W. B. Croft. (1994). Automatic query expansion for Japanese text retrieval. Technical report, Department of Computer Science, University of Massachusetts, Amherst. Jing, Y. and B. Croft. (1994). An association thesaurus for information retrieval. Proceedings of the Intelligent Multimedia Retrieval Systems and Management Conference, 146-160. Lawrence, S., C.L. Giles, and K. Bollacker. (1999). Digital libraries and autonomous citation indexing. IEEE Computer, 32(6), 67-71. Oard, D. W. (1998). A comparative study of query and document translation for crosslanguage information retrieval. Proceedings of the 3rd Conference of the Association for Machine Translation in the Americas, 472-483. Qiu, Y. and H. P. Frei. (1993). Concept based query expansion. Proceedings of the 16th International Conference on Research and Development in IR (SIGIR), 160-169.

20 Rogati, M. and Y. Yang. (2004) Resource Selection for Domain Specific CLIR. Proceedings of the 2004 International Conference on Research and Development in IR. Sager, J. C. (1990). A Practical Course in Terminology Processing. Amsterdam: John Benjamins Publishing Company. Small, H. (1973). Co-citation in scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24, 265-269. Soergel, D. (1997). Multilingual thesauri in cross-language text and speech retrieval. Proceedings of the AAAI Symposium on Cross-Language Text and Speech Retrieval.