Dictionary Based Transitive Cross-Language Information Retrieval Using Lexical Triangulation

Size: px

Start display at page:

Download "Dictionary Based Transitive Cross-Language Information Retrieval Using Lexical Triangulation"

Stella Mathews
5 years ago
Views:

1 Dictionary Based Transitive Cross-Language Information Retrieval Using Lexical Triangulation A study submitted in partial fulfilment of the requirements for the degree of Master of Science in Information Management At THE UNIVERSITY OF SHEFFIELD By Timothy John Gollins September 1 st 2000

2 Abstract The purpose of this dissertation is to investigate dictionary based cross language information retrieval using the technique of lexical triangulation. Lexical triangulation is a technique for combining the results of different transitive translations. A transitive translation uses an intermediate or pivot language to translate between two languages when no direct translation resource is available. The research took queries in German and translated then via Spanish, Dutch, or Italian into English. The research compared the results of retrieval experiments using these transitive queries, with other queries created by combining the transitive translations (lexical triangulation) or created by direct translation. Direct dictionary translation of a query introduces considerable ambiguity that damages retrieval results compared to the monolingual case, an average precision 79% or more below monolingual in this research. Transitive translation introduces more ambiguity giving results worse than 80% below the direct translation in this case. This research demonstrates that lexical triangulation between two transitive translations can eliminate the additional ambiguity introduced by transitive translation and achieve performance comparable with direct translation. Lexical triangulation between three transitive translations in some circumstances outperformed direct translation by 22%, achieving results 59% below monolingual. This research demonstrated that the technique of pre-translation pseudo-relevance feedback combined with direct and triangulated translation to achieve considerable performance improvements. Direct translation results improved by up to 77% and triangulated translation by over 73%. The direct and triangulated results remain comparable under these changes. The retrieval experiments used the GLASS system developed by Dr Mark Sanderson and INQUERY from the University of Massachusetts. Use of the INQUERY synonym operator eliminated the ambiguity of transitive translation. However, the synonym operator did not improve the direct translation results significantly and degraded the results of lexical triangulation. These experiments used language resources from the EuroWordNet and CELEX databases and text collections from both the TREC8 CLIR track and CLEF 2000 workshops. This research submitted results to the CLEF 2000 workshop. Page 2 of 284

3 Acknowledgements I would like to thank the following people for help and support with this dissertation. Dr Mark Sanderson, for the excellent supervision, encouragement and support he has given me throughout this dissertation. Wim Peters, for his support, and encouragement and for the supply of the language resources and advice that made much of the work possible. Jessica Peel-Yates for some excellent suggestions concerning comparisons between pivot languages. Asaad Alberair for his friendship and advice on matters concerning IR evaluation measures. Mark McCree for reading and commenting on the draft if this dissertation. Laura Tassoni for her friendship and support during the dark days of Semester 1. Karen McFarlane for her support and encouragement to take up the Masters Course and through her, my employers for their financial support throughout the course that made it possible. Jeremy who had to put up with Daddy going of to the Sheffield Factory every Sunday night. Most importantly, I must thank my wife Hazel, for her unending support and encouragement. She first encouraged me to take up the course and secondly put up with the enormous disruption caused by my weekly commuting. Thank you. Page 3 of 284

4 Table Of Contents 1 Introduction Overview Motivation and Background for CLIR CLIR - A summary of techniques Controlled Vocabulary Early CLIR More Recent approaches Machine Translation (MT) Translate the document corpus Translate the queries Comparable or Parallel Corpus Based techniques Translating using parallel corpora No Translation Needed - Vector approaches Dictionary-based techniques Background Pseudo-Relevance feedback The INQUERY synonym operator Why Transitive CLIR? Lexical Triangulation This dissertation Background Aims Structure of the Work Report Structure Objectives Methodology and Resources Underlying Philosophy Evaluation Methodology Choice of Languages Evaluation Resources Corpora Queries Relevance Judgements Language Resources EuroWordNet Background EuroWordNet in this investigation CELEX CELEX Background CELEX as used in this investigation SDA German Language Corpora for Pseudo-relevance feedback Technical Resources IR Systems GLASS INQUERY Programming Languages Awk Shell Scripts Experimental Environment: An Overview The Processing Pipeline Pre-translation Processing Page 4 of 284

5 Translation Processing Merge Processing Pre Retrieval Processing Experimental Components File Formats GLASS Format Extended GLASS Format INQUERY (batch) query file format Pre-translation Parsing the Queries Normalisation Lemmatisation background The basic lemmatiser The new lemmatiser (a variant of the above) Lemmatiser Coverage German stop-words module (Pre-translation) Pseudo-Relevance Feedback process Translation Pre Processing of EuroWordNet Simple Translation Translation with Cognate Spotting Merging Intersection Only Merge Full Merge Multi-Merge 2plus Multi-Merge Full Pre Retrieval Format conversion Stop-words English Stemming Retrieval and Post Retrieval GLASS INQUERY trec_eval_new Corpus Indexing GLASS English GLASS German (for pre-translation feedback) INQUERY The Experimental Story - A Step By Step Approach Phase 1 Is the basic idea sound? An Initial Study Experiment Results Analysis and Discussion First Automatic Translations Experiment Results Discussion Phase 2A The basic experiments - GLASS First approach Experiment Results Analysis Full or intersection merging Experiment Results and Analysis Porter Vs WordNet Experiment Page 5 of 284

6 Results Analysis To flatten or not to flatten Experiment Results Analysis Introducing German Stop-words Experiment Results Analysis Phase 2B - The basic experiments INQUERY Query Structure Simple approach Experiment Results Analysis Phrases Experiment Results Analysis Synonym Experiment Results Discussion and Analysis Synonym and Phrases Experiment Results Analysis Interim Evaluation The CLEF Submission Experiment Results Analysis and Discussion Phase 3 - The New Experimental Direction Coverage of the German WordNet The New Lemmatiser Coverage of the German WordNet Retrieval Evaluation of New lemmatiser GLASS Experiment GLASS Results GLASS Analysis INQUERY Experiment INQUERY Results INQUERY Analysis How effective is the compound splitting Experiment Results Analysis Are cognates important? Only Translated terms Experiment Results Analysis Cognate spotting Translation Experiment Results Analysis Pre-translation Query Expansion (Pseudo-Relevance Feedback) Page 6 of 284

7 5.9.1 Background Approach Initial development Optimising the GLASS parameters Optimising Experiment Optimising Results Evaluation Experiment Results Analysis Focused Feedback Optimising Experiment Optimising Results Evaluation Experiment Results Analysis Optimising INQUERY parameters Normal Feedback Experiment Optimising Results Focused Feedback Experiment Optimising Results Evaluation Experiment Discussion Multiple Transitive Translation GLASS Experiments Results Analysis INQUERY Experiment Results Analysis Discussion Combining Pre-translation Expansion with Multiple Transitive Translation GLASS Experiments Experiment Results Analysis INQUERY Experiments Experiment Results Analysis Discussion Discussion Transitive CLIR and ambiguity How to discuss translation effectiveness Lexical Triangulation Why did bilingual do so poorly The problems (possibly) Possible solutions The Synonym operator The different tools available to the experimenter Conclusion and Recommendations Objectives Achieved The Future for Lexical Triangulation Research Bibliography Page 7 of 284

8 1 Introduction 1.1 Overview The field of "Cross Language Information Retrieval" (CLIR) has emerged at the intersection of research into Machine Translation and conventional, or monolingual, Information Retrieval (IR) (Grefenstette (1998c)). CLIR addresses the situation where the query that a user presents to an IR system, is not in the same language as the corpus of documents they wish to search. This situation presents a number of challenges (Grefenstette (1998c)) but primary amongst these is the problem of crossing the language barrier (Schauble & Sheridan (1997)). Almost all the approaches to this problem require access to some form of rich translation resource to map terms in the query language (the source) to terms in the corpus (the target). Transitive CLIR aims to address the situation where there are limited direct translation resources available (Ballesteros (2000)). A transitive CLIR system translates the source language terms by first translating the terms into an intermediate or "pivot" language and then translating the resulting terms into the target language. Thus, a transitive system could translate a query from German to English via either Dutch, Spanish or an other language. The main aim of this work is to combine translations from two different transitive routes to discover if this can reduce the ambiguity inevitably introduced by transitive translation. Ballesteros suggested the possibility of using this approach in the summary to her recent paper (Ballesteros (2000)). I have chosen to call this approach lexical triangulation. See figure 1 below. Dutch Query Terms Translate to Dutch Translate to English German Query Combine English Terms English Query Translate to Spanish Translate to English Spanish Query Terms Figure 1. Lexical Triangulation - an example. Page 8 of 284

9 1.2 Motivation and Background for CLIR In an enlarging European community, and with the United Kingdom s new emphasis on regional politics, information in a multiplicity of languages is becoming more important. As these various institutions grow, mature, and generate information, the need for native speakers of one language to find information recorded in another will increase. Grefenstette (1998b) points out that the elimination of language barriers by the adoption of a single universal language has been a pipe dream held by many. However, the reality of the world is that these barriers exist and will remain. He also observes that, with explosion of electronic information on the World Wide Web and multinational corporate Intranets, the ability to find information across language barriers is becoming a commercial necessity. This view is echoed in the introduction to several papers by Oard (Oard (1997a), Oard (1997b), Oard (1998)) and by Ballesteros & Croft in the introduction to one of their papers (Ballesteros & Croft (1996)). A scenario often presented for the use of CLIR is one where a user can understand documents in the language of the corpus, but is unable to express a query in that language (Oard (1997a), Braschler, et al. (1998)). Both Oard and Braschler et al. also describe the scenario where a user has no skills in the corpus language, but has access to expensive translation resources. In this scenario, the user will only wish to submit the best documents for translation. In both these situations, the expectation that the retrieved documents will match the information need is paramount. One can also envisage other scenarios (Oard & Dorr (1996)), for example a polyglot who wishes to conduct a search in many languages simultaneously and does not want the significant additional effort of reformulating their request into all of the target languages. Finally, there is the scenario where the primary information is not language specific in nature, that is it may be multimedia information such as sound, image or video. In this case, the primary index to the material may be textural, but not in a language known to the user. The results of this sort of search may not require any translation at all. Page 9 of 284

10 1.3 CLIR - A summary of techniques. A number of papers have summarised the field of CLIR in recent years (Oard & Dorr (1996), Braschler et al. (1998), Grefenstette (1998a), Hull & Grefenstette (1996), Oard (1997a)). Each of these has attempted a categorisation of the techniques of CLIR. If these different categorisations are examined it becomes clear that they are based on the language resources used to cross the language barrier. The following list gives an overview of CLIR techniques. Controlled Vocabulary. These techniques use the terms in a thesaurus to index the documents in the collection. The system maps the terms in the query through the thesaurus in the same way. The system then compares document and query based on the thesaurus terms they have in common. Fluhr (1996) provides a short general description of this approach. Machine Translation. These techniques use a conventional machine translation system to translate either the query or the document corpus so that both are in the same language. This approach then uses a conventional monolingual retrieval system to index and retrieve the documents (Oard (1998)). Comparable or Parallel Corpus Based techniques. The common link between all corpus techniques is the use of corpus resources as the basis for training the IR system, or in constructing information structures, to be used in retrieval or translation (Braschler et al. (1998)). Dictionary Based techniques. As far as I am aware, the only work done in this field is based on the translation of the queries into the language of the document corpus. The motivation behind this approach is that resources needed for conventional Machine Translation and work with parallel corpora are both rare and expensive. The researchers in this area believe that bilingual dictionaries in a machine-readable form are more widely available (Ballesteros & Croft (1996), Ballesteros & Croft (1997), Ballesteros & Croft (1998a)). These dictionaries are often the electronic analogues of the normal, hardcopy, bilingual dictionaries used by linguists everywhere. In the following sub-sections, I will briefly examine some of the techniques that fall into these categories. Page 10 of 284

11 1.3.1 Controlled Vocabulary Early CLIR. The earliest work on CLIR used controlled vocabularies. The work of Salton on English, French and German and Pevsner on English and Russian during the 1970 s, is described in Oard & Dorr (1996). The techniques involved taking controlled vocabulary thesauri already developed for the monolingual retrieval systems SMART and PNP-2 respectively, and manually adding terms in the foreign language to each thesaurus concept category. Retrieval then proceeded in the same way as the monolingual case. The effect of augmenting the thesaurus in this way can be seen as translating both document and query into the common language defined by the thesaurus concepts More Recent approaches The approach taken by the TextWise LLC team at TREC 7 typifies this approach (Diekema, et al. (1998)). Diekema et al. have developed a system called CINDOR based on the Princeton WordNet as a central thesaurus. The original WordNet consists of a hierarchically arranged thesaurus with the different English terms arranged into synonym sets (synsets) each with a unique identifier (id number) (Miller, et al. (2000)). Each of these synsets represents one concept in the thesaurus-like hierarchy. In their work Diekema et al. (1998) regard this synset hierarchy as a language-independent conceptual inter-lingua. In order to proceed with CLIR each synset (keeping its unique meaning and id number) is populated with terms from the new language of interest. This produces a parallel thesaurus in the new language joined to the original through the conceptual inter-lingua (i.e. the synset-ids and their hierarchical relationship). Diekema et al. (1998) repeated this process for all the languages they required. The system then indexes documents in the languages of interest using the synset-ids of the synsets that contain the terms in the document. This effectively translates the document into the inter-lingua. Retrieval can then proceed by similarly translating the query into synset-ids and then matching the synset-ids using a conventional retrieval system. There a number of detailed issues associated with ambiguity and the lack of coverage of the thesaurus that can cause difficulties with this approach, these are dealt with by Diekema et al. (1998). This approach is significant as the translation resource chosen for my investigation is EuroWordNet, a resource developed with a very similar structure to the one described by Diekema et al. (1998). Page 11 of 284

12 1.3.2 Machine Translation (MT) There are two basic approaches to MT: - one translates either the documents, or the queries Translate the document corpus. This approach is typified by the work of Oard & Hackett (1997) and Oard (1998). There are some drawbacks to this approach, as compared to translating the queries, including the extensive processing required, and in the case of multiple query languages, the need to duplicate the documents in all of the potential query languages. Despite these drawbacks there are a features of this approach that are appealing, in particular the hope that translation ambiguity will be less pronounced in longer documents as machine translation is designed to work with whole sentences and documents. Another advantage of this approach is that the user may immediately receive the documents in their preferred language to enable them to skim or read them as appropriate (Oard (1998)). Oard reports that in his comparisons, for longer queries, machine translation of the document corpus is very effective Oard (1998). Unfortunately, the differences between this and other query translation techniques are not statistically significant in his experiment. Fluhr (1996) comments on the tendency for MT systems to make errors. He is also reported in Gachot, et al. (1998) as observing that machine translation is best applied in limited subject domains where MT can be specialised to the domain to reduce ambiguity Translate the queries. Oard (1998) discusses this technique and concludes that it is significantly less costly than translating the documents. However, the technique is clearly less effective than some other dictionary based techniques for short queries (Oard (1998)). Yamabana, et al. (1998) comment that the problem of resolving ambiguity in machine translation systems has been a major challenge in that field. They observe that the techniques successfully adopted by that the MT community are totally unsuited to translating queries, since queries are rarely sentences and more often just a sequence of words. Gey, et al. (1998) report on their use of the Globalink machine translation system for translating queries in a CLIR experiment. The absence of some language pairs in the Globalink lexicons forced them to use English as a universal intermediate or pivot language. Gey et al. make no comment as to the impact this process may have had on their results although there appears to be some evidence in their results that the effect was to reduce the average precision. Page 12 of 284

13 1.3.3 Comparable or Parallel Corpus Based techniques There are other approaches that concentrate on translating the query (e.g. Nie (1998), Nie (1999), Sheridan & Ballerini (1996)). There are also approaches that translate the target corpus (Franz, et al. (1998)). Finally there are techniques that involve no direct translations at all (Yang, et al. (1998), Landauer & Littman (1990)) Translating using parallel corpora The aim of these techniques is to use similar corpora from two different languages to generate a probabilistic model that can map terms in one language into their most likely translations in the other. For this method to succeed the corpora need to be quite similar. In general, the methods proceed by aligning the corpora sentence by sentence. Then based on the positions of the words in the sentences, together with anchor points such as cognates 1 and numbers, The system estimates the probability that all pairs of all the terms in the sentence are translations of each other. By combining these probabilities, the system creates an overall mapping that translates one term into a set of likely others. Nie (1998), Nie (1999), Nie, et al. (1999) use this technique to translate queries, and Franz et al. (1998) to translate the whole target corpora. All report considerable success No Translation Needed - Vector approaches. Vector approaches are quiet different from the other approaches as they rely on mapping the query and documents into a combined vector space which is usually defined by a parallel training corpus of some sort. No translation takes place. The approaches include the "Generalised Vector Space Model" (GVSM) as described by Carbonell, et al. (1997) and Yang et al. (1998). They also include the Latent Semantic Indexing approach as described by Landauer & Littman (1990), and Rehder, et al. (1997). Landauer & Littman (1990) reported considerable success based on his approach, however Yang et al. (1998) concluded that the mate finding evaluation technique used was a poor evaluation technique as it was rather too optimistic in its results. I have not elaborated on these vector approaches as they use techniques far removed from the techniques used in this investigation. For further information see the papers by Carbonell et al. (1997), Yang et al. (1998), Landauer & Littman (1990), and Rehder et al. (1997). 1 These are terms, often names, spelled the same (or nearly the same) in both languages. Page 13 of 284

14 1.3.4 Dictionary-based techniques Background The basic approach is to take each term 1 in the query and translate it by looking it up in a Machine Readable Dictionary (MRD) 2. This usually results in the significant expansion of the query, as terms inevitably have many possible translations. This is not only because terms may have several synonyms, but also because terms tend have a number of different senses that a naive approach can not distinguish. This basic approach is outlined by Grefenstette (1998b), and Ballesteros & Croft (1996). Ballesteros & Croft (1996) and Ballesteros & Croft (1997) report that Machine Readable Dictionary (MRD) translation of queries can lead to a drop in effectiveness of between 40-60% as compared with monolingual performance. They ascribe this to three primary factors, a lack of specialised vocabulary in the dictionary, the introduction of ambiguity from the translation process, and not translating multi-term concepts such as phrases. Ballesteros & Croft (1996) report significant improvement in effectiveness if dictionary translation is augmented with pseudo-relevance feedback (see section below) both before and after translation. They report improvements of between 16% and 34% for pseudo-relevance feedback applied before translation and between 14.3% and 47.5% when applied after translation. When combined the two stages of pseudo-relevance feedback produce improvements of between 40% and 51% (Ballesteros & Croft (1996)). Ballesteros & Croft (1997) show that although phrasal translation may improve effectiveness it is extremely sensitive to poor translation. A single, poor phrase translation may undo the good work of several accurate translations (Ballesteros & Croft (1997)). They also observe that MRDs do not provide sufficient context for good phrasal translation of most sorts of phrase. 1 In different work terms can be words, phrases, or either. linguists. 2 These are usually the electronic analogues of the traditional bilingual dictionaries used by Page 14 of 284

15 Ballesteros & Croft (1998a) continued their work using MRDs as the basis for translation with great success. By using a combination of the relevance feedback techniques, with part of speech tagging 1, the INQUERY synonym operator (see section below), and better phrase translation they have achieved better than 90% of monolingual performance in their experiments. The advanced phrase translation technique Ballesteros & Croft (1998a) employed uses the hypothesis that terms correctly translated from a phrase will preferentially co-occur in the target corpus. They achieve this translation by testing all the combinations of the various translation options for the terms of a phrase and choosing the combination that cooccurs most frequently in the target corpus. By combining this output with their existing dictionary phrase translation approach, they improve average precision by some 31% over simple word for word translation Pseudo-Relevance feedback. The idea behind relevance feedback dates back at least 20 years (Salton & Buckley (1990)). The basic concept is that the query received by a retrieval system is often quite short, and better retrieval will result if the system can assist the user in extending the query with additional appropriate terms (Sparck Jones & Willett (1997)). Numerous different techniques can select appropriate candidate terms, however they all aim to select terms that will be good discriminators of relevant documents. The principle is that terms occurring frequently in relevant documents and infrequently in non-relevant documents are good discriminators for the relevant documents (Sparck Jones & Willett (1997)). In a normal relevance feedback system, after the user has retrieved an initial list of documents they can indicate which are actually relevant. The system then uses this information to determine the most discriminating terms for the relevant set as compared to the rest of the corpus. The system then adds these terms to the query (sometimes giving different weights or priorities to the various terms) and re-executes the query. Experiments have shown that this technique is generally very effective at improving retrieval (Sparck Jones & Willett (1997)). 1 Part of speech tagging is a technique where the words in the query are processed to determine whether they are nouns, verbs, adjectives etc. This allows any subsequent translation to take account of the part of speech thus reducing ambiguity. Page 15 of 284

16 Pseudo-relevance feedback (sometimes called local feedback) differs from normal relevance feedback in that it assumes that the top n retrieved documents are relevant (Sparck Jones & Willett (1997), Ballesteros & Croft (1997)). The system then completes the process of determining the discriminating terms and re-executing the query without any intervention by the user. As discussed above Ballesteros & Croft (1996) introduced pseudo-relevance feedback as a pre-translation step. By using a corpus in the source language, the relevance feedback step introduces further terms that are relevant to the query and the translations of which may act to disambiguate the translations of the other query terms. Ballesteros & Croft (1997) found that pre-translation pseudo-relevance feedback strengthened the base for the translation and improved precision. However they also found that effect was limited by the tendency to introduce inappropriate translation terms (Ballesteros & Croft (1996)) The INQUERY synonym operator Within many IR systems, the main factor in determining the relevance of a document to a query is the frequency with which the query terms occur within the document corpus (Salton & Buckley (1988)). Systems measure the frequency of a term in two ways. The term-frequency or "tf" reflects how many times a term occurs within a particular document. The inverse document frequency or "idf" is inversely proportional to the number of documents containing the term (Salton & Buckley (1988), Ballesteros (2000)). The INQUERY 1 system uses a "belief score" (Ballesteros (2000 pg. 8)) based on these two measures. It is also normal for IR systems to give weight to a term proportional to the number of times it occurs within a query (Ballesteros (2000)). The INQUERY synonym operator groups together a set of words within a query. When used to determine the belief-score of a document INQUERY treats all occurrences of the words in the synonym operator, as occurrences of a single pseudo-term whose documentfrequency ("df") is the sum of the df's for each word. This has the effect of de-emphasising those words in the group that occur infrequently within the corpus (Ballesteros & Croft (1998a), Ballesteros (2000)). As a second effect, the synonym operator normalises for the number of words representing a concept. Consider the situation where different numbers of terms represent two different concepts of equal importance in a query. If each group of terms is enclosed in (2000). 1 For further detail on the INQUERY system see Broglio, et al. (1994), and INQUERY Page 16 of 284

17 a synonym operator, the INQUERY system will give the two concepts equal weight (Ballesteros (2000)). The first effect is useful in de-emphasising archaic senses of a translated term that may be present within a MRD (Ballesteros & Croft (1998a)) but that occur infrequently in the corpus. The second effect is useful for normalising the different numbers of translations that two separate terms may generate (Ballesteros (2000)). Ballesteros (2000) reports that using the synonym operator in MRD based CLIR yields improvements of greater than 45% over simple word-by-word translation alone. Ballesteros & Croft (1998a) report similar results. Page 17 of 284

18 1.4 Why Transitive CLIR? The European community has 11 official languages (Siebelink (1997)). This suggests that to translate queries between all possible pairs of languages would require 55 different bilingual resources. If the many unofficial European languages were included, the effort to maintain let alone create these resources would clearly become untenable. Even without this numerical consideration, many pairs of common languages have quite limited translation resources. By using an intermediate or pivot language for which good translation resources exist, transitive CLIR aims to reduce this problem (Ballesteros (2000)). The use of such pivot languages has been reported by a number of researchers (Braschler, et al. (1999a), Fluhr, et al. (1997), Hiemstra & Kraaij (1998), Gey et al. (1998), Franz et al. (1998), Littman, et al. (1998), Ballesteros (2000)). Significantly, apart from Ballesteros, these researchers have only used a transitive approach because no other resources were available to them. This illustrates that good translation resources can be hard to find even for well-funded researchers using common EU languages. Of those researchers cited above, only Ballesteros (2000) has researched explicitly the effect of using a transitive scheme and techniques to overcome some of its shortcomings. The principal problem Ballesteros discusses is the introduction of ambiguity. Concern with this issue was also reflected in comments by Fluhr et al. (1997). Ballesteros (2000) initially sets out to confirm for the language pair Spanish and French, the earlier work by Ballesteros & Croft (1996), Ballesteros & Croft (1997), and Ballesteros & Croft (1998a) which reported on dictionary based CLIR with Spanish and English. In doing so Ballesteros (2000) reports that word-by-word translation achieves 50%- 60% monolingual performance and that word ambiguity accounts for some 29% of the shortfall. Ballesteros (2000) attributes 40% of the shortfall to the failure to translate phrases. Ballesteros (2000) goes on to examine the impact of transitive translation, discovering that using simple word-by-word transitive translation from Spanish to English to French degrades performance by 91% when compared to word-by-word translation direct from Spanish to French. Ballesteros (2000) attributes this to the increase in ambiguity brought by transitive translation. Ballesteros (2000) goes further to attempt to reduce the ambiguity introduced by transitive translation using the techniques developed by Ballesteros & Croft (1996), Ballesteros & Croft (1997), and Ballesteros & Croft (1998a). These techniques include the Page 18 of 284

19 use of the INQUERY synonym operator, and pseudo relevance feedback. The synonym operator is particularly effective at reducing the ambiguity, reducing the differential between the direct translation and the transitive from -91% to -34%. By applying all of the various disambiguation techniques developed by Ballesteros & Croft (1996), Ballesteros & Croft (1997), and Ballesteros & Croft (1998a) at different stages in the transitive translation the results can be further improved. Ballesteros (2000) is able to obtain an average precision figure for transitive translation at 67% of the monolingual performance in the target language. This compare favourably with the 79% monolingual performance obtained from a direct translation approach. It is interesting to note that the institutions of the European community frequently use a form of pivot or "relay" simultaneous interpretation to support meetings and conferences. In this approach, interpreters translate into a common pivot language and then other interpreters then take this spoken text and interpret it for the various target listeners. This technique is used when an interpreter for a particular pair of languages is not available for a particular meeting or conference (MacKintosh (1998/99)). 1.5 Lexical Triangulation I hope, by adopting a scheme of lexical triangulation, to be able to demonstrate a reduction in the ambiguity introduced by transitive translation and consequently demonstrate an improvement in retrieval effectiveness. Informally the principal behind the approach is to average out the random noise introduced by transitive translation via the different pivots, leaving only the common signal present in both translation routes Consider the German word fisch, a German to Spanish translation gives the two terms pez, pescado whereas translating to Dutch gives vis. Taking each of these in turn, translating the Spanish terms to English gives pitch, fish, tar, food fish, while translating the Dutch to English gives pisces the fishes, pisces, fish. Each of the transitive translations has introduced quite a lot of translation noise and ambiguity. If we take the term that is in common from the two transitive translations, we have fish, a good and unambiguous translation of the original German word. This illustrates the principal of lexical triangulation. Page 19 of 284

20 1.6 This dissertation Background The genesis of this dissertation came from Dr Mark Sanderson, with the idea that lexical triangulation could be used to reduce the ambiguity introduced by transitive translation. An initial search of the literature in the autumn of 1999 revealed that there was little published work on transitive CLIR and, at the time, no mention of lexical triangulation or similar techniques. Subsequently Ballesteros published her paper on transitive CLIR using Machine Readable Dictionaries (Ballesteros (2000)), in the summary to which she suggests the possibility of comparing the output of two different transitive translations. The success reported by Ballesteros (2000) still leaves a gap in performance between direct translation approaches and transitive translation. If this investigation can demonstrate a beneficial effect from using lexical triangulation alone to disambiguate MRD based transitive translation then it is also important to demonstrate that lexical triangulation can be combined with other techniques to further improve performance. There at least two IR systems available to researchers within the Information Studies (IS) department at Sheffield University, the INQUERY system as used by Ballesteros (2000) and the GLASS system developed by Dr Mark Sanderson (Sanderson (2000)). The presence of the INQUERY system makes it possible to examine the effect of the synonym operator. The GLASS system has components that can be re-configured to implement simple pseudorelevance feedback. Thus, in addition to the basic aim of demonstrating the positive effect of lexical triangulation, this investigation will examine the effect of combining lexical triangulation with the INQUERY synonym operator and pre-translation pseudo-relevance feedback. This will enable comparison between the results of this investigation and the results reported by Ballesteros (2000). The history of IR in general runs parallel with the history of IR Evaluation (Ellis (1996)), and in that respect I feel that CLIR is no different. For that last 9 years the TREC has been pre-eminent in the evaluation of IR systems (NIST (2000)). In particular the TREC CLIR track has both supported and motivated much of the recent upsurge in CLIR research (Braschler, et al. (1999b), Braschler et al. (1998), Schauble & Sheridan (1997)). This year, the TREC s work on CLIR with European languages has moved to Europe under the auspices of the Cross-Language Evaluation Forum (CLEF) (Peters (2000)). Page 20 of 284

21 The Information Studies (IS) department at Sheffield University is beginning a major CLIR project (CLARITY) 1. For the time being, the IS department does not have a strong presence the CLIR community. Dr Mark Sanderson has registered the IS department with the CLEF to enable this and other work to access the evaluation resources provided by the CLEF and submit an entry to the CLEF 2000 workshop. By submitting a contribution to CLEF, I hope to raise Sheffield s profile in the CLIR community in advance of results from the CLARITY project. The CLEF queries and corpus will provide another evaluation environment to confirm any results found in the other experiments. Ballesteros (2000) reports a number of results that suggest that some of the techniques of transitive translation may not be applied easily to all European languages. By examining triangulation between more than one pair of European languages, I hope to be able to illuminate this issue Aims The overall aim is to see if lexical triangulation does produce a beneficial reduction in ambiguity and to examine any interaction with the INQUERY synonym operator and pre-translation pseudo-relevance feedback. A secondary aim is to submit a contribution to CLEF on behalf of the IS department. Finally, as time and resources permit, I aim to examine triangulation between different pairs of pivot languages, and triangulation between three transitive translations too see if this provides further improvements in transitive translation for CLIR. These aims give rise to the Objectives outline in section 2. In keeping with the work of Ballesteros & Croft (1996), Ballesteros & Croft (1997), Ballesteros & Croft (1998a), and Ballesteros (2000) this investigation will simulate a basic Machine-Readable Dictionary (MRD) approach to CLIR. This investigation will examine transitive cross-language information retrieval between German queries and an English corpus using Dutch, Spanish and Italian as intermediate pivot languages. commission. 1 The CLARITY project is currently in the contract negotiation phase with the European Page 21 of 284

22 1.6.3 Structure of the Work The work for this dissertation is divided into three main phases. The initial phase of trial and error development, and confirmation that the basic concept of lexical triangulation was sound. The development of the basic transitive translation processing, followed by the experiments to measure the effects of lexical triangulation. This culminated in the production of the CLEF submissions. The further development of the processing pipelines to introduce pre-translation pseudorelevance feedback, and triangulation between three translation routes. This culminated in further experiments to discover the effectiveness of these different techniques Report Structure The report is structured as follows: - Introduction. The Objectives for the work The Methodology and Resources used. A description of the Experimental Environment, including a description of all of the components created or used. This section will provide an overall understanding of the components used in the different individual experiments A description of the various experiments conducted including some discussion and analysis of the results they produced. This will show how the results obtained from each experiment motivated further experiments. These sections will reflect the three main phases outlined above. An overall discussion of the most interesting results and a drawing together of the analysis. Conclusions and Recommendations. Finally a Bibliography and Appendixes. Page 22 of 284

23 2 Objectives To discover if adopting a lexical triangulation approach can reduce the ambiguity introduced in MRD based transitive CLIR and improve retrieval effectiveness. To confirm previous work that indicates a loss of effectiveness when using a transitive approach to CLIR (as compared to a normal direct translation approach) (Ballesteros (2000)). To investigate whether any beneficial effects of lexical triangulation are affected by the use of disambiguation techniques applied by others in the field of CLIR. Such techniques include, pre-translation query expansion using pseudo-relevance feedback (Ballesteros & Croft (1996)), and the use of the synonym operator in the INQUERY system (Pirkola (1998), Ballesteros & Croft (1998a), Oard, et al. (1999)). To investigate the effect of triangulating between three different transitive translations. To investigate whether different language pairings affect the overall results of lexical triangulation. To represent Sheffield University Information Studies Department at the CLEF 2000 Workshop Peters (2000). Page 23 of 284

24 3 Methodology and Resources 3.1 Underlying Philosophy The underlying philosophy of this investigation is the KIS concept (Keep It Simple). This philosophy particularly drives the choice of techniques investigated but also to some extent the methods of evaluation used. The hope is to discover how successful a CLIR system could be using only the minimum of language resources and the minimum of sophisticated processing. In keeping with this simple philosophy, the basic approach adopted uses resources in a form to simulate a Machine Readable Dictionary (MRD). The systems will use this MRD to translate terms in the query into the language of the corpus, in a word-by-word fashion. This basic approach is outlined by Grefenstette (1998b), and Ballesteros & Croft (1996). The aim is to evaluate the underlying basic algorithms that a developer might use in future to develop a useable CLIR system. As such, there is no "user interface" to any of the systems developed by this investigation. The systems process all of the queries or results of retrievals as files for batch execution or later examination. In the same spirit of simplicity, I have not attempted to make any of the algorithms used particularly efficient, of more importance is the transparency and clarity of the processing so that any effects can be analysed and processes modified easily. Page 24 of 284

25 3.2 Evaluation Methodology A traditional Cranfield style methodology is the basic evaluation technique used in this investigation (Ellis (1996)). This uses a set of queries and corpora that have predetermined relevance judgements, together known as a collection. The significant effects of user interaction and difficulties in obtaining relevance judgements are thereby minimised. The intention is to measure the effects of the different techniques independently of these important factors. The investigation uses collections derived from TREC8 and the CLEF (see below), and the experiments adopt the specific methodology used by the TREC and CLEF for comparing different runs of a retrieval engine. Van Rijsbergen (1979) and Harman (1994) describe the methodology in detail. The aim is to compare the Recall and Precision behaviour of a CLIR system under controlled conditions. The different experiments implement different aspects of the system and different techniques. The experiments can thus compare the results of different runs against a common baseline to enable conclusions to be drawn. Zobel (1998) has examined the TREC evaluation approach and concluded that the results produced are reliable. He also observes that the Wilcoxon s signed-rank test is a reliable test for significance and a good discriminator of systems. Zobel (1998) however, raises some concerns about measures based on Recall. Voorhees (1998) has also examined the TREC methodology and confirmed the ability of the TREC collections to discriminate between different retrieval strategies, despite possible variations in relevance judgements. The investigation uses average precision as the measure for comparing the different runs, although I also report interesting features of other measures if appropriate. In line with Zobel (1998) I report statistical significance from Wilcoxon s test although I also report significance from the sign test as it occurs. Page 25 of 284

26 3.3 Choice of Languages. The availability of resources was the overriding factor determining the choice of languages for this investigation. Having registered to take part in the CLEF workshop, a number of evaluation resources became available. The CLEF, in conjunction with the TREC, made available to participants some of the collections from previous TREC CLIR evaluations. The CLEF offers participants a number of possible "tracks" in which to participate. The IS department registered for the "bilingual" track. The "bilingual" track experiments involve CLIR between one of a set of European languages, and a collection in English. The European languages concerned are English, French, German, Italian, Dutch, Finish, Spanish, and Swedish. In addition to the main CLEF experiment the CLEF made available a training collection for the bilingual track consisting of the English TREC8 CLIR corpus and queries, with matching relevance judgements, in English, French, German, and Italian. The other constraining factor in the choice of languages for this investigation was the availability of translation resources. The Department of Computer Science at Sheffield University (DCS) was one of the collaborators on the EuroWordNet project. Discussions with Wim Peters of the DCS confirmed that EuroWordNet was available for this investigation. EuroWordNet is a multilingual database consisting of WordNets for various European languages (Dutch, Italian, Spanish, German, French, Czech and Estonian) (Vossen (1999)). The intention of the EuroWordNet project (in 1997) was to develop a database with WordNets for a number of European languages similar to, and linked with, the Princeton WordNet 1.5 (Vossen (1997), Miller et al. (2000)). Discussions with Wim Peters, who was involved in the EuroWordNet project, suggested that the best choice of query language would be German, as the coverage of German in EuroWordNet is reasonable. Further discussion indicated that Dutch, Spanish and Italian would be good choices as pivot languages since they offered the best coverage in EuroWordNet. I have described the structure of EuroWordNet and its processing to simulate a MRD together with other language resources in section below. Page 26 of 284

TREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University

TREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University of Maryland, College Park, MD 20742 oard@glue.umd.edu