Dictionary Based Transitive Cross-Language Information Retrieval Using Lexical Triangulation

Size: px
Start display at page:

Download "Dictionary Based Transitive Cross-Language Information Retrieval Using Lexical Triangulation"

Transcription

1 Dictionary Based Transitive Cross-Language Information Retrieval Using Lexical Triangulation A study submitted in partial fulfilment of the requirements for the degree of Master of Science in Information Management At THE UNIVERSITY OF SHEFFIELD By Timothy John Gollins September 1 st 2000

2 Abstract The purpose of this dissertation is to investigate dictionary based cross language information retrieval using the technique of lexical triangulation. Lexical triangulation is a technique for combining the results of different transitive translations. A transitive translation uses an intermediate or pivot language to translate between two languages when no direct translation resource is available. The research took queries in German and translated then via Spanish, Dutch, or Italian into English. The research compared the results of retrieval experiments using these transitive queries, with other queries created by combining the transitive translations (lexical triangulation) or created by direct translation. Direct dictionary translation of a query introduces considerable ambiguity that damages retrieval results compared to the monolingual case, an average precision 79% or more below monolingual in this research. Transitive translation introduces more ambiguity giving results worse than 80% below the direct translation in this case. This research demonstrates that lexical triangulation between two transitive translations can eliminate the additional ambiguity introduced by transitive translation and achieve performance comparable with direct translation. Lexical triangulation between three transitive translations in some circumstances outperformed direct translation by 22%, achieving results 59% below monolingual. This research demonstrated that the technique of pre-translation pseudo-relevance feedback combined with direct and triangulated translation to achieve considerable performance improvements. Direct translation results improved by up to 77% and triangulated translation by over 73%. The direct and triangulated results remain comparable under these changes. The retrieval experiments used the GLASS system developed by Dr Mark Sanderson and INQUERY from the University of Massachusetts. Use of the INQUERY synonym operator eliminated the ambiguity of transitive translation. However, the synonym operator did not improve the direct translation results significantly and degraded the results of lexical triangulation. These experiments used language resources from the EuroWordNet and CELEX databases and text collections from both the TREC8 CLIR track and CLEF 2000 workshops. This research submitted results to the CLEF 2000 workshop. Page 2 of 284

3 Acknowledgements I would like to thank the following people for help and support with this dissertation. Dr Mark Sanderson, for the excellent supervision, encouragement and support he has given me throughout this dissertation. Wim Peters, for his support, and encouragement and for the supply of the language resources and advice that made much of the work possible. Jessica Peel-Yates for some excellent suggestions concerning comparisons between pivot languages. Asaad Alberair for his friendship and advice on matters concerning IR evaluation measures. Mark McCree for reading and commenting on the draft if this dissertation. Laura Tassoni for her friendship and support during the dark days of Semester 1. Karen McFarlane for her support and encouragement to take up the Masters Course and through her, my employers for their financial support throughout the course that made it possible. Jeremy who had to put up with Daddy going of to the Sheffield Factory every Sunday night. Most importantly, I must thank my wife Hazel, for her unending support and encouragement. She first encouraged me to take up the course and secondly put up with the enormous disruption caused by my weekly commuting. Thank you. Page 3 of 284

4 Table Of Contents 1 Introduction Overview Motivation and Background for CLIR CLIR - A summary of techniques Controlled Vocabulary Early CLIR More Recent approaches Machine Translation (MT) Translate the document corpus Translate the queries Comparable or Parallel Corpus Based techniques Translating using parallel corpora No Translation Needed - Vector approaches Dictionary-based techniques Background Pseudo-Relevance feedback The INQUERY synonym operator Why Transitive CLIR? Lexical Triangulation This dissertation Background Aims Structure of the Work Report Structure Objectives Methodology and Resources Underlying Philosophy Evaluation Methodology Choice of Languages Evaluation Resources Corpora Queries Relevance Judgements Language Resources EuroWordNet Background EuroWordNet in this investigation CELEX CELEX Background CELEX as used in this investigation SDA German Language Corpora for Pseudo-relevance feedback Technical Resources IR Systems GLASS INQUERY Programming Languages Awk Shell Scripts Experimental Environment: An Overview The Processing Pipeline Pre-translation Processing Page 4 of 284

5 Translation Processing Merge Processing Pre Retrieval Processing Experimental Components File Formats GLASS Format Extended GLASS Format INQUERY (batch) query file format Pre-translation Parsing the Queries Normalisation Lemmatisation background The basic lemmatiser The new lemmatiser (a variant of the above) Lemmatiser Coverage German stop-words module (Pre-translation) Pseudo-Relevance Feedback process Translation Pre Processing of EuroWordNet Simple Translation Translation with Cognate Spotting Merging Intersection Only Merge Full Merge Multi-Merge 2plus Multi-Merge Full Pre Retrieval Format conversion Stop-words English Stemming Retrieval and Post Retrieval GLASS INQUERY trec_eval_new Corpus Indexing GLASS English GLASS German (for pre-translation feedback) INQUERY The Experimental Story - A Step By Step Approach Phase 1 Is the basic idea sound? An Initial Study Experiment Results Analysis and Discussion First Automatic Translations Experiment Results Discussion Phase 2A The basic experiments - GLASS First approach Experiment Results Analysis Full or intersection merging Experiment Results and Analysis Porter Vs WordNet Experiment Page 5 of 284

6 Results Analysis To flatten or not to flatten Experiment Results Analysis Introducing German Stop-words Experiment Results Analysis Phase 2B - The basic experiments INQUERY Query Structure Simple approach Experiment Results Analysis Phrases Experiment Results Analysis Synonym Experiment Results Discussion and Analysis Synonym and Phrases Experiment Results Analysis Interim Evaluation The CLEF Submission Experiment Results Analysis and Discussion Phase 3 - The New Experimental Direction Coverage of the German WordNet The New Lemmatiser Coverage of the German WordNet Retrieval Evaluation of New lemmatiser GLASS Experiment GLASS Results GLASS Analysis INQUERY Experiment INQUERY Results INQUERY Analysis How effective is the compound splitting Experiment Results Analysis Are cognates important? Only Translated terms Experiment Results Analysis Cognate spotting Translation Experiment Results Analysis Pre-translation Query Expansion (Pseudo-Relevance Feedback) Page 6 of 284

7 5.9.1 Background Approach Initial development Optimising the GLASS parameters Optimising Experiment Optimising Results Evaluation Experiment Results Analysis Focused Feedback Optimising Experiment Optimising Results Evaluation Experiment Results Analysis Optimising INQUERY parameters Normal Feedback Experiment Optimising Results Focused Feedback Experiment Optimising Results Evaluation Experiment Discussion Multiple Transitive Translation GLASS Experiments Results Analysis INQUERY Experiment Results Analysis Discussion Combining Pre-translation Expansion with Multiple Transitive Translation GLASS Experiments Experiment Results Analysis INQUERY Experiments Experiment Results Analysis Discussion Discussion Transitive CLIR and ambiguity How to discuss translation effectiveness Lexical Triangulation Why did bilingual do so poorly The problems (possibly) Possible solutions The Synonym operator The different tools available to the experimenter Conclusion and Recommendations Objectives Achieved The Future for Lexical Triangulation Research Bibliography Page 7 of 284

8 1 Introduction 1.1 Overview The field of "Cross Language Information Retrieval" (CLIR) has emerged at the intersection of research into Machine Translation and conventional, or monolingual, Information Retrieval (IR) (Grefenstette (1998c)). CLIR addresses the situation where the query that a user presents to an IR system, is not in the same language as the corpus of documents they wish to search. This situation presents a number of challenges (Grefenstette (1998c)) but primary amongst these is the problem of crossing the language barrier (Schauble & Sheridan (1997)). Almost all the approaches to this problem require access to some form of rich translation resource to map terms in the query language (the source) to terms in the corpus (the target). Transitive CLIR aims to address the situation where there are limited direct translation resources available (Ballesteros (2000)). A transitive CLIR system translates the source language terms by first translating the terms into an intermediate or "pivot" language and then translating the resulting terms into the target language. Thus, a transitive system could translate a query from German to English via either Dutch, Spanish or an other language. The main aim of this work is to combine translations from two different transitive routes to discover if this can reduce the ambiguity inevitably introduced by transitive translation. Ballesteros suggested the possibility of using this approach in the summary to her recent paper (Ballesteros (2000)). I have chosen to call this approach lexical triangulation. See figure 1 below. Dutch Query Terms Translate to Dutch Translate to English German Query Combine English Terms English Query Translate to Spanish Translate to English Spanish Query Terms Figure 1. Lexical Triangulation - an example. Page 8 of 284

9 1.2 Motivation and Background for CLIR In an enlarging European community, and with the United Kingdom s new emphasis on regional politics, information in a multiplicity of languages is becoming more important. As these various institutions grow, mature, and generate information, the need for native speakers of one language to find information recorded in another will increase. Grefenstette (1998b) points out that the elimination of language barriers by the adoption of a single universal language has been a pipe dream held by many. However, the reality of the world is that these barriers exist and will remain. He also observes that, with explosion of electronic information on the World Wide Web and multinational corporate Intranets, the ability to find information across language barriers is becoming a commercial necessity. This view is echoed in the introduction to several papers by Oard (Oard (1997a), Oard (1997b), Oard (1998)) and by Ballesteros & Croft in the introduction to one of their papers (Ballesteros & Croft (1996)). A scenario often presented for the use of CLIR is one where a user can understand documents in the language of the corpus, but is unable to express a query in that language (Oard (1997a), Braschler, et al. (1998)). Both Oard and Braschler et al. also describe the scenario where a user has no skills in the corpus language, but has access to expensive translation resources. In this scenario, the user will only wish to submit the best documents for translation. In both these situations, the expectation that the retrieved documents will match the information need is paramount. One can also envisage other scenarios (Oard & Dorr (1996)), for example a polyglot who wishes to conduct a search in many languages simultaneously and does not want the significant additional effort of reformulating their request into all of the target languages. Finally, there is the scenario where the primary information is not language specific in nature, that is it may be multimedia information such as sound, image or video. In this case, the primary index to the material may be textural, but not in a language known to the user. The results of this sort of search may not require any translation at all. Page 9 of 284

10 1.3 CLIR - A summary of techniques. A number of papers have summarised the field of CLIR in recent years (Oard & Dorr (1996), Braschler et al. (1998), Grefenstette (1998a), Hull & Grefenstette (1996), Oard (1997a)). Each of these has attempted a categorisation of the techniques of CLIR. If these different categorisations are examined it becomes clear that they are based on the language resources used to cross the language barrier. The following list gives an overview of CLIR techniques. Controlled Vocabulary. These techniques use the terms in a thesaurus to index the documents in the collection. The system maps the terms in the query through the thesaurus in the same way. The system then compares document and query based on the thesaurus terms they have in common. Fluhr (1996) provides a short general description of this approach. Machine Translation. These techniques use a conventional machine translation system to translate either the query or the document corpus so that both are in the same language. This approach then uses a conventional monolingual retrieval system to index and retrieve the documents (Oard (1998)). Comparable or Parallel Corpus Based techniques. The common link between all corpus techniques is the use of corpus resources as the basis for training the IR system, or in constructing information structures, to be used in retrieval or translation (Braschler et al. (1998)). Dictionary Based techniques. As far as I am aware, the only work done in this field is based on the translation of the queries into the language of the document corpus. The motivation behind this approach is that resources needed for conventional Machine Translation and work with parallel corpora are both rare and expensive. The researchers in this area believe that bilingual dictionaries in a machine-readable form are more widely available (Ballesteros & Croft (1996), Ballesteros & Croft (1997), Ballesteros & Croft (1998a)). These dictionaries are often the electronic analogues of the normal, hardcopy, bilingual dictionaries used by linguists everywhere. In the following sub-sections, I will briefly examine some of the techniques that fall into these categories. Page 10 of 284

11 1.3.1 Controlled Vocabulary Early CLIR. The earliest work on CLIR used controlled vocabularies. The work of Salton on English, French and German and Pevsner on English and Russian during the 1970 s, is described in Oard & Dorr (1996). The techniques involved taking controlled vocabulary thesauri already developed for the monolingual retrieval systems SMART and PNP-2 respectively, and manually adding terms in the foreign language to each thesaurus concept category. Retrieval then proceeded in the same way as the monolingual case. The effect of augmenting the thesaurus in this way can be seen as translating both document and query into the common language defined by the thesaurus concepts More Recent approaches The approach taken by the TextWise LLC team at TREC 7 typifies this approach (Diekema, et al. (1998)). Diekema et al. have developed a system called CINDOR based on the Princeton WordNet as a central thesaurus. The original WordNet consists of a hierarchically arranged thesaurus with the different English terms arranged into synonym sets (synsets) each with a unique identifier (id number) (Miller, et al. (2000)). Each of these synsets represents one concept in the thesaurus-like hierarchy. In their work Diekema et al. (1998) regard this synset hierarchy as a language-independent conceptual inter-lingua. In order to proceed with CLIR each synset (keeping its unique meaning and id number) is populated with terms from the new language of interest. This produces a parallel thesaurus in the new language joined to the original through the conceptual inter-lingua (i.e. the synset-ids and their hierarchical relationship). Diekema et al. (1998) repeated this process for all the languages they required. The system then indexes documents in the languages of interest using the synset-ids of the synsets that contain the terms in the document. This effectively translates the document into the inter-lingua. Retrieval can then proceed by similarly translating the query into synset-ids and then matching the synset-ids using a conventional retrieval system. There a number of detailed issues associated with ambiguity and the lack of coverage of the thesaurus that can cause difficulties with this approach, these are dealt with by Diekema et al. (1998). This approach is significant as the translation resource chosen for my investigation is EuroWordNet, a resource developed with a very similar structure to the one described by Diekema et al. (1998). Page 11 of 284

12 1.3.2 Machine Translation (MT) There are two basic approaches to MT: - one translates either the documents, or the queries Translate the document corpus. This approach is typified by the work of Oard & Hackett (1997) and Oard (1998). There are some drawbacks to this approach, as compared to translating the queries, including the extensive processing required, and in the case of multiple query languages, the need to duplicate the documents in all of the potential query languages. Despite these drawbacks there are a features of this approach that are appealing, in particular the hope that translation ambiguity will be less pronounced in longer documents as machine translation is designed to work with whole sentences and documents. Another advantage of this approach is that the user may immediately receive the documents in their preferred language to enable them to skim or read them as appropriate (Oard (1998)). Oard reports that in his comparisons, for longer queries, machine translation of the document corpus is very effective Oard (1998). Unfortunately, the differences between this and other query translation techniques are not statistically significant in his experiment. Fluhr (1996) comments on the tendency for MT systems to make errors. He is also reported in Gachot, et al. (1998) as observing that machine translation is best applied in limited subject domains where MT can be specialised to the domain to reduce ambiguity Translate the queries. Oard (1998) discusses this technique and concludes that it is significantly less costly than translating the documents. However, the technique is clearly less effective than some other dictionary based techniques for short queries (Oard (1998)). Yamabana, et al. (1998) comment that the problem of resolving ambiguity in machine translation systems has been a major challenge in that field. They observe that the techniques successfully adopted by that the MT community are totally unsuited to translating queries, since queries are rarely sentences and more often just a sequence of words. Gey, et al. (1998) report on their use of the Globalink machine translation system for translating queries in a CLIR experiment. The absence of some language pairs in the Globalink lexicons forced them to use English as a universal intermediate or pivot language. Gey et al. make no comment as to the impact this process may have had on their results although there appears to be some evidence in their results that the effect was to reduce the average precision. Page 12 of 284

13 1.3.3 Comparable or Parallel Corpus Based techniques There are other approaches that concentrate on translating the query (e.g. Nie (1998), Nie (1999), Sheridan & Ballerini (1996)). There are also approaches that translate the target corpus (Franz, et al. (1998)). Finally there are techniques that involve no direct translations at all (Yang, et al. (1998), Landauer & Littman (1990)) Translating using parallel corpora The aim of these techniques is to use similar corpora from two different languages to generate a probabilistic model that can map terms in one language into their most likely translations in the other. For this method to succeed the corpora need to be quite similar. In general, the methods proceed by aligning the corpora sentence by sentence. Then based on the positions of the words in the sentences, together with anchor points such as cognates 1 and numbers, The system estimates the probability that all pairs of all the terms in the sentence are translations of each other. By combining these probabilities, the system creates an overall mapping that translates one term into a set of likely others. Nie (1998), Nie (1999), Nie, et al. (1999) use this technique to translate queries, and Franz et al. (1998) to translate the whole target corpora. All report considerable success No Translation Needed - Vector approaches. Vector approaches are quiet different from the other approaches as they rely on mapping the query and documents into a combined vector space which is usually defined by a parallel training corpus of some sort. No translation takes place. The approaches include the "Generalised Vector Space Model" (GVSM) as described by Carbonell, et al. (1997) and Yang et al. (1998). They also include the Latent Semantic Indexing approach as described by Landauer & Littman (1990), and Rehder, et al. (1997). Landauer & Littman (1990) reported considerable success based on his approach, however Yang et al. (1998) concluded that the mate finding evaluation technique used was a poor evaluation technique as it was rather too optimistic in its results. I have not elaborated on these vector approaches as they use techniques far removed from the techniques used in this investigation. For further information see the papers by Carbonell et al. (1997), Yang et al. (1998), Landauer & Littman (1990), and Rehder et al. (1997). 1 These are terms, often names, spelled the same (or nearly the same) in both languages. Page 13 of 284

14 1.3.4 Dictionary-based techniques Background The basic approach is to take each term 1 in the query and translate it by looking it up in a Machine Readable Dictionary (MRD) 2. This usually results in the significant expansion of the query, as terms inevitably have many possible translations. This is not only because terms may have several synonyms, but also because terms tend have a number of different senses that a naive approach can not distinguish. This basic approach is outlined by Grefenstette (1998b), and Ballesteros & Croft (1996). Ballesteros & Croft (1996) and Ballesteros & Croft (1997) report that Machine Readable Dictionary (MRD) translation of queries can lead to a drop in effectiveness of between 40-60% as compared with monolingual performance. They ascribe this to three primary factors, a lack of specialised vocabulary in the dictionary, the introduction of ambiguity from the translation process, and not translating multi-term concepts such as phrases. Ballesteros & Croft (1996) report significant improvement in effectiveness if dictionary translation is augmented with pseudo-relevance feedback (see section below) both before and after translation. They report improvements of between 16% and 34% for pseudo-relevance feedback applied before translation and between 14.3% and 47.5% when applied after translation. When combined the two stages of pseudo-relevance feedback produce improvements of between 40% and 51% (Ballesteros & Croft (1996)). Ballesteros & Croft (1997) show that although phrasal translation may improve effectiveness it is extremely sensitive to poor translation. A single, poor phrase translation may undo the good work of several accurate translations (Ballesteros & Croft (1997)). They also observe that MRDs do not provide sufficient context for good phrasal translation of most sorts of phrase. 1 In different work terms can be words, phrases, or either. linguists. 2 These are usually the electronic analogues of the traditional bilingual dictionaries used by Page 14 of 284

15 Ballesteros & Croft (1998a) continued their work using MRDs as the basis for translation with great success. By using a combination of the relevance feedback techniques, with part of speech tagging 1, the INQUERY synonym operator (see section below), and better phrase translation they have achieved better than 90% of monolingual performance in their experiments. The advanced phrase translation technique Ballesteros & Croft (1998a) employed uses the hypothesis that terms correctly translated from a phrase will preferentially co-occur in the target corpus. They achieve this translation by testing all the combinations of the various translation options for the terms of a phrase and choosing the combination that cooccurs most frequently in the target corpus. By combining this output with their existing dictionary phrase translation approach, they improve average precision by some 31% over simple word for word translation Pseudo-Relevance feedback. The idea behind relevance feedback dates back at least 20 years (Salton & Buckley (1990)). The basic concept is that the query received by a retrieval system is often quite short, and better retrieval will result if the system can assist the user in extending the query with additional appropriate terms (Sparck Jones & Willett (1997)). Numerous different techniques can select appropriate candidate terms, however they all aim to select terms that will be good discriminators of relevant documents. The principle is that terms occurring frequently in relevant documents and infrequently in non-relevant documents are good discriminators for the relevant documents (Sparck Jones & Willett (1997)). In a normal relevance feedback system, after the user has retrieved an initial list of documents they can indicate which are actually relevant. The system then uses this information to determine the most discriminating terms for the relevant set as compared to the rest of the corpus. The system then adds these terms to the query (sometimes giving different weights or priorities to the various terms) and re-executes the query. Experiments have shown that this technique is generally very effective at improving retrieval (Sparck Jones & Willett (1997)). 1 Part of speech tagging is a technique where the words in the query are processed to determine whether they are nouns, verbs, adjectives etc. This allows any subsequent translation to take account of the part of speech thus reducing ambiguity. Page 15 of 284

16 Pseudo-relevance feedback (sometimes called local feedback) differs from normal relevance feedback in that it assumes that the top n retrieved documents are relevant (Sparck Jones & Willett (1997), Ballesteros & Croft (1997)). The system then completes the process of determining the discriminating terms and re-executing the query without any intervention by the user. As discussed above Ballesteros & Croft (1996) introduced pseudo-relevance feedback as a pre-translation step. By using a corpus in the source language, the relevance feedback step introduces further terms that are relevant to the query and the translations of which may act to disambiguate the translations of the other query terms. Ballesteros & Croft (1997) found that pre-translation pseudo-relevance feedback strengthened the base for the translation and improved precision. However they also found that effect was limited by the tendency to introduce inappropriate translation terms (Ballesteros & Croft (1996)) The INQUERY synonym operator Within many IR systems, the main factor in determining the relevance of a document to a query is the frequency with which the query terms occur within the document corpus (Salton & Buckley (1988)). Systems measure the frequency of a term in two ways. The term-frequency or "tf" reflects how many times a term occurs within a particular document. The inverse document frequency or "idf" is inversely proportional to the number of documents containing the term (Salton & Buckley (1988), Ballesteros (2000)). The INQUERY 1 system uses a "belief score" (Ballesteros (2000 pg. 8)) based on these two measures. It is also normal for IR systems to give weight to a term proportional to the number of times it occurs within a query (Ballesteros (2000)). The INQUERY synonym operator groups together a set of words within a query. When used to determine the belief-score of a document INQUERY treats all occurrences of the words in the synonym operator, as occurrences of a single pseudo-term whose documentfrequency ("df") is the sum of the df's for each word. This has the effect of de-emphasising those words in the group that occur infrequently within the corpus (Ballesteros & Croft (1998a), Ballesteros (2000)). As a second effect, the synonym operator normalises for the number of words representing a concept. Consider the situation where different numbers of terms represent two different concepts of equal importance in a query. If each group of terms is enclosed in (2000). 1 For further detail on the INQUERY system see Broglio, et al. (1994), and INQUERY Page 16 of 284

17 a synonym operator, the INQUERY system will give the two concepts equal weight (Ballesteros (2000)). The first effect is useful in de-emphasising archaic senses of a translated term that may be present within a MRD (Ballesteros & Croft (1998a)) but that occur infrequently in the corpus. The second effect is useful for normalising the different numbers of translations that two separate terms may generate (Ballesteros (2000)). Ballesteros (2000) reports that using the synonym operator in MRD based CLIR yields improvements of greater than 45% over simple word-by-word translation alone. Ballesteros & Croft (1998a) report similar results. Page 17 of 284

18 1.4 Why Transitive CLIR? The European community has 11 official languages (Siebelink (1997)). This suggests that to translate queries between all possible pairs of languages would require 55 different bilingual resources. If the many unofficial European languages were included, the effort to maintain let alone create these resources would clearly become untenable. Even without this numerical consideration, many pairs of common languages have quite limited translation resources. By using an intermediate or pivot language for which good translation resources exist, transitive CLIR aims to reduce this problem (Ballesteros (2000)). The use of such pivot languages has been reported by a number of researchers (Braschler, et al. (1999a), Fluhr, et al. (1997), Hiemstra & Kraaij (1998), Gey et al. (1998), Franz et al. (1998), Littman, et al. (1998), Ballesteros (2000)). Significantly, apart from Ballesteros, these researchers have only used a transitive approach because no other resources were available to them. This illustrates that good translation resources can be hard to find even for well-funded researchers using common EU languages. Of those researchers cited above, only Ballesteros (2000) has researched explicitly the effect of using a transitive scheme and techniques to overcome some of its shortcomings. The principal problem Ballesteros discusses is the introduction of ambiguity. Concern with this issue was also reflected in comments by Fluhr et al. (1997). Ballesteros (2000) initially sets out to confirm for the language pair Spanish and French, the earlier work by Ballesteros & Croft (1996), Ballesteros & Croft (1997), and Ballesteros & Croft (1998a) which reported on dictionary based CLIR with Spanish and English. In doing so Ballesteros (2000) reports that word-by-word translation achieves 50%- 60% monolingual performance and that word ambiguity accounts for some 29% of the shortfall. Ballesteros (2000) attributes 40% of the shortfall to the failure to translate phrases. Ballesteros (2000) goes on to examine the impact of transitive translation, discovering that using simple word-by-word transitive translation from Spanish to English to French degrades performance by 91% when compared to word-by-word translation direct from Spanish to French. Ballesteros (2000) attributes this to the increase in ambiguity brought by transitive translation. Ballesteros (2000) goes further to attempt to reduce the ambiguity introduced by transitive translation using the techniques developed by Ballesteros & Croft (1996), Ballesteros & Croft (1997), and Ballesteros & Croft (1998a). These techniques include the Page 18 of 284

19 use of the INQUERY synonym operator, and pseudo relevance feedback. The synonym operator is particularly effective at reducing the ambiguity, reducing the differential between the direct translation and the transitive from -91% to -34%. By applying all of the various disambiguation techniques developed by Ballesteros & Croft (1996), Ballesteros & Croft (1997), and Ballesteros & Croft (1998a) at different stages in the transitive translation the results can be further improved. Ballesteros (2000) is able to obtain an average precision figure for transitive translation at 67% of the monolingual performance in the target language. This compare favourably with the 79% monolingual performance obtained from a direct translation approach. It is interesting to note that the institutions of the European community frequently use a form of pivot or "relay" simultaneous interpretation to support meetings and conferences. In this approach, interpreters translate into a common pivot language and then other interpreters then take this spoken text and interpret it for the various target listeners. This technique is used when an interpreter for a particular pair of languages is not available for a particular meeting or conference (MacKintosh (1998/99)). 1.5 Lexical Triangulation I hope, by adopting a scheme of lexical triangulation, to be able to demonstrate a reduction in the ambiguity introduced by transitive translation and consequently demonstrate an improvement in retrieval effectiveness. Informally the principal behind the approach is to average out the random noise introduced by transitive translation via the different pivots, leaving only the common signal present in both translation routes Consider the German word fisch, a German to Spanish translation gives the two terms pez, pescado whereas translating to Dutch gives vis. Taking each of these in turn, translating the Spanish terms to English gives pitch, fish, tar, food fish, while translating the Dutch to English gives pisces the fishes, pisces, fish. Each of the transitive translations has introduced quite a lot of translation noise and ambiguity. If we take the term that is in common from the two transitive translations, we have fish, a good and unambiguous translation of the original German word. This illustrates the principal of lexical triangulation. Page 19 of 284

20 1.6 This dissertation Background The genesis of this dissertation came from Dr Mark Sanderson, with the idea that lexical triangulation could be used to reduce the ambiguity introduced by transitive translation. An initial search of the literature in the autumn of 1999 revealed that there was little published work on transitive CLIR and, at the time, no mention of lexical triangulation or similar techniques. Subsequently Ballesteros published her paper on transitive CLIR using Machine Readable Dictionaries (Ballesteros (2000)), in the summary to which she suggests the possibility of comparing the output of two different transitive translations. The success reported by Ballesteros (2000) still leaves a gap in performance between direct translation approaches and transitive translation. If this investigation can demonstrate a beneficial effect from using lexical triangulation alone to disambiguate MRD based transitive translation then it is also important to demonstrate that lexical triangulation can be combined with other techniques to further improve performance. There at least two IR systems available to researchers within the Information Studies (IS) department at Sheffield University, the INQUERY system as used by Ballesteros (2000) and the GLASS system developed by Dr Mark Sanderson (Sanderson (2000)). The presence of the INQUERY system makes it possible to examine the effect of the synonym operator. The GLASS system has components that can be re-configured to implement simple pseudorelevance feedback. Thus, in addition to the basic aim of demonstrating the positive effect of lexical triangulation, this investigation will examine the effect of combining lexical triangulation with the INQUERY synonym operator and pre-translation pseudo-relevance feedback. This will enable comparison between the results of this investigation and the results reported by Ballesteros (2000). The history of IR in general runs parallel with the history of IR Evaluation (Ellis (1996)), and in that respect I feel that CLIR is no different. For that last 9 years the TREC has been pre-eminent in the evaluation of IR systems (NIST (2000)). In particular the TREC CLIR track has both supported and motivated much of the recent upsurge in CLIR research (Braschler, et al. (1999b), Braschler et al. (1998), Schauble & Sheridan (1997)). This year, the TREC s work on CLIR with European languages has moved to Europe under the auspices of the Cross-Language Evaluation Forum (CLEF) (Peters (2000)). Page 20 of 284

21 The Information Studies (IS) department at Sheffield University is beginning a major CLIR project (CLARITY) 1. For the time being, the IS department does not have a strong presence the CLIR community. Dr Mark Sanderson has registered the IS department with the CLEF to enable this and other work to access the evaluation resources provided by the CLEF and submit an entry to the CLEF 2000 workshop. By submitting a contribution to CLEF, I hope to raise Sheffield s profile in the CLIR community in advance of results from the CLARITY project. The CLEF queries and corpus will provide another evaluation environment to confirm any results found in the other experiments. Ballesteros (2000) reports a number of results that suggest that some of the techniques of transitive translation may not be applied easily to all European languages. By examining triangulation between more than one pair of European languages, I hope to be able to illuminate this issue Aims The overall aim is to see if lexical triangulation does produce a beneficial reduction in ambiguity and to examine any interaction with the INQUERY synonym operator and pre-translation pseudo-relevance feedback. A secondary aim is to submit a contribution to CLEF on behalf of the IS department. Finally, as time and resources permit, I aim to examine triangulation between different pairs of pivot languages, and triangulation between three transitive translations too see if this provides further improvements in transitive translation for CLIR. These aims give rise to the Objectives outline in section 2. In keeping with the work of Ballesteros & Croft (1996), Ballesteros & Croft (1997), Ballesteros & Croft (1998a), and Ballesteros (2000) this investigation will simulate a basic Machine-Readable Dictionary (MRD) approach to CLIR. This investigation will examine transitive cross-language information retrieval between German queries and an English corpus using Dutch, Spanish and Italian as intermediate pivot languages. commission. 1 The CLARITY project is currently in the contract negotiation phase with the European Page 21 of 284

22 1.6.3 Structure of the Work The work for this dissertation is divided into three main phases. The initial phase of trial and error development, and confirmation that the basic concept of lexical triangulation was sound. The development of the basic transitive translation processing, followed by the experiments to measure the effects of lexical triangulation. This culminated in the production of the CLEF submissions. The further development of the processing pipelines to introduce pre-translation pseudorelevance feedback, and triangulation between three translation routes. This culminated in further experiments to discover the effectiveness of these different techniques Report Structure The report is structured as follows: - Introduction. The Objectives for the work The Methodology and Resources used. A description of the Experimental Environment, including a description of all of the components created or used. This section will provide an overall understanding of the components used in the different individual experiments A description of the various experiments conducted including some discussion and analysis of the results they produced. This will show how the results obtained from each experiment motivated further experiments. These sections will reflect the three main phases outlined above. An overall discussion of the most interesting results and a drawing together of the analysis. Conclusions and Recommendations. Finally a Bibliography and Appendixes. Page 22 of 284

23 2 Objectives To discover if adopting a lexical triangulation approach can reduce the ambiguity introduced in MRD based transitive CLIR and improve retrieval effectiveness. To confirm previous work that indicates a loss of effectiveness when using a transitive approach to CLIR (as compared to a normal direct translation approach) (Ballesteros (2000)). To investigate whether any beneficial effects of lexical triangulation are affected by the use of disambiguation techniques applied by others in the field of CLIR. Such techniques include, pre-translation query expansion using pseudo-relevance feedback (Ballesteros & Croft (1996)), and the use of the synonym operator in the INQUERY system (Pirkola (1998), Ballesteros & Croft (1998a), Oard, et al. (1999)). To investigate the effect of triangulating between three different transitive translations. To investigate whether different language pairings affect the overall results of lexical triangulation. To represent Sheffield University Information Studies Department at the CLEF 2000 Workshop Peters (2000). Page 23 of 284

24 3 Methodology and Resources 3.1 Underlying Philosophy The underlying philosophy of this investigation is the KIS concept (Keep It Simple). This philosophy particularly drives the choice of techniques investigated but also to some extent the methods of evaluation used. The hope is to discover how successful a CLIR system could be using only the minimum of language resources and the minimum of sophisticated processing. In keeping with this simple philosophy, the basic approach adopted uses resources in a form to simulate a Machine Readable Dictionary (MRD). The systems will use this MRD to translate terms in the query into the language of the corpus, in a word-by-word fashion. This basic approach is outlined by Grefenstette (1998b), and Ballesteros & Croft (1996). The aim is to evaluate the underlying basic algorithms that a developer might use in future to develop a useable CLIR system. As such, there is no "user interface" to any of the systems developed by this investigation. The systems process all of the queries or results of retrievals as files for batch execution or later examination. In the same spirit of simplicity, I have not attempted to make any of the algorithms used particularly efficient, of more importance is the transparency and clarity of the processing so that any effects can be analysed and processes modified easily. Page 24 of 284

25 3.2 Evaluation Methodology A traditional Cranfield style methodology is the basic evaluation technique used in this investigation (Ellis (1996)). This uses a set of queries and corpora that have predetermined relevance judgements, together known as a collection. The significant effects of user interaction and difficulties in obtaining relevance judgements are thereby minimised. The intention is to measure the effects of the different techniques independently of these important factors. The investigation uses collections derived from TREC8 and the CLEF (see below), and the experiments adopt the specific methodology used by the TREC and CLEF for comparing different runs of a retrieval engine. Van Rijsbergen (1979) and Harman (1994) describe the methodology in detail. The aim is to compare the Recall and Precision behaviour of a CLIR system under controlled conditions. The different experiments implement different aspects of the system and different techniques. The experiments can thus compare the results of different runs against a common baseline to enable conclusions to be drawn. Zobel (1998) has examined the TREC evaluation approach and concluded that the results produced are reliable. He also observes that the Wilcoxon s signed-rank test is a reliable test for significance and a good discriminator of systems. Zobel (1998) however, raises some concerns about measures based on Recall. Voorhees (1998) has also examined the TREC methodology and confirmed the ability of the TREC collections to discriminate between different retrieval strategies, despite possible variations in relevance judgements. The investigation uses average precision as the measure for comparing the different runs, although I also report interesting features of other measures if appropriate. In line with Zobel (1998) I report statistical significance from Wilcoxon s test although I also report significance from the sign test as it occurs. Page 25 of 284

26 3.3 Choice of Languages. The availability of resources was the overriding factor determining the choice of languages for this investigation. Having registered to take part in the CLEF workshop, a number of evaluation resources became available. The CLEF, in conjunction with the TREC, made available to participants some of the collections from previous TREC CLIR evaluations. The CLEF offers participants a number of possible "tracks" in which to participate. The IS department registered for the "bilingual" track. The "bilingual" track experiments involve CLIR between one of a set of European languages, and a collection in English. The European languages concerned are English, French, German, Italian, Dutch, Finish, Spanish, and Swedish. In addition to the main CLEF experiment the CLEF made available a training collection for the bilingual track consisting of the English TREC8 CLIR corpus and queries, with matching relevance judgements, in English, French, German, and Italian. The other constraining factor in the choice of languages for this investigation was the availability of translation resources. The Department of Computer Science at Sheffield University (DCS) was one of the collaborators on the EuroWordNet project. Discussions with Wim Peters of the DCS confirmed that EuroWordNet was available for this investigation. EuroWordNet is a multilingual database consisting of WordNets for various European languages (Dutch, Italian, Spanish, German, French, Czech and Estonian) (Vossen (1999)). The intention of the EuroWordNet project (in 1997) was to develop a database with WordNets for a number of European languages similar to, and linked with, the Princeton WordNet 1.5 (Vossen (1997), Miller et al. (2000)). Discussions with Wim Peters, who was involved in the EuroWordNet project, suggested that the best choice of query language would be German, as the coverage of German in EuroWordNet is reasonable. Further discussion indicated that Dutch, Spanish and Italian would be good choices as pivot languages since they offered the best coverage in EuroWordNet. I have described the structure of EuroWordNet and its processing to simulate a MRD together with other language resources in section below. Page 26 of 284

TREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University

TREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University TREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University of Maryland, College Park, MD 20742 oard@glue.umd.edu

More information

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

Cross-Language Information Retrieval using Dutch Query Translation

Cross-Language Information Retrieval using Dutch Query Translation Cross-Language Information Retrieval using Dutch Query Translation Anne R. Diekema and Wen-Yuan Hsiao Syracuse University School of Information Studies 4-206 Ctr. for Science and Technology Syracuse, NY

More information

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents.

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents. Optimal Query Assume that the relevant set of documents C r are known. Then the best query is: q opt 1 C r d j C r d j 1 N C r d j C r d j Where N is the total number of documents. Note that even this

More information

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.4, April 2009 349 WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY Mohammed M. Sakre Mohammed M. Kouta Ali M. N. Allam Al Shorouk

More information

MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion

MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion Sara Lana-Serrano 1,3, Julio Villena-Román 2,3, José C. González-Cristóbal 1,3 1 Universidad Politécnica de Madrid 2 Universidad

More information

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND 41 CHAPTER 5 TEXT W. Bruce Croft BACKGROUND Much of the information in digital library or digital information organization applications is in the form of text. Even when the application focuses on multimedia

More information

This paper studies methods to enhance cross-language retrieval of domain-specific

This paper studies methods to enhance cross-language retrieval of domain-specific Keith A. Gatlin. Enhancing Cross-Language Retrieval of Comparable Corpora Through Thesaurus-Based Translation and Citation Indexing. A master s paper for the M.S. in I.S. degree. April, 2005. 23 pages.

More information

WikiTranslate: Query Translation for Cross-Lingual Information Retrieval Using Only Wikipedia

WikiTranslate: Query Translation for Cross-Lingual Information Retrieval Using Only Wikipedia WikiTranslate: Query Translation for Cross-Lingual Information Retrieval Using Only Wikipedia Dong Nguyen, Arnold Overwijk, Claudia Hauff, Dolf R.B. Trieschnigg, Djoerd Hiemstra, and Franciska M.G. de

More information

Noida institute of engineering and technology,greater noida

Noida institute of engineering and technology,greater noida Impact Of Word Sense Ambiguity For English Language In Web IR Prachi Gupta 1, Dr.AnuragAwasthi 2, RiteshRastogi 3 1,2,3 Department of computer Science and engineering, Noida institute of engineering and

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

Cross-Language Evaluation Forum - CLEF

Cross-Language Evaluation Forum - CLEF Cross-Language Evaluation Forum - CLEF Carol Peters IEI-CNR, Pisa, Italy IST-2000-31002 Kick-off: October 2001 Outline Project Objectives Background CLIR System Evaluation CLEF Infrastructure Results so

More information

Overview of iclef 2008: search log analysis for Multilingual Image Retrieval

Overview of iclef 2008: search log analysis for Multilingual Image Retrieval Overview of iclef 2008: search log analysis for Multilingual Image Retrieval Julio Gonzalo Paul Clough Jussi Karlgren UNED U. Sheffield SICS Spain United Kingdom Sweden julio@lsi.uned.es p.d.clough@sheffield.ac.uk

More information

Automatically Generating Queries for Prior Art Search

Automatically Generating Queries for Prior Art Search Automatically Generating Queries for Prior Art Search Erik Graf, Leif Azzopardi, Keith van Rijsbergen University of Glasgow {graf,leif,keith}@dcs.gla.ac.uk Abstract This report outlines our participation

More information

Web Query Translation with Representative Synonyms in Cross Language Information Retrieval

Web Query Translation with Representative Synonyms in Cross Language Information Retrieval Web Query Translation with Representative Synonyms in Cross Language Information Retrieval August 25, 2005 Bo-Young Kang, Qing Li, Yun Jin, Sung Hyon Myaeng Information Retrieval and Natural Language Processing

More information

Searching and Organizing Images Across Languages

Searching and Organizing Images Across Languages Searching and Organizing Images Across Languages Paul Clough University of Sheffield Western Bank Sheffield, UK +44 114 222 2664 p.d.clough@sheffield.ac.uk Mark Sanderson University of Sheffield Western

More information

QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL

QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL David Parapar, Álvaro Barreiro AILab, Department of Computer Science, University of A Coruña, Spain dparapar@udc.es, barreiro@udc.es

More information

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation

More information

Web Information Retrieval using WordNet

Web Information Retrieval using WordNet Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT

More information

Automatic Translation in Cross-Lingual Access to Legislative Databases

Automatic Translation in Cross-Lingual Access to Legislative Databases Automatic Translation in Cross-Lingual Access to Legislative Databases Catherine Bounsaythip, Aarno Lehtola, Jarno Tenni VTT Information Technology P. Box 1201, FIN-02044 VTT, Finland Phone: +358 9 456

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Influence of Word Normalization on Text Classification

Influence of Word Normalization on Text Classification Influence of Word Normalization on Text Classification Michal Toman a, Roman Tesar a and Karel Jezek a a University of West Bohemia, Faculty of Applied Sciences, Plzen, Czech Republic In this paper we

More information

The Effectiveness of a Dictionary-Based Technique for Indonesian-English Cross-Language Text Retrieval

The Effectiveness of a Dictionary-Based Technique for Indonesian-English Cross-Language Text Retrieval University of Massachusetts Amherst ScholarWorks@UMass Amherst Computer Science Department Faculty Publication Series Computer Science 1997 The Effectiveness of a Dictionary-Based Technique for Indonesian-English

More information

INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE

INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE 15 : CONCEPT AND SCOPE 15.1 INTRODUCTION Information is communicated or received knowledge concerning a particular fact or circumstance. Retrieval refers to searching through stored information to find

More information

Making Sense Out of the Web

Making Sense Out of the Web Making Sense Out of the Web Rada Mihalcea University of North Texas Department of Computer Science rada@cs.unt.edu Abstract. In the past few years, we have witnessed a tremendous growth of the World Wide

More information

Evaluating a Conceptual Indexing Method by Utilizing WordNet

Evaluating a Conceptual Indexing Method by Utilizing WordNet Evaluating a Conceptual Indexing Method by Utilizing WordNet Mustapha Baziz, Mohand Boughanem, Nathalie Aussenac-Gilles IRIT/SIG Campus Univ. Toulouse III 118 Route de Narbonne F-31062 Toulouse Cedex 4

More information

CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM

CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM Ms.Susan Geethu.D.K 1, Ms. R.Subha 2, Dr.S.Palaniswami 3 1, 2 Assistant Professor 1,2 Department of Computer Science and Engineering, Sri Krishna

More information

Word Sense Disambiguation for Cross-Language Information Retrieval

Word Sense Disambiguation for Cross-Language Information Retrieval Word Sense Disambiguation for Cross-Language Information Retrieval Mary Xiaoyong Liu, Ted Diamond, and Anne R. Diekema School of Information Studies Syracuse University Syracuse, NY 13244 xliu03@mailbox.syr.edu

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Relevance Feedback. Query Expansion Instructor: Rada Mihalcea Intelligent Information Retrieval 1. Relevance feedback - Direct feedback - Pseudo feedback 2. Query expansion

More information

Abstract. Cross-lingual information retrieval, query translation, word sense disambiguation, Wikipedia, comparable corpus

Abstract. Cross-lingual information retrieval, query translation, word sense disambiguation, Wikipedia, comparable corpus WikiTranslate: Query Translation for Cross-lingual Information Retrieval using only Wikipedia D. Nguyen, A.Overwijk, C.Hauff, R.B. Trieschnigg, D. Hiemstra, F.M.G. de Jong Twente University dong.p.ng@gmail.com,

More information

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information

More information

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task Walid Magdy, Gareth J.F. Jones Centre for Next Generation Localisation School of Computing Dublin City University,

More information

GIR experiements with Forostar at GeoCLEF 2007

GIR experiements with Forostar at GeoCLEF 2007 GIR experiements with Forostar at GeoCLEF 2007 Simon Overell 1, João Magalhães 1 and Stefan Rüger 2,1 1 Multimedia & Information Systems Department of Computing, Imperial College London, SW7 2AZ, UK 2

More information

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval DCU @ CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval Walid Magdy, Johannes Leveling, Gareth J.F. Jones Centre for Next Generation Localization School of Computing Dublin City University,

More information

Manning Chapter: Text Retrieval (Selections) Text Retrieval Tasks. Vorhees & Harman (Bulkpack) Evaluation The Vector Space Model Advanced Techniques

Manning Chapter: Text Retrieval (Selections) Text Retrieval Tasks. Vorhees & Harman (Bulkpack) Evaluation The Vector Space Model Advanced Techniques Text Retrieval Readings Introduction Manning Chapter: Text Retrieval (Selections) Text Retrieval Tasks Vorhees & Harman (Bulkpack) Evaluation The Vector Space Model Advanced Techniues 1 2 Text Retrieval:

More information

CS6200 Information Retrieval. Jesse Anderton College of Computer and Information Science Northeastern University

CS6200 Information Retrieval. Jesse Anderton College of Computer and Information Science Northeastern University CS6200 Information Retrieval Jesse Anderton College of Computer and Information Science Northeastern University Major Contributors Gerard Salton! Vector Space Model Indexing Relevance Feedback SMART Karen

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

3 Publishing Technique

3 Publishing Technique Publishing Tool 32 3 Publishing Technique As discussed in Chapter 2, annotations can be extracted from audio, text, and visual features. The extraction of text features from the audio layer is the approach

More information

Parallel Concordancing and Translation. Michael Barlow

Parallel Concordancing and Translation. Michael Barlow [Translating and the Computer 26, November 2004 [London: Aslib, 2004] Parallel Concordancing and Translation Michael Barlow Dept. of Applied Language Studies and Linguistics University of Auckland Auckland,

More information

Serbian Wordnet for biomedical sciences

Serbian Wordnet for biomedical sciences Serbian Wordnet for biomedical sciences Sanja Antonic University library Svetozar Markovic University of Belgrade, Serbia antonic@unilib.bg.ac.yu Cvetana Krstev Faculty of Philology, University of Belgrade,

More information

MT in the Online Environment: Challenges and Opportunities

MT in the Online Environment: Challenges and Opportunities Abstract MT in the Online Environment: Challenges and Opportunities Mary Flanagan (mflanagan@csi.compuserve.com) CompuServe Natural Language Technologies Integrating machine translation in online services

More information

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A thesis Submitted to the faculty of the graduate school of the University of Minnesota by Vamshi Krishna Thotempudi In partial fulfillment of the requirements

More information

Web Search Engine Question Answering

Web Search Engine Question Answering Web Search Engine Question Answering Reena Pindoria Supervisor Dr Steve Renals Com3021 07/05/2003 This report is submitted in partial fulfilment of the requirement for the degree of Bachelor of Science

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Dublin City University at CLEF 2005: Multi-8 Two-Years-On Merging Experiments

Dublin City University at CLEF 2005: Multi-8 Two-Years-On Merging Experiments Dublin City University at CLEF 2005: Multi-8 Two-Years-On Merging Experiments Adenike M. Lam-Adesina Gareth J. F. Jones School of Computing, Dublin City University, Dublin 9, Ireland {adenike,gjones}@computing.dcu.ie

More information

CACAO PROJECT AT THE 2009 TASK

CACAO PROJECT AT THE 2009 TASK CACAO PROJECT AT THE TEL@CLEF 2009 TASK Alessio Bosca, Luca Dini Celi s.r.l. - 10131 Torino - C. Moncalieri, 21 alessio.bosca, dini@celi.it Abstract This paper presents the participation of the CACAO prototype

More information

To search and summarize on Internet with Human Language Technology

To search and summarize on Internet with Human Language Technology To search and summarize on Internet with Human Language Technology Hercules DALIANIS Department of Computer and System Sciences KTH and Stockholm University, Forum 100, 164 40 Kista, Sweden Email:hercules@kth.se

More information

Using a Medical Thesaurus to Predict Query Difficulty

Using a Medical Thesaurus to Predict Query Difficulty Using a Medical Thesaurus to Predict Query Difficulty Florian Boudin, Jian-Yun Nie, Martin Dawes To cite this version: Florian Boudin, Jian-Yun Nie, Martin Dawes. Using a Medical Thesaurus to Predict Query

More information

Making Retrieval Faster Through Document Clustering

Making Retrieval Faster Through Document Clustering R E S E A R C H R E P O R T I D I A P Making Retrieval Faster Through Document Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-02 January 23, 2004 D a l l e M o l l e I n s t i t u t e

More information

Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization

Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization Romain Deveaud 1 and Florian Boudin 2 1 LIA - University of Avignon romain.deveaud@univ-avignon.fr

More information

WordNet-based User Profiles for Semantic Personalization

WordNet-based User Profiles for Semantic Personalization PIA 2005 Workshop on New Technologies for Personalized Information Access WordNet-based User Profiles for Semantic Personalization Giovanni Semeraro, Marco Degemmis, Pasquale Lops, Ignazio Palmisano LACAM

More information

Evaluating wordnets in Cross-Language Information Retrieval: the ITEM search engine

Evaluating wordnets in Cross-Language Information Retrieval: the ITEM search engine Evaluating wordnets in Cross-Language Information Retrieval: the ITEM search engine Felisa Verdejo, Julio Gonzalo, Anselmo Peñas, Fernando López and David Fernández Depto. de Ingeniería Eléctrica, Electrónica

More information

A New Measure of the Cluster Hypothesis

A New Measure of the Cluster Hypothesis A New Measure of the Cluster Hypothesis Mark D. Smucker 1 and James Allan 2 1 Department of Management Sciences University of Waterloo 2 Center for Intelligent Information Retrieval Department of Computer

More information

Error annotation in adjective noun (AN) combinations

Error annotation in adjective noun (AN) combinations Error annotation in adjective noun (AN) combinations This document describes the annotation scheme devised for annotating errors in AN combinations and explains how the inter-annotator agreement has been

More information

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Department of Electronic Engineering FINAL YEAR PROJECT REPORT Department of Electronic Engineering FINAL YEAR PROJECT REPORT BEngCE-2007/08-HCS-HCS-03-BECE Natural Language Understanding for Query in Web Search 1 Student Name: Sit Wing Sum Student ID: Supervisor:

More information

RMIT University at TREC 2006: Terabyte Track

RMIT University at TREC 2006: Terabyte Track RMIT University at TREC 2006: Terabyte Track Steven Garcia Falk Scholer Nicholas Lester Milad Shokouhi School of Computer Science and IT RMIT University, GPO Box 2476V Melbourne 3001, Australia 1 Introduction

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Enhanced retrieval using semantic technologies:

Enhanced retrieval using semantic technologies: Enhanced retrieval using semantic technologies: Ontology based retrieval as a new search paradigm? - Considerations based on new projects at the Bavarian State Library Dr. Berthold Gillitzer 28. Mai 2008

More information

DCU at FIRE 2013: Cross-Language!ndian News Story Search

DCU at FIRE 2013: Cross-Language!ndian News Story Search DCU at FIRE 2013: Cross-Language!ndian News Story Search Piyush Arora, Jennifer Foster, and Gareth J. F. Jones CNGL Centre for Global Intelligent Content School of Computing, Dublin City University Glasnevin,

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Term Frequency Normalisation Tuning for BM25 and DFR Models

Term Frequency Normalisation Tuning for BM25 and DFR Models Term Frequency Normalisation Tuning for BM25 and DFR Models Ben He and Iadh Ounis Department of Computing Science University of Glasgow United Kingdom Abstract. The term frequency normalisation parameter

More information

second_language research_teaching sla vivian_cook language_department idl

second_language research_teaching sla vivian_cook language_department idl Using Implicit Relevance Feedback in a Web Search Assistant Maria Fasli and Udo Kruschwitz Department of Computer Science, University of Essex, Wivenhoe Park, Colchester, CO4 3SQ, United Kingdom fmfasli

More information

Improving the Effectiveness of Information Retrieval with Local Context Analysis

Improving the Effectiveness of Information Retrieval with Local Context Analysis Improving the Effectiveness of Information Retrieval with Local Context Analysis JINXI XU BBN Technologies and W. BRUCE CROFT University of Massachusetts Amherst Techniques for automatic query expansion

More information

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten

More information

Prior Art Retrieval Using Various Patent Document Fields Contents

Prior Art Retrieval Using Various Patent Document Fields Contents Prior Art Retrieval Using Various Patent Document Fields Contents Metti Zakaria Wanagiri and Mirna Adriani Fakultas Ilmu Komputer, Universitas Indonesia Depok 16424, Indonesia metti.zakaria@ui.edu, mirna@cs.ui.ac.id

More information

Perfect Timing. Alejandra Pardo : Manager Andrew Emrazian : Testing Brant Nielsen : Design Eric Budd : Documentation

Perfect Timing. Alejandra Pardo : Manager Andrew Emrazian : Testing Brant Nielsen : Design Eric Budd : Documentation Perfect Timing Alejandra Pardo : Manager Andrew Emrazian : Testing Brant Nielsen : Design Eric Budd : Documentation Problem & Solution College students do their best to plan out their daily tasks, but

More information

Document Clustering for Mediated Information Access The WebCluster Project

Document Clustering for Mediated Information Access The WebCluster Project Document Clustering for Mediated Information Access The WebCluster Project School of Communication, Information and Library Sciences Rutgers University The original WebCluster project was conducted at

More information

led to different techniques for cross-language retrieval, ones which utilized the power of human indexing of documents to improve retrieval via bi-lin

led to different techniques for cross-language retrieval, ones which utilized the power of human indexing of documents to improve retrieval via bi-lin Cross-Language Retrieval for the CLEF Collections Comparing Multiple Methods of Retrieval Fredric C. Gey 1, Hailing Jiang 2, Vivien Petras 2 and Aitao Chen 2 1 UC Data Archive & Technical Assistance, 2

More information

Student retention in distance education using on-line communication.

Student retention in distance education using on-line communication. Doctor of Philosophy (Education) Student retention in distance education using on-line communication. Kylie Twyford AAPI BBus BEd (Hons) 2007 Certificate of Originality I certify that the work in this

More information

A Practical Passage-based Approach for Chinese Document Retrieval

A Practical Passage-based Approach for Chinese Document Retrieval A Practical Passage-based Approach for Chinese Document Retrieval Szu-Yuan Chi 1, Chung-Li Hsiao 1, Lee-Feng Chien 1,2 1. Department of Information Management, National Taiwan University 2. Institute of

More information

Word Indexing Versus Conceptual Indexing in Medical Image Retrieval

Word Indexing Versus Conceptual Indexing in Medical Image Retrieval Word Indexing Versus Conceptual Indexing in Medical Image Retrieval (ReDCAD participation at ImageCLEF Medical Image Retrieval 2012) Karim Gasmi, Mouna Torjmen-Khemakhem, and Maher Ben Jemaa Research unit

More information

Lecture 7: Relevance Feedback and Query Expansion

Lecture 7: Relevance Feedback and Query Expansion Lecture 7: Relevance Feedback and Query Expansion Information Retrieval Computer Science Tripos Part II Ronan Cummins Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk

More information

Coursework Master s Thesis Proposal

Coursework Master s Thesis Proposal Coursework Master s Thesis Proposal December 1999 University of South Australia School of Computer and Information Science Student: David Benn (9809422R) Supervisor: Dan Corbett Introduction Sowa s [1984]

More information

Euripides G.M. Petrakis 1, Angelos Hliaoutakis 2

Euripides G.M. Petrakis 1, Angelos Hliaoutakis 2 Automatic Document Categorisation by User Profile in Medline Euripides G.M. Petrakis 1, Angelos Hliaoutakis 2 Dept. Of Electronic and Comp. Engineering, Technical Univ. of Crete (TUC), Chania, Crete, Greece,

More information

M erg in g C lassifiers for Im p ro v ed In fo rm a tio n R e triev a l

M erg in g C lassifiers for Im p ro v ed In fo rm a tio n R e triev a l M erg in g C lassifiers for Im p ro v ed In fo rm a tio n R e triev a l Anette Hulth, Lars Asker Dept, of Computer and Systems Sciences Stockholm University [hulthi asker]ø dsv.su.s e Jussi Karlgren Swedish

More information

R 2 D 2 at NTCIR-4 Web Retrieval Task

R 2 D 2 at NTCIR-4 Web Retrieval Task R 2 D 2 at NTCIR-4 Web Retrieval Task Teruhito Kanazawa KYA group Corporation 5 29 7 Koishikawa, Bunkyo-ku, Tokyo 112 0002, Japan tkana@kyagroup.com Tomonari Masada University of Tokyo 7 3 1 Hongo, Bunkyo-ku,

More information

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Wolfgang Tannebaum, Parvaz Madabi and Andreas Rauber Institute of Software Technology and Interactive Systems, Vienna

More information

Pseudo-Relevance Feedback and Title Re-Ranking for Chinese Information Retrieval

Pseudo-Relevance Feedback and Title Re-Ranking for Chinese Information Retrieval Pseudo-Relevance Feedback and Title Re-Ranking Chinese Inmation Retrieval Robert W.P. Luk Department of Computing The Hong Kong Polytechnic University Email: csrluk@comp.polyu.edu.hk K.F. Wong Dept. Systems

More information

Submission to the International Integrated Reporting Council regarding the Consultation Draft of the International Integrated Reporting Framework

Submission to the International Integrated Reporting Council regarding the Consultation Draft of the International Integrated Reporting Framework Submission to the International Integrated Reporting Council regarding the Consultation Draft of the International Integrated Reporting Framework JULY 2013 Business Council of Australia July 2013 1 About

More information

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN , An Integrated Neural IR System. Victoria J. Hodge Dept. of Computer Science, University ofyork, UK vicky@cs.york.ac.uk Jim Austin Dept. of Computer Science, University ofyork, UK austin@cs.york.ac.uk Abstract.

More information

Challenge. Case Study. The fabric of space and time has collapsed. What s the big deal? Miami University of Ohio

Challenge. Case Study. The fabric of space and time has collapsed. What s the big deal? Miami University of Ohio Case Study Use Case: Recruiting Segment: Recruiting Products: Rosette Challenge CareerBuilder, the global leader in human capital solutions, operates the largest job board in the U.S. and has an extensive

More information

Contents 1. INTRODUCTION... 3

Contents 1. INTRODUCTION... 3 Contents 1. INTRODUCTION... 3 2. WHAT IS INFORMATION RETRIEVAL?... 4 2.1 FIRST: A DEFINITION... 4 2.1 HISTORY... 4 2.3 THE RISE OF COMPUTER TECHNOLOGY... 4 2.4 DATA RETRIEVAL VERSUS INFORMATION RETRIEVAL...

More information

SEMANTIC WEB POWERED PORTAL INFRASTRUCTURE

SEMANTIC WEB POWERED PORTAL INFRASTRUCTURE SEMANTIC WEB POWERED PORTAL INFRASTRUCTURE YING DING 1 Digital Enterprise Research Institute Leopold-Franzens Universität Innsbruck Austria DIETER FENSEL Digital Enterprise Research Institute National

More information

Multilingual Web Retrieval: An Experiment in English Chinese Business Intelligence

Multilingual Web Retrieval: An Experiment in English Chinese Business Intelligence Multilingual Web Retrieval: An Experiment in English Chinese Business Intelligence Jialun Qin and Yilu Zhou Department of Management Information Systems, The University of Arizona, Tucson, AZ 85721. E-mail:

More information

Question Answering Approach Using a WordNet-based Answer Type Taxonomy

Question Answering Approach Using a WordNet-based Answer Type Taxonomy Question Answering Approach Using a WordNet-based Answer Type Taxonomy Seung-Hoon Na, In-Su Kang, Sang-Yool Lee, Jong-Hyeok Lee Department of Computer Science and Engineering, Electrical and Computer Engineering

More information

A Session-based Ontology Alignment Approach for Aligning Large Ontologies

A Session-based Ontology Alignment Approach for Aligning Large Ontologies Undefined 1 (2009) 1 5 1 IOS Press A Session-based Ontology Alignment Approach for Aligning Large Ontologies Editor(s): Name Surname, University, Country Solicited review(s): Name Surname, University,

More information

What is this Song About?: Identification of Keywords in Bollywood Lyrics

What is this Song About?: Identification of Keywords in Bollywood Lyrics What is this Song About?: Identification of Keywords in Bollywood Lyrics by Drushti Apoorva G, Kritik Mathur, Priyansh Agrawal, Radhika Mamidi in 19th International Conference on Computational Linguistics

More information

arxiv:cmp-lg/ v1 5 Aug 1998

arxiv:cmp-lg/ v1 5 Aug 1998 Indexing with WordNet synsets can improve text retrieval Julio Gonzalo and Felisa Verdejo and Irina Chugur and Juan Cigarrán UNED Ciudad Universitaria, s.n. 28040 Madrid - Spain {julio,felisa,irina,juanci}@ieec.uned.es

More information

Automatic Wordnet Mapping: from CoreNet to Princeton WordNet

Automatic Wordnet Mapping: from CoreNet to Princeton WordNet Automatic Wordnet Mapping: from CoreNet to Princeton WordNet Jiseong Kim, Younggyun Hahm, Sunggoo Kwon, Key-Sun Choi Semantic Web Research Center, School of Computing, KAIST 291 Daehak-ro, Yuseong-gu,

More information

This literature review provides an overview of the various topics related to using implicit

This literature review provides an overview of the various topics related to using implicit Vijay Deepak Dollu. Implicit Feedback in Information Retrieval: A Literature Analysis. A Master s Paper for the M.S. in I.S. degree. April 2005. 56 pages. Advisor: Stephanie W. Haas This literature review

More information

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 1, January 2014,

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Multilingual Image Search from a user s perspective

Multilingual Image Search from a user s perspective Multilingual Image Search from a user s perspective Julio Gonzalo, Paul Clough, Jussi Karlgren QUAERO-Image CLEF workshop, 16/09/08 Finding is a matter of two fast stupid smart slow great potential for

More information

Using an Image-Text Parallel Corpus and the Web for Query Expansion in Cross-Language Image Retrieval

Using an Image-Text Parallel Corpus and the Web for Query Expansion in Cross-Language Image Retrieval Using an Image-Text Parallel Corpus and the Web for Query Expansion in Cross-Language Image Retrieval Yih-Chen Chang and Hsin-Hsi Chen * Department of Computer Science and Information Engineering National

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

How SPICE Language Modeling Works

How SPICE Language Modeling Works How SPICE Language Modeling Works Abstract Enhancement of the Language Model is a first step towards enhancing the performance of an Automatic Speech Recognition system. This report describes an integrated

More information

Putting ontologies to work in NLP

Putting ontologies to work in NLP Putting ontologies to work in NLP The lemon model and its future John P. McCrae National University of Ireland, Galway Introduction In natural language processing we are doing three main things Understanding

More information

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy Martin Rajman, Pierre Andrews, María del Mar Pérez Almenta, and Florian Seydoux Artificial Intelligence

More information

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING 43 CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING 3.1 INTRODUCTION This chapter emphasizes the Information Retrieval based on Query Expansion (QE) and Latent Semantic

More information

A Deep Relevance Matching Model for Ad-hoc Retrieval

A Deep Relevance Matching Model for Ad-hoc Retrieval A Deep Relevance Matching Model for Ad-hoc Retrieval Jiafeng Guo 1, Yixing Fan 1, Qingyao Ai 2, W. Bruce Croft 2 1 CAS Key Lab of Web Data Science and Technology, Institute of Computing Technology, Chinese

More information