Optimizing information retrieval in question answering using syntactic annotation

Size: px
Start display at page:

Download "Optimizing information retrieval in question answering using syntactic annotation"

Transcription

1 Optimizing information retrieval in question answering using syntactic annotation Jörg Tiedemann Alfa Informatica, University of Groningen Abstract One of the bottle-necks in open-domain question answering (QA) systems is the performance of the information retrieval (IR) component. In QA, IR is used to reduce the search space for answer extraction modules and therefore its performance is crucial for the success of the overall system. However, natural language questions are different to sets of keywords used in traditional IR. In this study we explore the possibilities of integrating linguistic information taken from machine annotated Dutch newspaper text into information retrieval. Various types of morphological and syntactic features are stored in a multi-layer index to improve IR queries derived from natural language input. The paper describes a genetic algorithm for optimizing queries send to such an enriched IR index. The experiments are based on the CLEF test sets for Dutch QA from the last two years. We could show an absolute improvement of about 8% in mean reciprocal rank scores compared to the base line using traditional IR with plain text keywords. 1 Introduction One of the strategies in question answering (QA) systems is to identify possible answers in large document collections. The task of the information retrieval (IR) component in such a system is to reduce the search space for information extraction modules that look for possible answers in relevant text passages. Obviously, the system fails if IR does not provide appropriate segments to the subsequent modules. Hence, the performance of IR is crucial for the entire system. The main problem for IR is to match a given query with relevant documents. This is usually done in a bag-of-words approach, i.e. sets of query keywords are matched with word type vectors describing documents in the collection. However, in QA we start up with a well-formed natural language question from which an appropriate query has to be formulated to send to the IR component. The base line approach is simply to use all content words in the question as keywords to run traditional IR. In many cases this is not satisfactory especially where questions are short with only a few informative content words. In some cases we want to restrict the query to narrow down possible matches (to improve precision). In other cases, where keywords from the question are to restrictive, we want to widen the query to increase recall. Natural language questions are more than bags of words and contain additional information besides possible keywords. Syntactic constructions and dependencies between constituents in the question bear valuable information about the given request. The challenge for QA is to take advantage of any linguistic clue in the question that might be necessary to find an appropriate answer. Therefore, natural language processing (NLP) is used in many components of QA systems, for example, in question analysis, answer extraction and off-line information extraction (see e.g. (Moldovan et al. 02; Jijkoun et al. 04; Bouma et al. 05)). The use of NLP tools in information retrieval has been studied mainly to find better and/or additional index terms, e.g. complex noun phrases, named entities, disambiguated root forms (see e.g. (Zhai 97; Prager et al. 00; Neumann & Sacaleanu 04)). Several studies also investigate the use of other syntactically derived word pairs (Fagan 87; Strzalkowski et al. 96). (Katz & Lin 03) argue that syntactic relations can be very effective in information retrieval in question answering when selected carefully. Following up on these ideas, we would like to combine various features and relations that can be extracted from linguistically analyzed documents in our IR component to find better matches between natural language questions and relevant text passages. Our investigations are focused on open-domain question answering for Dutch using dependency relations. We use the wide-coverage dependency parser Alpino (Bouma et al. 01) to produce linguistic analyses of both questions as well as sentences in documents in which we expect to find

2 the answers. An example of a syntactic dependency tree produced by Alpino can be seen in figure 1. Figure 1: A dependency tree produced by Alpino for a Dutch CLEF question (When did the German re-unification take place?). From the dependency trees produced by the parser we can extract features and relations that might be useful for IR, for example, part-ofspeech information, named-entity labels, and, of course, syntactic dependency relations. The idea is to add this information to the index in some way to make it searchable via the IR component. Questions are analyzed in the same way and similar features and relations can be extracted from the annotation. Hence, we can match them with the enriched IR index to find relevant text passages. For this we assume that questions do not only share lexical items with relevant text passages but also other linguistic features such as syntactic relations. For example, if the question is about winning the world cup we might want to look for documents that include sentences where world cup is the direct object of any inflectional form of to win. This would narrow down the query compared to a plain keyword search for world, cup and winning. The nice thing about relevance ranking in IR is that we can also combine traditional keyword queries with more restrictive queries using, e.g., dependency relations. Documents that contain both types will be ranked higher than the ones where only one type is matched. In this way we influence the ranking but we do not reduce the number of selected documents. Linguistic annotation can be used in many other ways. For example, part-of-speech information can be useful for disambiguation and weighting of keywords. Certain keyword types (e.g. nouns and names) can be marked as required or as more important than others. Named entity labels can be used to search for text passages that contain certain name types (for example, to match the expected answer type provided by question analysis). Morphological analyses can be used to split compositional compounds. There is a large variety of possible features and feature combinations that can be included in a linguistically enriched IR index. There is also a wide range of possible queries to such an index using all the features extracted from analyzed questions. Finding appropriate features and query parameters is certainly not straightforward. In our experiments, we use data from the CLEF (Cross-Language Evaluation Forum) competition on Dutch QA to measure the success of linguistically extended queries. The following sections describe the IR component in our QA system and an iterative learning approach to feature selection and query formulation in the QA task. 2 The IR component The IR component in our QA system (Joost) (Bouma et al. 05) is implemented as an interface to several off-the-shelf IR engines. The system may switch between seven engines that have been integrated in the system. One of the systems is based on the IR library Lucene from the Apache Jakarta project (Jakarta 04). Lucene is implemented in Java with a well-documented API. It implements a powerful query engine with relevance ranking and many additional interesting features. For example, Lucene indices may include several data fields connected to each document. This feature makes it very useful for our approach in which we want to store several layers of linguistic information for each document in the collection. Besides the data fields, Lucene also implements a powerful query language that makes it possible to adjust queries in various ways. For example, query terms can be weighted (using numeric boost factors ), boolean operators are supported and proximity searches can be stated as well. It also allows for phrase searches and fuzzy matching. The support of data fields and the flexible query language are the main reasons for selecting Lucene as the IR engine in this study.

3 The IR interface can be used independently from Joost. In this way we can run batch calls on pre-defined queries without requiring other modules of the QA system. 2.1 The multi-layer index The QA task in CLEF is corpus-based question answering. The corpus for the Dutch competition contains several years of newspaper texts, including about 190,000 documents with about 77 million words. Documents are marked with paragraph boundaries (which might be headers as well). We decided to use paragraphs for IR which gave the best balance between IR recall and precision. Paragraphs also seem to be a natural segmentation level for answer extraction even though the mark-up does not seem to be very homogeneous in the corpus. The entire corpus consists of about 1.1 million paragraphs that include altogether about 4 million sentences. The sentences have been parsed by Alpino and stored in XML tree structures. 1 From the parse trees, we extracted various kinds of features and feature combinations to be stored in different data fields in the Lucene index. Henceforth, we will call these data fields index layers and, thus, the index will be called a multi-layer index. We distinguish between token layers, type layers and annotation layers. Token layers include one item per token in the corpus. Table 1 lists token layers defined in our index. text root RootPos RootHead RootRel RootRelHead Table 1: Token layers stemmed plain text tokens root forms root form + POS tag root form + head word root form + relation name root form + relation + head As shown in the table above, certain features may appear in several layers combined with others. Features are simply concatenated (using special delimiting symbols between the various parts) to create individual items within the layer. Tokens in the text layer and in the root layer have also been split at hyphens and underscores to split compositional compounds (Alpino adds underscores between the compositional parts of words that have been identified to be compounds). 1 About 0.35% of the sentences could not be analyzed because of parsing timeouts. Type layers include only specific types of tokens in the corpus, e.g. named entities or compounds (see table 2). compound ne neloc neper neorg Table 2: Type layers compounds (non-split root forms) named entities (non-split roots) location names person names organization names Annotation layers include only the labels of (certain) token types. So far, we defined only one annotation layer for named entity labels. This layer may contain the items ORG, PER or LOC if such a named entity occurs in the paragraph. 2.2 Multi-layer IR queries Features are extracted from analyzed questions in the same way as it was done for the entire corpus when creating the IR index (see section 2.1). Now, complex queries can be sent to the multi-layer index described above. Each individual layer can be queried using keywords of the same type. Furthermore, we can restrict keywords to exclude or include certain types using the linguistic labels of the analyzed question. For example, we can restrict RootPos keywords to nouns only. We can also add another restriction about the relation of these nouns within the dependency of the tree. We can, for example, use only the nouns that are in a object relation to their head in the tree. Now, we can also change weights of certain types (using Lucene s boost factors) and we can run proximity searches using pre-defined token window sizes. Here is a summary of query items that we may use in IR queries: basic: a keyword in one of the index layers restricted: token-layer keywords can be restricted to a certain word class ( noun, name, adj, verb ) or/and a certain relation type ( obj1 (direct object), mod (modifier), app (apposition), su (subject)) weighted: keywords can be weighted using a boost factor proximity: a window can be defined for each set of (restricted) token-layer keywords The restriction features (second keyword type) are limited to the ones listed above. We could

4 easily extend the list with additional POS labels or relation types. However we want to keep the number of possible keyword types at a reasonable level. Altogether there would be 304 different keyword types using all combinations of restrictions and basic index layers, although, some of them are pointless because they cannot be instantiated. For example, a verb is not to be found in an object relation to its head and therefore, such a combination of restriction is useless. For simplification, we consider only a small pre-defined set of combined POS/relation-type restrictions: noun-obj1, name-obj1 (nouns or names as object); noun-mod, name-mod (nouns or names as modifiers); nounapp, name-app (nouns or names as appositions); and noun-su, name-su (nouns or names as subjects). In this way we get a total set of 208 keyword types. Figure 2 shows a rather simple example query using different keyword types, weights and one proximity query. Wanneer vond de Duitse hereniging plaats? (When did the German re-unification take place?) RootRelHead:(Duits/mod/hereniging hereniging/su/vind_plaats) root:((vind plaats)^0.2 Duits^0.2 hereniging^3) text:("vond Duitse hereniging"~50) Figure 2: An example query using linguistic features derived from a dependency tree using rootrelation-head triples, roots with boost factor 0.2, noun roots with boost factor 3 and text tokens in a window of 50 words (stop words have been removed) Note that all parts in the query are composed in a disjunctive way (which is the default operator in Lucene). In this way, each sub-query may influence the relevance of matching documents but does not restrict the query to documents for which each sub-query can be matched. In other words, no sub-query is required but all of them may influence the ranking according to their weights. An extension would be to allow even conjunctive parts in the query for items that are required in relevant documents. However, this is not part of the present study. 3 The CLEF test set We used the CLEF test sets from the Dutch QA tracks in the years 2003 and 2004 as training and evaluation data. Both collections contain Dutch questions from the CLEF competitions that have been answered by the participating system. The test sets include the answer string(s) and document ID(s) of possible answers in the CLEF corpus. We excluded the questions for which no answer has been found. Most of the questions are factoid questions such as Hoeveel inwoners heeft Zweden? (How many inhabitants does Sweden have?). Altogether there are 570 questions with 821 answers. 2 For evaluation we used the mean reciprocal rank (MRR) of relevant paragraphs retrieved by IR: MRR = 1 x x 1 rank(first answer) We used the provided answer string rather than the document ID to judge if a retrieved paragraph was relevant or not. In this way, the IR engine may provide passages with correct answers from other documents than the ones marked in the test set. We do simple string matching between answer strings and words in the retrieved paragraphs. Obviously, this introduces errors where the matching string does not correspond to a valid answer in the context. However, we believe that this does not influence the global evaluation figure significantly and therefore we use this approach as a reasonable compromise when doing automatic evaluation. 4 Automatic query optimization As described above, we have a large variety of possible keyword types that can be combined to query the multi-layer index. It would be possible to use intuition to set keyword restrictions, weights and window sizes. However, we like to carry out a more systematic search for optimizing queries using possible types and parameters. For this we use a simplified genetic algorithm in form of an iterative trial-and-error beam search. The optimization loop works as follows (using a subset of the CLEF questions): 1. Run initial queries (one keyword type per IR run) with default weights and default window settings. 2 Each question may have multiple possible answers. We also added some obvious answers which were not in the original test set when encountering them in the corpus. For example, names and numbers can be spelled differently (Kim Jong Il vs. Kim Jong-Il, Saoedi-Arabië vs. Saudi- Arabië, bijna vijftig jaar vs. bijna 50 jaar)

5 2. Combine the parameters of two of the N best IR runs (= crossover). For simplicity, we require each setting to be unique (i.e. we don t have to run a setting twice; the good ones survive anyway). Apply mutation operations (see next step) if crossover does not produce a unique setting. Do crossovers until we have a maximum number of new settings. 3. Change some settings at random (mutation). 4. Run the queries using the new settings and evaluate (determine fitness). 5. Continue with 2 until bored. This setting is very simple and straightforward. However, some additional parameters of the algorithm have to be set initially. First of all, we have to decide how many IR runs ( individuals ) we like to keep in our population. We decided to keep only a very small set of 25 individuals. Fitness is measured using the MRR scores for finding the answer strings in retrieved documents. Selecting parents for the combination of settings is simply done by randomly selecting two of the 25 living individuals. We compute the arithmetic mean of weights (or window sizes) if we encounter identical keyword types in both parents. We also have to set the number of new settings ( children ) that should be produced at a time. We set this value to a maximum of 50. Selection according to the fitness scores is done immediately when a new IR run is finished. Finally, we have to define mutation operations and their probability. Settings may be mutated by adding a keyword type (with a probability of 0.2), removing a keyword type (with a probability of 0.1), or by increasing/decreasing weights or window sizes (with a probability of 0.2). Window sizes are changed by a random value between 1 and 10 (shrinking or expanding) and weights are changed by a random real value between 0 and 5 (decreasing or increasing). The initial weight is 1 (which is also the default for Lucene) and the initial window size is 20. The optimization parameters are chosen intuitively. Probabilities for mutations are set at rather high values to enforce quicker changes within the process. Natural selection is simplified to a top-n selection without giving individuals with lower fitness values a chance to survive. Experimentally, we found out that this improves the convergence of the optimization process compared to a probabilistic selection method. Note that there is no obvious condition for termination. A simple approach would be to stop if we cannot improve the fitness scores anymore. However, this condition is too strict and would cause the process to stop too early. We simply stop it after a certain number of runs especially when we encounter that the optimization levels out. 5 Experiments For our experiments, we put together the CLEF questions from the last two years of the competition. From this we randomly selected a training set of 420 questions (and their answers) and an evaluation set of 150 questions with answers (heldout data). The main reason for merging both sets and not using one year s data for training and another year s data for evaluation is simply to avoid unwanted training/evaluation-set mismatches. Each year, the CLEF tasks are slightly different from previous years to avoid over-training on certain question types. By merging both sets and selecting at random we hope to create a more general training set with similar properties as the evaluation set. For optimization, we used the algorithm as described in the previous section together with the multi-layer index and the full set of keyword types as listed earlier. IR was run in parallel (3-7 Linux workstations on a local network) and a top 15 list was printed after each 10 runs. For each setting we also compute the fitness of the test data to compare training scores to scores on heldout data. Table 3 summarizes the optimization process by means of MRR scores and compares it to the base line of using traditional IR with plain text keywords. The algorithm was stopped after 1000 different settings. Figure 3 plots the training curve for 1000 settings on a logarithmic scale (left plot). In addition, the curve of the corresponding evaluation scores is plotted in the right part of the figure. The thin lines in figure 3 refer to the scores of individual settings tested in the optimization process. The solid bold lines refer to the top scores using the optimized queries. Both plots illustrate the nature of the iterative optimization process. Random modifications cause the oscillation of the fitness scores (see the thin black lines). However, the algorithm picks up the advantageous features and promotes them in

6 50 baseline (46.27%) optimized (training) 50 baseline (46.71%) optimized (evaluation) answer string MRR in % answer string MRR in % nr of settings nr of settings Figure 3: Parameter optimization. Training (left) and evaluation (right). nr of settings training evaluation baseline Table 3: Optimization of query parameters (MRR scores of answer strings). Baseline: IR with plain text tokens (+ stop word removal & Dutch stemming). All scores are in % the competitive selection process. The scores after the optimization are well above the base line of using plain text tokens only (about 8% measured in MRR). Most of the improvements can be observed in the beginning of the optimization process 3 which is very common in machine learning approaches. The training curve levels out already after about 300 settings. The two plots also illustrate the relation between scores in training and evaluation. There seems to be a strong correlation between training and evaluation data. The general tendency of evaluation scores is similar to the training curve with step-wise improvements throughout the optimization process even though the development of the evaluation score is not monotonic. Besides the drops in evaluation scores we can also observe a slight tendency for values to decrease after about 500 settings which is probably due to over-fitting. Note also that the evaluation scores for the opti- 3 Note that the X scale in figure 3 is logarithmic for both, training and evaluation. mized queries do not always reach the top scores among all tested individuals. However, the optimized queries are close to the best possible query according to the fitness scores measured on evaluation data. Now, we are interested in the features that have been selected in the optimization process. Table 4 shows the settings of the best query after trying 1000 different settings. It also lists the total numbers of keywords that have been produced for the questions in the training set for each keyword type using these settings. Eight token layers are used in the final query. It is somehow surprising that the named-entity layers are not used at all except the meta-layer that contains the named-entity labels (netypes). However, features captured in these layers are also used in some of the query parts where the POS restriction is set to name. This overlap probably makes it unnecessary to add named-entity keywords to the query. Most keyword restrictions are applied to the root layer. The largest weight, however, is set to the plain text token layer. This seems to be reasonable when looking at the performance of the single layers (the text layer performs best, followed by the root-layer). The most popular constraints are nouns and names (among POS labels) and subject (su), direct object (obj1) among relation types. This also seems to be natural as noun phrases usually include the most informative part of a sentence. Looking at the the proximity queries we can observe that the optimized query is quite strict with rather small windows. Many proximity queries use a window size below the initial setting of 20

7 Table 4: Optimized query parameters after 1000 settings including the number of keywords produced for the training set using these parameters. restrictions nr of layer POS relation keywords weight/window text 1478 weight = text 1478 window = 13 verb 237 window = 9 noun mod 3 weight = 2.75 root 1570 weight = 1.66 root 1570 window = 12 noun 520 weight = 1 su 237 window = 20 adj 125 window = 21 name 88 window = 26 noun su 72 weight = 1 name obj1 54 window = 18 name su 20 weight = 4.21 RootPos 1209 weight = 1 RootPos 1209 window = 29 noun 472 weight = 3.65 noun obj1 216 window = 36 name 44 weight = 1 RootRel 1209 weight = 3.12 obj1 413 weight = 1.76 obj1 413 window = 18 RootHead 1209 weight = 1 obj1 413 weight = 1 RootRelHead 1209 weight = 1 obj1 413 weight = 2.55 noun su 60 weight = 5.19 name obj1 23 weight = 1 name obj1 23 window = 10 compound 208 weight = 2.52 netypes 306 weight = 3.25 tokens. However, it is hard to judge the influence of the proximity queries and their parameters on the entire query where all parts are combined in a disjunctive way. Altogether, the system makes extensive use of enriched index layers and also gives them significant weights (see for example the RootPos layer for nouns and the RootRelHead layer for noun subjects). They seem to contribute to the IR performance in a positive way. 6 Conclusions and future work In this paper we describe the information retrieval component of our open-domain question answering system. We integrated linguistic features produced by a wide-coverage parser for Dutch, Alpino, in the IR index to improve the retrieval of relevant paragraphs. These features are stored in several index layers that can be queried by the QA system in various ways. We also use word class labels and syntactic relations to restrict keywords in queries. Furthermore, we use keyword weights and proximity queries in the retrieval system. In the paper, we demonstrate an iterative algorithm for optimizing query parameters to take advantage of the enriched IR index. Queries are optimized according to a training set of questions annotated with answers taken from the CLEF competitions on Dutch question answering. We could show that the performance of the IR component could be improved by about 8% on unseen evaluation data using the mean reciprocal rank of retrieved relevant paragraphs. We believe that this improvement also helps to boost the performance of the entire QA system. It will be part of future work to test the QA system with the adjusted IR component and the improved ranking of relevant passages. We also like to explore further techniques in integrating linguistic information in IR to optimize retrieval recall and precision even further. References (Bouma et al. 01) Gosse Bouma, Gertjan van Noord, and Robert Malouf. Alpino: Wide coverage computational analysis of Dutch. In Computational Linguistics in the Netherlands CLIN, Rodopi, (Bouma et al. 05) Gosse Bouma, Jori Mur, and Gertjan van Noord. Reasoning over dependency relations for QA. In Knowledge and Reasoning for Answering Questions (KRAQ 05), IJCAI Workshop, Edinburgh, Scotland, (Fagan 87) Joel L. Fagan. Automatic phrase indexing for document retrieval. In SIGIR 87: Proceedings of the 10th annual international ACM SIGIR conference on Research and development in information retrieval, pages , New York, NY, USA, ACM Press. (Jakarta 04) Apache Jakarta. Apache Lucene - a highperformance, full-featured text search engine library (Jijkoun et al. 04) Valentin Jijkoun, Jori Mur, and Maarten de Rijke. Information extraction for question answering: Improving recall through syntactic patterns. In Proceedings of COLING-2004, (Katz & Lin 03) Boris Katz and Jimmy Lin. Selectively using relations to improve precision in question answering. In Proceedings of the EACL-2003 Workshop on Natural Language Processing for Question Answering, (Moldovan et al. 02) Dan Moldovan, Sanda Harabagiu, Roxana Girju, Paul Morarescu, Finley Lacatusu, Adrian Novischi, Adriana Badulescu, and Orest Bolohan. LCC tools for question answering. In Proceedings of TREC-11, (Neumann & Sacaleanu 04) Günter Neumann and Bogdan Sacaleanu. Experiments on robust NL question interpretation and multi-layered document annotation for a cross-language question/answering system. In Proceedings of the CLEF 2004 working notes of the QA@CLEF, Bath, (Prager et al. 00) John Prager, Eric Brown, Anni Cohen, Dragomir Radev, and Valerie Samn. Question-answering by predictive annotation. In In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, Greece, July (Strzalkowski et al. 96) Tomek Strzalkowski, Louise Guthrie, Jussi Karlgren, Jim Leistensnider, Fang Lin, Jos Prez-Carballo, Troy Straszheim, Jin Wang, and Jon Wilding. Natural language information retrieval: TREC-5 report, (Zhai 97) Chengxiang Zhai. Fast statistical parsing of noun phrases for document indexing. In Proceedings of the fifth conference on Applied natural language processing, pages , San Francisco, CA, USA, Morgan Kaufmann Publishers Inc.

Patent Terminlogy Analysis: Passage Retrieval Experiments for the Intellecutal Property Track at CLEF

Patent Terminlogy Analysis: Passage Retrieval Experiments for the Intellecutal Property Track at CLEF Patent Terminlogy Analysis: Passage Retrieval Experiments for the Intellecutal Property Track at CLEF Julia Jürgens, Sebastian Kastner, Christa Womser-Hacker, and Thomas Mandl University of Hildesheim,

More information

Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization

Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization Romain Deveaud 1 and Florian Boudin 2 1 LIA - University of Avignon romain.deveaud@univ-avignon.fr

More information

Qanda and the Catalyst Architecture

Qanda and the Catalyst Architecture From: AAAI Technical Report SS-02-06. Compilation copyright 2002, AAAI (www.aaai.org). All rights reserved. Qanda and the Catalyst Architecture Scott Mardis and John Burger* The MITRE Corporation Bedford,

More information

An Adaptive Framework for Named Entity Combination

An Adaptive Framework for Named Entity Combination An Adaptive Framework for Named Entity Combination Bogdan Sacaleanu 1, Günter Neumann 2 1 IMC AG, 2 DFKI GmbH 1 New Business Department, 2 Language Technology Department Saarbrücken, Germany E-mail: Bogdan.Sacaleanu@im-c.de,

More information

A cocktail approach to the VideoCLEF 09 linking task

A cocktail approach to the VideoCLEF 09 linking task A cocktail approach to the VideoCLEF 09 linking task Stephan Raaijmakers Corné Versloot Joost de Wit TNO Information and Communication Technology Delft, The Netherlands {stephan.raaijmakers,corne.versloot,

More information

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

Parsing partially bracketed input

Parsing partially bracketed input Parsing partially bracketed input Martijn Wieling, Mark-Jan Nederhof and Gertjan van Noord Humanities Computing, University of Groningen Abstract A method is proposed to convert a Context Free Grammar

More information

Natural Language Processing. SoSe Question Answering

Natural Language Processing. SoSe Question Answering Natural Language Processing SoSe 2017 Question Answering Dr. Mariana Neves July 5th, 2017 Motivation Find small segments of text which answer users questions (http://start.csail.mit.edu/) 2 3 Motivation

More information

Question Answering Approach Using a WordNet-based Answer Type Taxonomy

Question Answering Approach Using a WordNet-based Answer Type Taxonomy Question Answering Approach Using a WordNet-based Answer Type Taxonomy Seung-Hoon Na, In-Su Kang, Sang-Yool Lee, Jong-Hyeok Lee Department of Computer Science and Engineering, Electrical and Computer Engineering

More information

Cross Lingual Question Answering using CINDI_QA for 2007

Cross Lingual Question Answering using CINDI_QA for 2007 Cross Lingual Question Answering using CINDI_QA for QA@CLEF 2007 Chedid Haddad, Bipin C. Desai Department of Computer Science and Software Engineering Concordia University 1455 De Maisonneuve Blvd. W.

More information

M erg in g C lassifiers for Im p ro v ed In fo rm a tio n R e triev a l

M erg in g C lassifiers for Im p ro v ed In fo rm a tio n R e triev a l M erg in g C lassifiers for Im p ro v ed In fo rm a tio n R e triev a l Anette Hulth, Lars Asker Dept, of Computer and Systems Sciences Stockholm University [hulthi asker]ø dsv.su.s e Jussi Karlgren Swedish

More information

Machine Learning for Question Answering from Tabular Data

Machine Learning for Question Answering from Tabular Data Machine Learning for Question Answering from Tabular Data Mahboob Alam Khalid Valentin Jijkoun Maarten de Rijke ISLA, University of Amsterdam, Kruislaan 403, 1098 SJ Amsterdam, The Netherlands {mahboob,jijkoun,mdr}@science.uva.nl

More information

QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL

QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL David Parapar, Álvaro Barreiro AILab, Department of Computer Science, University of A Coruña, Spain dparapar@udc.es, barreiro@udc.es

More information

Making Sense Out of the Web

Making Sense Out of the Web Making Sense Out of the Web Rada Mihalcea University of North Texas Department of Computer Science rada@cs.unt.edu Abstract. In the past few years, we have witnessed a tremendous growth of the World Wide

More information

Processing Structural Constraints

Processing Structural Constraints SYNONYMS None Processing Structural Constraints Andrew Trotman Department of Computer Science University of Otago Dunedin New Zealand DEFINITION When searching unstructured plain-text the user is limited

More information

NTUBROWS System for NTCIR-7. Information Retrieval for Question Answering

NTUBROWS System for NTCIR-7. Information Retrieval for Question Answering NTUBROWS System for NTCIR-7 Information Retrieval for Question Answering I-Chien Liu, Lun-Wei Ku, *Kuang-hua Chen, and Hsin-Hsi Chen Department of Computer Science and Information Engineering, *Department

More information

DCU at FIRE 2013: Cross-Language!ndian News Story Search

DCU at FIRE 2013: Cross-Language!ndian News Story Search DCU at FIRE 2013: Cross-Language!ndian News Story Search Piyush Arora, Jennifer Foster, and Gareth J. F. Jones CNGL Centre for Global Intelligent Content School of Computing, Dublin City University Glasnevin,

More information

Tokenization and Sentence Segmentation. Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017

Tokenization and Sentence Segmentation. Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017 Tokenization and Sentence Segmentation Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017 Outline 1 Tokenization Introduction Exercise Evaluation Summary 2 Sentence segmentation

More information

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific

More information

Parsing tree matching based question answering

Parsing tree matching based question answering Parsing tree matching based question answering Ping Chen Dept. of Computer and Math Sciences University of Houston-Downtown chenp@uhd.edu Wei Ding Dept. of Computer Science University of Massachusetts

More information

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy Martin Rajman, Pierre Andrews, María del Mar Pérez Almenta, and Florian Seydoux Artificial Intelligence

More information

CADIAL Search Engine at INEX

CADIAL Search Engine at INEX CADIAL Search Engine at INEX Jure Mijić 1, Marie-Francine Moens 2, and Bojana Dalbelo Bašić 1 1 Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, 10000 Zagreb, Croatia {jure.mijic,bojana.dalbelo}@fer.hr

More information

VK Multimedia Information Systems

VK Multimedia Information Systems VK Multimedia Information Systems Mathias Lux, mlux@itec.uni-klu.ac.at This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Results Exercise 01 Exercise 02 Retrieval

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Challenge. Case Study. The fabric of space and time has collapsed. What s the big deal? Miami University of Ohio

Challenge. Case Study. The fabric of space and time has collapsed. What s the big deal? Miami University of Ohio Case Study Use Case: Recruiting Segment: Recruiting Products: Rosette Challenge CareerBuilder, the global leader in human capital solutions, operates the largest job board in the U.S. and has an extensive

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

NUS-I2R: Learning a Combined System for Entity Linking

NUS-I2R: Learning a Combined System for Entity Linking NUS-I2R: Learning a Combined System for Entity Linking Wei Zhang Yan Chuan Sim Jian Su Chew Lim Tan School of Computing National University of Singapore {z-wei, tancl} @comp.nus.edu.sg Institute for Infocomm

More information

QANUS A GENERIC QUESTION-ANSWERING FRAMEWORK

QANUS A GENERIC QUESTION-ANSWERING FRAMEWORK QANUS A GENERIC QUESTION-ANSWERING FRAMEWORK NG, Jun Ping National University of Singapore ngjp@nus.edu.sg 30 November 2009 The latest version of QANUS and this documentation can always be downloaded from

More information

Learning to Answer Biomedical Factoid & List Questions: OAQA at BioASQ 3B

Learning to Answer Biomedical Factoid & List Questions: OAQA at BioASQ 3B Learning to Answer Biomedical Factoid & List Questions: OAQA at BioASQ 3B Zi Yang, Niloy Gupta, Xiangyu Sun, Di Xu, Chi Zhang, Eric Nyberg Language Technologies Institute School of Computer Science Carnegie

More information

Informativeness for Adhoc IR Evaluation:

Informativeness for Adhoc IR Evaluation: Informativeness for Adhoc IR Evaluation: A measure that prevents assessing individual documents Romain Deveaud 1, Véronique Moriceau 2, Josiane Mothe 3, and Eric SanJuan 1 1 LIA, Univ. Avignon, France,

More information

Ranking in a Domain Specific Search Engine

Ranking in a Domain Specific Search Engine Ranking in a Domain Specific Search Engine CS6998-03 - NLP for the Web Spring 2008, Final Report Sara Stolbach, ss3067 [at] columbia.edu Abstract A search engine that runs over all domains must give equal

More information

Natural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi)

Natural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi) Natural Language Processing SoSe 2015 Question Answering Dr. Mariana Neves July 6th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline 2 Introduction History QA Architecture Outline 3 Introduction

More information

From Passages into Elements in XML Retrieval

From Passages into Elements in XML Retrieval From Passages into Elements in XML Retrieval Kelly Y. Itakura David R. Cheriton School of Computer Science, University of Waterloo 200 Univ. Ave. W. Waterloo, ON, Canada yitakura@cs.uwaterloo.ca Charles

More information

QUALIFIER: Question Answering by Lexical Fabric and External Resources

QUALIFIER: Question Answering by Lexical Fabric and External Resources QUALIFIER: Question Answering by Lexical Fabric and External Resources Hui Yang Department of Computer Science National University of Singapore 3 Science Drive 2, Singapore 117543 yangh@comp.nus.edu.sg

More information

Sheffield University and the TREC 2004 Genomics Track: Query Expansion Using Synonymous Terms

Sheffield University and the TREC 2004 Genomics Track: Query Expansion Using Synonymous Terms Sheffield University and the TREC 2004 Genomics Track: Query Expansion Using Synonymous Terms Yikun Guo, Henk Harkema, Rob Gaizauskas University of Sheffield, UK {guo, harkema, gaizauskas}@dcs.shef.ac.uk

More information

CS674 Natural Language Processing

CS674 Natural Language Processing CS674 Natural Language Processing A question What was the name of the enchanter played by John Cleese in the movie Monty Python and the Holy Grail? Question Answering Eric Breck Cornell University Slides

More information

Ortolang Tools : MarsaTag

Ortolang Tools : MarsaTag Ortolang Tools : MarsaTag Stéphane Rauzy, Philippe Blache, Grégoire de Montcheuil SECOND VARIAMU WORKSHOP LPL, Aix-en-Provence August 20th & 21st, 2014 ORTOLANG received a State aid under the «Investissements

More information

An Attempt to Identify Weakest and Strongest Queries

An Attempt to Identify Weakest and Strongest Queries An Attempt to Identify Weakest and Strongest Queries K. L. Kwok Queens College, City University of NY 65-30 Kissena Boulevard Flushing, NY 11367, USA kwok@ir.cs.qc.edu ABSTRACT We explore some term statistics

More information

Incremental Integer Linear Programming for Non-projective Dependency Parsing

Incremental Integer Linear Programming for Non-projective Dependency Parsing Incremental Integer Linear Programming for Non-projective Dependency Parsing Sebastian Riedel James Clarke ICCS, University of Edinburgh 22. July 2006 EMNLP 2006 S. Riedel, J. Clarke (ICCS, Edinburgh)

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

JU_CSE_TE: System Description 2010 ResPubliQA

JU_CSE_TE: System Description 2010 ResPubliQA JU_CSE_TE: System Description QA@CLEF 2010 ResPubliQA Partha Pakray 1, Pinaki Bhaskar 1, Santanu Pal 1, Dipankar Das 1, Sivaji Bandyopadhyay 1, Alexander Gelbukh 2 Department of Computer Science & Engineering

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

USING SEMANTIC REPRESENTATIONS IN QUESTION ANSWERING. Sameer S. Pradhan, Valerie Krugler, Wayne Ward, Dan Jurafsky and James H.

USING SEMANTIC REPRESENTATIONS IN QUESTION ANSWERING. Sameer S. Pradhan, Valerie Krugler, Wayne Ward, Dan Jurafsky and James H. USING SEMANTIC REPRESENTATIONS IN QUESTION ANSWERING Sameer S. Pradhan, Valerie Krugler, Wayne Ward, Dan Jurafsky and James H. Martin Center for Spoken Language Research University of Colorado Boulder,

More information

To search and summarize on Internet with Human Language Technology

To search and summarize on Internet with Human Language Technology To search and summarize on Internet with Human Language Technology Hercules DALIANIS Department of Computer and System Sciences KTH and Stockholm University, Forum 100, 164 40 Kista, Sweden Email:hercules@kth.se

More information

A fully-automatic approach to answer geographic queries: GIRSA-WP at GikiP

A fully-automatic approach to answer geographic queries: GIRSA-WP at GikiP A fully-automatic approach to answer geographic queries: at GikiP Johannes Leveling Sven Hartrumpf Intelligent Information and Communication Systems (IICS) University of Hagen (FernUniversität in Hagen)

More information

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information

More information

Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus

Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus Donald C. Comeau *, Haibin Liu, Rezarta Islamaj Doğan and W. John Wilbur National Center

More information

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks University of Amsterdam at INEX 2010: Ad hoc and Book Tracks Jaap Kamps 1,2 and Marijn Koolen 1 1 Archives and Information Studies, Faculty of Humanities, University of Amsterdam 2 ISLA, Faculty of Science,

More information

Precise Medication Extraction using Agile Text Mining

Precise Medication Extraction using Agile Text Mining Precise Medication Extraction using Agile Text Mining Chaitanya Shivade *, James Cormack, David Milward * The Ohio State University, Columbus, Ohio, USA Linguamatics Ltd, Cambridge, UK shivade@cse.ohio-state.edu,

More information

Towards open-domain QA. Question answering. TReC QA framework. TReC QA: evaluation

Towards open-domain QA. Question answering. TReC QA framework. TReC QA: evaluation Question ing Overview and task definition History Open-domain question ing Basic system architecture Watson s architecture Techniques Predictive indexing methods Pattern-matching methods Advanced techniques

More information

Passage Reranking for Question Answering Using Syntactic Structures and Answer Types

Passage Reranking for Question Answering Using Syntactic Structures and Answer Types Passage Reranking for Question Answering Using Syntactic Structures and Answer Types Elif Aktolga, James Allan, and David A. Smith Center for Intelligent Information Retrieval, Department of Computer Science,

More information

Sentiment analysis under temporal shift

Sentiment analysis under temporal shift Sentiment analysis under temporal shift Jan Lukes and Anders Søgaard Dpt. of Computer Science University of Copenhagen Copenhagen, Denmark smx262@alumni.ku.dk Abstract Sentiment analysis models often rely

More information

GTE: DESCRIPTION OF THE TIA SYSTEM USED FOR MUC- 3

GTE: DESCRIPTION OF THE TIA SYSTEM USED FOR MUC- 3 GTE: DESCRIPTION OF THE TIA SYSTEM USED FOR MUC- 3 INTRODUCTIO N Robert Dietz GTE Government Systems Corporatio n 100 Ferguson Drive Mountain View, CA 9403 9 dietz%gtewd.dnet@gte.com (415) 966-2825 This

More information

Classification and Comparative Analysis of Passage Retrieval Methods

Classification and Comparative Analysis of Passage Retrieval Methods Classification and Comparative Analysis of Passage Retrieval Methods Dr. K. Saruladha, C. Siva Sankar, G. Chezhian, L. Lebon Iniyavan, N. Kadiresan Computer Science Department, Pondicherry Engineering

More information

Machine Learning in GATE

Machine Learning in GATE Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell Recap Previous two days looked at knowledge engineered IE This session looks at machine learned IE Supervised learning Effort

More information

Query-based Text Normalization Selection Models for Enhanced Retrieval Accuracy

Query-based Text Normalization Selection Models for Enhanced Retrieval Accuracy Query-based Text Normalization Selection Models for Enhanced Retrieval Accuracy Si-Chi Chin Rhonda DeCook W. Nick Street David Eichmann The University of Iowa Iowa City, USA. {si-chi-chin, rhonda-decook,

More information

Final Project Report: Learning optimal parameters of Graph-Based Image Segmentation

Final Project Report: Learning optimal parameters of Graph-Based Image Segmentation Final Project Report: Learning optimal parameters of Graph-Based Image Segmentation Stefan Zickler szickler@cs.cmu.edu Abstract The performance of many modern image segmentation algorithms depends greatly

More information

The University of Amsterdam at the CLEF 2008 Domain Specific Track

The University of Amsterdam at the CLEF 2008 Domain Specific Track The University of Amsterdam at the CLEF 2008 Domain Specific Track Parsimonious Relevance and Concept Models Edgar Meij emeij@science.uva.nl ISLA, University of Amsterdam Maarten de Rijke mdr@science.uva.nl

More information

RACAI s QA System at the Romanian-Romanian Main Task

RACAI s QA System at the Romanian-Romanian Main Task Radu Ion, Dan Ştefănescu, Alexandru Ceauşu and Dan Tufiş. RACAI s QA System at the RomanianRomanian QA@CLEF2008 Main Task. In Carol Peters, Thomas Deselaers, Nicola Ferro, Julio Gonzalo, Gareth J. F. Jones,

More information

I Know Your Name: Named Entity Recognition and Structural Parsing

I Know Your Name: Named Entity Recognition and Structural Parsing I Know Your Name: Named Entity Recognition and Structural Parsing David Philipson and Nikil Viswanathan {pdavid2, nikil}@stanford.edu CS224N Fall 2011 Introduction In this project, we explore a Maximum

More information

Evaluation of Web Search Engines with Thai Queries

Evaluation of Web Search Engines with Thai Queries Evaluation of Web Search Engines with Thai Queries Virach Sornlertlamvanich, Shisanu Tongchim and Hitoshi Isahara Thai Computational Linguistics Laboratory 112 Paholyothin Road, Klong Luang, Pathumthani,

More information

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback RMIT @ TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback Ameer Albahem ameer.albahem@rmit.edu.au Lawrence Cavedon lawrence.cavedon@rmit.edu.au Damiano

More information

Passage Retrieval and other XML-Retrieval Tasks

Passage Retrieval and other XML-Retrieval Tasks Passage Retrieval and other XML-Retrieval Tasks Andrew Trotman Department of Computer Science University of Otago Dunedin, New Zealand andrew@cs.otago.ac.nz Shlomo Geva Faculty of Information Technology

More information

Answer Extraction. NLP Systems and Applications Ling573 May 13, 2014

Answer Extraction. NLP Systems and Applications Ling573 May 13, 2014 Answer Extraction NLP Systems and Applications Ling573 May 13, 2014 Answer Extraction Goal: Given a passage, find the specific answer in passage Go from ~1000 chars -> short answer span Example: Q: What

More information

Linguistic Graph Similarity for News Sentence Searching

Linguistic Graph Similarity for News Sentence Searching Lingutic Graph Similarity for News Sentence Searching Kim Schouten & Flavius Frasincar schouten@ese.eur.nl frasincar@ese.eur.nl Web News Sentence Searching Using Lingutic Graph Similarity, Kim Schouten

More information

MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion

MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion Sara Lana-Serrano 1,3, Julio Villena-Román 2,3, José C. González-Cristóbal 1,3 1 Universidad Politécnica de Madrid 2 Universidad

More information

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured

More information

NUSIS at TREC 2011 Microblog Track: Refining Query Results with Hashtags

NUSIS at TREC 2011 Microblog Track: Refining Query Results with Hashtags NUSIS at TREC 2011 Microblog Track: Refining Query Results with Hashtags Hadi Amiri 1,, Yang Bao 2,, Anqi Cui 3,,*, Anindya Datta 2,, Fang Fang 2,, Xiaoying Xu 2, 1 Department of Computer Science, School

More information

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task Walid Magdy, Gareth J.F. Jones Centre for Next Generation Localisation School of Computing Dublin City University,

More information

Charles University at CLEF 2007 CL-SR Track

Charles University at CLEF 2007 CL-SR Track Charles University at CLEF 2007 CL-SR Track Pavel Češka and Pavel Pecina Institute of Formal and Applied Linguistics Charles University, 118 00 Praha 1, Czech Republic {ceska,pecina}@ufal.mff.cuni.cz Abstract

More information

Chinese Question Answering using the DLT System at NTCIR 2005

Chinese Question Answering using the DLT System at NTCIR 2005 Chinese Question Answering using the DLT System at NTCIR 2005 Richard F. E. Sutcliffe* Natural Language Engineering and Web Applications Group Department of Computer Science University of Essex, Wivenhoe

More information

Document Structure Analysis in Associative Patent Retrieval

Document Structure Analysis in Associative Patent Retrieval Document Structure Analysis in Associative Patent Retrieval Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media Studies University of Tsukuba 1-2 Kasuga, Tsukuba, 305-8550,

More information

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND 41 CHAPTER 5 TEXT W. Bruce Croft BACKGROUND Much of the information in digital library or digital information organization applications is in the form of text. Even when the application focuses on multimedia

More information

Information Extraction Techniques in Terrorism Surveillance

Information Extraction Techniques in Terrorism Surveillance Information Extraction Techniques in Terrorism Surveillance Roman Tekhov Abstract. The article gives a brief overview of what information extraction is and how it might be used for the purposes of counter-terrorism

More information

Tilburg University. Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval

Tilburg University. Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval Tilburg University Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval Publication date: 2006 Link to publication Citation for published

More information

Dmesure: a readability platform for French as a foreign language

Dmesure: a readability platform for French as a foreign language Dmesure: a readability platform for French as a foreign language Thomas François 1, 2 and Hubert Naets 2 (1) Aspirant F.N.R.S. (2) CENTAL, Université Catholique de Louvain Presentation at CLIN 21 February

More information

Digital Libraries: Language Technologies

Digital Libraries: Language Technologies Digital Libraries: Language Technologies RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Recall: Inverted Index..........................................

More information

Web-based Question Answering: Revisiting AskMSR

Web-based Question Answering: Revisiting AskMSR Web-based Question Answering: Revisiting AskMSR Chen-Tse Tsai University of Illinois Urbana, IL 61801, USA ctsai12@illinois.edu Wen-tau Yih Christopher J.C. Burges Microsoft Research Redmond, WA 98052,

More information

Taming Text. How to Find, Organize, and Manipulate It MANNING GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. KARRIS. Shelter Island

Taming Text. How to Find, Organize, and Manipulate It MANNING GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. KARRIS. Shelter Island Taming Text How to Find, Organize, and Manipulate It GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. KARRIS 11 MANNING Shelter Island contents foreword xiii preface xiv acknowledgments xvii about this book

More information

The University of Évora s Participation in

The University of Évora s Participation in The University of Évora s Participation in QA@CLEF-2007 José Saias and Paulo Quaresma Departamento de Informática Universidade de Évora, Portugal {jsaias,pq}@di.uevora.pt Abstract. The University of Évora

More information

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. 1 Acknowledgements This lecture series has been sponsored by the European

More information

University of Santiago de Compostela at CLEF-IP09

University of Santiago de Compostela at CLEF-IP09 University of Santiago de Compostela at CLEF-IP9 José Carlos Toucedo, David E. Losada Grupo de Sistemas Inteligentes Dept. Electrónica y Computación Universidad de Santiago de Compostela, Spain {josecarlos.toucedo,david.losada}@usc.es

More information

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha Physics Institute,

More information

GIR experiements with Forostar at GeoCLEF 2007

GIR experiements with Forostar at GeoCLEF 2007 GIR experiements with Forostar at GeoCLEF 2007 Simon Overell 1, João Magalhães 1 and Stefan Rüger 2,1 1 Multimedia & Information Systems Department of Computing, Imperial College London, SW7 2AZ, UK 2

More information

Progress Report STEVIN Projects

Progress Report STEVIN Projects Progress Report STEVIN Projects Project Name Large Scale Syntactic Annotation of Written Dutch Project Number STE05020 Reporting Period October 2009 - March 2010 Participants KU Leuven, University of Groningen

More information

Text Mining: A Burgeoning technology for knowledge extraction

Text Mining: A Burgeoning technology for knowledge extraction Text Mining: A Burgeoning technology for knowledge extraction 1 Anshika Singh, 2 Dr. Udayan Ghosh 1 HCL Technologies Ltd., Noida, 2 University School of Information &Communication Technology, Dwarka, Delhi.

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 1, January- June (2012), TECHNOLOGY (IJCET) IAEME ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume

More information

The Muc7 T Corpus. 1 Introduction. 2 Creation of Muc7 T

The Muc7 T Corpus. 1 Introduction. 2 Creation of Muc7 T The Muc7 T Corpus Katrin Tomanek and Udo Hahn Jena University Language & Information Engineering (JULIE) Lab Friedrich-Schiller-Universität Jena, Germany {katrin.tomanek udo.hahn}@uni-jena.de 1 Introduction

More information

Answer Retrieval From Extracted Tables

Answer Retrieval From Extracted Tables Answer Retrieval From Extracted Tables Xing Wei, Bruce Croft, David Pinto Center for Intelligent Information Retrieval University of Massachusetts Amherst 140 Governors Drive Amherst, MA 01002 {xwei,croft,pinto}@cs.umass.edu

More information

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Department of Electronic Engineering FINAL YEAR PROJECT REPORT Department of Electronic Engineering FINAL YEAR PROJECT REPORT BEngCE-2007/08-HCS-HCS-03-BECE Natural Language Understanding for Query in Web Search 1 Student Name: Sit Wing Sum Student ID: Supervisor:

More information

of combining different Web search engines on Web question-answering

of combining different Web search engines on Web question-answering Effect of combining different Web search engines on Web question-answering Tatsunori Mori Akira Kanai Madoka Ishioroshi Mitsuru Sato Graduate School of Environment and Information Sciences Yokohama National

More information

IPL at ImageCLEF 2010

IPL at ImageCLEF 2010 IPL at ImageCLEF 2010 Alexandros Stougiannis, Anestis Gkanogiannis, and Theodore Kalamboukis Information Processing Laboratory Department of Informatics Athens University of Economics and Business 76 Patission

More information

Text Mining. Representation of Text Documents

Text Mining. Representation of Text Documents Data Mining is typically concerned with the detection of patterns in numeric data, but very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data,

More information

A Case Study on the Similarity Between Source Code and Bug Reports Vocabularies

A Case Study on the Similarity Between Source Code and Bug Reports Vocabularies A Case Study on the Similarity Between Source Code and Bug Reports Vocabularies Diego Cavalcanti 1, Dalton Guerrero 1, Jorge Figueiredo 1 1 Software Practices Laboratory (SPLab) Federal University of Campina

More information

A Formal Approach to Score Normalization for Meta-search

A Formal Approach to Score Normalization for Meta-search A Formal Approach to Score Normalization for Meta-search R. Manmatha and H. Sever Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst, MA 01003

More information

CS54701: Information Retrieval

CS54701: Information Retrieval CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful

More information

Large-Scale Syntactic Processing: Parsing the Web. JHU 2009 Summer Research Workshop

Large-Scale Syntactic Processing: Parsing the Web. JHU 2009 Summer Research Workshop Large-Scale Syntactic Processing: JHU 2009 Summer Research Workshop Intro CCG parser Tasks 2 The Team Stephen Clark (Cambridge, UK) Ann Copestake (Cambridge, UK) James Curran (Sydney, Australia) Byung-Gyu

More information

An UIMA based Tool Suite for Semantic Text Processing

An UIMA based Tool Suite for Semantic Text Processing An UIMA based Tool Suite for Semantic Text Processing Katrin Tomanek, Ekaterina Buyko, Udo Hahn Jena University Language & Information Engineering Lab StemNet Knowledge Management for Immunology in life

More information