DOCUMENT RETRIEVAL USING A PROBABILISTIC KNOWLEDGE MODEL

Size: px
Start display at page:

Download "DOCUMENT RETRIEVAL USING A PROBABILISTIC KNOWLEDGE MODEL"

Transcription

1 DOCUMENT RETRIEVAL USING A PROBABILISTIC KNOWLEDGE MODEL Shuguang Wang Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA swang@cs.pitt.edu Shyam Visweswaran Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA shv3@pitt.edu Keywords: Abstract: Milos Hauskrecht Department of Computer Science, University of Pittsburgh, Pittsburgh, PA, USA milos@cs.pitt.edu Information Retrieval, Link Analysis, Domain Knowledge, Biomedical Documents, Probabilistic Model. We are interested in enhancing information retrieval methods by incorporating domain knowledge. In this paper, we present a new document retrieval framework that learns a probabilistic knowledge model and exploits this model to improve document retrieval. The knowledge model is represented by a network of associations among concepts defining key domain entities and is extracted from a corpus of documents or from a curated domain knowledge base. This knowledge model is then used to perform concept-related probabilistic inferences using link analysis methods and applied to the task of document retrieval. We evaluate this new framework on two biomedical datasets and show that this novel knowledge-based approach outperforms the state-of-art Lemur/Indri document retrieval method. 1 INTRODUCTION Due to the richness and complexity of scientific domains today, published research documents may feasibly mention only a fraction of knowledge of the domain. This is not a problem for human readers who are armed with a general knowledge of the domain and hence are able to overcome the missing link and connect the information in the article to the overall body of domain knowledge. Many existing search and information-retrieval systems that operate by analyzing and matching queries only to individual documents are very likely to miss these knowledge-based connections. Hence, many documents that are extremely relevant to the query may not be returned by the existing search systems. Our goal was to study ways of injecting knowledge into the information retrieval process in order to find better, more relevant documents that do not exactly match the search queries. We present a new probabilistic knowledge model learned from the association network relating pairs of domain concepts. In general, associations may stand for and abstract a variety of relations among domain concepts. We believe that association networks and patterns therein give clues about mutual relevance of domain concepts. Our hypothesis is that highly interconnected domain concepts define semantically relevant groups, and these patterns are useful in performing informationretrieval inferences, such as connecting hidden and explicitly mentioned domain concepts in the document. The analysis of network structures is typically done using link analysis methods. We adopt PHITS (Cohn and Chang, 2000) to analyze the mutual connectivity of domain concepts in association networks and derive a probabilistic model that reflects, if the above hypothesis is correct, the mutual relevance among domain concepts. Figure 1 illustrates the distinction between our model and existing related techniques. The top layer in Figure 1 consists of a set of documents, and the bottom layer corresponds to knowledge and relations among domain concepts (or terms). The two layers are interconnected; individual documents refer to multiple concepts and cite one another. Latent semantic models, such as LSI (Landauer et al., 1998) or PLSI (Hofmann, 1999), focus primarily on document-term relations. Link analysis such as PHITS is traditionally performed on the top (document) layer and studies interconnections/citations among documents to determine their dependencies. In this work we use link analysis to study intercon-

2 nectedness of knowledge concepts (knowledge layer) and their mutual influences. We believe that the structure and interconnectedness in the knowledge (and also on the document) layer has the potential to greatly enhance standard information retrieval inferences. doc1 document layer doc2 doc3 docn... knowledge layer Figure 1: Illustration of the Knowledge Layer. Red dotted arrows represent citations; blue dotted arrows link documents to the concepts mentioned in them; and green solid lines connect associated concepts. We experiment with and demonstrate the potential of our approach on biomedical research articles using search queries on protein and gene species referenced in these articles. Our results show that the addition of the knowledge layer inferences improves the retrieval of relevant articles and outperforms the state-of-the-art information retrieval systems, such as the Lemur/Indri 1. This paper is organized as follows. Section 2 describes the knowledge model derived from concept associations, and its construction from different sources. Section 3 describe the learning of the probabilistic knowledge model using PHITS and proposes two inference methods to support document retrieval. An extensive set of the experimental evaluations of these methods is presented in Section 4. Several recent related projects are reviewed in Section 5 and in the final section we provide conclusions and some directions for future work. 2 THE KNOWLEDGE MODEL The knowledge in any scientific domain can be represented as a rich network of relations among the domain concepts. Our information retrieval framework adopts this kind of knowledge model to aid in the information retrieval process. The knowledge model 1 is represented by a graph (network) structure, where nodes represent domain concepts and arcs between nodes represent the pairwise relations among domain concepts. In this paper, we focus on association relations, that abstract a variety of relations that may exist among domain concepts. Typically, a knowledge model is constructed by a human expert. An example of such an expert-built domain knowledge model is the gene and protein interaction network that is available from the Molecular Signatures Database (MSigDB) 2. Alternatively, a knowledge model can be extracted from existing document collections by mining and aggregating domain concepts and their relations across many documents. In the work described in this paper, we experiment with knowledge models from the biomedical domain with genes and proteins as domain concepts. We analyze the utility of knowledge models extracted from (1) the curated MSigDB database and (2) domain document collections. Knowledge extraction from MSigDB is done in a simple fashion. Concepts are already defined and the associations are based on the grouping of different domain concepts, i.e., concepts in the same group in the database are considered to be associated. Knowledge extraction from the documents is done in two steps. In the first step, key domain concepts are identified using a dictionary look up approach that will be presented in next section. In the second step, relations (associations) are identified by analyzing the co-occurrence of pairs of domain concepts in the documents. Domain concepts considered in our analysis consist of genes and proteins. There are several freely available resources that can be used to identify the names of genes and proteins in text. However, these resources are usually optimized with respect to F1 scores. To avoid introducing many false concepts and relations into the model, we developed a method that is optimized to achieve high precision. Briefly, our method performs the following steps: 1. Segments abstracts (with titles) into sentences. 2. Tags sentences with MedPost POS tagger from NCBI. 3. Parses tagged sentences with Collins full parser (Collins, 1999). 4. Matches the phrases with the concept names based on the GPSDB (Pillet et al., 2005) vocabulary. 5. Matches the synonyms of concepts and assigns unique id to distinct concepts. 2

3 This method archived over 90% precision at approximately 65% recall when applied to a 100-document test set to extract genes and proteins. Once the concepts in the documents are identified, we mine the associations among the concepts by analyzing co-occurrences of the concepts on the sentence level. Specifically, two concepts are associated and linked in the knowledge model if they co-occur in the same sentence. P(d) P(z) d P(z d) z z P(d z) P(c z) P(c z) d c c (a) Asymmetric parameterizations (b) Symmetric parameterizations Figure 2: Graphical representation of PHITS. 3 PROBABILISTIC KNOWLEDGE MODEL Our goal is to use the knowledge model represented as an association network model to support inferences on relevance among concepts. We propose to do so by analyzing the interconnectedness of concepts in the association network. More specifically, our hypothesis is that domain concepts are more likely to be relevant to each other if they belong to the same, well defined, and highly interconnected group of concepts. The intuition behind our approach is that concepts that are semantically interconnected in terms of their roles or functions should be considered more relevant to each other. And we expect these semantically distinct roles and functions to be embedded in the documents and hence picked up and reflected in our association network. To explore and understand the interconnectedness of domain concepts in the association network we employ link analysis and the PHITS model (Cohn and Chang, 2000). 3.1 Probabilistic HITS PHITS (Cohn and Chang, 2000) is a probabilistic link analysis model that was used to study graphs of cocitation networks or web hyperlink structures. However, in our work, we use it to study relations among domain concepts (terms) and not documents. This is an important distinction to stress, since the PHITS model on the document level has been used to improve search and information retrieval performance as well (Cohn and Chang, 2000). The novelty of our work is in the use PHITS to learn the relations among concepts and the development of a new approximate inference method to assess the mutual relevancy of domain concepts. Figure 2 shows a graphical representation of PHITS. Variable d represents documents, z is the latent factor, and c is a citation. There are two equivalent PHITS parameterizations. Typically, a symmetric parametrization is more efficient as the number of topics z is smaller than the number of documents d. Using the symmetric parametrization, the model defines P(d,c) as z P(z)P(c z)p(z d). The parameters of the PHITS model are learned from the link structure data using the Expectation- Maximization (EM) approach (Hofmann, 1999; Cohn and Chang, 2000). In the expectation step, it computes P(z d,c) and in the maximization step it reestimates P(z), P(d z), and P(c z). The PHITS model has two important features. First, PHITS (much like the PLSI model) is not a proper generative probabilistic model for documents. Latent Dirichlet Allocation (LDA) (Blei et al., 2003) fixes the problem by using a Dirichlet prior to define the latent factor distribution. Second, the PHITS model does not represent individual citations with multiple random variables, instead citations are linked to topics using a multinomial distribution over citations. Hence citations are treated as alternatives. In our work, we use PHITS to analyze relations among concepts. Hence both d (documents) and c (citations) are substituted with domain concepts. To make this difference clear we denote domain concepts by e. 3.2 Document Retrieval Inferences with PHITS Our information retrieval framework assumes that both documents and queries are represented by vectors of domain concepts. Since individual research articles usually refer only to a subset of domain concepts a perfect match between the queries and the documents may not exist. Our goal is to develop methods that use a knowledge model to infer absent but relevant domain concepts. We propose two approaches that use PHITS to improve document retrieval. The first approach works by expanding the document vector with relevant concepts first and by applying information retrieval techniques to retrieve documents afterwards. This approach can be easily incorporated into vector space retrieval models. The second approach expands the

4 original query vectors with relevant concepts before retrieving documents. This approach can be used in most of information retrieval systems. In the following sections, we first describe the basic inference supported by our model. After that we show how this inference is applied in two retrieval approaches PHITS Inference The basic inference task we support with the PHITS model is the calculation of the probability of seeing an absent (unobserved) concept e given a list of observed concepts o 1,o 2,...,o k. As noted earlier, PHITS treats concepts as alternatives and the conditional probability is defined by the following distribution: P(e = b 1 o 1,o 2,...,o k ) P(e = b 2 o 1,o 2,...,o k )... P(e = b n o 1,o 2,...,o k ), where e is a random variable and b 1,b 2,...,b n are its values that denote individual domain concepts. Intuitively, the conditional distribution defines the probability of seeing an absent concept next after we observe concepts in the document. To calculate the conditional probability of e, we use the following approximation: P(e o 1,o 2,...,o k,m phits ) = P(e z,m phits )P(z o 1,o 2,...,o k,,m phits ) z z P(e z,m phits ) k j=1 P(z o j,m phits ) (1) where o 1,o 2,...,o k are observed (known) concepts and M phits is the PHITS model. We take an approximation in this derivation due to the feature we discussed in Section 3.1 that it does not represent individual citations with multiple random variables Document Expansion Retrieval In information retrieval models, documents and queries are usually represented as term vectors. Documents are retrieved according to a certain similarity measure between documents and queries. In our work, we use binary vectors to represent documents. Explicitly mentioned concepts are represented as 1 s in these vectors, and the others are 0 s initially. However, concepts not explicitly mentioned in the document may still be relevant to the document. Our aim is to find a way to infer the probable relevance of those absent concepts to the documents. Essentially, these probable relevance values lets us transform the original term vector for the document into a new term vector, such that indicators of the terms in the document are kept intact and terms not explicitly mentioned in the document and their relevancies are inferred as P(e i = T d), where e i is an absent concept in document d. Figure 3 illustrates the approach and contrasts it with the typical information retrieval process. PLSA, LDA, BM25... term vector 1,0,0,1,0,0,0 IR Methods + Query Text DB 1,x,x,1,x,x,x PHITS KB 1,0.2,0.1,1,0.05,0,0 IR Methods + Query CuratedKB Figure 3: Exploitation of the domain knowledge model in information retrieval. The standard method in which the term vector is an indicator vector that reflects the occurrence of terms in the document and query is on the left. Here all unobserved terms are treated as zeros. In contrast, our approach uses the knowledge model to fill in the values of unobserved terms with their probabilities. PHITS and inferences in Equation 1 treat concepts as alternatives and the variable e ranges over all possible concepts. Hence in order to define P(e i d), where e i (different from e) is a boolean random variable that reflects the probability of a concept i given the document (or concepts explicitly observed in the document) we need an approximation. We define the probability P(e i = T d), with which a concept e i is expected or not expected to occur in the document d as: P(e i = T d) = min[α P(e = b i d),1] (2) where P(e = b i d) is calculated from the PHITS model using Equation 1 and α is a constant that scales P(e = b i d) to a new probability space. Constant α can be defined in various ways. In our experiments we assume α to be: α = 1/minP(e = b j d) j where j ranges over all concepts explicitly mentioned in the document d. An intuitive reason of this choice of α is that we do not want any absent concept to outweight any present concepts. We expect the above domain knowledge inferences to be applied before standard information retrieval methods are deployed. The row vector at the top of Figure 3 is an indicator-based term-vector. This term vector is either transformed with the help of

5 the knowledge model (our model) or retained without change (standard model). We use the knowledge model to expand the term vectors for all unobserved concepts with their probabilities inferred from the PHITS models. A variety of existing information retrieval methods (e.g. PLSI (Hofmann, 1999), and LDA (Wei and Croft, 2006)) can then be applied to these two vector-term options. This makes it possible to combine our knowledge model with many existing retrieval techniques easily Query Expansion Retrieval An alternative to expanding document vectors, is to expand the query vectors. Briefly, the aim is to select concepts that are not in the original query, but are likely to provide a relevant match. To implement this approach, we first calculate P(e = b i o 1,o 2,...,o k ), the probability of seeing an absent concept i given a list of observed concepts o 1,o 2,...,o k in a query, as defined in Equation 1. Then we sort the absent concepts according to their conditional probabilities, and choose the top m concepts to expand the query vector. We applied two different inferences to expand documents and queries because there is much less information or evidence in queries than that in documents. By choosing the top m relevant concepts we can control the amount of noise introduced in the expansion. Our evaluation results show this approach is effective to improve retrieval performance. 4 EXPERIMENTS To demonstrate the benefit of the knowledge model, we incorporate it in several document retrieval techniques and compare them to the state-of-art research search engine, Lemur/Indri. We learned the probabilistic PHITS knowledge model from two sources: the document corpus and the MSigDB database. Furthermore, we combine these two knowledge models using model averaging. We re-write Equation 1 as follows to combine the inferences from the two models for model averaging: P(d m)p(m) P(e d) = P(e d, m) (3) m m P(d m)p(m) where e is an absent concept in document d and m is a knowledge model. We assume uniform prior probability over the two models. Similarly, we apply model averaging in all inference steps to combine the two models. We use the standard retrieval evaluation metric, Mean Average Precision (MAP), to measure the retrieval performance in our experiments. 4.1 Document Expansion Retrieval In the first set of experiments, we evaluate the document expansion retrieval approach on a PubMed- Cancer literature database. It consists of over 6000 research articles on 10 common cancers. Our corpus contains both full documents and their abstracts. Since document expansion can be easily incorporated into vector space information retrieval methods, we combine it with two IR methods, namely, LDA and PLSI. We compare it to Lemur/Indri with default settings except that we use the Dirichlet smoothing (µ = 2500), which had the best performance in our experiments. The knowledge model was learned from abstracts and the complete text of the documents were used to assess the relevance of documents returned by the system. We randomly selected a subset of 20% of the articles as the test set. To learn the probabilistic PHITS knowledge model, we used i) 80% of the document corpus, and ii) the MSigDB database. The combination of the two models was done at the inference level. The relevance of a scientific document to the query, especially if partial matches are to be assessed, is best done by human experts. Unfortunately, this is a very time-consuming process. So, we adopted the following experimental setup: we perform all knowledge-model learning and retrieval analysis on abstracts only, and use exact matches of queries on full texts as surrogate measures of true relevance. Briefly, for a given query we retrieve a document based on its abstract, and its relevance is judged (automatically) by the query s match to the full document. We generated a set of 500 queries that consisted of pairs of two domain concepts (proteins or genes) such that 100 of these queries were generated by randomly pairing any two concepts identified in the training corpus, and 400 queries were generated using documents in the test corpus by the following process. To generate a query we first randomly picked a test document, and then randomly selected a pair of concepts that were associated with each other in the full text of this document. Thus, the generated query had a perfect match in the full text of at least one document. All 500 queries were run on abstracts, and the relevance of the retrieved document to the query was determined by analyzing the full text and the match of the query to the full text Document Retrieval Results We applied the 500 queries to compare (1) PLSI and LDA document retrievals with and without knowledge expansion, and (2) document retrievals with knowledge models mined from different sources to

6 the performance of the Lemur/Indri. Synonyms of each query concept term are identified and appended to the original query terms. To construct queries for the Lemur/Indri, we use boolean and to connect the pair of query terms. An example of a query is: #band( #syn(synonyms of 1st gene) #syn(synonyms of 2nd gene)). Thus, using (1) we investigate if probabilistic knowledge models help in improving retrieval performance and using (2) we compare the utility of several knowledge sources for constructing the knowledge model. Table 1 gives the MAP scores of obtained by the above mentioned methods utilizing knowledge models extracted from several sources. The different knowledge sources are denoted by subscripts. TextKM refers to methods that incorporate associations from the document corpus; MSigDBKM refers to methods that include associations from the MSigDB database; and Text-MSigDBKM refers to methods that include associations from both the corpus refer to the relative improvement of the corresponding method over the baseline Lemur/Indri method. Overall, Lemur/Indri that ran on abstracts performed significantly better than both the original method, i.e., LDA and PLSI. However, all method that incorporated a knowledge model performed better than Lemur/Indri. Furthermore, combined approaches performed significantly better than the corresponding original approaches. These results show that domain knowledge improves document retrieval and supports our hypothesis that relevance is (at least partly) determined by connectivity among domain concepts. And it also confirms that domain knowledge is helpful in finding relevant domain concepts. Knowledge models that are mined from different sources benefit the retrieval results differently. Models mined from MSigDB did not help retrieval as much as those mined from the document corpus. This is partially due to its relatively small size: the database contains about 3,000 unique genes in over 300 groups. This also explains why combining knowledge from the document corpus and the MSigDB database did not show a significant improvement over knowledge obtained solely from corpus. With a larger curated database, and thus a potentially better knowledge model, we expect a larger improvement in retrieval performance. 4.2 Query Expansion Retrieval In the second set of experiments, we compare the query expansion approach with the pseudo-relevance feedback expansion model in Lemur/Indri. In addition to the PubMed-Cancer dataset, these experiments Table 1: MAP scores on PubMed-Cancer data. Approaches MAP Lemur/Indri PLSI PLSI TextKM (+7%) PLSI MSigDBKM (+2%) PLSI Text+MSigDBKM (+7%) LDA LDA TextKM (+7%) LDA MSigDBKM (+5%) LDA Text MSigDBKM (+7%) included the TREC Genomic Track 2003 dataset. The Genomic Track dataset is significantly larger than the Pubmed-Cancer dataset and consists of over 520,000 abstracts from Medline. The dataset comes with a set of test queries used for the evaluation of retrieval methods. Evaluation queries consist of gene names and their aliases, with the specific task being derived from the definition of GeneRIF. The relevance of the documents to test queries for the TREC data were assessed by human experts. Although queries contain only gene names, the dataset itself consists of a wider range of biomedical topics. Again, we learn the knowledge models from two sources. To learn the knowledge model from the TREC dataset, we used only 1/3 of the dataset for the sake of efficiency. We expanded the original queries with 10 terms (m=10) from i) an internal query expansion module in Lemur/Indri that is based on a pseudo-relevance feedback model (Lavrenko and Croft, 2001), and ii) with the most relevant concepts inferred from our probabilistic knowledge models. The following query provides an example: original query #combine(o concept1 o concept2...) expanded query #weight(2.0 #combine( o concept1...) 1.0 #combine(e concept1...)) We use the higher (double) weights for the terms in the original query compared to the expanded terms. Table 2 gives the MAP scores of Lemur/Indri with various settings on the PubMed-Cancer and Genomic Track datasets. We compare (1) different expansion approaches, and (2) knowledge-driven expansion with three different PHITS models (same as those used for the experiments described in the previous section). The relative percents in brackets represent relative improvements of various methods over the baseline Lemur/Indri. The query expansion methods that use knowledge models (last three rows) show im-

7 proved retrieval performance. Furthermore, all these methods outperform the pseudo-relevance feedback model in Lemur/Indri. On comparing the effect of knowledge models mined from different sources, the models that include associations from the document corpus perform better than models extracted from the curated database. We believe that this is again due to the relatively small size of the database and the sparse association network it induces. On comparing the performance on two datasets, the relative improvement is smaller for all query expansion approaches on the PubMed-Cancer data. This retrieval task is more difficult since the queries that we constructed are extracted from the full text of the documents, and hence many the of query terms may not even appear in the abstracts. Thus, document expansion is a better choice when documents contain more evidence concepts than queries. Table 2: MAP scores on TREC Genomic Track 03 and PubMed-Cancer data. Approaches TREC PubMed-Cancer Lemur/Indri relevance (+3%) (+3%) +TextKM (+8%) (+4%) +MSigDBKM (+5%) (+3%) +Text-MSigDBKM (+8%) (+4%) 5 RELATED WORK In the context of information retrieval, domain knowledge has been used in (Zhou et al., 2007) and (Pickens and MacFarlane, 2006). (Pickens and MacFarlane, 2006) showed that the occurrences of terms can be better weighted using contextual knowledge of the terms. (Zhou et al., 2007) presented various ways of incorporating existing domain knowledge from MeSH and Entrez Gene and demonstrated improvement in information retrieval. (Büttcher et al., 2004) expanded the queries with synonyms of all biomedical terms extracted from external databases. In addition to synonyms, (Aronson and Rindflesch, 1997) mapped the terms in the queries to biomedical concepts using MetaMap and added these concepts to the original queries. (Lin and Demner-Fushman, 2006) showed that it is beneficial to use a knowledge model. In terms of knowledge extraction, (Lee et al., 2007) presented a method for finding important associations among GO and MeSH terms and for computing confidence and support scores for them. All these approaches differ from our approach in that we extract domain concepts and relations from a document corpus. Then we learn a probabilistic knowledge model from the association network automatically and exploit it to infer the missing knowledge in the individual documents. 6 CONCLUSION AND FUTURE WORK We have presented a new framework that extracts domain knowledge from multiple documents and a curated domain knowledge base and uses it to support document retrieval inferences. We showed that our method can improve the retrieval performance of documents when applied to the biomedical literature. To the best of our knowledge this is the first study that attempts to learn probabilistic relations among domain concepts using link analysis methods and performs inference in the knowledge model for document retrieval. The inference potential of our framework in retrieval of relevant documents was demonstrated in the experiments with document abstracts, in which full documents and relations therein were used only to assess quantitatively the relevance of the document to the query. This result was confirmed by further experiments on the Genomic track dataset. Our knowledge model was extracted using associations among domain-specific terms observed in a document corpus and from curated knowledge collected in a database. We did attempt to refine these associations and identify the specific relations they represent. However, we anticipate that the use of a more comprehensive knowledge model with a variety of explicitly represented relations among the domain concepts will further improve the information retrieval performance. At the same time, we are looking into other inference alternatives to avoid taking approximations. Our experiments show that knowledge models mined from the literature perform better in document retrieval. This is partially due to the difficulty in locating an appropriate knowledge base for each retrieval task. Finally, our model is robust and flexible enough to integrate knowledge from various sources and combine them with existing document retrieval methods. REFERENCES Aronson, A. R. and Rindflesch, T. C. (1997). Query expansion using the umls metathesaurus. In

8 TREC 04: Proceedings of the AMIA Annual Fall Symposium 97, JAMIA Suppl, pages AMIA. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3: Büttcher, S., Clarke, C. L. A., and Cormack, G. V. (2004). Domain-specific synonym expansion and validation for biomedical information retrieval. In TREC 04: Proceedings of the 13th Text REtrieval Conference. Cohn, D. and Chang, H. (2000). Learning to probabilistically identify authoritative documents. In Proc. 17th International Conf. on Machine Learning, pages Morgan Kaufmann, San Francisco, CA. Collins, M. (1999). Head-driven statistical models for natural language parsing. PhD Dissertation. Hofmann, T. (1999). Probabilistic latent semantic indexing. In SIGIR 99: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval, pages ACM Press. Landauer, T. K., Foltz, P. W., and Laham, D. (1998). Introduction to latent semantic analysis. Discourse Processes, 25: Lavrenko, V. and Croft, B. W. (2001). Relevance based language models. In SIGIR 01: Proceedings of the 24th annual international ACM SI- GIR conference on Research and development in information retrieval, pages ACM. Lee, W.-J., Raschid, L., Srinivasan, P., Shah, N., Rubin, D., and Noy, N. (2007). Using annotations from controlled vocabularies to find meaningful associations. In DILS 07: In Fourth International Workshop on Data Integration in the Life Sciences, pages Lin, J. and Demner-Fushman, D. (2006). The role of knowledge in conceptual retrieval: a study in the domain of clinical medicine. In SIGIR 06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages ACM. Pickens, J. and MacFarlane, A. (2006). Term context models for information retrieval. In CIKM 06: Proceedings of the 15th ACM international conference on Information and knowledge management, pages ACM. Pillet, V., Zehnder, M., Seewald, A. K., Veuthey, A.- L., and Petra, J. (2005). Gpsdb: a new database for synonyms expansion of gene and protein names. Bioinformatics, 21(8): Wei, X. and Croft, W. B. (2006). Lda-based document models for ad-hoc retrieval. In SIGIR 06: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, pages ACM. Zhou, W., Yu, C., Smalheiser, N., Torvik, V., and Hong, J. (2007). Knowledge-intensive conceptual retrieval and passage extraction of biomedical literature. In SIGIR 07: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, pages ACM.

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Effective Latent Space Graph-based Re-ranking Model with Global Consistency Effective Latent Space Graph-based Re-ranking Model with Global Consistency Feb. 12, 2009 1 Outline Introduction Related work Methodology Graph-based re-ranking model Learning a latent space graph A case

More information

Document Retrieval using Predication Similarity

Document Retrieval using Predication Similarity Document Retrieval using Predication Similarity Kalpa Gunaratna 1 Kno.e.sis Center, Wright State University, Dayton, OH 45435 USA kalpa@knoesis.org Abstract. Document retrieval has been an important research

More information

Improving Difficult Queries by Leveraging Clusters in Term Graph

Improving Difficult Queries by Leveraging Clusters in Term Graph Improving Difficult Queries by Leveraging Clusters in Term Graph Rajul Anand and Alexander Kotov Department of Computer Science, Wayne State University, Detroit MI 48226, USA {rajulanand,kotov}@wayne.edu

More information

SciMiner User s Manual

SciMiner User s Manual SciMiner User s Manual Copyright 2008 Junguk Hur. All rights reserved. Bioinformatics Program University of Michigan Ann Arbor, MI 48109, USA Email: juhur@umich.edu Homepage: http://jdrf.neurology.med.umich.edu/sciminer/

More information

Robust Relevance-Based Language Models

Robust Relevance-Based Language Models Robust Relevance-Based Language Models Xiaoyan Li Department of Computer Science, Mount Holyoke College 50 College Street, South Hadley, MA 01075, USA Email: xli@mtholyoke.edu ABSTRACT We propose a new

More information

Retrieval of Highly Related Documents Containing Gene-Disease Association

Retrieval of Highly Related Documents Containing Gene-Disease Association Retrieval of Highly Related Documents Containing Gene-Disease Association K. Santhosh kumar 1, P. Sudhakar 2 Department of Computer Science & Engineering Annamalai University Annamalai Nagar, India. santhosh09539@gmail.com,

More information

Sheffield University and the TREC 2004 Genomics Track: Query Expansion Using Synonymous Terms

Sheffield University and the TREC 2004 Genomics Track: Query Expansion Using Synonymous Terms Sheffield University and the TREC 2004 Genomics Track: Query Expansion Using Synonymous Terms Yikun Guo, Henk Harkema, Rob Gaizauskas University of Sheffield, UK {guo, harkema, gaizauskas}@dcs.shef.ac.uk

More information

Using a Medical Thesaurus to Predict Query Difficulty

Using a Medical Thesaurus to Predict Query Difficulty Using a Medical Thesaurus to Predict Query Difficulty Florian Boudin, Jian-Yun Nie, Martin Dawes To cite this version: Florian Boudin, Jian-Yun Nie, Martin Dawes. Using a Medical Thesaurus to Predict Query

More information

Query Likelihood with Negative Query Generation

Query Likelihood with Negative Query Generation Query Likelihood with Negative Query Generation Yuanhua Lv Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 ylv2@uiuc.edu ChengXiang Zhai Department of Computer

More information

A Multiple-stage Approach to Re-ranking Clinical Documents

A Multiple-stage Approach to Re-ranking Clinical Documents A Multiple-stage Approach to Re-ranking Clinical Documents Heung-Seon Oh and Yuchul Jung Information Service Center Korea Institute of Science and Technology Information {ohs, jyc77}@kisti.re.kr Abstract.

More information

Humboldt-University of Berlin

Humboldt-University of Berlin Humboldt-University of Berlin Exploiting Link Structure to Discover Meaningful Associations between Controlled Vocabulary Terms exposé of diploma thesis of Andrej Masula 13th October 2008 supervisor: Louiqa

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

An Investigation of Basic Retrieval Models for the Dynamic Domain Task

An Investigation of Basic Retrieval Models for the Dynamic Domain Task An Investigation of Basic Retrieval Models for the Dynamic Domain Task Razieh Rahimi and Grace Hui Yang Department of Computer Science, Georgetown University rr1042@georgetown.edu, huiyang@cs.georgetown.edu

More information

Genescene: Biomedical Text and Data Mining

Genescene: Biomedical Text and Data Mining Claremont Colleges Scholarship @ Claremont CGU Faculty Publications and Research CGU Faculty Scholarship 5-1-2003 Genescene: Biomedical Text and Data Mining Gondy Leroy Claremont Graduate University Hsinchun

More information

Tilburg University. Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval

Tilburg University. Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval Tilburg University Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval Publication date: 2006 Link to publication Citation for published

More information

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback RMIT @ TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback Ameer Albahem ameer.albahem@rmit.edu.au Lawrence Cavedon lawrence.cavedon@rmit.edu.au Damiano

More information

A ew Algorithm for Community Identification in Linked Data

A ew Algorithm for Community Identification in Linked Data A ew Algorithm for Community Identification in Linked Data Nacim Fateh Chikhi, Bernard Rothenburger, Nathalie Aussenac-Gilles Institut de Recherche en Informatique de Toulouse 118, route de Narbonne 31062

More information

Literature Databases

Literature Databases Literature Databases Introduction to Bioinformatics Dortmund, 16.-20.07.2007 Lectures: Sven Rahmann Exercises: Udo Feldkamp, Michael Wurst 1 Overview 1. Databases 2. Publications in Science 3. PubMed and

More information

Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014

Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014 Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014 Sungbin Choi, Jinwook Choi Medical Informatics Laboratory, Seoul National University, Seoul, Republic of

More information

Risk Minimization and Language Modeling in Text Retrieval Thesis Summary

Risk Minimization and Language Modeling in Text Retrieval Thesis Summary Risk Minimization and Language Modeling in Text Retrieval Thesis Summary ChengXiang Zhai Language Technologies Institute School of Computer Science Carnegie Mellon University July 21, 2002 Abstract This

More information

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 188 CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 6.1 INTRODUCTION Image representation schemes designed for image retrieval systems are categorized into two

More information

An Introduction to PubMed Searching: A Reference Guide

An Introduction to PubMed Searching: A Reference Guide An Introduction to PubMed Searching: A Reference Guide Created by the Ontario Public Health Libraries Association (OPHLA) ACCESSING PubMed PubMed, the National Library of Medicine s free version of MEDLINE,

More information

Ranking models in Information Retrieval: A Survey

Ranking models in Information Retrieval: A Survey Ranking models in Information Retrieval: A Survey R.Suganya Devi Research Scholar Department of Computer Science and Engineering College of Engineering, Guindy, Chennai, Tamilnadu, India Dr D Manjula Professor

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

Towards Semantic Search and Inference in Electronic Medical Records: an approach using Concept- based Information Retrieval

Towards Semantic Search and Inference in Electronic Medical Records: an approach using Concept- based Information Retrieval Towards Semantic Search and Inference in Electronic Medical Records: an approach using Concept- based Information Retrieval Bevan Koopman 1,2 Peter Bruza 2 Laurianne Sitbon 2 Michael Lawley 1 1: Australian

More information

A Deep Relevance Matching Model for Ad-hoc Retrieval

A Deep Relevance Matching Model for Ad-hoc Retrieval A Deep Relevance Matching Model for Ad-hoc Retrieval Jiafeng Guo 1, Yixing Fan 1, Qingyao Ai 2, W. Bruce Croft 2 1 CAS Key Lab of Web Data Science and Technology, Institute of Computing Technology, Chinese

More information

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language Dong Han and Kilian Stoffel Information Management Institute, University of Neuchâtel Pierre-à-Mazel 7, CH-2000 Neuchâtel,

More information

Using open access literature to guide full-text query formulation. Heather A. Piwowar and Wendy W. Chapman. Background

Using open access literature to guide full-text query formulation. Heather A. Piwowar and Wendy W. Chapman. Background Using open access literature to guide full-text query formulation Heather A. Piwowar and Wendy W. Chapman Background Much scientific knowledge is contained in the details of the full-text biomedical literature.

More information

efip online Help Document

efip online Help Document efip online Help Document University of Delaware Computer and Information Sciences & Center for Bioinformatics and Computational Biology Newark, DE, USA December 2013 K K S I K K Table of Contents INTRODUCTION...

More information

The NLM Medical Text Indexer System for Indexing Biomedical Literature

The NLM Medical Text Indexer System for Indexing Biomedical Literature The NLM Medical Text Indexer System for Indexing Biomedical Literature James G. Mork 1, Antonio J. Jimeno Yepes 2,1, Alan R. Aronson 1 1 National Library of Medicine, Bethesda, MD, USA {mork,alan}@nlm.nih.gov

More information

Text Document Clustering Using DPM with Concept and Feature Analysis

Text Document Clustering Using DPM with Concept and Feature Analysis Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,

More information

jldadmm: A Java package for the LDA and DMM topic models

jldadmm: A Java package for the LDA and DMM topic models jldadmm: A Java package for the LDA and DMM topic models Dat Quoc Nguyen School of Computing and Information Systems The University of Melbourne, Australia dqnguyen@unimelb.edu.au Abstract: In this technical

More information

Inter and Intra-Document Contexts Applied in Polyrepresentation

Inter and Intra-Document Contexts Applied in Polyrepresentation Inter and Intra-Document Contexts Applied in Polyrepresentation Mette Skov, Birger Larsen and Peter Ingwersen Department of Information Studies, Royal School of Library and Information Science Birketinget

More information

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information

More information

University of Delaware at Diversity Task of Web Track 2010

University of Delaware at Diversity Task of Web Track 2010 University of Delaware at Diversity Task of Web Track 2010 Wei Zheng 1, Xuanhui Wang 2, and Hui Fang 1 1 Department of ECE, University of Delaware 2 Yahoo! Abstract We report our systems and experiments

More information

Multimodal Medical Image Retrieval based on Latent Topic Modeling

Multimodal Medical Image Retrieval based on Latent Topic Modeling Multimodal Medical Image Retrieval based on Latent Topic Modeling Mandikal Vikram 15it217.vikram@nitk.edu.in Suhas BS 15it110.suhas@nitk.edu.in Aditya Anantharaman 15it201.aditya.a@nitk.edu.in Sowmya Kamath

More information

Retrieval and Feedback Models for Blog Distillation

Retrieval and Feedback Models for Blog Distillation Retrieval and Feedback Models for Blog Distillation Jonathan Elsas, Jaime Arguello, Jamie Callan, Jaime Carbonell Language Technologies Institute, School of Computer Science, Carnegie Mellon University

More information

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Wolfgang Tannebaum, Parvaz Madabi and Andreas Rauber Institute of Software Technology and Interactive Systems, Vienna

More information

This is the author s version of a work that was submitted/accepted for publication in the following source:

This is the author s version of a work that was submitted/accepted for publication in the following source: This is the author s version of a work that was submitted/accepted for publication in the following source: Koopman, Bevan, Bruza, Peter, Sitbon, Laurianne, & Lawley, Michael (2011) AEHRC & QUT at TREC

More information

DESIGN AND IMPLEMENTATION OF AN INTERACTIVE QUERY EXPANSION METHODOLOGY FOR INFORMATION RETRIEVAL

DESIGN AND IMPLEMENTATION OF AN INTERACTIVE QUERY EXPANSION METHODOLOGY FOR INFORMATION RETRIEVAL DESIGN AND IMPLEMENTATION OF AN INTERACTIVE QUERY EXPANSION METHODOLOGY FOR INFORMATION RETRIEVAL S. Ruban 1, Vanitha T 2. and S. Behin Sam 3 1 Department of Computer Science, Bharathiar University, Coimbatore,

More information

TREC-10 Web Track Experiments at MSRA

TREC-10 Web Track Experiments at MSRA TREC-10 Web Track Experiments at MSRA Jianfeng Gao*, Guihong Cao #, Hongzhao He #, Min Zhang ##, Jian-Yun Nie**, Stephen Walker*, Stephen Robertson* * Microsoft Research, {jfgao,sw,ser}@microsoft.com **

More information

Hierarchical Location and Topic Based Query Expansion

Hierarchical Location and Topic Based Query Expansion Hierarchical Location and Topic Based Query Expansion Shu Huang 1 Qiankun Zhao 2 Prasenjit Mitra 1 C. Lee Giles 1 Information Sciences and Technology 1 AOL Research Lab 2 Pennsylvania State University

More information

A Measurement Design for the Comparison of Expert Usability Evaluation and Mobile App User Reviews

A Measurement Design for the Comparison of Expert Usability Evaluation and Mobile App User Reviews A Measurement Design for the Comparison of Expert Usability Evaluation and Mobile App User Reviews Necmiye Genc-Nayebi and Alain Abran Department of Software Engineering and Information Technology, Ecole

More information

A Framework for BioCuration (part II)

A Framework for BioCuration (part II) A Framework for BioCuration (part II) Text Mining for the BioCuration Workflow Workshop, 3rd International Biocuration Conference Friday, April 17, 2009 (Berlin) Martin Krallinger Spanish National Cancer

More information

On Duplicate Results in a Search Session

On Duplicate Results in a Search Session On Duplicate Results in a Search Session Jiepu Jiang Daqing He Shuguang Han School of Information Sciences University of Pittsburgh jiepu.jiang@gmail.com dah44@pitt.edu shh69@pitt.edu ABSTRACT In this

More information

R 2 D 2 at NTCIR-4 Web Retrieval Task

R 2 D 2 at NTCIR-4 Web Retrieval Task R 2 D 2 at NTCIR-4 Web Retrieval Task Teruhito Kanazawa KYA group Corporation 5 29 7 Koishikawa, Bunkyo-ku, Tokyo 112 0002, Japan tkana@kyagroup.com Tomonari Masada University of Tokyo 7 3 1 Hongo, Bunkyo-ku,

More information

Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization

Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization Romain Deveaud 1 and Florian Boudin 2 1 LIA - University of Avignon romain.deveaud@univ-avignon.fr

More information

Scientific Literature Retrieval based on Terminological Paraphrases using Predicate Argument Tuple

Scientific Literature Retrieval based on Terminological Paraphrases using Predicate Argument Tuple Scientific Literature Retrieval based on Terminological Paraphrases using Predicate Argument Tuple Sung-Pil Choi 1, Sa-kwang Song 1, Hanmin Jung 1, Michaela Geierhos 2, Sung Hyon Myaeng 3 1 Korea Institute

More information

A Novel Model for Semantic Learning and Retrieval of Images

A Novel Model for Semantic Learning and Retrieval of Images A Novel Model for Semantic Learning and Retrieval of Images Zhixin Li, ZhiPing Shi 2, ZhengJun Tang, Weizhong Zhao 3 College of Computer Science and Information Technology, Guangxi Normal University, Guilin

More information

A Practical Passage-based Approach for Chinese Document Retrieval

A Practical Passage-based Approach for Chinese Document Retrieval A Practical Passage-based Approach for Chinese Document Retrieval Szu-Yuan Chi 1, Chung-Li Hsiao 1, Lee-Feng Chien 1,2 1. Department of Information Management, National Taiwan University 2. Institute of

More information

Supporting Discovery in Medicine by Association Rule Mining in Medline and UMLS

Supporting Discovery in Medicine by Association Rule Mining in Medline and UMLS Supporting Discovery in Medicine by Association Rule Mining in Medline and UMLS Dimitar Hristovski a, Janez Stare a, Borut Peterlin b, Saso Dzeroski c a IBMI, Medical Faculty, University of Ljubljana,

More information

A Semantic Model for Concept Based Clustering

A Semantic Model for Concept Based Clustering A Semantic Model for Concept Based Clustering S.Saranya 1, S.Logeswari 2 PG Scholar, Dept. of CSE, Bannari Amman Institute of Technology, Sathyamangalam, Tamilnadu, India 1 Associate Professor, Dept. of

More information

Promoting Ranking Diversity for Biomedical Information Retrieval based on LDA

Promoting Ranking Diversity for Biomedical Information Retrieval based on LDA Promoting Ranking Diversity for Biomedical Information Retrieval based on LDA Yan Chen, Xiaoshi Yin, Zhoujun Li, Xiaohua Hu and Jimmy Huang State Key Laboratory of Software Development Environment, Beihang

More information

CADIAL Search Engine at INEX

CADIAL Search Engine at INEX CADIAL Search Engine at INEX Jure Mijić 1, Marie-Francine Moens 2, and Bojana Dalbelo Bašić 1 1 Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, 10000 Zagreb, Croatia {jure.mijic,bojana.dalbelo}@fer.hr

More information

Using Maximum Entropy for Automatic Image Annotation

Using Maximum Entropy for Automatic Image Annotation Using Maximum Entropy for Automatic Image Annotation Jiwoon Jeon and R. Manmatha Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst Amherst, MA-01003.

More information

Text Modeling with the Trace Norm

Text Modeling with the Trace Norm Text Modeling with the Trace Norm Jason D. M. Rennie jrennie@gmail.com April 14, 2006 1 Introduction We have two goals: (1) to find a low-dimensional representation of text that allows generalization to

More information

Measuring inter-annotator agreement in GO annotations

Measuring inter-annotator agreement in GO annotations Measuring inter-annotator agreement in GO annotations Camon EB, Barrell DG, Dimmer EC, Lee V, Magrane M, Maslen J, Binns ns D, Apweiler R. An evaluation of GO annotation retrieval for BioCreAtIvE and GOA.

More information

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015 University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 5:00pm-6:15pm, Monday, October 26th Name: ComputingID: This is a closed book and closed notes exam. No electronic

More information

Evaluating Relevance Ranking Strategies for MEDLINE Retrieval

Evaluating Relevance Ranking Strategies for MEDLINE Retrieval 32 Lu et al., Evaluating Relevance Ranking Strategies Application of Information Technology Evaluating Relevance Ranking Strategies for MEDLINE Retrieval ZHIYONG LU, PHD, WON KIM, PHD, W. JOHN WILBUR,

More information

RLIMS-P Website Help Document

RLIMS-P Website Help Document RLIMS-P Website Help Document Table of Contents Introduction... 1 RLIMS-P architecture... 2 RLIMS-P interface... 2 Login...2 Input page...3 Results Page...4 Text Evidence/Curation Page...9 URL: http://annotation.dbi.udel.edu/text_mining/rlimsp2/

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Exploring and Exploiting the Biological Maze. Presented By Vidyadhari Edupuganti Advisor Dr. Zoe Lacroix

Exploring and Exploiting the Biological Maze. Presented By Vidyadhari Edupuganti Advisor Dr. Zoe Lacroix Exploring and Exploiting the Biological Maze Presented By Vidyadhari Edupuganti Advisor Dr. Zoe Lacroix Motivation An abundance of biological data sources contain data about scientific entities, such as

More information

A Survey on Postive and Unlabelled Learning

A Survey on Postive and Unlabelled Learning A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled

More information

Reducing Redundancy with Anchor Text and Spam Priors

Reducing Redundancy with Anchor Text and Spam Priors Reducing Redundancy with Anchor Text and Spam Priors Marijn Koolen 1 Jaap Kamps 1,2 1 Archives and Information Studies, Faculty of Humanities, University of Amsterdam 2 ISLA, Informatics Institute, University

More information

Author: Yunqing Xia, Zhongda Xie, Qiuge Zhang, Huiyuan Zhao, Huan Zhao Presenter: Zhongda Xie

Author: Yunqing Xia, Zhongda Xie, Qiuge Zhang, Huiyuan Zhao, Huan Zhao Presenter: Zhongda Xie Author: Yunqing Xia, Zhongda Xie, Qiuge Zhang, Huiyuan Zhao, Huan Zhao Presenter: Zhongda Xie Outline 1.Introduction 2.Motivation 3.Methodology 4.Experiments 5.Conclusion 6.Future Work 2 1.Introduction(1/3)

More information

IPA: networks generation algorithm

IPA: networks generation algorithm IPA: networks generation algorithm Dr. Michael Shmoish Bioinformatics Knowledge Unit, Head The Lorry I. Lokey Interdisciplinary Center for Life Sciences and Engineering Technion Israel Institute of Technology

More information

Term Frequency Normalisation Tuning for BM25 and DFR Models

Term Frequency Normalisation Tuning for BM25 and DFR Models Term Frequency Normalisation Tuning for BM25 and DFR Models Ben He and Iadh Ounis Department of Computing Science University of Glasgow United Kingdom Abstract. The term frequency normalisation parameter

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

ENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL

ENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL ENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL Shwetha S P 1 and Alok Ranjan 2 Visvesvaraya Technological University, Belgaum, Dept. of Computer Science and Engineering, Canara

More information

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS 1 WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS BRUCE CROFT NSF Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts,

More information

SNUMedinfo at TREC CDS track 2014: Medical case-based retrieval task

SNUMedinfo at TREC CDS track 2014: Medical case-based retrieval task SNUMedinfo at TREC CDS track 2014: Medical case-based retrieval task Sungbin Choi, Jinwook Choi Medical Informatics Laboratory, Seoul National University, Seoul, Republic of Korea wakeup06@empas.com, jinchoi@snu.ac.kr

More information

Focused Retrieval Using Topical Language and Structure

Focused Retrieval Using Topical Language and Structure Focused Retrieval Using Topical Language and Structure A.M. Kaptein Archives and Information Studies, University of Amsterdam Turfdraagsterpad 9, 1012 XT Amsterdam, The Netherlands a.m.kaptein@uva.nl Abstract

More information

Enriching Knowledge Domain Visualizations: Analysis of a Record Linkage and Information Fusion Approach to Citation Data

Enriching Knowledge Domain Visualizations: Analysis of a Record Linkage and Information Fusion Approach to Citation Data Enriching Knowledge Domain Visualizations: Analysis of a Record Linkage and Information Fusion Approach to Citation Data Marie B. Synnestvedt, MSEd 1, 2 1 Drexel University College of Information Science

More information

A Web Recommendation System Based on Maximum Entropy

A Web Recommendation System Based on Maximum Entropy A Web Recommendation System Based on Maximum Entropy Xin Jin, Bamshad Mobasher,Yanzan Zhou Center for Web Intelligence School of Computer Science, Telecommunication, and Information Systems DePaul University,

More information

Identification of Navigation Lead Candidates Using Citation and Co-Citation Analysis

Identification of Navigation Lead Candidates Using Citation and Co-Citation Analysis Identification of Navigation Lead Candidates Using Citation and Co-Citation Analysis Robert Moro, Mate Vangel, Maria Bielikova Slovak University of Technology in Bratislava Faculty of Informatics and Information

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets Arjumand Younus 1,2, Colm O Riordan 1, and Gabriella Pasi 2 1 Computational Intelligence Research Group,

More information

Ontology Based Prediction of Difficult Keyword Queries

Ontology Based Prediction of Difficult Keyword Queries Ontology Based Prediction of Difficult Keyword Queries Lubna.C*, Kasim K Pursuing M.Tech (CSE)*, Associate Professor (CSE) MEA Engineering College, Perinthalmanna Kerala, India lubna9990@gmail.com, kasim_mlp@gmail.com

More information

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Masaki Eto Gakushuin Women s College Tokyo, Japan masaki.eto@gakushuin.ac.jp Abstract. To improve the search performance

More information

The Goal of this Document. Where to Start?

The Goal of this Document. Where to Start? A QUICK INTRODUCTION TO THE SEMILAR APPLICATION Mihai Lintean, Rajendra Banjade, and Vasile Rus vrus@memphis.edu linteam@gmail.com rbanjade@memphis.edu The Goal of this Document This document introduce

More information

Knowledge-based Word Sense Disambiguation using Topic Models Devendra Singh Chaplot

Knowledge-based Word Sense Disambiguation using Topic Models Devendra Singh Chaplot Knowledge-based Word Sense Disambiguation using Topic Models Devendra Singh Chaplot Ruslan Salakhutdinov Word Sense Disambiguation Word sense disambiguation (WSD) is defined as the problem of computationally

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation Daniel Lowd January 14, 2004 1 Introduction Probabilistic models have shown increasing popularity

More information

Supervised classification of law area in the legal domain

Supervised classification of law area in the legal domain AFSTUDEERPROJECT BSC KI Supervised classification of law area in the legal domain Author: Mees FRÖBERG (10559949) Supervisors: Evangelos KANOULAS Tjerk DE GREEF June 24, 2016 Abstract Search algorithms

More information

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING 43 CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING 3.1 INTRODUCTION This chapter emphasizes the Information Retrieval based on Query Expansion (QE) and Latent Semantic

More information

Estimating Embedding Vectors for Queries

Estimating Embedding Vectors for Queries Estimating Embedding Vectors for Queries Hamed Zamani Center for Intelligent Information Retrieval College of Information and Computer Sciences University of Massachusetts Amherst Amherst, MA 01003 zamani@cs.umass.edu

More information

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks University of Amsterdam at INEX 2010: Ad hoc and Book Tracks Jaap Kamps 1,2 and Marijn Koolen 1 1 Archives and Information Studies, Faculty of Humanities, University of Amsterdam 2 ISLA, Faculty of Science,

More information

Retrieval Evaluation. Hongning Wang

Retrieval Evaluation. Hongning Wang Retrieval Evaluation Hongning Wang CS@UVa What we have learned so far Indexed corpus Crawler Ranking procedure Research attention Doc Analyzer Doc Rep (Index) Query Rep Feedback (Query) Evaluation User

More information

Window Extraction for Information Retrieval

Window Extraction for Information Retrieval Window Extraction for Information Retrieval Samuel Huston Center for Intelligent Information Retrieval University of Massachusetts Amherst Amherst, MA, 01002, USA sjh@cs.umass.edu W. Bruce Croft Center

More information

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12 Fall 2016 CS646: Information Retrieval Lecture 2 - Introduction to Search Result Ranking Jiepu Jiang University of Massachusetts Amherst 2016/09/12 More course information Programming Prerequisites Proficiency

More information

ECNU at 2017 ehealth Task 2: Technologically Assisted Reviews in Empirical Medicine

ECNU at 2017 ehealth Task 2: Technologically Assisted Reviews in Empirical Medicine ECNU at 2017 ehealth Task 2: Technologically Assisted Reviews in Empirical Medicine Jiayi Chen 1, Su Chen 1, Yang Song 1, Hongyu Liu 1, Yueyao Wang 1, Qinmin Hu 1, Liang He 1, and Yan Yang 1,2 Department

More information

e-scider: A tool to retrieve, prioritize and analyze the articles from PubMed database Sujit R. Tangadpalliwar 1, Rakesh Nimbalkar 2, Prabha Garg* 3

e-scider: A tool to retrieve, prioritize and analyze the articles from PubMed database Sujit R. Tangadpalliwar 1, Rakesh Nimbalkar 2, Prabha Garg* 3 e-scider: A tool to retrieve, prioritize and analyze the articles from PubMed database Sujit R. Tangadpalliwar 1, Rakesh Nimbalkar 2, Prabha Garg* 3 1 National Institute of Pharmaceutical Education and

More information

dr.ir. D. Hiemstra dr. P.E. van der Vet

dr.ir. D. Hiemstra dr. P.E. van der Vet dr.ir. D. Hiemstra dr. P.E. van der Vet Abstract Over the last 20 years genomics research has gained a lot of interest. Every year millions of articles are published and stored in databases. Researchers

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus

Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus Donald C. Comeau *, Haibin Liu, Rezarta Islamaj Doğan and W. John Wilbur National Center

More information

Exploiting Index Pruning Methods for Clustering XML Collections

Exploiting Index Pruning Methods for Clustering XML Collections Exploiting Index Pruning Methods for Clustering XML Collections Ismail Sengor Altingovde, Duygu Atilgan and Özgür Ulusoy Department of Computer Engineering, Bilkent University, Ankara, Turkey {ismaila,

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy Martin Rajman, Pierre Andrews, María del Mar Pérez Almenta, and Florian Seydoux Artificial Intelligence

More information

MedKit: A Helper Toolkit for Automatic Mining of MEDLINE/PubMed Citations. Jing Ding. Daniel Berleant *

MedKit: A Helper Toolkit for Automatic Mining of MEDLINE/PubMed Citations. Jing Ding. Daniel Berleant * MedKit: A Helper Toolkit for Automatic Mining of MEDLINE/PubMed Citations Jing Ding Department of Electrical and Computer Engineering, Iowa State University, Ames, IA 50011, USA Daniel Berleant * Department

More information

Overview of BioCreative VI Precision Medicine Track

Overview of BioCreative VI Precision Medicine Track Overview of BioCreative VI Precision Medicine Track Mining scientific literature for protein interactions affected by mutations Organizers: Rezarta Islamaj Dogan (NCBI) Andrew Chatr-aryamontri (BioGrid)

More information