Scientific Literature Retrieval based on Terminological Paraphrases using Predicate Argument Tuple

Size: px

Start display at page:

Download "Scientific Literature Retrieval based on Terminological Paraphrases using Predicate Argument Tuple"

Spencer Johnston
6 years ago
Views:

1 Scientific Literature Retrieval based on Terminological Paraphrases using Predicate Argument Tuple Sung-Pil Choi 1, Sa-kwang Song 1, Hanmin Jung 1, Michaela Geierhos 2, Sung Hyon Myaeng 3 1 Korea Institute of Science and Technology Information, Daejeon, Korea 2 Munich University, Munich, Germany 3 Korea Advanced Institute of Science and Technology, Daejeon, Korea {spchoi, esmallj, jhm}@kisti.re.kr, michaela.geierhos@cis.unimuenchen.de, myaeng@kaist.ac.kr Abstract. The conceptual condensability of technical terms permits us to use them as effective queries to search scientific databases. However, authors often employ alternative expressions to represent the meanings of specific terms, in other words, Terminological Paraphrases (TPs) in the literature for certain reasons. In this paper, we propose an effective way to retrieve de facto relevance documents which only contain those TPs and cannot be searched by conventional models in an environment with only controlled vocabularies by adapting Predicate Argument Tuple (PAT). The experiment confirms that PAT-based document retrieval is an effective and promising method to search those kinds of documents and to improve terminology-based scientific information access models. 1 Introduction Terminology is defined as a set of linguistic elements, each of which represents, designates, and defines a technical concept in a particular scientific field. InfoTerm [1], an international information center for terminology, specifies two important roles of terminology: to conceptually represent the expertise of a particular domain, and serve as a tool to access domain-specific information and knowledge. Although much effort has been devoted to invent effective ways of query formulation and processing thus far, most of the world s major scientific databases adopt simple keyword-based strategies rather than more enhanced but complicated approaches 1. One reason is that scientific documents such as articles and patents include many technical terms that are discriminative and therefore highly informative. Accordingly, given that users and contents can share these technical terms, simple termbased methods can still achieve high levels of satisfaction. 1 Google Scholar( PubMed( Microsoft Academic Search ( adfa, p. 1, Springer-Verlag Berlin Heidelberg 2012

2 The conceptual condensability of technical terms permits us to use them as effective queries to search scientific databases. However, authors often employ alternative expressions to represent the meanings of specific terms in the literature for certain reasons. Normal keyword matching models can only find documents that contain the input query terms. In sum, with a single technical term, it is nontrivial to access documents that include only alternative expressions of terms, in other words, terminological paraphrases (TPs). In this paper, we propose an effective way to retrieve documents that contain the alternative expressions which denote the concepts of terminologies in literature by adapting Predicate Argument Tuple (PAT). A PAT consists of multiple arguments and a predicate which represents the semantic relation between them and therefore expresses both syntactic and semantic interrelations between words in a sentence. We exploit PATs as indices for searching various textual segments similar to an input sentence that defines a particular terminology (TPs). To achieve this, we construct a novel document retrieval system based on the PATs to investigate the retrieval of the de facto relevance documents which only contain those TPs and cannot be searched by conventional models in an environment with only controlled vocabularies (namely, single terms). 2 Related Work To enhance the search functions of PubMed, the largest biomedical literature database in the world, Lu et al. (2009) introduced the Automatic Term Mapping (ATM) method, which automatically maps user queries into MeSH descriptors and enables QE with various types of thesaurus information [2]. There have been many studies of QE application to improve the performance of biomedical information retrieval with controlled vocabularies such as MeSH and UMLS [3-7]. 3 PAT-based Scientific Literature Retrieval System This chapter explains a newly invented retrieval system that can identify the TPs of input query terms in scientific literatures based on the definitions of the terms and therefore retrieve de facto relevance documents in an efficient way. We start by introducing the detailed architecture of our proposed system.

3 Fig. 1. System Architecture and Process of PAT-based Retrieval System Fig. 1 shows the architecture and procedure of our system. With an input query term, the term definition finder can obtain the definition of the term from various sources. Definitional PATs, which compose a term definition, are extracted from the definition by applying syntactic parsing, PAT extraction, and preprocessing. With a PAT query consisting of definitional PATs, the system searches and ranks relevant documents that have similar sentences to the definition of the input term. To build the search database, our system extracts all the PATs, rather than words from the original target texts as indices and constructs an inverted file based on them as seen in the Fig. 2. Fig. 2. PAT-based Inverted File Fig. 2 shows a small portion of the PAT-based inverted file. Although conventional information retrieval systems have very complex indexing structures, we construct a simple inverted file structure that contains only sentence identifiers as posting information.

3.1 Predicate Argument Tuple (PAT) Predicate Argument Structure (PAS) is a graph structure that denotes collectively the syntactic and semantic relations between words in a sentence [8].

4 3.1 Predicate Argument Tuple (PAT) Predicate Argument Structure (PAS) is a graph structure that denotes collectively the syntactic and semantic relations between words in a sentence [8]. Figure 3 shows an example of the PAS generated from the results of the Enju Parser [8]. Fig. 3. Predicate Argument Structure and Predicate Argument Tuples in a Sentence In the left side of the figure, the gray boxes represent predicates, the white boxes denote arguments, and the arrows express the syntactic relations between them. For example, although the predicate covering in the sentence has two arguments, structure and portion, sperm carries only a single noun argument, head. We can extract Predicate-Argument Tuples (PATs) from the PAS of a sentence as in Fig. 4. A PAT is an element of a PAS and can be classified into one of four types: connective, verbal, adjectival, and nominal. 3.2 Ranking by PAT To compute the similarity between an input PAT query and a document and then rank the search results, we use a simple ranking scheme which measures how many PATs in a PAT query exist in a document. p p Q p S PMRQ, S (1) p p S where Q is a PAT query, p is a single PAT and S is a set of PATs in a sentence. Although we use the PMR (PAT Match Ratio) as our main ranking scheme in this fundamental research, we can invent many additional schemes which can be more effective in retrieving documents containing TPs.

5 4 Experiments In this chapter, we investigated the retrieval of these de facto relevant documents in an environment with only controlled vocabularies (namely, single terms) to retrieve TPs from scientific literature. 4.1 Experimental Settings We use a set of abstracts in biomedical domain selected from NDSL (National Discovery for Science Leaders) 2 database. Table 1 shows its statistics. Table 1. Target Database used in the Experiment Items # of documents # of sentences # of PAT indices extracted Size 615,125 6,061,366 20,608,631 As for the experimental queries, the experiment uses 43 terms randomly selected from MeSH thesaurus which frequently appear in the target database as shown in Table 2. Table 2. Sample Queries from 43 Terms ID MeSH Term Term Definition D Bronchitis, Chronic A subcategory of chronic obstructive pulmonary disease. D Monilethrix Rare autosomal dominant disorder of the hair shaft. D Femur Head Necrosis Aseptic or avascular necrosis of the femoral head. D Kidney Failure, Chronic The end-stage of chronic renal insufficiency. D Dermatitis, Seborrheic A chronic inflammatory disease of the skin with unknown etiology. D Nervous System Disease Diseases of the central and peripheral nervous system. D Hyperargininemia A rare autosomal recessive disorder of the urea cycle. We use three different retrieval models for comparison in this experiment: the (1) Pseudo-Relevance Feedback model (PRF), (2) relevance model with term definitions (DEF), and (3) PAT-based document retrieval (PAT) for performance comparison. For (1) and (2), we used Indri system which produces a ranking model based on a combination of language models [9] and an inference network [10]. In addition, its relevance feedback uses Lavrenko s relevance model [11]. Two experts performed the relevance judgment manually with the top 10 documents retrieved by each system based on the 43 query terms. We measured the agreement ratio for all judged documents. The results are shown in Table

6 Table 3. Agreement Ratio in Relevant Judgements Systems Kappa Score [12] Evaluation 3 PRF Substantial Agreement PAT Almost Perfect Agreement DEF Substantial Agreement Average Substantial Agreement Two raters almost perfectly agreed on the result of a PAT-based search. As for the others, the scores were not significantly different. We selected and analyzed one of the two judgment results without adjusting the conflicts. 4.2 Experimental Results and Discussion Table 4 shows the comprehensive results of the experiment with the three document retrieval systems. Table 4. Evaluation Results of the Three Retrieval Models (Top 10) Items PRF PAT DEF Number of total query terms (S) 43 # of terms searching more than 1 document 29 (67.4%) 43 (100%) 43 (100%) # of terms searching more than 10 documents 16 (37.2%) 28 (65.12%) 43 (100%) Total # of retrieved documents (A) Total # of relevant documents (B) # of retrieved documents per term (A/S) # of relevant documents per term (B/S) Average precision over terms Total precision First, we counted the number of input query terms that retrieved more than one document. Whereas PAT and DEF could retrieve documents with all queries, only 29 queries retrieved more than one document by using PRF. The numbers of queries retrieving more than 10 documents were 16 with PRF, 28 with PAT, and 43 with DEF. This shows the difficulty of retrieving documents without the query terms. PAT retrieved the largest number of relevant documents (226) and showed the highest average precision over terms (0.59). Total precision, which refers to the ratio of relevant documents to the total retrieved documents, was highest in PAT. Although PRF showed low precision, its total precision was relatively competitive (0.57) in that this model used only statistical information to expand the initial query terms. 3 Fair (0.2 <κ 0.4), Moderate (0.4 <κ 0.6), Substantial (0.6 <κ 0.8), and Almost perfect (κ> 0.8)

7 5 Conclusion and Future Work In this paper, we confirmed that PAT-based document retrieval is an effective and promising method to search relevant documents with no explicit query terms as well as to improve terminology-based scientific information access models. Moreover, we found that PAT-based retrieval could search hidden relevant documents that could not be retrieved by the PRF model. Therefore, our proposed model can be used as a supplementary model by combining it with other conventional retrieval models to improve search performance. The most pressing issue for future studies will be to expand the PAT retrieval model to search more TPs from the literature. It is possible to generate synonymous PATs such as cause(virus, disease), cause(virus, disorder) and develop(host, disease) without much lexical ambiguity owing to the richness of their contextual information. 6 References 1. InfoTerm. Terminology Standardization. 2010; Available from: 2. Lu, Z., W. Kim, and W.J. Wilbur, Evaluation of query expansion using MeSH in PubMed. Inf. Retr., (1): p Abdou, S., P. Ruck, and J. Savoy, Evaluation of stemming, query expansion and manual indexing approaches for the genomic task. cell. 501: p Aronson, A.R., The effect of textual variation on concept based information retrieval, in Proceedings a conference of the American Medical Informatics Association p Srinivasan, P., Query expansion and MEDLINE. Inf. Process. Manage., (4): p Choi, S.-P., S.-K. Song, and S.-H. Myaeng, Analysis of Sentential Paraphrase Patterns and Errors through Predicate-Argument Tuple-based Approximate Alignment. KIPS Journal, B(2). 7. Choi, S.-P. and S.-H. Myaeng, Simplicity is better: revisiting single kernel PPI extraction, in COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics Miyao, Y. and J.i. Tsujii, Feature Forest Models for Probabilistic HPSG Parsing. Computational Linguistics, (1): p Ponte, J.M. and W.B. Croft, A language modeling approach to information retrieval, in Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. 1998, ACM: Melbourne, Australia. p Turtle, H. and W.B. Croft, Evaluation of an inference network-based retrieval model. ACM Trans. Inf. Syst., (3): p Lavrenko, V. and W.B. Croft, Relevance based language models, in Proceedings of the 24th annual international ACM SIGIR conference on

8 Research and development in information retrieval. 2001, ACM: New Orleans, Louisiana, United States. p Cohen, J., Weighed kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, (4): p

Query Reformulation for Clinical Decision Support Search

Query Reformulation for Clinical Decision Support Search Luca Soldaini, Arman Cohan, Andrew Yates, Nazli Goharian, Ophir Frieder Information Retrieval Lab Computer Science Department Georgetown University