Retrieval of Highly Related Documents Containing Gene-Disease Association

Similar documents
Chapter 6: Information Retrieval and Web Search. An introduction

Keyword Extraction by KNN considering Similarity among Features

Plan for today. CS276B Text Retrieval and Mining Winter Vector spaces and XML. Text-centric XML retrieval. Vector spaces and XML

Information Retrieval. (M&S Ch 15)

International Journal of Advanced Research in Computer Science and Software Engineering

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

Genescene: Biomedical Text and Data Mining

String Vector based KNN for Text Categorization

A Semantic Model for Concept Based Clustering

Document Retrieval using Predication Similarity

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.

Instructor: Stefan Savev

dr.ir. D. Hiemstra dr. P.E. van der Vet

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

Knowledge Discovery and Data Mining 1 (VO) ( )

A Survey on improving performance of Information Retrieval System using Adaptive Genetic Algorithm

Encoding Words into String Vectors for Word Categorization

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

efip online Help Document

BioNav: An Ontology-Based Framework to Discover Semantic Links in the Cloud of Linked Data

Incoming, Outgoing Degree and Importance Analysis of Network Motifs

What is this Song About?: Identification of Keywords in Bollywood Lyrics

Chapter 27 Introduction to Information Retrieval and Web Search

A Survey On Different Text Clustering Techniques For Patent Analysis

SEEK User Manual. Introduction

A Technique for Design Patterns Detection

AN EFFECTIVE INFORMATION RETRIEVAL FOR AMBIGUOUS QUERY

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Dynamic Clustering of Data with Modified K-Means Algorithm

Context Based Web Indexing For Semantic Web

Keywords Data alignment, Data annotation, Web database, Search Result Record

An Efficient Semantic Image Retrieval based on Color and Texture Features and Data Mining Techniques

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

Midterm Exam Search Engines ( / ) October 20, 2015

Information Retrieval using Pattern Deploying and Pattern Evolving Method for Text Mining

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts

CSE 494: Information Retrieval, Mining and Integration on the Internet

Information Retrieval

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM

Information Retrieval

CHAPTER 5 CLUSTERING USING MUST LINK AND CANNOT LINK ALGORITHM

KEYWORD EXTRACTION FROM DESKTOP USING TEXT MINING TECHNIQUES

CS 6320 Natural Language Processing

CS54701: Information Retrieval

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Lecture 5: Information Retrieval using the Vector Space Model

Information Retrieval: Retrieval Models

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Multimodal Information Spaces for Content-based Image Retrieval

Java Archives Search Engine Using Byte Code as Information Source

COMP6237 Data Mining Searching and Ranking

modern database systems lecture 4 : information retrieval

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky

Application of Patent Networks to Information Retrieval: A Preliminary Study

e-scider: A tool to retrieve, prioritize and analyze the articles from PubMed database Sujit R. Tangadpalliwar 1, Rakesh Nimbalkar 2, Prabha Garg* 3

The Topic Specific Search Engine

Volume 2, Issue 9, September 2014 ISSN

Multimedia Information Systems

Extractive Text Summarization Techniques

A Review: Content Base Image Mining Technique for Image Retrieval Using Hybrid Clustering

A Modified Vector Space Model for Protein Retrieval

Hybrid Approach for Query Expansion using Query Log

Volume 2, Issue 6, June 2014 International Journal of Advance Research in Computer Science and Management Studies

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Exploring and Exploiting the Biological Maze. Presented By Vidyadhari Edupuganti Advisor Dr. Zoe Lacroix

ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies

An Efficient Hash-based Association Rule Mining Approach for Document Clustering

NTUBROWS System for NTCIR-7. Information Retrieval for Question Answering

Medical Records Clustering Based on the Text Fetched from Records

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

Q: Given a set of keywords how can we return relevant documents quickly?

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures

Medical Image Feature, Extraction, Selection And Classification

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL

Concept-Based Document Similarity Based on Suffix Tree Document

Text mining tools for semantically enriching the scientific literature

Exploiting Index Pruning Methods for Clustering XML Collections

Collaborative Rough Clustering

An Efficient Clustering for Crime Analysis

International Journal of Advance Engineering and Research Development. A Survey on Data Mining Methods and its Applications

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction

RLIMS-P Website Help Document

Semi-Supervised Clustering with Partial Background Information

Monika Maharishi Dayanand University Rohtak

2. LITERATURE REVIEW

Sheffield University and the TREC 2004 Genomics Track: Query Expansion Using Synonymous Terms

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language

doi: / _32

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Manning Chapter: Text Retrieval (Selections) Text Retrieval Tasks. Vorhees & Harman (Bulkpack) Evaluation The Vector Space Model Advanced Techniques

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Transcription:

Retrieval of Highly Related Documents Containing Gene-Disease Association K. Santhosh kumar 1, P. Sudhakar 2 Department of Computer Science & Engineering Annamalai University Annamalai Nagar, India. santhosh09539@gmail.com, kar.sudha@gmail.com Abstract - In this paper, a method for retrieval of documents focused for a given gene-disease association is presented. For a given query containing a gene and a disease, the proposed algorithm selects and ranks rich and conclusive biomedical documents from a document pool containing information regarding the gene-disease mentioned in the query string. It is believed that the documents retrieved by the proposed document retrieval system are highly relevant to the gene and disease mentioned in the query and so retrieving high relevant documents will aid medical expert curators to completely curate association between the gene and disease in a less time. Keywords - biomedical documents, document retrieval system, gene-disease association, expert curators I. INTRODUCTION The basic objective of studying gene disease association is to identify the influence of gene over a disease by analyzing people having the same gene with and without disease under consideration. A gene disease association must be interpreted carefully for proper understanding of the cause and cure of the respective disease. A gene disease association does not mean that the genetic variant is not an important factor for the cause of the disease. A gene is considered to be associated with a disease if some type of its alterations is frequently seen in people having that disease. To diagnose and cure a disease efficiently more research is concentrated in finding associations between a gene and a disease. So far a large volume of biomedical literatures are presented describing the association between the gene and disease. Also search engines are developed to extract literatures relevant to a given gene disease association. To enhance the knowledge sharing process and improve the research further the associations between the gene and the disease mentioned in the literatures are to be curated. Many databases containing gene-disease association have been constructed and made available over internet. Such database includes GHR-Genetic Home Reference and OMIM-Online Mendelian Inheritance in Human. Even though some databases for gene-disease association exist, it is hard to curate them completely in a short span. This is mainly due to following reasons: a huge volume biomedical literature is to be analyzed to identify possible gene-disease association, to curate a single gene-disease association a careful survey of all the biomedical literatures has to be accomplished. To curate a gene disease association it requires a large number of experts in the field of biomedicine. The experts have to put lot of effort and have to spend more time to maintain and update the dataset properly. Thus to curate recent findings from the literature in the biomedical research, it takes great time [1-4]. For conducting the experiments, a huge set of journals from PubMed Central (PMC) Open Access is used as it is a free full-text archives of biomedical and life sciences journal literature. The journals which contain keywords like cardio-vascular, neural, and other Gene names are extracted. In this paper, a review of prior work for retrieving the documents which is processed through the use of keywords is presented. Based on the submitted query, the system outputs a list of sorted documents. As a query is made with a logical combination of keywords that occur in a set of documents, the documents are retrieved using a search mechanism. These search engines follow different retrieval techniques such as Vector-space model [5], Boolean model etc. [6]. The performance of a document retrieval system can be measured through parameters like Precision and Recall. II. PROPOSED METHODOLOGY The document retrieval process is initiated with the task of keyword detection in each document and then using the TF-IDF score of the detected keywords they are ranked. The high rank documents containing the keywords given in the query are selected and given to the user. It is assumed that the phrases which indicated the gene-disease association in the document contains at least one gene and target disease name. Key words are the words increasing possibility of sentence containing interaction relationships, such as genes and diseases. A dictionary of gene name is constructed from the OMIM database [7] and disease name dictionary is built from Genetic Association Database (GAD) [8]. In the document retrieval process all gene and disease names are considered as key words. DOI 10.5013/IJSSST.a.17.32.08 8.1 ISSN: 1473-804x online, 1473-8031 print

A. Key Word Detection In this work an external corpus is not utilized in the document retrieval process. Two different strategies considered in this work to handle the keyword detection problem in the text of the documents are the clustering and the entropic approaches. Always there exist a crucial coherence between the spatial distribution of a word and its importance in the document. This factor is considered mainly in this work as the most relevant words are inhomogeneously distributed in the document and are highly relevant to the important topics of the text. These words having a large fluctuation in their frequency occur in certain regions and also they have a tendency to form a cluster. The entropic approach, a novel keyword detection technique was presented in [9]. The Shannon s entropy is utilized in the work for defining a novel technique based on the series of appearance of each word present in the document text. For the entropy calculation of any word a segment of the text is needed. The measures of relevancy between words considered in this work are, the entropic measure defined as E nor, and the clustering measure defined as C. The mathematical equations to calculate both the measure are given below. To define the entropic measure E nor Shannon s entropy is used and as mentioned earlier to calculate the entropy a partition of the text is needed. A text of length N is divided in to P parts and for each type of word w, a probability measure for the partition {pi(w)} can be defined as follows: where fi(w) is the occurrence frequency of the word type w in the i th part. For the above mentioned distribution, Shannon s entropy can be defined mathematically as given in below equation: parts P that the text is divided and 2) method adopted for dividing the text. To improve the method developed in [6], the clustering measure C was used. The text with length N and a word having occurrence frequency n can be denoted by {d1, d2,..., dn 1}. The inter occurrences distances of the word which can be considered as a measure of the relevance of the word can be represented as below. σ = s/ d, where s, the standard deviation = ( d 2 ) 1/2 d the average distance. By this representation the large value of σ means, relevancy is larger. Even though the keyword detection using σ appears as a perfect technique to detect keywords, σ has some weakness that can lead to inappropriate results. The measure C corrects some of its weaknesses. It should be noted that the randomly placed words don t have the same level of clustering σ. This is due to its dependability on the probability p = n/n of the word in the text. To eliminate that, normalization of σ needs to be done using the equation given below: where is the expected value of σ for a word with probability p randomly distributed. With this normalization, values around σ nor = 1 shows irrelevant information content of the word for the given text. The large values of σ nor indicate more clustering which means they are more relevant. On the other hand, the frequency of occurrence n decides the value of both σ and σ nor. By simulation of random texts, that the n dependence of the mean σ nor and the standard deviation sd(σ nor) is well fitted by the functions To eliminate the dependency of the definition on the occurrence frequency, Herrera and Pury [10] defined the normalized entropy After considering the above functions, the measure C, is mathematically represented as follows If the value of E nor(w) is larger, then the diversity of the word type w s distribution will be larger and so the relevancy will also be larger. If the words are distributed randomly, then the E nor value will be equal to 1. The value of E nor is highly influenced by two factors 1) number of If the value of C is 0, it means that the word distributed in a random manner and if the value of C is high then the word is highly related. If the value of C is negative then it represents a state of repulsion. DOI 10.5013/IJSSST.a.17.32.08 8.2 ISSN: 1473-804x online, 1473-8031 print

B. Document Retrieval In this paper, for extracting the more relevant documents containing detected keywords given in query are taken based on the weight. The weight of the detected keywords is calculated using parameters like the Term Frequency, TF and Inverse Document Frequency, IDF. The TF is a measure which can be used to detect most occurring term in the documents and the term IDF represents the percentage of importance of keyword in relevant documents i.e. the number of term, t that occurred in a document, d which is term freq(t, d). The term frequency can be given mathematically as: where f (t, d) is the number of occurrence the term-t in the documents-d. The IDF represents the scaling factor and the significance of the keyword and is represented as: where N = Number of documents available in the corpus. term t appears. The TF-IDF measures: = number of documents in which the The overall keyword weight is given by this measure and it is helpful while retrieving relevant documents. It is considered that a corpus contains K unique terms, and a document contains k unique terms. In the vector space model, the documents are represented as a vector of k- dimension in a vector space of K-dimension. Normally the value k will be less than the value of K, as it is obvious that not all the terms present in the corpus will appear in the document. The tf-idf weights are the values in the vector with respect to the k terms that present in the document. And they are calculated using the above formula. If the K-k terms are not present in the document then their corresponding entry is 0. Since their tf weight in the current document will be 0, as they do not appear in the document. It should be noted that the corresponding idf scores will not be 0, as the measure idf denotes a collectionwise statistic and it rely on the documents taken in the corpus. The term tf, denotes a documentwise statistic and it relies on the current document. Therefore if a term is not present in the current document, then the corresponding tf score is 0. If it is multiplied with idf, then the tf-idf weight of the term K-k becomes 0. Finally a sparse vector is generated with 0 in its majority of entries. At the end the document is represented as vectors in the vector space. A document vector contains an entry for each term, and the value is its corresponding tf-idf score. In a similar fashion the query text is represented as a vector in the K-dimensional vector space. As they are shorter than the documents they will have much fewer dimensions. The similarity value between the vectors of query and each of the documents is computed. Based on the similarity value the documents are sorted and documents with high similarity values are selected as their relevancy to the query is higher. Prior to the computation of similarity scores between the query vector and document vector, both the vectors are normalized to obtain a unit vector. Finally the documents are retrieved by computing the similarity scores between the keywords present in the query and the document vectors. The angle between the vectors gives the similarity score. In the vector space the similar documents will be closer and the angle between them will be smaller. The angle between the vectors is calculated based on cosine similarity technique. The dot product of the vectors gives the cosine of the angle between the two vectors. If the angle between the vectors is small then the cosine value will be large. This shows that if two similar vectors the corresponding cosine similarity will be large. Once the angle between the query vector and all the document vectors is computed using cosine similarity, sort the angle in descending order, and select the corresponding documents with high cosine similarity. This will give an ordered list of documents related the given query. III. RESULTS AND DISCUSSION The effectiveness and reliability of a document retrieval system can be evaluated using two measures such as precision and recall. These are first defined for the simple case where a system retrieves a list of documents for a query. Precision (P) is the ratio of the number of items received to the number of relevant items. Recall (R) is the fraction of relevant documents that are retrieved DOI 10.5013/IJSSST.a.17.32.08 8.3 ISSN: 1473-804x online, 1473-8031 print

The Averaged 11-point precision/recall graph across 50 queries is shown in Fig. 1 and the Mean Average Precision for this system is 0.2553. In Fig. 2 results of retrieval with TF-IDF on our data is presented. High values of w d are concentrated on the beginning of the graph, so basing query retrieval on the top words here will likely return relevant documents. The extra graph indicates the exponential of the resultant curve. adaptive genetic algorithm." International Journal of Computer Science & Information Technology 5.5 (2013): 91. [7] Online Mendelian Inheritance in Man. Baltimore, MD: McKusick- Nathans Institute of Genetic Medicine, Johns Hopkins University. 2015. Retrieved 23 July 2015. [8] A. Bauer-Mehren, "Gene-Disease network Analysis Reveals Functional Modules in Mendelian, Complex and Environmental diseases," PLOS One, pp. 1-3, 2011. [9] J.P. Herrera, P.A. Pury, Statistical keyword detection in literary corpora, Eur. Phys. J. B 63 (2008) 135. IV. CONCLUSION This work shows that TF-IDF is an efficient and simple measure to match keywords given in a user query to documents that are related to the corresponding query. From the experiments conducted on the set of collected documents, it is seen that TF-IDF score retrieves highly related documents for a given query. If the user query search for a certain topic, TF-IDF score helps to retrieve documents containing relevant information on the query. Also, encoding TF-IDF is straight-forward, this feature of it makes it fit for more query based document retrieval systems. The TF-IDF score has some drawbacks. The TF- IDF score will be suitable only if the query contains an exact keyword which is found frequently in the document. If the user wanted to find retrieve documents related to documents containing association between a gene and a disease TF-IDF will never retrieve documents that might be related to the query but searches for an exact match for the word in the query. So in this work we have identified keywords from each of the document and then document containing the relevant keywords are further processed for identifying the TF-IDF score. Then the highly ranked documents are retrieved from the prior extracted result eliminating the less relevant document. REFERENCES [1] Özgür A, Vu T, Erkan G, Radev DR. Identifying gene-disease associations using centrality on a literature mined gene-interaction network. Bioinformatics. 2008 [2] Gonzalez G, Uribe JC, Tari L, Brophy C, Baral C. Proc. of Pacific Symposium on Biocomputing, Volume 12. 2007. Mining genedisease relationships from biomedical literature: weighting protein-protein interactions and connectivity measures; pp. 28 39. [3] Baral C, Gonzalez G, Gitter A, Teegarden C, Zeigler A. Proc. of the 6th International Conference on Computational System Bioinformatics. 2007. CBioC: beyond a prototype for collaborative annotation of molecular interactions from the literature; pp. 381 384. [4] Hu Y, Hines LM, Weng H, Zuo D, Rivera M, Richardson A, LaBaer J. Analysis of genomic and proteomic data using advanced literature mining. J Proteome Res. 2003;2:405 412. [5] Xu, Hongjiao, et al. "Exploring similarity between academic paper and patent based on Latent Semantic Analysis and Vector Space Model." Fuzzy Systems and Knowledge Discovery (FSKD), 2015 12th International Conference on. IEEE, 2015. [6] Maitah, Wafa, Mamoun Al-Rababaa, and Ghasan Kannan. "Improving the effectiveness of information retrieval system using DOI 10.5013/IJSSST.a.17.32.08 8.4 ISSN: 1473-804x online, 1473-8031 print

Fig.1 Precision-Recall curve for the retrieved documents Fig.2 Results of retrieval with TF-IDF on our data. DOI 10.5013/IJSSST.a.17.32.08 8.5 ISSN: 1473-804x online, 1473-8031 print