CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

43 CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING 3.1 INTRODUCTION This chapter emphasizes the Information Retrieval based on Query Expansion (QE) and Latent Semantic Indexing (LSI). Initially the user s query input is semantically expanded by query processing with WordNet and MeSH (Medical Subject Headings) ontologies. Query expansion reformulates the original query that enables users desired information to be retrieved. After query reformulation a term document matrix is constructed under Latent Semantic Indexing process. Once a term-document matrix is constructed, local and global weighting functions can be applied to it to condition the data. Rank reduced singular value decomposition is performed on the matrix to determine patterns in the relationships between the terms and concepts contained in the text. The new document vector coordinates and new query vector coordinates are the results of singular value decomposition (SVD). Finally the documents are ranked by using cosine similarity. 3.1.1 Query Expansion Semantic Web is a machine understandable web in which information provides well-defined meaning that enables system and people for better understanding of information from different sources and enables them to work effectively (Berners-Lee et al 2001). The introduction of semantic web is a great leap from the existing Web 2.0 in which the user not only interacts with the web, but also has the capability to generate more

44 meaningful information. The complete information is represented with the help of Ontology. Due to the decentralized nature of the Semantic Web, it is inevitable that different communities of users or software developers will use their own ontologies to describe semantic data sources (Yingjie Li et al 2010). Furthermore, the increasing popularity of Internet and digital libraries has made applications of information retrieval techniques crucial for finding relevant documents. A collection of documents that are related to the user s request are retrieved by comparing it with an automatically generated index of the textual content of the documents present in the system by means of a computerized process, called document retrieval (also known as Information Retrieval). The objective of retrieval activity is the retrieval of highly useful documents and not the huge number of documents. In this proposed work an efficient ontology mapping query expansion model is proposed for providing a dynamic information retrieval service. The ontologies adapt to the document space within multi-disciplinary domains where a different terminology is used. The objective is to enhance the user-experience by improving the search result quality for large-scale search systems (Stein Tomassen 2006). One of the applications of ontologies in information retrieval is related to query expansion, which involves the searching the new terms in the ontologies. These new terms are related to the original query terms, to be used as a part of the query. Ontologies are useful for disambiguation in natural language. The well designed ontologies give the basis for knowledge representation in common sense or specific sense. Ontology-based knowledge representation provides two applications in information retrieval. Domainspecific ontologies help identify the semantic categories that are involved in understanding discussion in that domain. For this purpose, the ontologies work as a concept dictionary. Domain-independent ontology is a general-

45 purpose ontology and has been used for language understanding (Chandrasekaran et al 1999). Query expansion is one of the methods to improve the performance of retrieval system. The basic process is that select new terms which are based on the initial query, and then combine both of them to form a new query. Query expansion aims to express an information need by multiple terms. The ontology-based query refinement gives the suggestion of newly selected terms from the conceptualized knowledge. There are several retrieval methods for IR system. This work focuses on the ontology-based query expansion for the effective document retrieval which expands the original query term by using WordNet and MeSH. Ontology allows knowledge to be represented as a set of concepts, properties and the relations between them (Uschold & Gruninger 1996). In information retrieval, the users don t search with the exact terms represented in the documents in most of the cases. Hence, relevant documents are not fetched by the keyword-based information retrieval but the semantic web makes the information retrieval more users driven than that of keyword driven. Hence it helps to retrieve more relevant documents. Many researchers widely used Natural language processing (NLP) to understand the meaning of the user query input in ontology based information retrieval. The semantic web makes use of various types for ontologies for understanding the user query input in information retrieval. Widely, linguistic ontologies like WordNet, VerbNet, FrameNet, and ConceptNet are used to understand the concepts of user query. There are also other ontologies like application or domain specific ontology, contextual ontology and user history based generative ontology used in information retrieval based on the user s requirement.

46 The major problem of current IR system is that its performance is affected by language ambiguity. When several terms express the same concept, an IR system perhaps only retrieves the documents that represent the concept by the same term used in the query. When a term expresses multiple concepts, this term might lead the retrieval result to non-relevant documents. WordNet is a domain-independent ontology, the application of it is to expand common vocabulary in the query terms, match the same concept but represented by different terms in documents. It is an online lexical reference system where English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. MeSH is a domain-specific ontology, used to expand biomedical terms and to identify term variants respectively. It is the National Library of Medicine's vocabulary thesaurus, which contains a collection of words representing descriptors in a hierarchical structure. The method of expanding query term and their weight are the most important factors that affect the retrieval performance. 3.1.2 Latent Semantic Indexing Latent Semantic Indexing (LSI) provides the ability to find a broader range of relevant documents, because it looks at semantic relationships between groups of keywords and the use of a high-dimensional representation. The LSI algorithm also represents both terms and text objects in the same space, allowing all relevant information to be processed together as a collection rather than unrelated documents. This representational feature also helps in allowing objects to be retrieved directly from the query terms. Latent Semantic Indexing is generally thought of as being an improvement over the current keyword matching search engines. However, there are a variety of other implementations.

47 The standard search engine only requires users to enter a small number of keywords to create a query. The larger this supplied list of keywords, the more irrelevant documents are returned. LSI contrasts this approach by making use of a larger set of related keywords to improve the recall. Relevance feedback improves the query supplied by the user by making use of the terms within relevant documents. This is achieved without increasing the computational requirements needed to perform the query, and allows the recall and precision of the results returned to be improved. As Latent Semantic Indexing is able to correlate related keywords, one possible use is to use it for information filtering, where certain types of words are removed or documents/text containing certain words are removed from retrieved documents. With an appropriate implementation of LSI, information filtering could be used to remove all documents that followed a generic structure. It could be also used to manage spam, content expressed within chat rooms, news groups, bulletin boards and family-suitable search engine results, using a similar system. LSI could be used to determine the semantic relationship between parts of the document. LSI could also aid in the task of fully automating academic integrity checking. With a broad range of documents within its collection, a system could check a submitted document to determine if any of the content is copied directly from other sources. 3.1.3 Singular Value Decomposition Initially the latent semantic indexing methodology uses input query document and target set of documents to construct a term document matrix. The latent semantic structure model is obtained using Singular Value Decomposition (SVD) by analyzing the matrix. Singular value decomposition is very much related to a number of mathematical and statistical techniques in

48 other fields, including eigenvector decomposition, spectral analysis, and factor analysis. The terminology of factor analysis is used, since that approach has some precedence in the information retrieval literature. The Singular Value Decomposition model starts with a matrix of associations between all pairs of one type of object, e.g., documents. Further this matrix is decomposed into the product of two matrices of a very special form by a process called eigen analysis. The decomposed matrix consists of eigen vectors and eigen values. These special matrices show a decomposing of the original data into linearly independent components or factors. In common many of these components are very small, and may be ignored, resulting to an approximate model that contains many fewer factors. The similarity in behavior of each of the original documents is now approximated by its values on this smaller number of factors. Finally the documents are ranked corresponding to their estimated similarity by using distant measure namely cosine similarity. Hence for information retrieval techniques, SVD can be the more suitable technique for finding a set of uncorrelated indexing variables or factors. Each term and document is represented by its vector of factor values in this methodology. It is noted that by desirable quality of the dimension reduction, it is possible for documents with fairly different profiles of term usage to be mapped into the same vector of factor values. The improvement of unreliable data is accomplished by this property.

49 3.2 PROPOSED SYSTEM ARCHITECTURE The proposed system architecture comprises of four major processes mentioned below. 1. Query expansion using MeSH and WordNet ontologies 2. Term document matrix construction using Term Frequency (TF) and Inverse Document Frequency (IDF) 3. Indexing of documents by Latent Semantic Indexing (LSI) and Singular Valued Decomposition (SVD) 4. Ranking of documents by calculating Cosine similarity. Figure 3.1 Proposed System Architecture

50 Finally, the relevant documents are retrieved from the document repository using the similarity value. The Query Expansion process, TF-IDF calculation, LSI& SVD procedure and cosine similarity method for the effective information retrieval is explained in detail in the following sections. The proposed system architecture for the retrieval of documents is shown in the Figure 3.1. 3.2.1 Ontology-based Query Expansion Query Expansion in information retrieval is one of the applications of ontologies. Query Expansion is to expand each term in the original query with synonyms or related terms by searching the new terms in the ontologies. These terms are to be added as a part of the query and related to the original query terms. The basis for knowledge representation in common sense or specific sense is given by the well-designed ontologies. In information retrieval there are two applications namely domain specific ontology and domain independent ontology. Domain specific ontologies are mainly used to discover the semantic categories that are concerned with that particular domain. Domain independent ontology is nothing but a general purpose ontology which is mostly used for language purpose. 3.2.1.1 Term-based similarity calculation using MeSH ontology MeSH is a biomedical controlled vocabulary formed by the U.S. National Library of Medicine (NLM). MeSH consists of descriptors, qualifiers and supplementary concepts. Descriptors are the core basics of the vocabulary. To express a special aspect of the concept qualifiers are assigned to descriptors inside the MeSH fields. Both descriptors and qualifiers are arranged in several hierarchies. In this proposed approach the Jaccard similarity measure is used for the mapping of input query terms to MeSH terms.

51 Jaccard similarity is the cardinality of the intersection of the two sets divided by that of the union of these two sets. The value of J (A, B) is 1 if A and B are exactly the same, and decreases when A and B become increasingly different. The Jaccard similarity measure is used to measure the similarity between two terms. The Jaccard similarity is calculated using the Equation (3.1) is given as J A, B = A B A B (3.1) where A B is the intersection of two sets of A and B, A B is the union of two sets of A and B. If the similarity value is greater than or equal to the threshold value 0.5, then the corresponding MeSH terms are added to the query. 3.2.1.2 Synonyms-based similarity calculation using WordNet ontology WordNet is an electronic lexical database of English and it was developed and is being maintained by the Cognitive Science Laboratory of Princeton University. In WordNet, a concept represents a meaning of a term. The terms which have the same meaning are grouped in a synset. Each synset has its gloss (definition) and links with other synset higher or lower in the hierarchy by different types of semantic relations. Synonyms-based similarity calculation is carried out using WordNet. By using the expanded query from the previous module, this similarity calculation generates words equivalent to input query words by substituting the head noun of each word with its synonyms. Hence the expanded input query from the previous stage is once again expanded by adding keywords from synset. The introduction of synonyms in the input query resolves the ambiguity problem.

52 3.2.2 Term Extraction using TF-IDF The expanded query is fed into the TF-IDF module. Using the expanded query words, the Term Frequency and Inverse Document Frequency are calculated. Then term matrix is constructed for further processing. The Equation (3.2) for Term Frequency (TF t ) can be written as, TF t = n t N (3.2) In the above equation, the number of occurrences of a considered term t in a document is defined by n t and N denotes the number of occurrences of all terms in a specific document. written as, The Equation (3.3) for Inverse Document Frequency (IDF t ) can be IDF t = 1 2 1 + log 2 D d t (3.3) where D specifies the number of all the documents and d t denotes the number of documents with term t in it. Terms that have very low discrimination value, i.e., the terms that are not useful for differentiating among documents, are replaced by the terms for low-frequency terms and phrases for highfrequency terms. If the documents are moved together after the assignment of a term, then the term is a poor or low discriminator valued term. The highest value of IDF shows that the term occurs in only one document, and the lowest value shows that the term occurs in many documents. Thus the correlation between the term and the document has been identified.

53 Now both TF and IDF have been defined and they can be combined to produce the score of a feature on the document. The TF-IDF score is calculated using the Equation (3.4) and can be written as TF-IDF score = TF t IDF t (3.4) By the high TF and the low DF of a considered term in the collection of documents, a high score in TD-IDF can be achieved. The scores lead to filter out the other unwanted common terms. Hence the extraction of terms and construction of term matrix using TF-IDF is successfully completed. 3.2.3 Latent Semantic Indexing Latent semantic indexing (LSI) (Scott Deerwester et al) tries to resolve the problems of lexical matching. The conceptual indices instead of individual words for retrieval are used by LSI. It is an extension of Vector Space Modelling, developed to provide retrieve results related to a specified keyword, even if the keyword is not present in the document. Using the term document constructed from the previous stage the LSI is implemented through the application of singular value decomposition (SVD). Singular value decomposition (SVD) is an orthogonal decomposition and is used to reduce the number of dimensions used to represent the documents. Singular Value Decomposition splits any rectangular matrix S of size t d into three components X, Y and Z. The Equation (3.5) for SVD can be written as S = XYZ T (3.5) where S is t d term-document matrix, X is t d orthogonal matrix, Z is t d orthogonal matrix and Y is d d diagonal matrix. The LSI algorithm is

54 depicted in the Figure 3.2. The document vector coordinates and the query vector coordinates are derived by implementing LSI algorithm. Figure 3.2 Latent Semantic Indexing Algorithm 3.2.3.1 Cosine similarity This metric is frequently used when trying to determine similarity between two documents. Since there are more words that are in common between two documents, it is not viable to use the other methods of calculating similarities (namely the Euclidean Distance and the Pearson Correlation Coefficient). In this similarity metric, the attributes (or words, in the case of the documents) are used as a vector to find the normalized dot product of the two documents. The Cosine similarity distance measure is calculated using the Equation (3.6) as given below. Sim q, d = q.d q d (3.6)

55 Where q and d are vectors of attributes, q.d denotes dot product of q and d. By substituting the document coordinate values and the query vector coordinate values in the above Equation (3.6) the similarity value is found. If the similarity value is greater than or equal to the threshold value 0.60, then the documents are retrieved from the document repositories. 3.2.3.2 LSI & SVD implementation an example The Latent Semantic Indexing (LSI) pseudo code is explained with Bronchial Asthma respiratory disease dataset as follows. Target set of document collection consists of the following documents. d1: Asthma is characterized by airway inflammation. d2: Wood dust is a common cause of Asthma. d3: Bronchial Asthma is a respiratory disease. Step 1: Initially term weights are calculated. The term document matrix A and query matrix q is constructed using term weights. Terms d1 d2 d3 q a airway astma broncial by cause common caractrized dust inflammtion is of respiratory_ disease A = 0 1 1 1 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 1 0 1 0 0 0 1 0 1 0 0 1 1 1 0 1 0 0 0 1 q = 0 0 1 1 0 0 0 0 0 0 0 0 1

56 Step 2: The matrices U (orthogonal matrix), S (diagonal matrix) and V (orthogonal matrix) are found after the decomposition of term-document matrix A where A = USV T U = 0.368 0.303 0.212 0.154 0.396 0.130 0.522 0.092 0.082 0.161 0.048 0.510 0.154 0.396 0.130 0.208 0.255 0.298 0.208 0.255 0.298 0.154 0.396 0.130 0.208 0.255 0.298 0.154 0.396 0.130 0.522 0.092 0.082 0.208 0.255 0.298 0.161 0.048 0.510 S = 3.286 0.000 0.000 0.000 2.113 0.000 0.000 0.000 1.655 V = 0.505 0.836 0.215 0.683 0.539 0.493 0.528 0.102 0.843 V T = 0.505 0.683 0.528 0.836 0.539 0.102 0.215 0.493 0,843 Step 3: A Rank 2 Approximation is implemented by keeping the first columns of X and Z and the first columns and rows of Y (k=2). U U k = 0,368 0.303 0.154 0.396 0.522 0.092 0.161 0.048 0.154 0.396 0.208 0.255 0.208 0.255 0.154 0.396 0.208 0.255 0.154 0.396 0.522 0.092 0.298 0.255 0.161 0.048 S S k = 3.286 0.000 0.000 2.113

57 V V k = 0.505 0.836 0.683 0.539 0.528 0.102 V T V k T = 0.505 0.683 0.528 0.836 0.539 0.102 Step 4: In this step the new document vector coordinates are calculated. The eigenvector values of the rows of Z are the coordinates of individual document vectors and represented by d1 (-0.4945, 0.6492), d2 (- 0.6458, -0.7194), d3 (-0.5817, 0.2469). Step 5: the equation below. In this step the new query vector coordinates are calculated using q=q T U k S k 1 (3.7) In the Equation (3.7) q is the query matrix, X is the orthogonal matrix and Y is the diagonal matrix. Note: These are the new coordinate of the query vector in two dimensions. Note how this matrix is now different from the original query matrix q given in Step 1. q = q T U k S k 1 k = 2 q = 00000100011 0.368 0.303 0.154 0.396 0.522 0.092 0.161 0.048 0.154 0.396 0.208 0.255 0.208 0.255 0.154 0.396 0.208 0.255 0.154 0.396 0.522 0.092 0.208 0.255 0.161 0.048 0.304 0.0000 0.0000 0.473

58 q = 0.257 0.002 Step 6: By using cosine similarity measure the documents are ranked in decreasing order. Sim (q, d) = q d q d (sim (q, d1) =0.51, sim (q, d2) = 0.55, sim (q, d3) = 0.80) Ranking of documents in descending order is represented by d 3 > d 2 > d 1. The Pseudo code for latent semantic indexing is as follows: Figure 3.3 Latent Semantic Indexing - Pseudo Code 3.3 EXPERIMENTAL ENVIRONMENT AND DATASET DESCRIPTION The proposed LSI algorithm combined with Query Expansion strategy is tested using documents obtained from PubMed database. PubMed is the National Library of Medicine's search service that provides access to over 11 million citations in MEDLINE. MEDLINE is the premier

59 Bibliographic database covering the fields of medicine, nursing, dentistry, veterinary medicine, health care system, and preclinical sciences. It contains more than 11 million references and abstracts from over 4000 Biomedical Journals. From that database, 180 medical journals for 6 different medical keywords have been chosen. The chosen sample keywords are shown in Table 3.1. The proposed Information Retrieval system is implemented using Java (jdk 1.7) and NetBeans. Table 3.1 Input Query Words Sl.No Input Query 1. Rotavirus 2. Anemia 3. Asthma 4. Neoplasm 5. Hyperthyroidism 6. Diabetes 3.4 RESULTS AND DISCUSSION The experiment results of the proposed method are presented and analyzed in this section. The evaluation metrics used in this performance analysis are also discussed. 3.4.1 Evaluation Metrics The following evaluation metrics are used to evaluate the effectiveness of document retrieval systems and to justify theoretical and practical developments of these systems. It consists of a set of measures that follow a common underlying evaluation methodology. Some of the metrics that have been chosen for the evaluation purpose are Recall, Precision and the F-measure and the corresponding Equations (3.8), (3.9) & (3.10) are as follows.

60 Precision, P = related documents extracted documents (extracted documnets ) (3.8) Recall, R = related documents extracted documents (related documnets ) (3.9) F- Measure, F = 2PR (P+R) (3.10) As suggested by the above equations in the field of document retrieval, Precision is the fraction of retrieved documents that are relevant to the search, Recall is the fraction of the documents that are relevant to the query that are successfully retrieved and the F-measure that combines Precision and Recall is the harmonic mean of precision and recall. 3.4.2 Performance Analysis The Information Retrieval metrics Precision, Recall and F-measure are calculated and tabulated in the Table 3.2. The average values of the above mentioned metrics also calculated and tabulated. The performance of the proposed retrieval system is compared with the existing system of Kogilavani & Balasubramanie (2009). The Precision, Recall and F-Measure of the proposed system increases by 19%, 7% and 13% respectively when compared to the existing system. The performance metrics (Precision, Recall & F-Measure) comparison table is illustrated in the Table 3.3. The comparison graph for the average precision, recall and F-measure of existing and proposed system is depicted in the Figure 3.3. Even though the document clustering technique is improved by concept hierarchy knowledge of ontology in the existing system the proposed LSI based retrieval system outperforms the existing one. In the existing system only the domain specific ontology MeSH is used. In the case of proposed system the general purpose ontology WordNet is also used for query expansion process.

61 Table 3.2 Performance Metrics of the Proposed System Sl.No Input Query Precision Recall F-Measure 1. Rotavirus 0.76 0.69 0.72 2. Anemia 0.86 0.75 0.80 3. Asthma 0.73 0.68 0.70 4. Neoplasm 0.78 0.71 0.74 5. Hyperthyroidism 0.65 0.58 0.61 6. Diabetes Mellitusz 0.67 0.62 0.64 Average 0.74 0.67 0.70 Table 3.3 Performance Metrics Comparison (Existing Vs Proposed) Methods Precision Recall F-Measure Existing 0.55 0.6 0.57 Proposed 0.74 0.67 0.70 0.8 Precision,Recall & F-Measure 0.6 0.4 0.2 Existing Proposed 0 Precision Recall F-Measure Figure 3.4 Comparison Graph for the Precision, Recall and F-measure Values with the Existing Work

62 The performance of the proposed system is also compared with the existing system of Aswani Kumar et al (2006) and illustrated in the Table 3.4 & Table 3.5. The precision of the proposed method increases by 29%, and 15% when compared to the VSM with traditional approach and LSI with traditional approach respectively. The Precision of the proposed method increases by 13%, and 14% when compared to LSI with Intelligent approach and VSM with Intelligent approach respectively. The average Precision value of the proposed method is increased due to the ontology based (MeSH & WordNet) query expansion module. The basic process is that select new terms which are based on the initial query, and then combine both of them to form a new query. Query expansion aims to express an information need by multiple terms. The ontology-based query refinement gives the suggestion of newly selected terms from the conceptualized knowledge. Table 3.4 Precision Value Comparison (Traditional Approach Vs Proposed) Methods Precision VSM with Traditional Approach (Existing) 0.45 LSI with Traditional Approach (Existing) 0.59 Proposed Method 0.74 Table 3.5 Precision Value Comparison (Intelligent Approach Vs Proposed) Methods Precision VSM with Intelligent Approach Aswani Kumar et al (2006) 0.60 LSI with Intelligent Approach Aswani Kumar et al (2006) 0.61 Proposed Method 0.74

63 The performance metrics comparison for the traditional approach with the proposed system is presented in the Table 3.4 and intelligent approach with the proposed system is presented in the Table 3.5. They are also plotted as a chart and depicted in the following Figure 3.4 & Figure 3.5. Proposed Approach LSI(Traditional Approach) VSM(Traditional Approach) 0 0.2 0.4 0.6 0.8 Precision Figure 3.5 Precision Value Comparison(Traditional Approach Vs Proposed) Proposed Approach LSI(Intelligent Approach) Precision VSM(Intelligent Approach) 0 0.2 0.4 0.6 0.8 Precision Figure 3.6 Precision Value Comparison (Intelligent Approach Vs Proposed)

64 3.5 SUMMARY Using Query Expansion (QE) approach and Latent Semantic Indexing (LSI) methodology the performance of the proposed system is improved. The proposed method is tested on Pub Med Database. In traditional Information Retrieval systems the user queries are formed by a few keywords and the term mismatch is the serious issue which leads to poor retrieval performance. The multiple domain specific keywords given by a user exactly express his/her information need and naturally the retrieval process returns the relevant information. This is achieved by ontology based query expansion method which is dynamically implemented with WordNet and MeSH ontologies. Since most of the search engines do not expand the query by synonyms, the IR performance metric recall measure is substantially reduced by the synonymy problem. Moreover the automatic query expansion may add terms that have a different meaning from those intended by the user which leads to polysemy problem. This problem is resolved by LSI by using the word usage patterns which are building upon word co-occurrences. The terms and documents are represented within the semantic space created by LSI. SVD is used in this proposed approach to implement dimensionality reduction. The patterns of the word usage are analyzed and the similarities between the terms and documents are shown within the reduced space. Finally the retrieved documents are ordered by the cosine similarity distance measure. The proposed system is evaluated with standard evaluation metrics like Precision, Recall and F-measure. The results of the experiments proved that the retrieval performance of the proposed system significantly increased when compared to the existing system performance.