Concept Based Search Using LSI and Automatic Keyphrase Extraction

Size: px

Start display at page:

Download "Concept Based Search Using LSI and Automatic Keyphrase Extraction"

Justin Burns
6 years ago
Views:

1 Concept Based Search Using LSI and Automatic Keyphrase Extraction Ravina Rodrigues, Kavita Asnani Department of Information Technology (M.E.) Padre Conceição College of Engineering Verna, India {ravinarodrigues Abstract Classic information retrieval model might lead to poor retrieval due to unrelated documents that might be included in the answer set or missed relevant documents that do not contain at least one index term. Retrieval based on index terms is vague and noisy. The user information need is more related to concepts and ideas than to index terms. Latent Semantic Indexing (LSI) model is a concept-based retrieval method which overcomes many of the problems evident in today's popular word-based retrieval systems. Most retrieval systems match words in the user s queries with words in the text of documents in the corpus; whereas LSI model performs the match based on the concepts. In order to perform concept mapping, Singular Value Decomposition (SVD) is used. Also keyphrases are an important means of document summarization, clustering and topic search. Keyphrases give high level description of document contents that indeed makes it easy for perspective readers to decide whether or not it is relevant to them. In this paper, we first develop an automatic keyphrase extraction model for extracting keyphrases from documents and then use these keyphrases as a corpus on which conceptual search will be performed using LSI. Keywords- Latent Semantic Indexing; Keyphrases; Retrieval models; Singular Value Decomposition 1. INTRODUCTION We describe here a new approach to automatic indexing and retrieval. It is designed to overcome a fundamental problem that plagues existing retrieval techniques that try to match words of queries with words of documents. The problem is that users want to retrieve on the basis of conceptual content, and individual words provide unreliable evidence about the conceptual topic or meaning of a document. There are usually many ways to express a given concept, so the literal terms in a user s query may not match those of a relevant document. In addition, most words have multiple meanings, so terms in a user s IJETET 43

2 query will literally match terms in documents that are not of interest to the user. The proposed approach tries to overcome the deficiencies of term-matching retrieval by treating the unreliability of observed term-document association data as a statistical problem. We assume there is some underlying latent semantic structure in the data that is partially obscured by the randomness of word choice with respect to retrieval. We use statistical techniques to estimate this latent structure, and get rid of the obscuring noise. Latent Semantic Indexing (LSI) is one such statistical information retrieval method. It uses an algebraic model of document retrieval, using a matrix technique known as Singular Value Decomposition (SVD). The basic idea of LSI is that if two document vectors represent the same topic, they will share many associated words with a keyword and they will have very close semantic structures after dimension reduction via truncated SVD ([1], [2]). However, for large-scale corpus, the computing and storage costs associated with truncated SVD representation may be prohibitive. Recent studies also indicate that retrieval accuracy of the truncated SVD technique can deteriorate if the document sets are large ([3], [4]). Several strategies have been proposed to deal with LSI on large corpus ([5], [6]). One way of dealing with LSI on large corpus is to calculate the SVD on document s keywords called Keyphrases rather than the whole document. The keyphrases provide a brief summary of a document s contents. They can be defined as a list of terms each of which is made up of one or more words that cover the main topics of the document with which they are associated. They can be used in information retrieval systems as descriptions of the documents returned by a query, as the basis for search indexes, as a way of browsing a collection, and also as a document clustering technique. Despite their importance, especially in digital libraries, the majority of currently available publications do not contain author assigned keyphrases. To annotate existing publications with the keyphrases, either human indexers are needed (this can be very costly and time consuming), or a mechanism whereby the keyphrases can be automatically extracted from a document. Automatic keyphrase extraction is thus defined as the process of extracting the keyphrases from a document that a human author is likely to assign. In this paper, we present a concept based search using LSI on a corpus of automatically extracted keyphrases. For automatic keyphrase extraction we use a machine learning algorithm, Kea: Practical Automatic Keyphrase Extraction [7]. The Kea algorithm is simple and effective, and performs at the current state of the art [7]. It uses the Naïve Bayes machine learning method for training and keyphrase extraction. IJETET 44

3 2. BUILDING THE CORPUS: AUTOMATIC KEYPHRASE EXTRACTION The corpus consists of automatically generated keyphrases from the documents. For the process of automatic keyphrase extraction, we make use of the well-known Kea algorithm. The Kea s extraction algorithm has two stages: i. Training: Create a model for identifying the keyphrases, using training documents where the author s keyphrases are known. ii. Extraction: Choose the keyphrases from a new document, using the above model. Both stages choose a set of candidate phrases from their input documents, and then calculate the values of certain attributes (called features) for each candidate [7]. The Kea algorithm as in [7] comprises of the following steps: i. Candidate phrase generation ii. Feature calculation iii. Discretization iv. Training: Building the model v. Extraction of new keyphrases 3. VECTOR SPACE METHOD (VSM) A vector-based information retrieval method represents both documents and queries with high-dimensional vectors, while computing their similarities by the vector inner product. When the vectors are normalized to unit lengths, the inner product measures the cosine of the angle between the two vectors in the vector space [8]. In VSM, each document is represented by a weight vector d j = (w 1j, w 2 j,..., w tj )T, where w zj is the weight or importance of the term z th in the representation of the document d j, and T is the size of the indexing term set. A collection of n documents is then represented by term-by-document matrix with t rows and n columns. In VSM, two main components of the term weight are used to represent the elements of the term-by-document matrix, the Term Frequency (TF) of occurrences of each word in a particular document and the Inverse Document Frequency (IDF) of each word, which varies inversely with the number of documents to which a word is assigned. So, the weight of a term i in a document j is given by the following equation: w i,j = tf i,j * idf i (1) IJETET 45

4 where, tf i,j is the frequency of the i th term in the j th document, df i is the number of documents in which the term i appears at least once, and N is the number of documents in the collection. This method assigns the highest weight to those terms which appear frequently in a small number of documents in the documents set. For queries, also the same vector representation is given as q i = (q 1i, q 2i,..., q ti )T, where q zi is the weight of z th term in the representation of the query q i. We measure the similarity between a document and a query where both are normalized to unit lengths in the underlying vector space [8]. The Cosine Similarity Formula is given below: sim(q,d) = Q*D / ( Q * D ) (2) where, Q*D is the dot product between q and d, Q is the magnitude of query vector and D is the magnitude of document vector. The advantages of this approach are adaptability, robustness and minimal user intervention. The VSM method can be further optimized for its first level of data reduction using Kea, thereby representing the document vectors using the keyphrases. We refer to this method as VSM based on automatically extracted Keyphrases (KVSM). 4. SINGULAR VALUE DECOMPOSITION A very powerful set of techniques dealing with sets of equations or matrices that are either singular or numerically very close to singular is called Singular Value Decomposition (SVD). The SVD allows one to diagnose the problems in a given matrix and provides numerical answer as well. According to [9], the SVD can be looked at from three mutually compatible points of view. On the one hand, we can see it as a method for transforming correlated variables into a set of uncorrelated ones that better expose the various relationships among the original data items. At the same time, the SVD is a method for identifying and ordering the dimensions along which data points exhibit the most variation. This points to the third way of looking at the SVD as that which makes it possible to find the best approximation of the original data points using fewer dimensions. Thus, the SVD can be seen as a method for data reduction. SVD is computed using an m x n matrix A with rank r which is decomposed into three matrices as follows: A = U*D*V T (3) IJETET 46

5 where, m x r matrix U whose orthonormal column vectors are the Eigenvectors of AA T, r x r diagonal matrix D of singular values whose positive roots are the Eigenvalues of A T A, and n x r matrix V with orthonormal column vectors that are Eigenvectors of A T A. 5. LATENT SEMANTIC INDEXING ALGORITHM LSI is a variant of the VSM that maps a high dimensional space into a low dimensional space. LSI replaces the original matrix by another matrix whose column space is only a subspace of the column space of the original matrix. In the VSM, the document matrix is usually of a high dimension and sparse, since every word does not appear in each document. High dimensional and sparse matrices are susceptible to noise and have difficulty in capturing the underlying semantic structure. Additionally, the storage and processing of such data place great demands on computing resources. Reduction in model dimensionality is one way to address this problem [10]. The SVD takes advantage of the implicit higher order structure in the association of terms within documents by largest singular vectors. The vectors representing the documents and queries are projected onto a new, low dimensional space obtained by truncated SVD. The dimensionality reduction is accomplished by approximating the original term-by-document A with a new matrix A k. In the SVD, a large term-by-document matrix is decomposed into a set of orthogonal factors from which the original term-bydocument matrix can be approximated by a linear combination. Vectors of factor weights represent documents. The SVD of a matrix A is written as A = U*Σ*V T (4) If the term-by-document matrix A is t d, and then U is a t d orthogonal matrix, V is a d d orthogonal matrix, and Σ is a d d diagonal matrix, where the values on the diagonal of Σ are called the singular values. The singular values can then be sorted by magnitude and the top k values are selected as a means of developing a latent semantic representation of the original matrix. By changing all but the top k rows of Σ to zero rows, low rank approximation to A, called A k, can be created through the truncated SVD as A k = U k *Σ k *V k T (5) where, U k is the t k term-by-concept matrix, Σ k is the k k concept-by-concept matrix; V k is the k d concept-by-document matrix ([11], [4], [2]). Only the first k columns are kept in U k and only the first k columns are recorded in V T k. Each row of U k is a k-dimensional vector representing a term in the original collection. To each of the k reduced dimensions there is associated a latent concept which may not have any explicit semantic content, yet helps to discriminate documents. IJETET 47

6 A query can be considered as just another document. Queries are formed into pseudo-documents that specify the location of a query in the reduced term-document space [12]. Given q, a vector whose non-zero elements contain the weighted-term frequency counts of the terms that appear in the query, the pseudo-document, q, can be represented by q = q T U k Σ k -1 (6) Thus, the pseudo-document consists of the sum-of-the term vectors (q T U k ) corresponding to the terms specified in the query scaled by the inverse of the singular values (Σ -1 k ). The singular values are used to individually weigh each dimension of the term-document space. Once the query is projected onto the reduced term-document space, one of several similarity measures can be applied to compare the position of the pseudo-document. Documents are ranked according to the results of this similarity measure, and the highest ranked documents are returned to the user ([13], [14], [15]). Hence the LSI based on automatically extracted Keyphrases (KLSI) can be summarized as follows: i. Compute term-document matrix, A. ii. Compute the SVD for A. iii. Choose the top k values of S to form Sk as a mean of developing a latent semantic representation on the matrix A. iv. The remaining singular values are then set to 0.Only the first k columns are kept in U k and only the first k rows are recorded in V k. v. Compute the query (or pseudo-document) vector using q = q T U k S -1 k. vi. Map the document vector into the concept space using d = d T U k S -1 k. vii. Calculate the cosine similarity coefficients between the coordinates of the query vector and documents. viii. Rank the documents based on their similarity measures. A concrete example as shown in Table 1. makes the LSI algorithm and its advantages clearer. TABLE 1. IJETET 48

7 In this case, the document set consisted of the titles of 9 Bellcore technical memoranda. Keyphrases from the documents were selected for indexing; they are italicized. Note that there are two classes of titles: five about human-computer interaction (labeled c1-c5) and four about graph theory (labeled m1-m4). The entries in the term by document matrix are simply the frequencies with which each term actually occurred in each document. Such a matrix is used directly for the initial input of the SVD analysis. For this example we carefully chose documents and terms so that SVD would produce a satisfactory solution using just two dimensions. We use a simple query: "human computer interaction" to find the relevant documents. Simple term matching techniques would return documents c1, c2 and c4 since they share one or more terms with the query. However, two other documents which are also relevant (c3 and c5) are missed by this method since they have no terms in common with the query. But from Table 2, we can observe that, by using KLSI, documents c1 c5 (but not m1 m4) are nearby. Also c3 and c4 which share no index terms at all with the query are retrieved. This is the strength of using LSI. Hence all relevant documents get retrieved. IJETET 49

8 TABLE EXPERIMENTAL RESULTS In this section, we present the details of the experiments conducted on a corpus containing documents from the site based on Computer Science Technical Reports (CSTR) collection. A. Recall and Precision Retrieval quality for an information retrieval system can be expressed in a variety of ways. In the current work, we primarily use precision and recall to express retrieval performance. Precision is defined as the number of relevant documents returned divided by the total number of documents returned. Recall is the number of relevant documents returned divided by the total number of relevant documents. The CSTR document collection contains a total of 100 documents indexed by 483 terms. So it forms a term-bydocument matrix of size 483 x 100. The retrieval quality of LSI heavily depends on its number of dimensions. We need an optimal rank that captures the underlying semantic nature of the data. If we truncate the rank further from an optimal rank, it will lose important factors, and if we keep a higher rank, it will result in modeling the unnecessary noise and lead to a poor performance by regenerating the original data. Hence for this corpus we choose k=6 as the optimal rank. IJETET 50

9 In this experiment the recall v/s precision was calculated using 10 different queries. Figure 1. gives recall and precision for each query Figure 1. Recall v/s Precision B. Comparison evaluation between Classic VSM, KVSM and KLSI In order to compare the effectiveness of the system, we conducted experiments on 20 queries and calculated the average recall and precision for each model of the system framework. The models under comparison included classic VSM, KVSM and KLSI. The performance evaluation is depicted in Table 3. TABLE 3. System Model Average Recall Average Precision Classic VSM 41.25% 50.37% KVSM 43.75% 50.81% IJETET 51

10 KLSI 68.23% 57.63% Hence, we can see that KLSI clearly exhibits its superiority over classic VSM as well as KVSM. 7. CONCLUSION In this paper, an analysis and understanding, of how conceptual search on automatically extracted keyphrases from documents would improve the efficiency of IR, was provided. We conducted experiments on the document collection and observed that there was an improvement in retrieval results with LSI based on automatically extracted keyphrases as compared to the vector space method based on automatically extracted keyphrases. Hence we conclude that the performance of LSI is superior to the traditional vector space method. ACKNOWLEDGMENT We would like to thank Peter Turney, for kindly sharing his corpus and discoveries. Eibe Frank and Micheal Berry for their experiences and suggestions. REFERENCES Conference paper [1] Deerwester S., Indexing by latent semantic analysis, J. Ameri. Soci. Inf. Sci., Vol. 41, No. 6, pp ,1990. [2] Landauer T.K., Foltz P.W. and Laham D., Introduction to latent semantic analysis, Discourse Processes, Vol. 25,pp ,1998. [3] Balinski J. and Danilowicz C. Ranking method based on inter document distances, Inf. Process. Manag., Vol. 41,No. 4, pp , [4] Berry M.W. and Shakhina A.P. Computing sparse reduced-rank approximation to sparse matrices, ACM Trans. Math. Software, 2005, Vol. 31, No. 2, pp , [5] Cherukuri Aswani Kumar and Suripeddi Srinivas, Latent semantic indexing using eigenvalue analysis for efficient information retrieval, Int. J. Appl. Math. Comput. Sci., Vol. 16, No. 4, , [6] Li Li, Wu Chou, Improving latent semantic indexing based classifier with information gain, Seventh International Conference on Spoken Language Processing, [7] Ian H. Witten, Gordon W. Paynter, Eibe Frank,Carl Gutwin and Craig G. Nevill-Manning, KEA: Practical Automatic Keyphrase Extraction, Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries,2001. IJETET 52

11 [8] Yates R.B. and Neto B.R., Modern Information Retrieval, New Delhi: Pearson Education, [9] Kirk Baker, Singular Value Decomposition Tutorial, Unpublished. [10] Park H. and Elden L., Matrix rank reduction for data analysis and feature extraction, Tech. Rep., Dept. Computer Science and Engineering, University of Minnesota, [11] Aswani Kumar Ch., Gupta A., Batool M. and Trehan S., An information retrieval model based on latent semantic indexing with intelligent preprocessing, J. Inf. Knowl. Manag., Vol. 4, No. 4, pp. 1 7, [12] Bast H. and Weber I., Insights from viewing ranked retrieval as rank aggregation, Proc.Workshop Challenges in Web Information Retrieval and Integration, WIRI05,Tokyo, Japan, pp , [13] Berry M.W., Drmac Z. and Jessup E.R., Matrices, vector spaces, and information retrieval, SIAM Rev., Vol. 41,No. 2, pp , [14] Husbands P., Simon H. and Ding C., On the use of singular value decomposition for text retrieval, SIAM Comput. Inf. Retrieval, pp , [15] Ye Y.Q., Comparing matrix methods in text-based information retrieval, Tech. Rep., School of Mathematical Sciences,Peking University,2000. IJETET 53

vector space retrieval many slides courtesy James Amherst

vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the