Concept Based Search Using LSI and Automatic Keyphrase Extraction

Size: px
Start display at page:

Download "Concept Based Search Using LSI and Automatic Keyphrase Extraction"

Transcription

1 Concept Based Search Using LSI and Automatic Keyphrase Extraction Ravina Rodrigues, Kavita Asnani Department of Information Technology (M.E.) Padre Conceição College of Engineering Verna, India {ravinarodrigues Abstract Classic information retrieval model might lead to poor retrieval due to unrelated documents that might be included in the answer set or missed relevant documents that do not contain at least one index term. Retrieval based on index terms is vague and noisy. The user information need is more related to concepts and ideas than to index terms. Latent Semantic Indexing (LSI) model is a concept-based retrieval method which overcomes many of the problems evident in today's popular word-based retrieval systems. Most retrieval systems match words in the user s queries with words in the text of documents in the corpus; whereas LSI model performs the match based on the concepts. In order to perform concept mapping, Singular Value Decomposition (SVD) is used. Also keyphrases are an important means of document summarization, clustering and topic search. Keyphrases give high level description of document contents that indeed makes it easy for perspective readers to decide whether or not it is relevant to them. In this paper, we first develop an automatic keyphrase extraction model for extracting keyphrases from documents and then use these keyphrases as a corpus on which conceptual search will be performed using LSI. Keywords- Latent Semantic Indexing; Keyphrases; Retrieval models; Singular Value Decomposition 1. INTRODUCTION We describe here a new approach to automatic indexing and retrieval. It is designed to overcome a fundamental problem that plagues existing retrieval techniques that try to match words of queries with words of documents. The problem is that users want to retrieve on the basis of conceptual content, and individual words provide unreliable evidence about the conceptual topic or meaning of a document. There are usually many ways to express a given concept, so the literal terms in a user s query may not match those of a relevant document. In addition, most words have multiple meanings, so terms in a user s IJETET 43

2 query will literally match terms in documents that are not of interest to the user. The proposed approach tries to overcome the deficiencies of term-matching retrieval by treating the unreliability of observed term-document association data as a statistical problem. We assume there is some underlying latent semantic structure in the data that is partially obscured by the randomness of word choice with respect to retrieval. We use statistical techniques to estimate this latent structure, and get rid of the obscuring noise. Latent Semantic Indexing (LSI) is one such statistical information retrieval method. It uses an algebraic model of document retrieval, using a matrix technique known as Singular Value Decomposition (SVD). The basic idea of LSI is that if two document vectors represent the same topic, they will share many associated words with a keyword and they will have very close semantic structures after dimension reduction via truncated SVD ([1], [2]). However, for large-scale corpus, the computing and storage costs associated with truncated SVD representation may be prohibitive. Recent studies also indicate that retrieval accuracy of the truncated SVD technique can deteriorate if the document sets are large ([3], [4]). Several strategies have been proposed to deal with LSI on large corpus ([5], [6]). One way of dealing with LSI on large corpus is to calculate the SVD on document s keywords called Keyphrases rather than the whole document. The keyphrases provide a brief summary of a document s contents. They can be defined as a list of terms each of which is made up of one or more words that cover the main topics of the document with which they are associated. They can be used in information retrieval systems as descriptions of the documents returned by a query, as the basis for search indexes, as a way of browsing a collection, and also as a document clustering technique. Despite their importance, especially in digital libraries, the majority of currently available publications do not contain author assigned keyphrases. To annotate existing publications with the keyphrases, either human indexers are needed (this can be very costly and time consuming), or a mechanism whereby the keyphrases can be automatically extracted from a document. Automatic keyphrase extraction is thus defined as the process of extracting the keyphrases from a document that a human author is likely to assign. In this paper, we present a concept based search using LSI on a corpus of automatically extracted keyphrases. For automatic keyphrase extraction we use a machine learning algorithm, Kea: Practical Automatic Keyphrase Extraction [7]. The Kea algorithm is simple and effective, and performs at the current state of the art [7]. It uses the Naïve Bayes machine learning method for training and keyphrase extraction. IJETET 44

3 2. BUILDING THE CORPUS: AUTOMATIC KEYPHRASE EXTRACTION The corpus consists of automatically generated keyphrases from the documents. For the process of automatic keyphrase extraction, we make use of the well-known Kea algorithm. The Kea s extraction algorithm has two stages: i. Training: Create a model for identifying the keyphrases, using training documents where the author s keyphrases are known. ii. Extraction: Choose the keyphrases from a new document, using the above model. Both stages choose a set of candidate phrases from their input documents, and then calculate the values of certain attributes (called features) for each candidate [7]. The Kea algorithm as in [7] comprises of the following steps: i. Candidate phrase generation ii. Feature calculation iii. Discretization iv. Training: Building the model v. Extraction of new keyphrases 3. VECTOR SPACE METHOD (VSM) A vector-based information retrieval method represents both documents and queries with high-dimensional vectors, while computing their similarities by the vector inner product. When the vectors are normalized to unit lengths, the inner product measures the cosine of the angle between the two vectors in the vector space [8]. In VSM, each document is represented by a weight vector d j = (w 1j, w 2 j,..., w tj )T, where w zj is the weight or importance of the term z th in the representation of the document d j, and T is the size of the indexing term set. A collection of n documents is then represented by term-by-document matrix with t rows and n columns. In VSM, two main components of the term weight are used to represent the elements of the term-by-document matrix, the Term Frequency (TF) of occurrences of each word in a particular document and the Inverse Document Frequency (IDF) of each word, which varies inversely with the number of documents to which a word is assigned. So, the weight of a term i in a document j is given by the following equation: w i,j = tf i,j * idf i (1) IJETET 45

4 where, tf i,j is the frequency of the i th term in the j th document, df i is the number of documents in which the term i appears at least once, and N is the number of documents in the collection. This method assigns the highest weight to those terms which appear frequently in a small number of documents in the documents set. For queries, also the same vector representation is given as q i = (q 1i, q 2i,..., q ti )T, where q zi is the weight of z th term in the representation of the query q i. We measure the similarity between a document and a query where both are normalized to unit lengths in the underlying vector space [8]. The Cosine Similarity Formula is given below: sim(q,d) = Q*D / ( Q * D ) (2) where, Q*D is the dot product between q and d, Q is the magnitude of query vector and D is the magnitude of document vector. The advantages of this approach are adaptability, robustness and minimal user intervention. The VSM method can be further optimized for its first level of data reduction using Kea, thereby representing the document vectors using the keyphrases. We refer to this method as VSM based on automatically extracted Keyphrases (KVSM). 4. SINGULAR VALUE DECOMPOSITION A very powerful set of techniques dealing with sets of equations or matrices that are either singular or numerically very close to singular is called Singular Value Decomposition (SVD). The SVD allows one to diagnose the problems in a given matrix and provides numerical answer as well. According to [9], the SVD can be looked at from three mutually compatible points of view. On the one hand, we can see it as a method for transforming correlated variables into a set of uncorrelated ones that better expose the various relationships among the original data items. At the same time, the SVD is a method for identifying and ordering the dimensions along which data points exhibit the most variation. This points to the third way of looking at the SVD as that which makes it possible to find the best approximation of the original data points using fewer dimensions. Thus, the SVD can be seen as a method for data reduction. SVD is computed using an m x n matrix A with rank r which is decomposed into three matrices as follows: A = U*D*V T (3) IJETET 46

5 where, m x r matrix U whose orthonormal column vectors are the Eigenvectors of AA T, r x r diagonal matrix D of singular values whose positive roots are the Eigenvalues of A T A, and n x r matrix V with orthonormal column vectors that are Eigenvectors of A T A. 5. LATENT SEMANTIC INDEXING ALGORITHM LSI is a variant of the VSM that maps a high dimensional space into a low dimensional space. LSI replaces the original matrix by another matrix whose column space is only a subspace of the column space of the original matrix. In the VSM, the document matrix is usually of a high dimension and sparse, since every word does not appear in each document. High dimensional and sparse matrices are susceptible to noise and have difficulty in capturing the underlying semantic structure. Additionally, the storage and processing of such data place great demands on computing resources. Reduction in model dimensionality is one way to address this problem [10]. The SVD takes advantage of the implicit higher order structure in the association of terms within documents by largest singular vectors. The vectors representing the documents and queries are projected onto a new, low dimensional space obtained by truncated SVD. The dimensionality reduction is accomplished by approximating the original term-by-document A with a new matrix A k. In the SVD, a large term-by-document matrix is decomposed into a set of orthogonal factors from which the original term-bydocument matrix can be approximated by a linear combination. Vectors of factor weights represent documents. The SVD of a matrix A is written as A = U*Σ*V T (4) If the term-by-document matrix A is t d, and then U is a t d orthogonal matrix, V is a d d orthogonal matrix, and Σ is a d d diagonal matrix, where the values on the diagonal of Σ are called the singular values. The singular values can then be sorted by magnitude and the top k values are selected as a means of developing a latent semantic representation of the original matrix. By changing all but the top k rows of Σ to zero rows, low rank approximation to A, called A k, can be created through the truncated SVD as A k = U k *Σ k *V k T (5) where, U k is the t k term-by-concept matrix, Σ k is the k k concept-by-concept matrix; V k is the k d concept-by-document matrix ([11], [4], [2]). Only the first k columns are kept in U k and only the first k columns are recorded in V T k. Each row of U k is a k-dimensional vector representing a term in the original collection. To each of the k reduced dimensions there is associated a latent concept which may not have any explicit semantic content, yet helps to discriminate documents. IJETET 47

6 A query can be considered as just another document. Queries are formed into pseudo-documents that specify the location of a query in the reduced term-document space [12]. Given q, a vector whose non-zero elements contain the weighted-term frequency counts of the terms that appear in the query, the pseudo-document, q, can be represented by q = q T U k Σ k -1 (6) Thus, the pseudo-document consists of the sum-of-the term vectors (q T U k ) corresponding to the terms specified in the query scaled by the inverse of the singular values (Σ -1 k ). The singular values are used to individually weigh each dimension of the term-document space. Once the query is projected onto the reduced term-document space, one of several similarity measures can be applied to compare the position of the pseudo-document. Documents are ranked according to the results of this similarity measure, and the highest ranked documents are returned to the user ([13], [14], [15]). Hence the LSI based on automatically extracted Keyphrases (KLSI) can be summarized as follows: i. Compute term-document matrix, A. ii. Compute the SVD for A. iii. Choose the top k values of S to form Sk as a mean of developing a latent semantic representation on the matrix A. iv. The remaining singular values are then set to 0.Only the first k columns are kept in U k and only the first k rows are recorded in V k. v. Compute the query (or pseudo-document) vector using q = q T U k S -1 k. vi. Map the document vector into the concept space using d = d T U k S -1 k. vii. Calculate the cosine similarity coefficients between the coordinates of the query vector and documents. viii. Rank the documents based on their similarity measures. A concrete example as shown in Table 1. makes the LSI algorithm and its advantages clearer. TABLE 1. IJETET 48

7 In this case, the document set consisted of the titles of 9 Bellcore technical memoranda. Keyphrases from the documents were selected for indexing; they are italicized. Note that there are two classes of titles: five about human-computer interaction (labeled c1-c5) and four about graph theory (labeled m1-m4). The entries in the term by document matrix are simply the frequencies with which each term actually occurred in each document. Such a matrix is used directly for the initial input of the SVD analysis. For this example we carefully chose documents and terms so that SVD would produce a satisfactory solution using just two dimensions. We use a simple query: "human computer interaction" to find the relevant documents. Simple term matching techniques would return documents c1, c2 and c4 since they share one or more terms with the query. However, two other documents which are also relevant (c3 and c5) are missed by this method since they have no terms in common with the query. But from Table 2, we can observe that, by using KLSI, documents c1 c5 (but not m1 m4) are nearby. Also c3 and c4 which share no index terms at all with the query are retrieved. This is the strength of using LSI. Hence all relevant documents get retrieved. IJETET 49

8 TABLE EXPERIMENTAL RESULTS In this section, we present the details of the experiments conducted on a corpus containing documents from the site based on Computer Science Technical Reports (CSTR) collection. A. Recall and Precision Retrieval quality for an information retrieval system can be expressed in a variety of ways. In the current work, we primarily use precision and recall to express retrieval performance. Precision is defined as the number of relevant documents returned divided by the total number of documents returned. Recall is the number of relevant documents returned divided by the total number of relevant documents. The CSTR document collection contains a total of 100 documents indexed by 483 terms. So it forms a term-bydocument matrix of size 483 x 100. The retrieval quality of LSI heavily depends on its number of dimensions. We need an optimal rank that captures the underlying semantic nature of the data. If we truncate the rank further from an optimal rank, it will lose important factors, and if we keep a higher rank, it will result in modeling the unnecessary noise and lead to a poor performance by regenerating the original data. Hence for this corpus we choose k=6 as the optimal rank. IJETET 50

9 In this experiment the recall v/s precision was calculated using 10 different queries. Figure 1. gives recall and precision for each query Figure 1. Recall v/s Precision B. Comparison evaluation between Classic VSM, KVSM and KLSI In order to compare the effectiveness of the system, we conducted experiments on 20 queries and calculated the average recall and precision for each model of the system framework. The models under comparison included classic VSM, KVSM and KLSI. The performance evaluation is depicted in Table 3. TABLE 3. System Model Average Recall Average Precision Classic VSM 41.25% 50.37% KVSM 43.75% 50.81% IJETET 51

10 KLSI 68.23% 57.63% Hence, we can see that KLSI clearly exhibits its superiority over classic VSM as well as KVSM. 7. CONCLUSION In this paper, an analysis and understanding, of how conceptual search on automatically extracted keyphrases from documents would improve the efficiency of IR, was provided. We conducted experiments on the document collection and observed that there was an improvement in retrieval results with LSI based on automatically extracted keyphrases as compared to the vector space method based on automatically extracted keyphrases. Hence we conclude that the performance of LSI is superior to the traditional vector space method. ACKNOWLEDGMENT We would like to thank Peter Turney, for kindly sharing his corpus and discoveries. Eibe Frank and Micheal Berry for their experiences and suggestions. REFERENCES Conference paper [1] Deerwester S., Indexing by latent semantic analysis, J. Ameri. Soci. Inf. Sci., Vol. 41, No. 6, pp ,1990. [2] Landauer T.K., Foltz P.W. and Laham D., Introduction to latent semantic analysis, Discourse Processes, Vol. 25,pp ,1998. [3] Balinski J. and Danilowicz C. Ranking method based on inter document distances, Inf. Process. Manag., Vol. 41,No. 4, pp , [4] Berry M.W. and Shakhina A.P. Computing sparse reduced-rank approximation to sparse matrices, ACM Trans. Math. Software, 2005, Vol. 31, No. 2, pp , [5] Cherukuri Aswani Kumar and Suripeddi Srinivas, Latent semantic indexing using eigenvalue analysis for efficient information retrieval, Int. J. Appl. Math. Comput. Sci., Vol. 16, No. 4, , [6] Li Li, Wu Chou, Improving latent semantic indexing based classifier with information gain, Seventh International Conference on Spoken Language Processing, [7] Ian H. Witten, Gordon W. Paynter, Eibe Frank,Carl Gutwin and Craig G. Nevill-Manning, KEA: Practical Automatic Keyphrase Extraction, Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries,2001. IJETET 52

11 [8] Yates R.B. and Neto B.R., Modern Information Retrieval, New Delhi: Pearson Education, [9] Kirk Baker, Singular Value Decomposition Tutorial, Unpublished. [10] Park H. and Elden L., Matrix rank reduction for data analysis and feature extraction, Tech. Rep., Dept. Computer Science and Engineering, University of Minnesota, [11] Aswani Kumar Ch., Gupta A., Batool M. and Trehan S., An information retrieval model based on latent semantic indexing with intelligent preprocessing, J. Inf. Knowl. Manag., Vol. 4, No. 4, pp. 1 7, [12] Bast H. and Weber I., Insights from viewing ranked retrieval as rank aggregation, Proc.Workshop Challenges in Web Information Retrieval and Integration, WIRI05,Tokyo, Japan, pp , [13] Berry M.W., Drmac Z. and Jessup E.R., Matrices, vector spaces, and information retrieval, SIAM Rev., Vol. 41,No. 2, pp , [14] Husbands P., Simon H. and Ding C., On the use of singular value decomposition for text retrieval, SIAM Comput. Inf. Retrieval, pp , [15] Ye Y.Q., Comparing matrix methods in text-based information retrieval, Tech. Rep., School of Mathematical Sciences,Peking University,2000. IJETET 53

vector space retrieval many slides courtesy James Amherst

vector space retrieval many slides courtesy James Amherst vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the

More information

Information Retrieval: Retrieval Models

Information Retrieval: Retrieval Models CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models

More information

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University

More information

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING 43 CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING 3.1 INTRODUCTION This chapter emphasizes the Information Retrieval based on Query Expansion (QE) and Latent Semantic

More information

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier Wang Ding, Songnian Yu, Shanqing Yu, Wei Wei, and Qianfeng Wang School of Computer Engineering and Science, Shanghai University, 200072

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Vector Space Models: Theory and Applications

Vector Space Models: Theory and Applications Vector Space Models: Theory and Applications Alexander Panchenko Centre de traitement automatique du langage (CENTAL) Université catholique de Louvain FLTR 2620 Introduction au traitement automatique du

More information

Towards Understanding Latent Semantic Indexing. Second Reader: Dr. Mario Nascimento

Towards Understanding Latent Semantic Indexing. Second Reader: Dr. Mario Nascimento Towards Understanding Latent Semantic Indexing Bin Cheng Supervisor: Dr. Eleni Stroulia Second Reader: Dr. Mario Nascimento 0 TABLE OF CONTENTS ABSTRACT...3 1 INTRODUCTION...4 2 RELATED WORKS...6 2.1 TRADITIONAL

More information

Information Retrieval. hussein suleman uct cs

Information Retrieval. hussein suleman uct cs Information Management Information Retrieval hussein suleman uct cs 303 2004 Introduction Information retrieval is the process of locating the most relevant information to satisfy a specific information

More information

VK Multimedia Information Systems

VK Multimedia Information Systems VK Multimedia Information Systems Mathias Lux, mlux@itec.uni-klu.ac.at This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Information Retrieval Basics: Agenda Vector

More information

Latent Semantic Indexing

Latent Semantic Indexing Latent Semantic Indexing Thanks to Ian Soboroff Information Retrieval 1 Issues: Vector Space Model Assumes terms are independent Some terms are likely to appear together synonyms, related words spelling

More information

Collaborative Filtering based on User Trends

Collaborative Filtering based on User Trends Collaborative Filtering based on User Trends Panagiotis Symeonidis, Alexandros Nanopoulos, Apostolos Papadopoulos, and Yannis Manolopoulos Aristotle University, Department of Informatics, Thessalonii 54124,

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

LATENT SEMANTIC ANALYSIS AND WEIGHTED TREE SIMILARITY FOR SEMANTIC SEARCH IN DIGITAL LIBRARY

LATENT SEMANTIC ANALYSIS AND WEIGHTED TREE SIMILARITY FOR SEMANTIC SEARCH IN DIGITAL LIBRARY 6-02 Latent Semantic Analysis And Weigted Tree Similarity For Semantic Search In Digital Library LATENT SEMANTIC ANALYSIS AND WEIGHTED TREE SIMILARITY FOR SEMANTIC SEARCH IN DIGITAL LIBRARY Umi Sa adah

More information

Analysis and Latent Semantic Indexing

Analysis and Latent Semantic Indexing 18 Principal Component Analysis and Latent Semantic Indexing Understand the basics of principal component analysis and latent semantic index- Lab Objective: ing. Principal Component Analysis Understanding

More information

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate Searching Information Servers Based on Customized Proles Technical Report USC-CS-96-636 Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California

More information

Document Clustering using Concept Space and Cosine Similarity Measurement

Document Clustering using Concept Space and Cosine Similarity Measurement 29 International Conference on Computer Technology and Development Document Clustering using Concept Space and Cosine Similarity Measurement Lailil Muflikhah Department of Computer and Information Science

More information

Clustered SVD strategies in latent semantic indexing q

Clustered SVD strategies in latent semantic indexing q Information Processing and Management 41 (5) 151 163 www.elsevier.com/locate/infoproman Clustered SVD strategies in latent semantic indexing q Jing Gao, Jun Zhang * Laboratory for High Performance Scientific

More information

Instructor: Stefan Savev

Instructor: Stefan Savev LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

Impact of Term Weighting Schemes on Document Clustering A Review

Impact of Term Weighting Schemes on Document Clustering A Review Volume 118 No. 23 2018, 467-475 ISSN: 1314-3395 (on-line version) url: http://acadpubl.eu/hub ijpam.eu Impact of Term Weighting Schemes on Document Clustering A Review G. Hannah Grace and Kalyani Desikan

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

A Content Vector Model for Text Classification

A Content Vector Model for Text Classification A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California Two-Dimensional Visualization for Internet Resource Discovery Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California 90089-0781 fshli, danzigg@cs.usc.edu

More information

Dimension Reduction CS534

Dimension Reduction CS534 Dimension Reduction CS534 Why dimension reduction? High dimensionality large number of features E.g., documents represented by thousands of words, millions of bigrams Images represented by thousands of

More information

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering A. Anil Kumar Dept of CSE Sri Sivani College of Engineering Srikakulam, India S.Chandrasekhar Dept of CSE Sri Sivani

More information

Published in A R DIGITECH

Published in A R DIGITECH IMAGE RETRIEVAL USING LATENT SEMANTIC INDEXING Rachana C Patil*1, Imran R. Shaikh*2 *1 (M.E Student S.N.D.C.O.E.R.C, Yeola) *2(Professor, S.N.D.C.O.E.R.C, Yeola) rachanap4@gmail.com*1, imran.shaikh22@gmail.com*2

More information

DOCUMENT INDEXING USING INDEPENDENT TOPIC EXTRACTION. Yu-Hwan Kim and Byoung-Tak Zhang

DOCUMENT INDEXING USING INDEPENDENT TOPIC EXTRACTION. Yu-Hwan Kim and Byoung-Tak Zhang DOCUMENT INDEXING USING INDEPENDENT TOPIC EXTRACTION Yu-Hwan Kim and Byoung-Tak Zhang School of Computer Science and Engineering Seoul National University Seoul 5-7, Korea yhkim,btzhang bi.snu.ac.kr ABSTRACT

More information

Distributed Information Retrieval using LSI. Markus Watzl and Rade Kutil

Distributed Information Retrieval using LSI. Markus Watzl and Rade Kutil Distributed Information Retrieval using LSI Markus Watzl and Rade Kutil Abstract. Latent semantic indexing (LSI) is a recently developed method for information retrieval (IR). It is a modification of the

More information

Two Modifications of Weight Calculation of the Non-Local Means Denoising Method

Two Modifications of Weight Calculation of the Non-Local Means Denoising Method Engineering, 2013, 5, 522-526 ttp://dx.doi.org/10.4236/eng.2013.510b107 Publised Online October 2013 (ttp://www.scirp.org/journal/eng) Two Modifications of Weigt Calculation of te Non-Local Means Denoising

More information

Feature Selection for fmri Classification

Feature Selection for fmri Classification Feature Selection for fmri Classification Chuang Wu Program of Computational Biology Carnegie Mellon University Pittsburgh, PA 15213 chuangw@andrew.cmu.edu Abstract The functional Magnetic Resonance Imaging

More information

The Semantic Conference Organizer

The Semantic Conference Organizer 34 The Semantic Conference Organizer Kevin Heinrich, Michael W. Berry, Jack J. Dongarra, Sathish Vadhiyar University of Tennessee, Knoxville, USA CONTENTS 34.1 Background... 571 34.2 Latent Semantic Indexing...

More information

Information Retrieval and Data Mining Part 1 Information Retrieval

Information Retrieval and Data Mining Part 1 Information Retrieval Information Retrieval and Data Mining Part 1 Information Retrieval 2005/6, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Information Retrieval - 1 1 Today's Question 1. Information

More information

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL Lim Bee Huang 1, Vimala Balakrishnan 2, Ram Gopal Raj 3 1,2 Department of Information System, 3 Department

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

ENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL

ENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL ENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL Shwetha S P 1 and Alok Ranjan 2 Visvesvaraya Technological University, Belgaum, Dept. of Computer Science and Engineering, Canara

More information

Collaborative Ranking between Supervised and Unsupervised Approaches for Keyphrase Extraction

Collaborative Ranking between Supervised and Unsupervised Approaches for Keyphrase Extraction The 2014 Conference on Computational Linguistics and Speech Processing ROCLING 2014, pp. 110-124 The Association for Computational Linguistics and Chinese Language Processing Collaborative Ranking between

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) CSE 6242 / CX 4242 Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John

More information

Feature selection. LING 572 Fei Xia

Feature selection. LING 572 Fei Xia Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection

More information

General Instructions. Questions

General Instructions. Questions CS246: Mining Massive Data Sets Winter 2018 Problem Set 2 Due 11:59pm February 8, 2018 Only one late period is allowed for this homework (11:59pm 2/13). General Instructions Submission instructions: These

More information

Image Compression with Singular Value Decomposition & Correlation: a Graphical Analysis

Image Compression with Singular Value Decomposition & Correlation: a Graphical Analysis ISSN -7X Volume, Issue June 7 Image Compression with Singular Value Decomposition & Correlation: a Graphical Analysis Tamojay Deb, Anjan K Ghosh, Anjan Mukherjee Tripura University (A Central University),

More information

Using Singular Value Decomposition to Improve a Genetic Algorithm s Performance

Using Singular Value Decomposition to Improve a Genetic Algorithm s Performance Using Singular Value Decomposition to Improve a Genetic Algorithm s Performance Jacob G. Martin Computer Science University of Georgia Athens, GA 30602 martin@cs.uga.edu Khaled Rasheed Computer Science

More information

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Boolean Retrieval vs. Ranked Retrieval Many users (professionals) prefer

More information

International Journal of Advancements in Research & Technology, Volume 2, Issue 8, August ISSN

International Journal of Advancements in Research & Technology, Volume 2, Issue 8, August ISSN International Journal of Advancements in Research & Technology, Volume 2, Issue 8, August-2013 244 Image Compression using Singular Value Decomposition Miss Samruddhi Kahu Ms. Reena Rahate Associate Engineer

More information

Self-organization of very large document collections

Self-organization of very large document collections Chapter 10 Self-organization of very large document collections Teuvo Kohonen, Samuel Kaski, Krista Lagus, Jarkko Salojärvi, Jukka Honkela, Vesa Paatero, Antti Saarela Text mining systems are developed

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation

More information

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert

More information

Text Modeling with the Trace Norm

Text Modeling with the Trace Norm Text Modeling with the Trace Norm Jason D. M. Rennie jrennie@gmail.com April 14, 2006 1 Introduction We have two goals: (1) to find a low-dimensional representation of text that allows generalization to

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

Content-based Dimensionality Reduction for Recommender Systems

Content-based Dimensionality Reduction for Recommender Systems Content-based Dimensionality Reduction for Recommender Systems Panagiotis Symeonidis Aristotle University, Department of Informatics, Thessaloniki 54124, Greece symeon@csd.auth.gr Abstract. Recommender

More information

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Feature Selection Using Modified-MCA Based Scoring Metric for Classification 2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014. A B S T R A C T International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Information Retrieval Models and Searching Methodologies: Survey Balwinder Saini*,Vikram Singh,Satish

More information

A Metadatabase System for Semantic Image Search by a Mathematical Model of Meaning

A Metadatabase System for Semantic Image Search by a Mathematical Model of Meaning A Metadatabase System for Semantic Image Search by a Mathematical Model of Meaning Yasushi Kiyoki, Takashi Kitagawa and Takanari Hayama Institute of Information Sciences and Electronics University of Tsukuba

More information

Automatic Summarization

Automatic Summarization Automatic Summarization CS 769 Guest Lecture Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of Wisconsin, Madison February 22, 2008 Andrew B. Goldberg (CS Dept) Summarization

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction Organizing Internet Bookmarks using Latent Semantic Analysis and Intelligent Icons Note: This file is a homework produced by two students for UCR CS235, Spring 06. In order to fully appreacate it, it may

More information

Document Clustering in Reduced Dimension Vector Space

Document Clustering in Reduced Dimension Vector Space Document Clustering in Reduced Dimension Vector Space Kristina Lerman USC Information Sciences Institute 4676 Admiralty Way Marina del Rey, CA 90292 Email: lerman@isi.edu Abstract Document clustering is

More information

Essential Dimensions of Latent Semantic Indexing (LSI)

Essential Dimensions of Latent Semantic Indexing (LSI) Essential Dimensions of Latent Semantic Indexing (LSI) April Kontostathis Department of Mathematics and Computer Science Ursinus College Collegeville, PA 19426 Email: akontostathis@ursinus.edu Abstract

More information

Behavioral Data Mining. Lecture 18 Clustering

Behavioral Data Mining. Lecture 18 Clustering Behavioral Data Mining Lecture 18 Clustering Outline Why? Cluster quality K-means Spectral clustering Generative Models Rationale Given a set {X i } for i = 1,,n, a clustering is a partition of the X i

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

News-Oriented Keyword Indexing with Maximum Entropy Principle.

News-Oriented Keyword Indexing with Maximum Entropy Principle. News-Oriented Keyword Indexing with Maximum Entropy Principle. Li Sujian' Wang Houfeng' Yu Shiwen' Xin Chengsheng2 'Institute of Computational Linguistics, Peking University, 100871, Beijing, China Ilisujian,

More information

Refining Search Queries from Examples Using Boolean Expressions and Latent Semantic Analysis

Refining Search Queries from Examples Using Boolean Expressions and Latent Semantic Analysis Refining Search Queries from Examples Using Boolean Expressions and Latent Semantic Analysis David Johnson, Vishv Malhotra, Peter Vamplew and Sunanda Patro School of Computing, University of Tasmania Private

More information

CorePhrase: Keyphrase Extraction for Document Clustering

CorePhrase: Keyphrase Extraction for Document Clustering CorePhrase: Keyphrase Extraction for Document Clustering Khaled M. Hammouda 1, Diego N. Matute 2, and Mohamed S. Kamel 3 1 Department of Systems Design Engineering 2 School of Computer Science 3 Department

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS

More information

A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS)

A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS) A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS) Eman Abdu eha90@aol.com Graduate Center The City University of New York Douglas Salane dsalane@jjay.cuny.edu Center

More information

A Modified Hierarchical Clustering Algorithm for Document Clustering

A Modified Hierarchical Clustering Algorithm for Document Clustering A Modified Hierarchical Algorithm for Document Merin Paul, P Thangam Abstract is the division of data into groups called as clusters. Document clustering is done to analyse the large number of documents

More information

Module 9 : Numerical Relaying II : DSP Perspective

Module 9 : Numerical Relaying II : DSP Perspective Module 9 : Numerical Relaying II : DSP Perspective Lecture 36 : Fast Fourier Transform Objectives In this lecture, We will introduce Fast Fourier Transform (FFT). We will show equivalence between FFT and

More information

Retrieval of Highly Related Documents Containing Gene-Disease Association

Retrieval of Highly Related Documents Containing Gene-Disease Association Retrieval of Highly Related Documents Containing Gene-Disease Association K. Santhosh kumar 1, P. Sudhakar 2 Department of Computer Science & Engineering Annamalai University Annamalai Nagar, India. santhosh09539@gmail.com,

More information

Multimodal Information Spaces for Content-based Image Retrieval

Multimodal Information Spaces for Content-based Image Retrieval Research Proposal Multimodal Information Spaces for Content-based Image Retrieval Abstract Currently, image retrieval by content is a research problem of great interest in academia and the industry, due

More information

Image Contrast Enhancement in Wavelet Domain

Image Contrast Enhancement in Wavelet Domain Advances in Computational Sciences and Technology ISSN 0973-6107 Volume 10, Number 6 (2017) pp. 1915-1922 Research India Publications http://www.ripublication.com Image Contrast Enhancement in Wavelet

More information

A Patent Retrieval Method Using a Hierarchy of Clusters at TUT

A Patent Retrieval Method Using a Hierarchy of Clusters at TUT A Patent Retrieval Method Using a Hierarchy of Clusters at TUT Hironori Doi Yohei Seki Masaki Aono Toyohashi University of Technology 1-1 Hibarigaoka, Tenpaku-cho, Toyohashi-shi, Aichi 441-8580, Japan

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference

Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference Minh Dao 1, Xiang Xiang 1, Bulent Ayhan 2, Chiman Kwan 2, Trac D. Tran 1 Johns Hopkins Univeristy, 3400

More information

Lecture Video Indexing and Retrieval Using Topic Keywords

Lecture Video Indexing and Retrieval Using Topic Keywords Lecture Video Indexing and Retrieval Using Topic Keywords B. J. Sandesh, Saurabha Jirgi, S. Vidya, Prakash Eljer, Gowri Srinivasa International Science Index, Computer and Information Engineering waset.org/publication/10007915

More information

Semantic Search in s

Semantic Search in  s Semantic Search in Emails Navneet Kapur, Mustafa Safdari, Rahul Sharma December 10, 2010 Abstract Web search technology is abound with techniques to tap into the semantics of information. For email search,

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 3 Modeling Part I: Classic Models Introduction to IR Models Basic Concepts The Boolean Model Term Weighting The Vector Model Probabilistic Model Chap 03: Modeling,

More information

Tag-based Social Interest Discovery

Tag-based Social Interest Discovery Tag-based Social Interest Discovery Xin Li / Lei Guo / Yihong (Eric) Zhao Yahoo!Inc 2008 Presented by: Tuan Anh Le (aletuan@vub.ac.be) 1 Outline Introduction Data set collection & Pre-processing Architecture

More information

Concept-Based Document Similarity Based on Suffix Tree Document

Concept-Based Document Similarity Based on Suffix Tree Document Concept-Based Document Similarity Based on Suffix Tree Document *P.Perumal Sri Ramakrishna Engineering College Associate Professor Department of CSE, Coimbatore perumalsrec@gmail.com R. Nedunchezhian Sri

More information

KeaKAT An Online Automatic Keyphrase Assignment Tool

KeaKAT An Online Automatic Keyphrase Assignment Tool 2012 10th International Conference on Frontiers of Information Technology KeaKAT An Online Automatic Keyphrase Assignment Tool Rabia Irfan, Sharifullah Khan, Irfan Ali Khan, Muhammad Asif Ali School of

More information

FEATURE EXTRACTION TECHNIQUES FOR IMAGE RETRIEVAL USING HAAR AND GLCM

FEATURE EXTRACTION TECHNIQUES FOR IMAGE RETRIEVAL USING HAAR AND GLCM FEATURE EXTRACTION TECHNIQUES FOR IMAGE RETRIEVAL USING HAAR AND GLCM Neha 1, Tanvi Jain 2 1,2 Senior Research Fellow (SRF), SAM-C, Defence R & D Organization, (India) ABSTRACT Content Based Image Retrieval

More information

modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation

More information

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders CSC 411: Lecture 14: Principal Components Analysis & Autoencoders Raquel Urtasun & Rich Zemel University of Toronto Nov 4, 2015 Urtasun & Zemel (UofT) CSC 411: 14-PCA & Autoencoders Nov 4, 2015 1 / 18

More information

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.25-30 Enhancing Clustering Results In Hierarchical Approach

More information

WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE

WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE Ms.S.Muthukakshmi 1, R. Surya 2, M. Umira Taj 3 Assistant Professor, Department of Information Technology, Sri Krishna College of Technology, Kovaipudur,

More information

Favorites-Based Search Result Ordering

Favorites-Based Search Result Ordering Favorites-Based Search Result Ordering Ben Flamm and Georey Schiebinger CS 229 Fall 2009 1 Introduction Search engine rankings can often benet from knowledge of users' interests. The query jaguar, for

More information

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders CSC 411: Lecture 14: Principal Components Analysis & Autoencoders Richard Zemel, Raquel Urtasun and Sanja Fidler University of Toronto Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 1 / 18

More information

Sprinkled Latent Semantic Indexing for Text Classification with Background Knowledge

Sprinkled Latent Semantic Indexing for Text Classification with Background Knowledge Sprinkled Latent Semantic Indexing for Text Classification with Background Knowledge Haiqin Yang and Irwin King Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin,

More information

Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Algebraic Techniques for Analysis of Large Discrete-Valued Datasets Algebraic Techniques for Analysis of Large Discrete-Valued Datasets Mehmet Koyutürk 1,AnanthGrama 1, and Naren Ramakrishnan 2 1 Dept. of Computer Sciences, Purdue University W. Lafayette, IN, 47907, USA

More information

Ranking models in Information Retrieval: A Survey

Ranking models in Information Retrieval: A Survey Ranking models in Information Retrieval: A Survey R.Suganya Devi Research Scholar Department of Computer Science and Engineering College of Engineering, Guindy, Chennai, Tamilnadu, India Dr D Manjula Professor

More information

Compression, Clustering and Pattern Discovery in Very High Dimensional Discrete-Attribute Datasets

Compression, Clustering and Pattern Discovery in Very High Dimensional Discrete-Attribute Datasets IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1 Compression, Clustering and Pattern Discovery in Very High Dimensional Discrete-Attribute Datasets Mehmet Koyutürk, Ananth Grama, and Naren Ramakrishnan

More information

New user profile learning for extremely sparse data sets

New user profile learning for extremely sparse data sets New user profile learning for extremely sparse data sets Tomasz Hoffmann, Tadeusz Janasiewicz, and Andrzej Szwabe Institute of Control and Information Engineering, Poznan University of Technology, pl.

More information

Latent Semantic Analysis and Fiedler Embeddings

Latent Semantic Analysis and Fiedler Embeddings Latent Semantic Analysis and Fiedler Embeddings Bruce Hendricson Abstract Latent semantic analysis (LSA) is a method for information retrieval and processing which is based upon the singular value decomposition.

More information

Graph drawing in spectral layout

Graph drawing in spectral layout Graph drawing in spectral layout Maureen Gallagher Colleen Tygh John Urschel Ludmil Zikatanov Beginning: July 8, 203; Today is: October 2, 203 Introduction Our research focuses on the use of spectral graph

More information