Parallel Rare Term Vector Replacement: Fast and Effective Dimensionality Reduction for Text (Abridged)

Size: px
Start display at page:

Download "Parallel Rare Term Vector Replacement: Fast and Effective Dimensionality Reduction for Text (Abridged)"

Transcription

1 Parallel Rare Term Vector Replacement: Fast and Effective Dimensionality Reduction for Text (Abridged) Tobias Berka Marian Vajteršic Abstract Dimensionality reduction is an established area in text mining and information retrieval. These methods convert the highly sparse corpus matrices into dense matrix format while preserving or improving the classification accuracy or retrieval performance. In this paper, we describe a novel approach to dimensionality reduction for text, along with a parallel algorithm suitable for private memory parallel computer systems. According to Zipf s law, the majority of indexing terms occurs only in a small number of documents. Our algorithm replaces rare terms by computing a vector which expresses their semantics in terms of common terms. This process produces a projection matrix, which can be applied to a corpus matrix and individual document and query vectors. We give an accurate mathematical and algorithmic description of our algorithms and present an experimental evaluation on two benchmark corpora. These experiments indicate that our algorithm can deliver a substantial reduction in the number of features, from 47,236 to 392 features on the Reuters corpus with a clear improvement in the retrieval performance. We have evaluated our parallel implementation using the message passing interface with up to 32 processes on a Nehalem Xeon cluster, computing the projection matrix for the dimensionality reduction for over 800,000 documents in just under 100 seconds. This is a strongly abridged version of the article Tobias Berka and Marian Vajteršic: Parallel Rare Term Vector Replacement: Fast and Effective Dimensionality Reduction for Text. Journal of Parallel and Distributed Computing, 2012 (to appear). 1 Introduction Dimensionality reduction techniques reduce the number of components of a data set by representing the original data as accurately as possible with fewer features and/or instances. These techniques are put to work in a broad range of content-based retrieval applications, such as text and multimedia mining, web search, etc. The goal is to produce a more compact representation of the data with only limited loss of information in order to reduce the storage and runtime requirements. In some applications the reduction is used to discover and account for latent dependencies between features which are not reflected in the original data [1]. Here, the lower dimensionality not only reduces the computational costs, but also improves the retrieval performance. In the area of text retrieval, low-rank matrix approximations have long been utilized in order to deal with polysemy and synonymy words with multiple meanings (e.g. bank or light) or different words with the same meaning (e.g. drowsy and sleepy). But the problem with information retrieval applications is that these systems operate on tens of thousands of features and millions of documents or more. The key issues that warrant the investigation of new dimensionality reduction techniques are performance and scalability. While traditional methods focus on theoretic properties of the underlying mathematical method, the ultimate measure of success for any dimensionality reduction technique is the impact on the retrieval performance. The development of new methods for dimensionality reduction beyond the traditional, algebraic or statistical methods is therefore fair game, as long as the retrieval performance is equal or better than on the unprocessed raw data. Rare term vector replacement is a new dimensionality reduction technique for text retrieval [2]. It is based on the fact that according to Zipf s law, most terms occur only in a very limited number of documents, which makes them interesting candidates for any form of dimensionality reduction or compression. Deletion Department of Computer Sciences, University of Salzburg, Austria, tberka@cosy.sbg.ac.at, marian@cosy.sbg.ac.at. Department of Informatics, Mathematical Institute, Slovak Academy of Sciences, Bratislava, Slovakia 1

2 of such rare terms is out of the question, because they play an important role in the search process. In our approach, we attempt to replace rare terms with a signature feature vector, which preserves the semantics of the rare term by expressing it in more frequent terms. This vector is then scaled by the frequency of the rare term in this document. The rest of this paper is structured as follows. We describe the serial approach in Section 3 and present a parallelization in Section 4. We evaluate the performance and scalability in Section 5. Lastly, we summarize our findings in Section 6. But first, we are going to briefly examine the state-of-the-art in dimensionality reduction for text retrieval. 2 Related Work There are several well known dimensionality reduction techniques in the fields of numerical linear algebra and statistics. The two foremost are the singular value decomposition (SVD), see e.g. [1], and the principal component analysis (PCA), see e.g. [3]. Non-negative matrix factorizations (NMF), see e.g. [4], are a more recent development that is motivated by factor analysis, where non-negativity may be necessary to interpret the factors. In information retrieval, latent semantic indexing (LSI), see [5], is the straightforward application of the SVD to the task at hand. The PCA has also been applied in the COV approach [6]. Factor analysis based on the SVD applied to automated indexing has been reported as probabilistic latent semantic analysis (PLSA) in [7]. NMF methods are often used in various text classification tasks [8] [9]. The reduction by projection onto the representative vectors of a feature clustering has in fact been developed specifically for text retrieval applications [10]. The use of kernel methods can lead to a square increase in the number of features and is therefore unsuitable for sparse, high-dimensional text data. There are many different parallelizations and algorithmic approaches for all of the above and a full review of parallel SVD, PCA and NMF methods is clearly beyond the scope of this paper. However, for the purposes of text retrieval, the thin SVD computed with the Lanczos algorithm is of primary importance [11] [12]. Our own method for reducing the dimensionality of a text index is based on an entirely new approach. It is based on an intuition about text indices and has no counterpart in matrix factorizations or descriptive statistics. We therefore need to examine the mathematical formulation and the impact on the retrieval quality. 3 Rare Term Vector Replacement We have a set D of D = n documents and a set F of F = m features. Every document d D is represented as a feature vector d R m to which we assign a column index col(d) {1,..., n}. As a convenient notation, we use d j to denote col(d) = j. Analogously, every feature f F has a row index row(f) {1,..., m}, using the short form f i : row(f) = i. We can thus form a corpus matrix C R m n, which contains the documents feature vectors as column vectors C = [ d 1 d 2... d n ] R m n. (1) Since we are interested in the occurrence of features within documents we define a function D : F D to determine which documents contain any particular feature f i, and dually a function F : D F that determines which features occur in a given document, formally D(f i ) := {d j D C i,j 0}, (2) F(d j ) := {f i F C i,j 0}. (3) We select the rare features through a function N : N F that determines the set of features occurring in at most t documents, N (t) := {f F D(f) t}. (4) After choosing an elimination threshold t, which is 1% of all documents for our experiments, we can now define the set of elimination features E F as E := N (t), which will ultimately lead to a reduced-dimensional feature space consisting of k common terms, where k := m E. 2

3 To eliminate an index i of a vector v, we define a truncation operator τ for the elimination of a set of indices using the recursive definition τ {i} (v) := (v 1 v i 1 v i+1 v m ) T, (5) τ {A,b} (v) := τ {b} (τ A (v)), τ (v) := v. (6) We use a linear combination of common features to replace rare features f i E. Formally, we will initially determine the set of documents that contain the feature D(f i ). We will truncate all document vectors d j D(f i ), eliminate all rare features by taking τ E (d i ) and scale them by the quantification C i,j of feature f i in document d j. We compute the sum of these vectors and apply a scaling factor λ to normalize the length of the resulting replacement vector. Formally, we define the function ρ E to compute the replacement vector: 1 ρ E (f i ) := C i,j τ E (C,j ), where λ E (f i ) := C i,j. (7) λ E (f i ) d j D(f i) d j D(f i) If we use the notation e 1,..., e m R m to denote the canonical base vectors, we can define replacement vectors r E, which either preserve features we do not wish to eliminate, or perform the vector replacement on features f E, { ρ E (f i )... f i E r E (f i ) := (8) τ E (e i )... f i E. We now assemble these vectors column-wise to form a replacement matrix R E for the elimination features E, formally R E := [ r E (f 1 ) r E (f 2 ) r E (f m ) ] R k m. (9) For any vector d j R m, we can obtain an improved, dimensionality-reduced vector d j Rk by taking d j := R E d j. To reduce an entire collection of documents, we can map our original corpus matrix C into the reduced dimensional space by taking C := R E C. But we also have to consider the on-line query processing. For any incoming query q R m, we compute q := R E q and evaluate all similarities on the augmented corpus C using cosine similarity: j {1,..., n} : sim(q, d j ) := cos(q, d j). (10) In Algorithm 1 we give a definition of this process in pseudo-code. The algorithmic complexity of this algorithm depends heavily on the distribution of non-zero components in the corpus matrix. But in general, the upper bound for the complexity is O(kmn). (11) In this serial form, our method does not necessarily provide an improvement in the algorithmic complexity, but since the replacement vectors can be computed independently of each other, we have a great potential for parallel scalability. Since the vector replacement method is brand new, we have used the Reuters Corpus Volume I in its corrected second version (RCV1-v2) [13] to conduct an evaluation of the retrieval performance. All 23,149 training documents were used as sample queries and we used the available class labels for the topic categorization to discriminate between relevant and irrelevant search results. We used the pre-vectorized TF-IDF version with 47,236 features in a corpus matrix which is 0.16% dense, i.e. less than one percent of all components are non-zero. Replacing all features which occur in less than 1% of all documents produced 1,430 features and a corpus matrix which is 99.87% dense. In addition, we applied a subsequent spectral rankreduction creating 700 features as a third representation for our evaluation. Figure 1 illustrates our results, which provide a clear indication that our approach succeeds in reducing the dimensionality and improving the retrieval performance. In light of these results, we may accept the replacement vector method as a suitable means for dimensionality reduction for information retrieval. We can therefore proceed to transform the mathematical formulation into one that is more suitable for parallel processing and derive a parallel algorithm. 3

4 Input: The corpus matrix C R m n, the set of documents D, the set of features F and the maximum occurrence count for rare features t N. Data: The occurrence count vector N N m for all features, the elimination features E F, the elimination features present in a document G F, the permutation π : N N, the feature f i F, the document d j D and an integer k N. Output: The replacement matrix R R k m. for d j D do for f i F(d j ) do N(i) += 1; k := 1; for f i F do if N(i) t then E := E {f i }; else π(i) := k, k += 1; k -= 1; for d j D do G := E F(d j ); for f i G do R(1 : k, i) += C(i, j) τ E (C(1 : m, j)), for f i F do if f i E then R(1 : k, i) := (l(i)) 1 R(1 : k, i); else R(1 : k, i) := e π(i) ; l i += C(i, j) ; Algorithm 1: Serial Replacement Vector Construction sparse TD-IDF (47,236) dense rare vector replacement (1,430) rank-reduced vector replacement (700) mean precision hit list rank Figure 1: Evaluation of the precision-at-k retrieval performance on the Reuters Corpus Volume I Version 2 using the topic categorization. 4

5 4 Parallel Construction of Replacement Vectors The obvious approach to parallel construction of replacement vectors is a task-parallel strategy. First, we are going to construct a partitioning of the elimination features E and derive a matrix partitioning for the result data in R E. The parallel computation follows naturally from these partitionings. We therefore partition the set of elimination features E into Q pair-wise disjoint sets E[1],..., E[Q]. Correspondingly, we partition the set of features F into partitions F [1],..., F [Q], so that each feature partition F [q] contains both the elimination features E[q] as well as the common features surrounding them, formally q {1,..., Q} : E[q] F [q]. This gives us a column-partitioning of the projection matrix R E into R E [1],..., R E [Q] according to F [1],..., F [Q]. Now, we can use Q processors to compute these column-partitions in parallel. The computation consists of three phases: 1. Determine the set of local elimination features E and partition E into E[1],..., E[Q] and F into F [1],..., F [Q]. 2. Compute the local column partitions R E [1],..., R E [Q] of the replacement matrix R E. 3. Gather the partitions to form the replacement matrix R E = [R E [1] R E [Q]]. Since every block R E [q] must be computed from the entire document collection, we have to replicate the corpus matrix C to all participating hosts q Q. This is the reason why we refer to this approach as a data-replicated task-parallel strategy. But if the replacement vectors must be computed for an extremely large document collection, we can parallelize the computation of the replacement vectors for a data-partitioned scenario. Mathematically, we partition our corpus into P pairwise disjoint sets of documents D[1],..., D[P ], where D[p] = n p, and the corpus matrix into blocks C[p] R m np, such that C = [ C[1] C[P ] ]. (12) To determine the set of elimination features E we compute the occurrence count for all features as a vector N across all document partitions, N := P N[p] Z m, where N[p](i) := {j {1,..., n p } C[p] i,j 0}, (13) p=1 and compute E = {i {1,..., m} N(i) t}. We restructure the formula for the replacement vectors ρ E according to our partitioning, ρ E (f i ) = 1 λ E (f i ) n P p ρ E [p](f i ), where ρ E [p](f i ) := C[p] i,j τ E (C[p],j ), (14) p=1 with the scaling factor λ E rewritten as λ E (f i ) := j=1 n P p λ E [p](f i ), where λ E [p](f i ) := C[p] i,j. (15) p=1 We compute this data-partitioned strategy in three steps: 1. Compute the local occurrence counts N[1],..., N[P ] and form the global occurrence count N and elimination features E. 2. Compute the local contributions ρ E [1],..., ρ E [P ] and λ E [1],..., λ E [P ]. 3. Compute the global replacement vectors ρ E, scalars λ E and normalize. j=1 5

6 Hosts: P Q processors in a 2D mesh with a switched network. Input: Hosts (1 : P, 1) hold the local corpus matrix C[p] R m np, the local document partition D[p] D, the global set of features F and the maximum occurrence count for rare features t N. Data: The local occurrence count vector N[p] N m, the global occurrence count vector N N m, the global set of elimination features E F, the local set of elimination features E[q] F, the local feature partition F [q] F, the set of elimination features present in a document G F, the index permutation π : N N, the feature variable f i F, the document variable d j D, an integer variable k N, the document partition index p {1,..., P } and the replication index q {1,..., Q}. Output: Host (1, 1) returns the replacement matrix R R k m. parallel for p {1,..., P } do parallel for q {1,..., Q} do if q = 1 then for d j D[p] do for f i F p (d j ) do N[p](i) += 1; all-to-all reduce (1 : P, 1) (1 : P, 1) : k := 1; for f i F do if N(i) t then E := E {f i }; else π(i) := k, k += 1; k -= 1; broadcast (p, 1) (p, 1 : Q) : D[p], C[p], E, π, k; partition E into E[1 : Q] and F into F [1 : Q]; R[p, q] := 0, l[p, q] := 0; for d j D[p] do G := E[q] F p (d j ); for f i G do R[p, q](1 : k, i) += C[p](i, j) τ E (C[p](1 : m, j)); l[p, q](i) += C[p](i, j) ; N := P p=1 N[p]; reduce (1 : P, q) (1, q) : R[q] := P p=1 R[p, q], l[q] := P p=1 l[p, q]; if p = 1 then for f i F [q] do if f i E then R[q](1 : k, i) := (l[q](i)) 1 R[q](1 : k, i); else R[q](1 : k, i) := e π(i) ; gather (1, 1 : Q) (1, 1) : if q = 1 then return R; R := [R[1] R[Q]]; Algorithm 2: Parallel Replacement Vector Construction 6

7 Once we have compute the replacement matrix R, we can use the same processes to compute the projection to the reduced-dimensional space by computing a parallel matrix product using a suitable parallel matrix multiplication algorithm. To achieve scalability in both the number of documents and the number of features, we can combine these two strategies. Algorithm 2 gives the pseudo-code for a combined task and data parallel replacement matrix computation on a 2D mesh of P Q processors. To discuss the theoretic computational complexity without communication costs, we split the algorithm into three phases: 1. We determine the number of occurrences for every feature, the set of elimination features E and the index permutation π on P processes in a complexity of O ( mn P log P ). ( 2. We can use all P Q processes to compute the local parts of the replacement matrix is O ) accumulate their sum in O. ( km log P Q ) kmn P Q and ( ) km 3. Lastly, we use Q processes to normalize the replacement vectors in O Q steps and send all parts to the first hosts. The overall complexity can thus be written as ( ) mnq + kmn + kmp log P + kmp O P Q. (16) As we can see, the hybrid task and data parallelization incurs some overhead. The first phase only operates across the document partitioning on P processes and the third phase can only use the task partitioning with its Q processes. Only the second phase can make effective use of the P Q process mesh. It is therefore imperative that we evaluate the actual performance to see if the data partitioning, the task partitioning and/or the hybrid parallelization deliver a convincing performance improvement in practice. 5 Evaluation Before we turn to an empiric evaluation, let us briefly consider the theoretic speed-up in terms of algorithmic complexity without communication overheads. Combining the complexity of the parallel algorithm in Equation 16 with the serial complexity given in Equation 11, we get a theoretic speed-up S of ( ) knp Q S(k, n, P, Q) = O, (17) k + kp + kp log P + nq with a theoretic parallel efficiency E of ( ) kn E(k, n, P, Q) = O k + kp + kp log P + nq. (18) However, speed-up and efficiency observed in practice differ from this theoretic result, because of the highly sparse nature of our computation. To give an estimate of the parallel performance under real-world conditions, we have implemented our algorithm in C/C++ using MPI as communication middleware. We have conducted execution time measurements on a cluster consisting of four nodes equipped with dual Intel Xeon E5520 CPUs clocked at 2.27 GHz with 4 cores and 48 GiB of RAM on an InfiniBand network running CentOS 64 bit version 5.2, Linux kernel version el5 using OpenMPI version and the GNU Compiler Collection version We executed 8 MPI processes on every node one per physically available core. In our experiments, the sparsity had two distinct effects: Firstly, the sparsity of the corpus matrix C leads to an unequal distribution of the actual workload. This problem is difficult to alleviate, because the short execution time of our algorithm makes it very difficult to apply more advanced partitionings and load balancing techniques. 7

8 n P Q P Q t [s] S E 421, , , , , , , , n P Q P Q t [s] S E 621, , , , , , , , n P Q P Q t [s] S E 804, , , , , , , , Table 1: Experimentally determined optimal mesh sizes: for various numbers of documents n and numbers of processors P Q, we give the most efficient number of document partitions P and elimination feature partitions Q, as well as the resulting execution time t (in seconds), speed-up S and efficiency E. Secondly, the sparse data structures for holding the data make the communication code less efficient. The data must always be packed and unpacked to and from sequential memory segments of varying length. This made message passing with MPI rather inefficient. Furthermore, it forced us to implement all collective operations by hand using point-to-point communications to implement the parallel summation of sparse vectors and matrices with a flat-tree topology. To measure the performance on a realistic data set, we have used the training and test set for the Reuters Corpus Volume I. The test set consists of three parts, which have been accumulated to give us three evaluation sets: the training set and the first part of the test set (421,816 documents), the training set and the first two parts of the test set (621,392 documents) and the the training set together with the entire test set (804,414 documents). Figure 2 depicts our measurements for the response time, speed-up and efficiency. In Table 1, we present the most efficient mesh configurations for a range of problem sizes and numbers of processes, i.e. the number of document partitions P and task partitions Q that delivered the highest parallel efficiency E. Since Q > P across the entire range of problem sizes and process counts, we have a clear indication that the task parallelization is the preferred strategy. The measured efficiency E indicates that the complications caused by the sparsity of the data are indeed manageable. But we also see that the data partitioning becomes increasingly important as the number of documents grows, because the value of P increases for larger data sets and processor counts. 8

9 time [s] processes 421,816 documents 621,392 documents 804,414 documents speed-up processes 421,816 documents 621,392 documents 804,414 documents efficiency processes 421,816 documents 621,392 documents 804,414 documents Figure 2: Empirical performance evaluation showing execution time, speed-up and efficiency for the Reuters Corpus Volume I Version 2. 9

10 6 Summary & Conclusions In this paper, we have introduced a new method of dimensionality reduction for text retrieval and presented the serial algorithm along with an evaluation of the quality of the search results on a standard benchmark data set. We mathematically derived a parallelization based on a hybrid task and data partitioning and presented a parallel algorithm for private-memory parallel computer systems. Despite the fact that the sparse nature of our method presents some difficulties, we analyzed the complexity of both the serial and the parallel algorithm and presented the theoretic speed-up without communication costs. To account for these problems and to give a realistic estimate of the performance of our parallelization, we report the details of an empirical performance evaluation. We believe that we have delivered a solid proof-of-concept implementation and evaluation for our parallel algorithm for dimensionality reduction of text indices in the vector space model. The algorithm delivers a very high quality of search results and the underlying projection matrix can be computed very rapidly. The performance evaluation was conducted for up to 800,000 documents and there is a good indication that the algorithm scales well beyond this problem size. References [1] Berry, M.W., Drmac, Z., Jessup, E.R.: Matrices, Vector Spaces, and Information Retrieval. SIAM Review 41(2) (1999) [2] Berka, T., Vajteršic, M.: Dimensionality Reduction for Information Retrieval using Vector Replacement of Rare Terms. In: Proc. TM. (2011) [3] Jolliffe, I.T.: Principal Component Analysis. 2nd edn. Springer (2002) [4] Paatero, P., Tapper, U.: Positive Matrix Factorization: A Non-negative Factor Model With Optimal Utilization of Error Estimates of Data Values. Environmetrics 5(2) (1994) [5] Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by Latent Semantic Analysis. Journal of the Society for Information Science 41(6) (1990) [6] Kobayashi, M., Aono, M., Takeuchi, H., Samukawa, H.: Matrix Computations for Information Retrieval and Major and Outlier Cluster Detection. Journal of Computational and Applied Mathematics 149(1) (2002) [7] Hofmann, T.: Probabilistic Latent Semantic Indexing. In: Proc. SIGIR, New York, NY, USA, ACM (1999) [8] Janecek, A.G., Gansterer, W.N.: Utilizing Nonnegative Matrix Factorization for Classification Problems. In Berry, M.W., Kogan, J., eds.: Survey of Text Mining III: Application and Theory. Wiley (2010) [9] Langville, A.N., Berry, M.W.: Nonnegative Matrix Factorization for Document Classification. In: Next Generation of Data Mining. Chapman & Hall/CRC (2008) [10] Dhillon, I.S., Modha, D.S.: Concept Decompositions for Large Sparse Text Data using Clustering. Machine Learning 42(1/2) (2001) [11] Berry, M.W., Do, T., Ogielski, A.: Massively-Parallel Implementations of Lanczos Algorithms for Computing the SVD of Large Sparse Matrices. In: Proc. PP, SIAM (1993) [12] Berry, M.W., Mezher, D., Philippe, B., Sameh, A.: Parallel Algorithms for the Singular Value Decomposition. In: Handbook on Parallel Computing and Statistics. Volume 184 of Statistics: A Series of Textbooks and Monographs. Chapman & Hall, Boca Raton, USA (2006) [13] Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. The Journal of Machine Learning Research 5 (December 2004)

Dimensionality Reduction for Information Retrieval using Vector Replacement of Rare Terms

Dimensionality Reduction for Information Retrieval using Vector Replacement of Rare Terms Dimensionality Reduction for Information Retrieval using Vector Replacement of Rare Terms Tobias Berka Marian Vajteršic Abstract Dimensionality reduction by algebraic methods is an established technique

More information

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University

More information

A Content Vector Model for Text Classification

A Content Vector Model for Text Classification A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.

More information

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier Wang Ding, Songnian Yu, Shanqing Yu, Wei Wei, and Qianfeng Wang School of Computer Engineering and Science, Shanghai University, 200072

More information

Clustered SVD strategies in latent semantic indexing q

Clustered SVD strategies in latent semantic indexing q Information Processing and Management 41 (5) 151 163 www.elsevier.com/locate/infoproman Clustered SVD strategies in latent semantic indexing q Jing Gao, Jun Zhang * Laboratory for High Performance Scientific

More information

Feature selection. LING 572 Fei Xia

Feature selection. LING 572 Fei Xia Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection

More information

Vector Space Models: Theory and Applications

Vector Space Models: Theory and Applications Vector Space Models: Theory and Applications Alexander Panchenko Centre de traitement automatique du langage (CENTAL) Université catholique de Louvain FLTR 2620 Introduction au traitement automatique du

More information

Distributed Information Retrieval using LSI. Markus Watzl and Rade Kutil

Distributed Information Retrieval using LSI. Markus Watzl and Rade Kutil Distributed Information Retrieval using LSI Markus Watzl and Rade Kutil Abstract. Latent semantic indexing (LSI) is a recently developed method for information retrieval (IR). It is a modification of the

More information

A ew Algorithm for Community Identification in Linked Data

A ew Algorithm for Community Identification in Linked Data A ew Algorithm for Community Identification in Linked Data Nacim Fateh Chikhi, Bernard Rothenburger, Nathalie Aussenac-Gilles Institut de Recherche en Informatique de Toulouse 118, route de Narbonne 31062

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

A Patent Retrieval Method Using a Hierarchy of Clusters at TUT

A Patent Retrieval Method Using a Hierarchy of Clusters at TUT A Patent Retrieval Method Using a Hierarchy of Clusters at TUT Hironori Doi Yohei Seki Masaki Aono Toyohashi University of Technology 1-1 Hibarigaoka, Tenpaku-cho, Toyohashi-shi, Aichi 441-8580, Japan

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

Text Modeling with the Trace Norm

Text Modeling with the Trace Norm Text Modeling with the Trace Norm Jason D. M. Rennie jrennie@gmail.com April 14, 2006 1 Introduction We have two goals: (1) to find a low-dimensional representation of text that allows generalization to

More information

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction Organizing Internet Bookmarks using Latent Semantic Analysis and Intelligent Icons Note: This file is a homework produced by two students for UCR CS235, Spring 06. In order to fully appreacate it, it may

More information

Explore Co-clustering on Job Applications. Qingyun Wan SUNet ID:qywan

Explore Co-clustering on Job Applications. Qingyun Wan SUNet ID:qywan Explore Co-clustering on Job Applications Qingyun Wan SUNet ID:qywan 1 Introduction In the job marketplace, the supply side represents the job postings posted by job posters and the demand side presents

More information

Information Retrieval. hussein suleman uct cs

Information Retrieval. hussein suleman uct cs Information Management Information Retrieval hussein suleman uct cs 303 2004 Introduction Information retrieval is the process of locating the most relevant information to satisfy a specific information

More information

Compression, Clustering and Pattern Discovery in Very High Dimensional Discrete-Attribute Datasets

Compression, Clustering and Pattern Discovery in Very High Dimensional Discrete-Attribute Datasets IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1 Compression, Clustering and Pattern Discovery in Very High Dimensional Discrete-Attribute Datasets Mehmet Koyutürk, Ananth Grama, and Naren Ramakrishnan

More information

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California Two-Dimensional Visualization for Internet Resource Discovery Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California 90089-0781 fshli, danzigg@cs.usc.edu

More information

Block Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations

Block Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations Block Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations D. Zheltkov, N. Zamarashkin INM RAS September 24, 2018 Scalability of Lanczos method Notations Matrix order

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate Searching Information Servers Based on Customized Proles Technical Report USC-CS-96-636 Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California

More information

Concept Based Search Using LSI and Automatic Keyphrase Extraction

Concept Based Search Using LSI and Automatic Keyphrase Extraction Concept Based Search Using LSI and Automatic Keyphrase Extraction Ravina Rodrigues, Kavita Asnani Department of Information Technology (M.E.) Padre Conceição College of Engineering Verna, India {ravinarodrigues

More information

Multi-Stage Rocchio Classification for Large-scale Multilabeled

Multi-Stage Rocchio Classification for Large-scale Multilabeled Multi-Stage Rocchio Classification for Large-scale Multilabeled Text data Dong-Hyun Lee Nangman Computing, 117D Garden five Tools, Munjeong-dong Songpa-gu, Seoul, Korea dhlee347@gmail.com Abstract. Large-scale

More information

Collaborative Filtering based on User Trends

Collaborative Filtering based on User Trends Collaborative Filtering based on User Trends Panagiotis Symeonidis, Alexandros Nanopoulos, Apostolos Papadopoulos, and Yannis Manolopoulos Aristotle University, Department of Informatics, Thessalonii 54124,

More information

On Finding Power Method in Spreading Activation Search

On Finding Power Method in Spreading Activation Search On Finding Power Method in Spreading Activation Search Ján Suchal Slovak University of Technology Faculty of Informatics and Information Technologies Institute of Informatics and Software Engineering Ilkovičova

More information

Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations

Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations Nikolai Zamarashkin and Dmitry Zheltkov INM RAS, Gubkina 8, Moscow, Russia {nikolai.zamarashkin,dmitry.zheltkov}@gmail.com

More information

Using Singular Value Decomposition to Improve a Genetic Algorithm s Performance

Using Singular Value Decomposition to Improve a Genetic Algorithm s Performance Using Singular Value Decomposition to Improve a Genetic Algorithm s Performance Jacob G. Martin Computer Science University of Georgia Athens, GA 30602 martin@cs.uga.edu Khaled Rasheed Computer Science

More information

Semantic text features from small world graphs

Semantic text features from small world graphs Semantic text features from small world graphs Jurij Leskovec 1 and John Shawe-Taylor 2 1 Carnegie Mellon University, USA. Jozef Stefan Institute, Slovenia. jure@cs.cmu.edu 2 University of Southampton,UK

More information

ADAPTIVE LOW RANK AND SPARSE DECOMPOSITION OF VIDEO USING COMPRESSIVE SENSING

ADAPTIVE LOW RANK AND SPARSE DECOMPOSITION OF VIDEO USING COMPRESSIVE SENSING ADAPTIVE LOW RANK AND SPARSE DECOMPOSITION OF VIDEO USING COMPRESSIVE SENSING Fei Yang 1 Hong Jiang 2 Zuowei Shen 3 Wei Deng 4 Dimitris Metaxas 1 1 Rutgers University 2 Bell Labs 3 National University

More information

Data Mining. Jeff M. Phillips. January 7, 2019 CS 5140 / CS 6140

Data Mining. Jeff M. Phillips. January 7, 2019 CS 5140 / CS 6140 Data Mining CS 5140 / CS 6140 Jeff M. Phillips January 7, 2019 What is Data Mining? What is Data Mining? Finding structure in data? Machine learning on large data? Unsupervised learning? Large scale computational

More information

An Approach to Identify the Number of Clusters

An Approach to Identify the Number of Clusters An Approach to Identify the Number of Clusters Katelyn Gao Heather Hardeman Edward Lim Cristian Potter Carl Meyer Ralph Abbey July 11, 212 Abstract In this technological age, vast amounts of data are generated.

More information

Comparing and Combining Dimension Reduction Techniques for Efficient Text Clustering

Comparing and Combining Dimension Reduction Techniques for Efficient Text Clustering Comparing and Combining Dimension Reduction Techniques for Efficient Text Clustering Bin Tang Michael Shepherd Evangelos Milios Malcolm I. Heywood Faculty of Computer Science, Dalhousie University, Halifax,

More information

Balance of Processing and Communication using Sparse Networks

Balance of Processing and Communication using Sparse Networks Balance of essing and Communication using Sparse Networks Ville Leppänen and Martti Penttonen Department of Computer Science University of Turku Lemminkäisenkatu 14a, 20520 Turku, Finland and Department

More information

The Semantic Conference Organizer

The Semantic Conference Organizer 34 The Semantic Conference Organizer Kevin Heinrich, Michael W. Berry, Jack J. Dongarra, Sathish Vadhiyar University of Tennessee, Knoxville, USA CONTENTS 34.1 Background... 571 34.2 Latent Semantic Indexing...

More information

Sprinkled Latent Semantic Indexing for Text Classification with Background Knowledge

Sprinkled Latent Semantic Indexing for Text Classification with Background Knowledge Sprinkled Latent Semantic Indexing for Text Classification with Background Knowledge Haiqin Yang and Irwin King Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin,

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

2.3 Algorithms Using Map-Reduce

2.3 Algorithms Using Map-Reduce 28 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK one becomes available. The Master must also inform each Reduce task that the location of its input from that Map task has changed. Dealing with a failure

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

Essential Dimensions of Latent Semantic Indexing (LSI)

Essential Dimensions of Latent Semantic Indexing (LSI) Essential Dimensions of Latent Semantic Indexing (LSI) April Kontostathis Department of Mathematics and Computer Science Ursinus College Collegeville, PA 19426 Email: akontostathis@ursinus.edu Abstract

More information

Mathematics. 2.1 Introduction: Graphs as Matrices Adjacency Matrix: Undirected Graphs, Directed Graphs, Weighted Graphs CHAPTER 2

Mathematics. 2.1 Introduction: Graphs as Matrices Adjacency Matrix: Undirected Graphs, Directed Graphs, Weighted Graphs CHAPTER 2 CHAPTER Mathematics 8 9 10 11 1 1 1 1 1 1 18 19 0 1.1 Introduction: Graphs as Matrices This chapter describes the mathematics in the GraphBLAS standard. The GraphBLAS define a narrow set of mathematical

More information

An efficient algorithm for sparse PCA

An efficient algorithm for sparse PCA An efficient algorithm for sparse PCA Yunlong He Georgia Institute of Technology School of Mathematics heyunlong@gatech.edu Renato D.C. Monteiro Georgia Institute of Technology School of Industrial & System

More information

Analysis and Latent Semantic Indexing

Analysis and Latent Semantic Indexing 18 Principal Component Analysis and Latent Semantic Indexing Understand the basics of principal component analysis and latent semantic index- Lab Objective: ing. Principal Component Analysis Understanding

More information

Improving Probabilistic Latent Semantic Analysis with Principal Component Analysis

Improving Probabilistic Latent Semantic Analysis with Principal Component Analysis Improving Probabilistic Latent Semantic Analysis with Principal Component Analysis Ayman Farahat Palo Alto Research Center 3333 Coyote Hill Road Palo Alto, CA 94304 ayman.farahat@gmail.com Francine Chen

More information

Bipartite Graph Partitioning and Content-based Image Clustering

Bipartite Graph Partitioning and Content-based Image Clustering Bipartite Graph Partitioning and Content-based Image Clustering Guoping Qiu School of Computer Science The University of Nottingham qiu @ cs.nott.ac.uk Abstract This paper presents a method to model the

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive

More information

Comparing and Combining Dimension Reduction Techniques for Efficient Text Clustering

Comparing and Combining Dimension Reduction Techniques for Efficient Text Clustering Comparing and Combining Dimension Reduction Techniques for Efficient Text Clustering Bin Tang, Michael Shepherd, Evangelos Milios, Malcolm I. Heywood {btang, shepherd, eem, mheywood}@cs.dal.ca Faculty

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

A Connection between Network Coding and. Convolutional Codes

A Connection between Network Coding and. Convolutional Codes A Connection between Network Coding and 1 Convolutional Codes Christina Fragouli, Emina Soljanin christina.fragouli@epfl.ch, emina@lucent.com Abstract The min-cut, max-flow theorem states that a source

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Feature Selection Using Modified-MCA Based Scoring Metric for Classification 2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification

More information

Impact of Term Weighting Schemes on Document Clustering A Review

Impact of Term Weighting Schemes on Document Clustering A Review Volume 118 No. 23 2018, 467-475 ISSN: 1314-3395 (on-line version) url: http://acadpubl.eu/hub ijpam.eu Impact of Term Weighting Schemes on Document Clustering A Review G. Hannah Grace and Kalyani Desikan

More information

ENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL

ENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL ENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL Shwetha S P 1 and Alok Ranjan 2 Visvesvaraya Technological University, Belgaum, Dept. of Computer Science and Engineering, Canara

More information

A NEW SIMPLEX TYPE ALGORITHM FOR THE MINIMUM COST NETWORK FLOW PROBLEM

A NEW SIMPLEX TYPE ALGORITHM FOR THE MINIMUM COST NETWORK FLOW PROBLEM A NEW SIMPLEX TYPE ALGORITHM FOR THE MINIMUM COST NETWORK FLOW PROBLEM KARAGIANNIS PANAGIOTIS PAPARRIZOS KONSTANTINOS SAMARAS NIKOLAOS SIFALERAS ANGELO * Department of Applied Informatics, University of

More information

More Efficient Classification of Web Content Using Graph Sampling

More Efficient Classification of Web Content Using Graph Sampling More Efficient Classification of Web Content Using Graph Sampling Chris Bennett Department of Computer Science University of Georgia Athens, Georgia, USA 30602 bennett@cs.uga.edu Abstract In mining information

More information

Classification Based on NMF

Classification Based on NMF E-Mail Classification Based on NMF Andreas G. K. Janecek* Wilfried N. Gansterer Abstract The utilization of nonnegative matrix factorization (NMF) in the context of e-mail classification problems is investigated.

More information

New user profile learning for extremely sparse data sets

New user profile learning for extremely sparse data sets New user profile learning for extremely sparse data sets Tomasz Hoffmann, Tadeusz Janasiewicz, and Andrzej Szwabe Institute of Control and Information Engineering, Poznan University of Technology, pl.

More information

Multimodal Information Spaces for Content-based Image Retrieval

Multimodal Information Spaces for Content-based Image Retrieval Research Proposal Multimodal Information Spaces for Content-based Image Retrieval Abstract Currently, image retrieval by content is a research problem of great interest in academia and the industry, due

More information

GraphBLAS Mathematics - Provisional Release 1.0 -

GraphBLAS Mathematics - Provisional Release 1.0 - GraphBLAS Mathematics - Provisional Release 1.0 - Jeremy Kepner Generated on April 26, 2017 Contents 1 Introduction: Graphs as Matrices........................... 1 1.1 Adjacency Matrix: Undirected Graphs,

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) ( ) Knowledge Discovery and Data Mining 1 (VO) (707.003) Data Matrices and Vector Space Model Denis Helic KTI, TU Graz Nov 6, 2014 Denis Helic (KTI, TU Graz) KDDM1 Nov 6, 2014 1 / 55 Big picture: KDDM Probability

More information

Lecture 27: Fast Laplacian Solvers

Lecture 27: Fast Laplacian Solvers Lecture 27: Fast Laplacian Solvers Scribed by Eric Lee, Eston Schweickart, Chengrun Yang November 21, 2017 1 How Fast Laplacian Solvers Work We want to solve Lx = b with L being a Laplacian matrix. Recall

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections

More information

Visualization of Text Document Corpus

Visualization of Text Document Corpus Informatica 29 (2005) 497 502 497 Visualization of Text Document Corpus Blaž Fortuna, Marko Grobelnik and Dunja Mladenić Jozef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia E-mail: {blaz.fortuna,

More information

Some Advanced Topics in Linear Programming

Some Advanced Topics in Linear Programming Some Advanced Topics in Linear Programming Matthew J. Saltzman July 2, 995 Connections with Algebra and Geometry In this section, we will explore how some of the ideas in linear programming, duality theory,

More information

Chapter 1 AN INTRODUCTION TO TEXT MINING. 1. Introduction. Charu C. Aggarwal. ChengXiang Zhai

Chapter 1 AN INTRODUCTION TO TEXT MINING. 1. Introduction. Charu C. Aggarwal. ChengXiang Zhai Chapter 1 AN INTRODUCTION TO TEXT MINING Charu C. Aggarwal IBM T. J. Watson Research Center Yorktown Heights, NY charu@us.ibm.com ChengXiang Zhai University of Illinois at Urbana-Champaign Urbana, IL czhai@cs.uiuc.edu

More information

Content-based Dimensionality Reduction for Recommender Systems

Content-based Dimensionality Reduction for Recommender Systems Content-based Dimensionality Reduction for Recommender Systems Panagiotis Symeonidis Aristotle University, Department of Informatics, Thessaloniki 54124, Greece symeon@csd.auth.gr Abstract. Recommender

More information

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics

More information

Dense Matrix Algorithms

Dense Matrix Algorithms Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication

More information

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

Matrices. Chapter Matrix A Mathematical Definition Matrix Dimensions and Notation

Matrices. Chapter Matrix A Mathematical Definition Matrix Dimensions and Notation Chapter 7 Introduction to Matrices This chapter introduces the theory and application of matrices. It is divided into two main sections. Section 7.1 discusses some of the basic properties and operations

More information

A Parallel Algorithm for Exact Structure Learning of Bayesian Networks

A Parallel Algorithm for Exact Structure Learning of Bayesian Networks A Parallel Algorithm for Exact Structure Learning of Bayesian Networks Olga Nikolova, Jaroslaw Zola, and Srinivas Aluru Department of Computer Engineering Iowa State University Ames, IA 0010 {olia,zola,aluru}@iastate.edu

More information

Web page recommendation using a stochastic process model

Web page recommendation using a stochastic process model Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,

More information

Reviewer Profiling Using Sparse Matrix Regression

Reviewer Profiling Using Sparse Matrix Regression Reviewer Profiling Using Sparse Matrix Regression Evangelos E. Papalexakis, Nicholas D. Sidiropoulos, Minos N. Garofalakis Technical University of Crete, ECE department 14 December 2010, OEDM 2010, Sydney,

More information

A New Technique to Optimize User s Browsing Session using Data Mining

A New Technique to Optimize User s Browsing Session using Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

Published in A R DIGITECH

Published in A R DIGITECH IMAGE RETRIEVAL USING LATENT SEMANTIC INDEXING Rachana C Patil*1, Imran R. Shaikh*2 *1 (M.E Student S.N.D.C.O.E.R.C, Yeola) *2(Professor, S.N.D.C.O.E.R.C, Yeola) rachanap4@gmail.com*1, imran.shaikh22@gmail.com*2

More information

A Metadatabase System for Semantic Image Search by a Mathematical Model of Meaning

A Metadatabase System for Semantic Image Search by a Mathematical Model of Meaning A Metadatabase System for Semantic Image Search by a Mathematical Model of Meaning Yasushi Kiyoki, Takashi Kitagawa and Takanari Hayama Institute of Information Sciences and Electronics University of Tsukuba

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Parallel Hybrid Monte Carlo Algorithms for Matrix Computations

Parallel Hybrid Monte Carlo Algorithms for Matrix Computations Parallel Hybrid Monte Carlo Algorithms for Matrix Computations V. Alexandrov 1, E. Atanassov 2, I. Dimov 2, S.Branford 1, A. Thandavan 1 and C. Weihrauch 1 1 Department of Computer Science, University

More information

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview

More information

TDT- An Efficient Clustering Algorithm for Large Database Ms. Kritika Maheshwari, Mr. M.Rajsekaran

TDT- An Efficient Clustering Algorithm for Large Database Ms. Kritika Maheshwari, Mr. M.Rajsekaran TDT- An Efficient Clustering Algorithm for Large Database Ms. Kritika Maheshwari, Mr. M.Rajsekaran M-Tech Scholar, Department of Computer Science and Engineering, SRM University, India Assistant Professor,

More information

Mining Social Network Graphs

Mining Social Network Graphs Mining Social Network Graphs Analysis of Large Graphs: Community Detection Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com Note to other teachers and users of these slides: We would be

More information

Instructor: Stefan Savev

Instructor: Stefan Savev LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information

More information

Lecture 5: Information Retrieval using the Vector Space Model

Lecture 5: Information Retrieval using the Vector Space Model Lecture 5: Information Retrieval using the Vector Space Model Trevor Cohn (tcohn@unimelb.edu.au) Slide credits: William Webber COMP90042, 2015, Semester 1 What we ll learn today How to take a user query

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 1: Course Overview; Matrix Multiplication Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 21 Outline 1 Course

More information

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation

More information

Accelerating Implicit LS-DYNA with GPU

Accelerating Implicit LS-DYNA with GPU Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,

More information

A Method for Construction of Orthogonal Arrays 1

A Method for Construction of Orthogonal Arrays 1 Eighth International Workshop on Optimal Codes and Related Topics July 10-14, 2017, Sofia, Bulgaria pp. 49-54 A Method for Construction of Orthogonal Arrays 1 Iliya Bouyukliev iliyab@math.bas.bg Institute

More information

Generalized trace ratio optimization and applications

Generalized trace ratio optimization and applications Generalized trace ratio optimization and applications Mohammed Bellalij, Saïd Hanafi, Rita Macedo and Raca Todosijevic University of Valenciennes, France PGMO Days, 2-4 October 2013 ENSTA ParisTech PGMO

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

More information

Text Document Clustering Using DPM with Concept and Feature Analysis

Text Document Clustering Using DPM with Concept and Feature Analysis Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,

More information

Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets

Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets Sheetal K. Labade Computer Engineering Dept., JSCOE, Hadapsar Pune, India Srinivasa Narasimha

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation I.Ceema *1, M.Kavitha *2, G.Renukadevi *3, G.sripriya *4, S. RajeshKumar #5 * Assistant Professor, Bon Secourse College

More information

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN , An Integrated Neural IR System. Victoria J. Hodge Dept. of Computer Science, University ofyork, UK vicky@cs.york.ac.uk Jim Austin Dept. of Computer Science, University ofyork, UK austin@cs.york.ac.uk Abstract.

More information

A Geometric Analysis of Subspace Clustering with Outliers

A Geometric Analysis of Subspace Clustering with Outliers A Geometric Analysis of Subspace Clustering with Outliers Mahdi Soltanolkotabi and Emmanuel Candés Stanford University Fundamental Tool in Data Mining : PCA Fundamental Tool in Data Mining : PCA Subspace

More information

6. Lecture notes on matroid intersection

6. Lecture notes on matroid intersection Massachusetts Institute of Technology 18.453: Combinatorial Optimization Michel X. Goemans May 2, 2017 6. Lecture notes on matroid intersection One nice feature about matroids is that a simple greedy algorithm

More information

Fast Efficient Clustering Algorithm for Balanced Data

Fast Efficient Clustering Algorithm for Balanced Data Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut

More information