Parallel Rare Term Vector Replacement: Fast and Effective Dimensionality Reduction for Text (Abridged)
|
|
- Amie Newton
- 5 years ago
- Views:
Transcription
1 Parallel Rare Term Vector Replacement: Fast and Effective Dimensionality Reduction for Text (Abridged) Tobias Berka Marian Vajteršic Abstract Dimensionality reduction is an established area in text mining and information retrieval. These methods convert the highly sparse corpus matrices into dense matrix format while preserving or improving the classification accuracy or retrieval performance. In this paper, we describe a novel approach to dimensionality reduction for text, along with a parallel algorithm suitable for private memory parallel computer systems. According to Zipf s law, the majority of indexing terms occurs only in a small number of documents. Our algorithm replaces rare terms by computing a vector which expresses their semantics in terms of common terms. This process produces a projection matrix, which can be applied to a corpus matrix and individual document and query vectors. We give an accurate mathematical and algorithmic description of our algorithms and present an experimental evaluation on two benchmark corpora. These experiments indicate that our algorithm can deliver a substantial reduction in the number of features, from 47,236 to 392 features on the Reuters corpus with a clear improvement in the retrieval performance. We have evaluated our parallel implementation using the message passing interface with up to 32 processes on a Nehalem Xeon cluster, computing the projection matrix for the dimensionality reduction for over 800,000 documents in just under 100 seconds. This is a strongly abridged version of the article Tobias Berka and Marian Vajteršic: Parallel Rare Term Vector Replacement: Fast and Effective Dimensionality Reduction for Text. Journal of Parallel and Distributed Computing, 2012 (to appear). 1 Introduction Dimensionality reduction techniques reduce the number of components of a data set by representing the original data as accurately as possible with fewer features and/or instances. These techniques are put to work in a broad range of content-based retrieval applications, such as text and multimedia mining, web search, etc. The goal is to produce a more compact representation of the data with only limited loss of information in order to reduce the storage and runtime requirements. In some applications the reduction is used to discover and account for latent dependencies between features which are not reflected in the original data [1]. Here, the lower dimensionality not only reduces the computational costs, but also improves the retrieval performance. In the area of text retrieval, low-rank matrix approximations have long been utilized in order to deal with polysemy and synonymy words with multiple meanings (e.g. bank or light) or different words with the same meaning (e.g. drowsy and sleepy). But the problem with information retrieval applications is that these systems operate on tens of thousands of features and millions of documents or more. The key issues that warrant the investigation of new dimensionality reduction techniques are performance and scalability. While traditional methods focus on theoretic properties of the underlying mathematical method, the ultimate measure of success for any dimensionality reduction technique is the impact on the retrieval performance. The development of new methods for dimensionality reduction beyond the traditional, algebraic or statistical methods is therefore fair game, as long as the retrieval performance is equal or better than on the unprocessed raw data. Rare term vector replacement is a new dimensionality reduction technique for text retrieval [2]. It is based on the fact that according to Zipf s law, most terms occur only in a very limited number of documents, which makes them interesting candidates for any form of dimensionality reduction or compression. Deletion Department of Computer Sciences, University of Salzburg, Austria, tberka@cosy.sbg.ac.at, marian@cosy.sbg.ac.at. Department of Informatics, Mathematical Institute, Slovak Academy of Sciences, Bratislava, Slovakia 1
2 of such rare terms is out of the question, because they play an important role in the search process. In our approach, we attempt to replace rare terms with a signature feature vector, which preserves the semantics of the rare term by expressing it in more frequent terms. This vector is then scaled by the frequency of the rare term in this document. The rest of this paper is structured as follows. We describe the serial approach in Section 3 and present a parallelization in Section 4. We evaluate the performance and scalability in Section 5. Lastly, we summarize our findings in Section 6. But first, we are going to briefly examine the state-of-the-art in dimensionality reduction for text retrieval. 2 Related Work There are several well known dimensionality reduction techniques in the fields of numerical linear algebra and statistics. The two foremost are the singular value decomposition (SVD), see e.g. [1], and the principal component analysis (PCA), see e.g. [3]. Non-negative matrix factorizations (NMF), see e.g. [4], are a more recent development that is motivated by factor analysis, where non-negativity may be necessary to interpret the factors. In information retrieval, latent semantic indexing (LSI), see [5], is the straightforward application of the SVD to the task at hand. The PCA has also been applied in the COV approach [6]. Factor analysis based on the SVD applied to automated indexing has been reported as probabilistic latent semantic analysis (PLSA) in [7]. NMF methods are often used in various text classification tasks [8] [9]. The reduction by projection onto the representative vectors of a feature clustering has in fact been developed specifically for text retrieval applications [10]. The use of kernel methods can lead to a square increase in the number of features and is therefore unsuitable for sparse, high-dimensional text data. There are many different parallelizations and algorithmic approaches for all of the above and a full review of parallel SVD, PCA and NMF methods is clearly beyond the scope of this paper. However, for the purposes of text retrieval, the thin SVD computed with the Lanczos algorithm is of primary importance [11] [12]. Our own method for reducing the dimensionality of a text index is based on an entirely new approach. It is based on an intuition about text indices and has no counterpart in matrix factorizations or descriptive statistics. We therefore need to examine the mathematical formulation and the impact on the retrieval quality. 3 Rare Term Vector Replacement We have a set D of D = n documents and a set F of F = m features. Every document d D is represented as a feature vector d R m to which we assign a column index col(d) {1,..., n}. As a convenient notation, we use d j to denote col(d) = j. Analogously, every feature f F has a row index row(f) {1,..., m}, using the short form f i : row(f) = i. We can thus form a corpus matrix C R m n, which contains the documents feature vectors as column vectors C = [ d 1 d 2... d n ] R m n. (1) Since we are interested in the occurrence of features within documents we define a function D : F D to determine which documents contain any particular feature f i, and dually a function F : D F that determines which features occur in a given document, formally D(f i ) := {d j D C i,j 0}, (2) F(d j ) := {f i F C i,j 0}. (3) We select the rare features through a function N : N F that determines the set of features occurring in at most t documents, N (t) := {f F D(f) t}. (4) After choosing an elimination threshold t, which is 1% of all documents for our experiments, we can now define the set of elimination features E F as E := N (t), which will ultimately lead to a reduced-dimensional feature space consisting of k common terms, where k := m E. 2
3 To eliminate an index i of a vector v, we define a truncation operator τ for the elimination of a set of indices using the recursive definition τ {i} (v) := (v 1 v i 1 v i+1 v m ) T, (5) τ {A,b} (v) := τ {b} (τ A (v)), τ (v) := v. (6) We use a linear combination of common features to replace rare features f i E. Formally, we will initially determine the set of documents that contain the feature D(f i ). We will truncate all document vectors d j D(f i ), eliminate all rare features by taking τ E (d i ) and scale them by the quantification C i,j of feature f i in document d j. We compute the sum of these vectors and apply a scaling factor λ to normalize the length of the resulting replacement vector. Formally, we define the function ρ E to compute the replacement vector: 1 ρ E (f i ) := C i,j τ E (C,j ), where λ E (f i ) := C i,j. (7) λ E (f i ) d j D(f i) d j D(f i) If we use the notation e 1,..., e m R m to denote the canonical base vectors, we can define replacement vectors r E, which either preserve features we do not wish to eliminate, or perform the vector replacement on features f E, { ρ E (f i )... f i E r E (f i ) := (8) τ E (e i )... f i E. We now assemble these vectors column-wise to form a replacement matrix R E for the elimination features E, formally R E := [ r E (f 1 ) r E (f 2 ) r E (f m ) ] R k m. (9) For any vector d j R m, we can obtain an improved, dimensionality-reduced vector d j Rk by taking d j := R E d j. To reduce an entire collection of documents, we can map our original corpus matrix C into the reduced dimensional space by taking C := R E C. But we also have to consider the on-line query processing. For any incoming query q R m, we compute q := R E q and evaluate all similarities on the augmented corpus C using cosine similarity: j {1,..., n} : sim(q, d j ) := cos(q, d j). (10) In Algorithm 1 we give a definition of this process in pseudo-code. The algorithmic complexity of this algorithm depends heavily on the distribution of non-zero components in the corpus matrix. But in general, the upper bound for the complexity is O(kmn). (11) In this serial form, our method does not necessarily provide an improvement in the algorithmic complexity, but since the replacement vectors can be computed independently of each other, we have a great potential for parallel scalability. Since the vector replacement method is brand new, we have used the Reuters Corpus Volume I in its corrected second version (RCV1-v2) [13] to conduct an evaluation of the retrieval performance. All 23,149 training documents were used as sample queries and we used the available class labels for the topic categorization to discriminate between relevant and irrelevant search results. We used the pre-vectorized TF-IDF version with 47,236 features in a corpus matrix which is 0.16% dense, i.e. less than one percent of all components are non-zero. Replacing all features which occur in less than 1% of all documents produced 1,430 features and a corpus matrix which is 99.87% dense. In addition, we applied a subsequent spectral rankreduction creating 700 features as a third representation for our evaluation. Figure 1 illustrates our results, which provide a clear indication that our approach succeeds in reducing the dimensionality and improving the retrieval performance. In light of these results, we may accept the replacement vector method as a suitable means for dimensionality reduction for information retrieval. We can therefore proceed to transform the mathematical formulation into one that is more suitable for parallel processing and derive a parallel algorithm. 3
4 Input: The corpus matrix C R m n, the set of documents D, the set of features F and the maximum occurrence count for rare features t N. Data: The occurrence count vector N N m for all features, the elimination features E F, the elimination features present in a document G F, the permutation π : N N, the feature f i F, the document d j D and an integer k N. Output: The replacement matrix R R k m. for d j D do for f i F(d j ) do N(i) += 1; k := 1; for f i F do if N(i) t then E := E {f i }; else π(i) := k, k += 1; k -= 1; for d j D do G := E F(d j ); for f i G do R(1 : k, i) += C(i, j) τ E (C(1 : m, j)), for f i F do if f i E then R(1 : k, i) := (l(i)) 1 R(1 : k, i); else R(1 : k, i) := e π(i) ; l i += C(i, j) ; Algorithm 1: Serial Replacement Vector Construction sparse TD-IDF (47,236) dense rare vector replacement (1,430) rank-reduced vector replacement (700) mean precision hit list rank Figure 1: Evaluation of the precision-at-k retrieval performance on the Reuters Corpus Volume I Version 2 using the topic categorization. 4
5 4 Parallel Construction of Replacement Vectors The obvious approach to parallel construction of replacement vectors is a task-parallel strategy. First, we are going to construct a partitioning of the elimination features E and derive a matrix partitioning for the result data in R E. The parallel computation follows naturally from these partitionings. We therefore partition the set of elimination features E into Q pair-wise disjoint sets E[1],..., E[Q]. Correspondingly, we partition the set of features F into partitions F [1],..., F [Q], so that each feature partition F [q] contains both the elimination features E[q] as well as the common features surrounding them, formally q {1,..., Q} : E[q] F [q]. This gives us a column-partitioning of the projection matrix R E into R E [1],..., R E [Q] according to F [1],..., F [Q]. Now, we can use Q processors to compute these column-partitions in parallel. The computation consists of three phases: 1. Determine the set of local elimination features E and partition E into E[1],..., E[Q] and F into F [1],..., F [Q]. 2. Compute the local column partitions R E [1],..., R E [Q] of the replacement matrix R E. 3. Gather the partitions to form the replacement matrix R E = [R E [1] R E [Q]]. Since every block R E [q] must be computed from the entire document collection, we have to replicate the corpus matrix C to all participating hosts q Q. This is the reason why we refer to this approach as a data-replicated task-parallel strategy. But if the replacement vectors must be computed for an extremely large document collection, we can parallelize the computation of the replacement vectors for a data-partitioned scenario. Mathematically, we partition our corpus into P pairwise disjoint sets of documents D[1],..., D[P ], where D[p] = n p, and the corpus matrix into blocks C[p] R m np, such that C = [ C[1] C[P ] ]. (12) To determine the set of elimination features E we compute the occurrence count for all features as a vector N across all document partitions, N := P N[p] Z m, where N[p](i) := {j {1,..., n p } C[p] i,j 0}, (13) p=1 and compute E = {i {1,..., m} N(i) t}. We restructure the formula for the replacement vectors ρ E according to our partitioning, ρ E (f i ) = 1 λ E (f i ) n P p ρ E [p](f i ), where ρ E [p](f i ) := C[p] i,j τ E (C[p],j ), (14) p=1 with the scaling factor λ E rewritten as λ E (f i ) := j=1 n P p λ E [p](f i ), where λ E [p](f i ) := C[p] i,j. (15) p=1 We compute this data-partitioned strategy in three steps: 1. Compute the local occurrence counts N[1],..., N[P ] and form the global occurrence count N and elimination features E. 2. Compute the local contributions ρ E [1],..., ρ E [P ] and λ E [1],..., λ E [P ]. 3. Compute the global replacement vectors ρ E, scalars λ E and normalize. j=1 5
6 Hosts: P Q processors in a 2D mesh with a switched network. Input: Hosts (1 : P, 1) hold the local corpus matrix C[p] R m np, the local document partition D[p] D, the global set of features F and the maximum occurrence count for rare features t N. Data: The local occurrence count vector N[p] N m, the global occurrence count vector N N m, the global set of elimination features E F, the local set of elimination features E[q] F, the local feature partition F [q] F, the set of elimination features present in a document G F, the index permutation π : N N, the feature variable f i F, the document variable d j D, an integer variable k N, the document partition index p {1,..., P } and the replication index q {1,..., Q}. Output: Host (1, 1) returns the replacement matrix R R k m. parallel for p {1,..., P } do parallel for q {1,..., Q} do if q = 1 then for d j D[p] do for f i F p (d j ) do N[p](i) += 1; all-to-all reduce (1 : P, 1) (1 : P, 1) : k := 1; for f i F do if N(i) t then E := E {f i }; else π(i) := k, k += 1; k -= 1; broadcast (p, 1) (p, 1 : Q) : D[p], C[p], E, π, k; partition E into E[1 : Q] and F into F [1 : Q]; R[p, q] := 0, l[p, q] := 0; for d j D[p] do G := E[q] F p (d j ); for f i G do R[p, q](1 : k, i) += C[p](i, j) τ E (C[p](1 : m, j)); l[p, q](i) += C[p](i, j) ; N := P p=1 N[p]; reduce (1 : P, q) (1, q) : R[q] := P p=1 R[p, q], l[q] := P p=1 l[p, q]; if p = 1 then for f i F [q] do if f i E then R[q](1 : k, i) := (l[q](i)) 1 R[q](1 : k, i); else R[q](1 : k, i) := e π(i) ; gather (1, 1 : Q) (1, 1) : if q = 1 then return R; R := [R[1] R[Q]]; Algorithm 2: Parallel Replacement Vector Construction 6
7 Once we have compute the replacement matrix R, we can use the same processes to compute the projection to the reduced-dimensional space by computing a parallel matrix product using a suitable parallel matrix multiplication algorithm. To achieve scalability in both the number of documents and the number of features, we can combine these two strategies. Algorithm 2 gives the pseudo-code for a combined task and data parallel replacement matrix computation on a 2D mesh of P Q processors. To discuss the theoretic computational complexity without communication costs, we split the algorithm into three phases: 1. We determine the number of occurrences for every feature, the set of elimination features E and the index permutation π on P processes in a complexity of O ( mn P log P ). ( 2. We can use all P Q processes to compute the local parts of the replacement matrix is O ) accumulate their sum in O. ( km log P Q ) kmn P Q and ( ) km 3. Lastly, we use Q processes to normalize the replacement vectors in O Q steps and send all parts to the first hosts. The overall complexity can thus be written as ( ) mnq + kmn + kmp log P + kmp O P Q. (16) As we can see, the hybrid task and data parallelization incurs some overhead. The first phase only operates across the document partitioning on P processes and the third phase can only use the task partitioning with its Q processes. Only the second phase can make effective use of the P Q process mesh. It is therefore imperative that we evaluate the actual performance to see if the data partitioning, the task partitioning and/or the hybrid parallelization deliver a convincing performance improvement in practice. 5 Evaluation Before we turn to an empiric evaluation, let us briefly consider the theoretic speed-up in terms of algorithmic complexity without communication overheads. Combining the complexity of the parallel algorithm in Equation 16 with the serial complexity given in Equation 11, we get a theoretic speed-up S of ( ) knp Q S(k, n, P, Q) = O, (17) k + kp + kp log P + nq with a theoretic parallel efficiency E of ( ) kn E(k, n, P, Q) = O k + kp + kp log P + nq. (18) However, speed-up and efficiency observed in practice differ from this theoretic result, because of the highly sparse nature of our computation. To give an estimate of the parallel performance under real-world conditions, we have implemented our algorithm in C/C++ using MPI as communication middleware. We have conducted execution time measurements on a cluster consisting of four nodes equipped with dual Intel Xeon E5520 CPUs clocked at 2.27 GHz with 4 cores and 48 GiB of RAM on an InfiniBand network running CentOS 64 bit version 5.2, Linux kernel version el5 using OpenMPI version and the GNU Compiler Collection version We executed 8 MPI processes on every node one per physically available core. In our experiments, the sparsity had two distinct effects: Firstly, the sparsity of the corpus matrix C leads to an unequal distribution of the actual workload. This problem is difficult to alleviate, because the short execution time of our algorithm makes it very difficult to apply more advanced partitionings and load balancing techniques. 7
8 n P Q P Q t [s] S E 421, , , , , , , , n P Q P Q t [s] S E 621, , , , , , , , n P Q P Q t [s] S E 804, , , , , , , , Table 1: Experimentally determined optimal mesh sizes: for various numbers of documents n and numbers of processors P Q, we give the most efficient number of document partitions P and elimination feature partitions Q, as well as the resulting execution time t (in seconds), speed-up S and efficiency E. Secondly, the sparse data structures for holding the data make the communication code less efficient. The data must always be packed and unpacked to and from sequential memory segments of varying length. This made message passing with MPI rather inefficient. Furthermore, it forced us to implement all collective operations by hand using point-to-point communications to implement the parallel summation of sparse vectors and matrices with a flat-tree topology. To measure the performance on a realistic data set, we have used the training and test set for the Reuters Corpus Volume I. The test set consists of three parts, which have been accumulated to give us three evaluation sets: the training set and the first part of the test set (421,816 documents), the training set and the first two parts of the test set (621,392 documents) and the the training set together with the entire test set (804,414 documents). Figure 2 depicts our measurements for the response time, speed-up and efficiency. In Table 1, we present the most efficient mesh configurations for a range of problem sizes and numbers of processes, i.e. the number of document partitions P and task partitions Q that delivered the highest parallel efficiency E. Since Q > P across the entire range of problem sizes and process counts, we have a clear indication that the task parallelization is the preferred strategy. The measured efficiency E indicates that the complications caused by the sparsity of the data are indeed manageable. But we also see that the data partitioning becomes increasingly important as the number of documents grows, because the value of P increases for larger data sets and processor counts. 8
9 time [s] processes 421,816 documents 621,392 documents 804,414 documents speed-up processes 421,816 documents 621,392 documents 804,414 documents efficiency processes 421,816 documents 621,392 documents 804,414 documents Figure 2: Empirical performance evaluation showing execution time, speed-up and efficiency for the Reuters Corpus Volume I Version 2. 9
10 6 Summary & Conclusions In this paper, we have introduced a new method of dimensionality reduction for text retrieval and presented the serial algorithm along with an evaluation of the quality of the search results on a standard benchmark data set. We mathematically derived a parallelization based on a hybrid task and data partitioning and presented a parallel algorithm for private-memory parallel computer systems. Despite the fact that the sparse nature of our method presents some difficulties, we analyzed the complexity of both the serial and the parallel algorithm and presented the theoretic speed-up without communication costs. To account for these problems and to give a realistic estimate of the performance of our parallelization, we report the details of an empirical performance evaluation. We believe that we have delivered a solid proof-of-concept implementation and evaluation for our parallel algorithm for dimensionality reduction of text indices in the vector space model. The algorithm delivers a very high quality of search results and the underlying projection matrix can be computed very rapidly. The performance evaluation was conducted for up to 800,000 documents and there is a good indication that the algorithm scales well beyond this problem size. References [1] Berry, M.W., Drmac, Z., Jessup, E.R.: Matrices, Vector Spaces, and Information Retrieval. SIAM Review 41(2) (1999) [2] Berka, T., Vajteršic, M.: Dimensionality Reduction for Information Retrieval using Vector Replacement of Rare Terms. In: Proc. TM. (2011) [3] Jolliffe, I.T.: Principal Component Analysis. 2nd edn. Springer (2002) [4] Paatero, P., Tapper, U.: Positive Matrix Factorization: A Non-negative Factor Model With Optimal Utilization of Error Estimates of Data Values. Environmetrics 5(2) (1994) [5] Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by Latent Semantic Analysis. Journal of the Society for Information Science 41(6) (1990) [6] Kobayashi, M., Aono, M., Takeuchi, H., Samukawa, H.: Matrix Computations for Information Retrieval and Major and Outlier Cluster Detection. Journal of Computational and Applied Mathematics 149(1) (2002) [7] Hofmann, T.: Probabilistic Latent Semantic Indexing. In: Proc. SIGIR, New York, NY, USA, ACM (1999) [8] Janecek, A.G., Gansterer, W.N.: Utilizing Nonnegative Matrix Factorization for Classification Problems. In Berry, M.W., Kogan, J., eds.: Survey of Text Mining III: Application and Theory. Wiley (2010) [9] Langville, A.N., Berry, M.W.: Nonnegative Matrix Factorization for Document Classification. In: Next Generation of Data Mining. Chapman & Hall/CRC (2008) [10] Dhillon, I.S., Modha, D.S.: Concept Decompositions for Large Sparse Text Data using Clustering. Machine Learning 42(1/2) (2001) [11] Berry, M.W., Do, T., Ogielski, A.: Massively-Parallel Implementations of Lanczos Algorithms for Computing the SVD of Large Sparse Matrices. In: Proc. PP, SIAM (1993) [12] Berry, M.W., Mezher, D., Philippe, B., Sameh, A.: Parallel Algorithms for the Singular Value Decomposition. In: Handbook on Parallel Computing and Statistics. Volume 184 of Statistics: A Series of Textbooks and Monographs. Chapman & Hall, Boca Raton, USA (2006) [13] Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. The Journal of Machine Learning Research 5 (December 2004)
Dimensionality Reduction for Information Retrieval using Vector Replacement of Rare Terms
Dimensionality Reduction for Information Retrieval using Vector Replacement of Rare Terms Tobias Berka Marian Vajteršic Abstract Dimensionality reduction by algebraic methods is an established technique
More informationMinoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University
Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University
More informationA Content Vector Model for Text Classification
A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.
More informationLRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier
LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier Wang Ding, Songnian Yu, Shanqing Yu, Wei Wei, and Qianfeng Wang School of Computer Engineering and Science, Shanghai University, 200072
More informationClustered SVD strategies in latent semantic indexing q
Information Processing and Management 41 (5) 151 163 www.elsevier.com/locate/infoproman Clustered SVD strategies in latent semantic indexing q Jing Gao, Jun Zhang * Laboratory for High Performance Scientific
More informationFeature selection. LING 572 Fei Xia
Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection
More informationVector Space Models: Theory and Applications
Vector Space Models: Theory and Applications Alexander Panchenko Centre de traitement automatique du langage (CENTAL) Université catholique de Louvain FLTR 2620 Introduction au traitement automatique du
More informationDistributed Information Retrieval using LSI. Markus Watzl and Rade Kutil
Distributed Information Retrieval using LSI Markus Watzl and Rade Kutil Abstract. Latent semantic indexing (LSI) is a recently developed method for information retrieval (IR). It is a modification of the
More informationA ew Algorithm for Community Identification in Linked Data
A ew Algorithm for Community Identification in Linked Data Nacim Fateh Chikhi, Bernard Rothenburger, Nathalie Aussenac-Gilles Institut de Recherche en Informatique de Toulouse 118, route de Narbonne 31062
More informationA Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition
A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es
More informationA Patent Retrieval Method Using a Hierarchy of Clusters at TUT
A Patent Retrieval Method Using a Hierarchy of Clusters at TUT Hironori Doi Yohei Seki Masaki Aono Toyohashi University of Technology 1-1 Hibarigaoka, Tenpaku-cho, Toyohashi-shi, Aichi 441-8580, Japan
More informationA Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition
A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es
More informationText Modeling with the Trace Norm
Text Modeling with the Trace Norm Jason D. M. Rennie jrennie@gmail.com April 14, 2006 1 Introduction We have two goals: (1) to find a low-dimensional representation of text that allows generalization to
More informationJune 15, Abstract. 2. Methodology and Considerations. 1. Introduction
Organizing Internet Bookmarks using Latent Semantic Analysis and Intelligent Icons Note: This file is a homework produced by two students for UCR CS235, Spring 06. In order to fully appreacate it, it may
More informationExplore Co-clustering on Job Applications. Qingyun Wan SUNet ID:qywan
Explore Co-clustering on Job Applications Qingyun Wan SUNet ID:qywan 1 Introduction In the job marketplace, the supply side represents the job postings posted by job posters and the demand side presents
More informationInformation Retrieval. hussein suleman uct cs
Information Management Information Retrieval hussein suleman uct cs 303 2004 Introduction Information retrieval is the process of locating the most relevant information to satisfy a specific information
More informationCompression, Clustering and Pattern Discovery in Very High Dimensional Discrete-Attribute Datasets
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1 Compression, Clustering and Pattern Discovery in Very High Dimensional Discrete-Attribute Datasets Mehmet Koyutürk, Ananth Grama, and Naren Ramakrishnan
More informationTwo-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California
Two-Dimensional Visualization for Internet Resource Discovery Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California 90089-0781 fshli, danzigg@cs.usc.edu
More informationBlock Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations
Block Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations D. Zheltkov, N. Zamarashkin INM RAS September 24, 2018 Scalability of Lanczos method Notations Matrix order
More informationString Vector based KNN for Text Categorization
458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research
More informationhighest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate
Searching Information Servers Based on Customized Proles Technical Report USC-CS-96-636 Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California
More informationConcept Based Search Using LSI and Automatic Keyphrase Extraction
Concept Based Search Using LSI and Automatic Keyphrase Extraction Ravina Rodrigues, Kavita Asnani Department of Information Technology (M.E.) Padre Conceição College of Engineering Verna, India {ravinarodrigues
More informationMulti-Stage Rocchio Classification for Large-scale Multilabeled
Multi-Stage Rocchio Classification for Large-scale Multilabeled Text data Dong-Hyun Lee Nangman Computing, 117D Garden five Tools, Munjeong-dong Songpa-gu, Seoul, Korea dhlee347@gmail.com Abstract. Large-scale
More informationCollaborative Filtering based on User Trends
Collaborative Filtering based on User Trends Panagiotis Symeonidis, Alexandros Nanopoulos, Apostolos Papadopoulos, and Yannis Manolopoulos Aristotle University, Department of Informatics, Thessalonii 54124,
More informationOn Finding Power Method in Spreading Activation Search
On Finding Power Method in Spreading Activation Search Ján Suchal Slovak University of Technology Faculty of Informatics and Information Technologies Institute of Informatics and Software Engineering Ilkovičova
More informationBlock Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations
Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations Nikolai Zamarashkin and Dmitry Zheltkov INM RAS, Gubkina 8, Moscow, Russia {nikolai.zamarashkin,dmitry.zheltkov}@gmail.com
More informationUsing Singular Value Decomposition to Improve a Genetic Algorithm s Performance
Using Singular Value Decomposition to Improve a Genetic Algorithm s Performance Jacob G. Martin Computer Science University of Georgia Athens, GA 30602 martin@cs.uga.edu Khaled Rasheed Computer Science
More informationSemantic text features from small world graphs
Semantic text features from small world graphs Jurij Leskovec 1 and John Shawe-Taylor 2 1 Carnegie Mellon University, USA. Jozef Stefan Institute, Slovenia. jure@cs.cmu.edu 2 University of Southampton,UK
More informationADAPTIVE LOW RANK AND SPARSE DECOMPOSITION OF VIDEO USING COMPRESSIVE SENSING
ADAPTIVE LOW RANK AND SPARSE DECOMPOSITION OF VIDEO USING COMPRESSIVE SENSING Fei Yang 1 Hong Jiang 2 Zuowei Shen 3 Wei Deng 4 Dimitris Metaxas 1 1 Rutgers University 2 Bell Labs 3 National University
More informationData Mining. Jeff M. Phillips. January 7, 2019 CS 5140 / CS 6140
Data Mining CS 5140 / CS 6140 Jeff M. Phillips January 7, 2019 What is Data Mining? What is Data Mining? Finding structure in data? Machine learning on large data? Unsupervised learning? Large scale computational
More informationAn Approach to Identify the Number of Clusters
An Approach to Identify the Number of Clusters Katelyn Gao Heather Hardeman Edward Lim Cristian Potter Carl Meyer Ralph Abbey July 11, 212 Abstract In this technological age, vast amounts of data are generated.
More informationComparing and Combining Dimension Reduction Techniques for Efficient Text Clustering
Comparing and Combining Dimension Reduction Techniques for Efficient Text Clustering Bin Tang Michael Shepherd Evangelos Milios Malcolm I. Heywood Faculty of Computer Science, Dalhousie University, Halifax,
More informationBalance of Processing and Communication using Sparse Networks
Balance of essing and Communication using Sparse Networks Ville Leppänen and Martti Penttonen Department of Computer Science University of Turku Lemminkäisenkatu 14a, 20520 Turku, Finland and Department
More informationThe Semantic Conference Organizer
34 The Semantic Conference Organizer Kevin Heinrich, Michael W. Berry, Jack J. Dongarra, Sathish Vadhiyar University of Tennessee, Knoxville, USA CONTENTS 34.1 Background... 571 34.2 Latent Semantic Indexing...
More informationSprinkled Latent Semantic Indexing for Text Classification with Background Knowledge
Sprinkled Latent Semantic Indexing for Text Classification with Background Knowledge Haiqin Yang and Irwin King Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin,
More informationKeyword Extraction by KNN considering Similarity among Features
64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,
More information2.3 Algorithms Using Map-Reduce
28 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK one becomes available. The Master must also inform each Reduce task that the location of its input from that Map task has changed. Dealing with a failure
More informationIn = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most
In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100
More informationEssential Dimensions of Latent Semantic Indexing (LSI)
Essential Dimensions of Latent Semantic Indexing (LSI) April Kontostathis Department of Mathematics and Computer Science Ursinus College Collegeville, PA 19426 Email: akontostathis@ursinus.edu Abstract
More informationMathematics. 2.1 Introduction: Graphs as Matrices Adjacency Matrix: Undirected Graphs, Directed Graphs, Weighted Graphs CHAPTER 2
CHAPTER Mathematics 8 9 10 11 1 1 1 1 1 1 18 19 0 1.1 Introduction: Graphs as Matrices This chapter describes the mathematics in the GraphBLAS standard. The GraphBLAS define a narrow set of mathematical
More informationAn efficient algorithm for sparse PCA
An efficient algorithm for sparse PCA Yunlong He Georgia Institute of Technology School of Mathematics heyunlong@gatech.edu Renato D.C. Monteiro Georgia Institute of Technology School of Industrial & System
More informationAnalysis and Latent Semantic Indexing
18 Principal Component Analysis and Latent Semantic Indexing Understand the basics of principal component analysis and latent semantic index- Lab Objective: ing. Principal Component Analysis Understanding
More informationImproving Probabilistic Latent Semantic Analysis with Principal Component Analysis
Improving Probabilistic Latent Semantic Analysis with Principal Component Analysis Ayman Farahat Palo Alto Research Center 3333 Coyote Hill Road Palo Alto, CA 94304 ayman.farahat@gmail.com Francine Chen
More informationBipartite Graph Partitioning and Content-based Image Clustering
Bipartite Graph Partitioning and Content-based Image Clustering Guoping Qiu School of Computer Science The University of Nottingham qiu @ cs.nott.ac.uk Abstract This paper presents a method to model the
More informationPredictive Indexing for Fast Search
Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive
More informationComparing and Combining Dimension Reduction Techniques for Efficient Text Clustering
Comparing and Combining Dimension Reduction Techniques for Efficient Text Clustering Bin Tang, Michael Shepherd, Evangelos Milios, Malcolm I. Heywood {btang, shepherd, eem, mheywood}@cs.dal.ca Faculty
More informationDesigning and Building an Automatic Information Retrieval System for Handling the Arabic Data
American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far
More informationA Connection between Network Coding and. Convolutional Codes
A Connection between Network Coding and 1 Convolutional Codes Christina Fragouli, Emina Soljanin christina.fragouli@epfl.ch, emina@lucent.com Abstract The min-cut, max-flow theorem states that a source
More informationClustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search
Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2
More informationFeature Selection Using Modified-MCA Based Scoring Metric for Classification
2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification
More informationImpact of Term Weighting Schemes on Document Clustering A Review
Volume 118 No. 23 2018, 467-475 ISSN: 1314-3395 (on-line version) url: http://acadpubl.eu/hub ijpam.eu Impact of Term Weighting Schemes on Document Clustering A Review G. Hannah Grace and Kalyani Desikan
More informationENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL
ENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL Shwetha S P 1 and Alok Ranjan 2 Visvesvaraya Technological University, Belgaum, Dept. of Computer Science and Engineering, Canara
More informationA NEW SIMPLEX TYPE ALGORITHM FOR THE MINIMUM COST NETWORK FLOW PROBLEM
A NEW SIMPLEX TYPE ALGORITHM FOR THE MINIMUM COST NETWORK FLOW PROBLEM KARAGIANNIS PANAGIOTIS PAPARRIZOS KONSTANTINOS SAMARAS NIKOLAOS SIFALERAS ANGELO * Department of Applied Informatics, University of
More informationMore Efficient Classification of Web Content Using Graph Sampling
More Efficient Classification of Web Content Using Graph Sampling Chris Bennett Department of Computer Science University of Georgia Athens, Georgia, USA 30602 bennett@cs.uga.edu Abstract In mining information
More informationClassification Based on NMF
E-Mail Classification Based on NMF Andreas G. K. Janecek* Wilfried N. Gansterer Abstract The utilization of nonnegative matrix factorization (NMF) in the context of e-mail classification problems is investigated.
More informationNew user profile learning for extremely sparse data sets
New user profile learning for extremely sparse data sets Tomasz Hoffmann, Tadeusz Janasiewicz, and Andrzej Szwabe Institute of Control and Information Engineering, Poznan University of Technology, pl.
More informationMultimodal Information Spaces for Content-based Image Retrieval
Research Proposal Multimodal Information Spaces for Content-based Image Retrieval Abstract Currently, image retrieval by content is a research problem of great interest in academia and the industry, due
More informationGraphBLAS Mathematics - Provisional Release 1.0 -
GraphBLAS Mathematics - Provisional Release 1.0 - Jeremy Kepner Generated on April 26, 2017 Contents 1 Introduction: Graphs as Matrices........................... 1 1.1 Adjacency Matrix: Undirected Graphs,
More informationJoint Entity Resolution
Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationKnowledge Discovery and Data Mining 1 (VO) ( )
Knowledge Discovery and Data Mining 1 (VO) (707.003) Data Matrices and Vector Space Model Denis Helic KTI, TU Graz Nov 6, 2014 Denis Helic (KTI, TU Graz) KDDM1 Nov 6, 2014 1 / 55 Big picture: KDDM Probability
More informationLecture 27: Fast Laplacian Solvers
Lecture 27: Fast Laplacian Solvers Scribed by Eric Lee, Eston Schweickart, Chengrun Yang November 21, 2017 1 How Fast Laplacian Solvers Work We want to solve Lx = b with L being a Laplacian matrix. Recall
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections
More informationVisualization of Text Document Corpus
Informatica 29 (2005) 497 502 497 Visualization of Text Document Corpus Blaž Fortuna, Marko Grobelnik and Dunja Mladenić Jozef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia E-mail: {blaz.fortuna,
More informationSome Advanced Topics in Linear Programming
Some Advanced Topics in Linear Programming Matthew J. Saltzman July 2, 995 Connections with Algebra and Geometry In this section, we will explore how some of the ideas in linear programming, duality theory,
More informationChapter 1 AN INTRODUCTION TO TEXT MINING. 1. Introduction. Charu C. Aggarwal. ChengXiang Zhai
Chapter 1 AN INTRODUCTION TO TEXT MINING Charu C. Aggarwal IBM T. J. Watson Research Center Yorktown Heights, NY charu@us.ibm.com ChengXiang Zhai University of Illinois at Urbana-Champaign Urbana, IL czhai@cs.uiuc.edu
More informationContent-based Dimensionality Reduction for Recommender Systems
Content-based Dimensionality Reduction for Recommender Systems Panagiotis Symeonidis Aristotle University, Department of Informatics, Thessaloniki 54124, Greece symeon@csd.auth.gr Abstract. Recommender
More informationA novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems
A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics
More informationDense Matrix Algorithms
Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication
More informationUAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA
UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University
More informationEncoding Words into String Vectors for Word Categorization
Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,
More informationMatrices. Chapter Matrix A Mathematical Definition Matrix Dimensions and Notation
Chapter 7 Introduction to Matrices This chapter introduces the theory and application of matrices. It is divided into two main sections. Section 7.1 discusses some of the basic properties and operations
More informationA Parallel Algorithm for Exact Structure Learning of Bayesian Networks
A Parallel Algorithm for Exact Structure Learning of Bayesian Networks Olga Nikolova, Jaroslaw Zola, and Srinivas Aluru Department of Computer Engineering Iowa State University Ames, IA 0010 {olia,zola,aluru}@iastate.edu
More informationWeb page recommendation using a stochastic process model
Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,
More informationReviewer Profiling Using Sparse Matrix Regression
Reviewer Profiling Using Sparse Matrix Regression Evangelos E. Papalexakis, Nicholas D. Sidiropoulos, Minos N. Garofalakis Technical University of Crete, ECE department 14 December 2010, OEDM 2010, Sydney,
More informationA New Technique to Optimize User s Browsing Session using Data Mining
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
More informationPublished in A R DIGITECH
IMAGE RETRIEVAL USING LATENT SEMANTIC INDEXING Rachana C Patil*1, Imran R. Shaikh*2 *1 (M.E Student S.N.D.C.O.E.R.C, Yeola) *2(Professor, S.N.D.C.O.E.R.C, Yeola) rachanap4@gmail.com*1, imran.shaikh22@gmail.com*2
More informationA Metadatabase System for Semantic Image Search by a Mathematical Model of Meaning
A Metadatabase System for Semantic Image Search by a Mathematical Model of Meaning Yasushi Kiyoki, Takashi Kitagawa and Takanari Hayama Institute of Information Sciences and Electronics University of Tsukuba
More informationInformation Retrieval. (M&S Ch 15)
Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion
More informationParallel Hybrid Monte Carlo Algorithms for Matrix Computations
Parallel Hybrid Monte Carlo Algorithms for Matrix Computations V. Alexandrov 1, E. Atanassov 2, I. Dimov 2, S.Branford 1, A. Thandavan 1 and C. Weihrauch 1 1 Department of Computer Science, University
More informationLecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC
Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview
More informationTDT- An Efficient Clustering Algorithm for Large Database Ms. Kritika Maheshwari, Mr. M.Rajsekaran
TDT- An Efficient Clustering Algorithm for Large Database Ms. Kritika Maheshwari, Mr. M.Rajsekaran M-Tech Scholar, Department of Computer Science and Engineering, SRM University, India Assistant Professor,
More informationMining Social Network Graphs
Mining Social Network Graphs Analysis of Large Graphs: Community Detection Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com Note to other teachers and users of these slides: We would be
More informationInstructor: Stefan Savev
LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information
More informationLecture 5: Information Retrieval using the Vector Space Model
Lecture 5: Information Retrieval using the Vector Space Model Trevor Cohn (tcohn@unimelb.edu.au) Slide credits: William Webber COMP90042, 2015, Semester 1 What we ll learn today How to take a user query
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra)
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 1: Course Overview; Matrix Multiplication Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 21 Outline 1 Course
More informationJames Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!
James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation
More informationAccelerating Implicit LS-DYNA with GPU
Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,
More informationA Method for Construction of Orthogonal Arrays 1
Eighth International Workshop on Optimal Codes and Related Topics July 10-14, 2017, Sofia, Bulgaria pp. 49-54 A Method for Construction of Orthogonal Arrays 1 Iliya Bouyukliev iliyab@math.bas.bg Institute
More informationGeneralized trace ratio optimization and applications
Generalized trace ratio optimization and applications Mohammed Bellalij, Saïd Hanafi, Rita Macedo and Raca Todosijevic University of Valenciennes, France PGMO Days, 2-4 October 2013 ENSTA ParisTech PGMO
More informationInternational Journal of Advanced Research in Computer Science and Software Engineering
Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:
More informationDimension reduction : PCA and Clustering
Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental
More informationText Document Clustering Using DPM with Concept and Feature Analysis
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,
More informationResults and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets
Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets Sheetal K. Labade Computer Engineering Dept., JSCOE, Hadapsar Pune, India Srinivasa Narasimha
More informationCluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More informationClustering Web Documents using Hierarchical Method for Efficient Cluster Formation
Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation I.Ceema *1, M.Kavitha *2, G.Renukadevi *3, G.sripriya *4, S. RajeshKumar #5 * Assistant Professor, Bon Secourse College
More informationESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,
An Integrated Neural IR System. Victoria J. Hodge Dept. of Computer Science, University ofyork, UK vicky@cs.york.ac.uk Jim Austin Dept. of Computer Science, University ofyork, UK austin@cs.york.ac.uk Abstract.
More informationA Geometric Analysis of Subspace Clustering with Outliers
A Geometric Analysis of Subspace Clustering with Outliers Mahdi Soltanolkotabi and Emmanuel Candés Stanford University Fundamental Tool in Data Mining : PCA Fundamental Tool in Data Mining : PCA Subspace
More information6. Lecture notes on matroid intersection
Massachusetts Institute of Technology 18.453: Combinatorial Optimization Michel X. Goemans May 2, 2017 6. Lecture notes on matroid intersection One nice feature about matroids is that a simple greedy algorithm
More informationFast Efficient Clustering Algorithm for Balanced Data
Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut
More information