Parallel Rare Term Vector Replacement: Fast and Effective Dimensionality Reduction for Text (Abridged)

Size: px

Start display at page:

Download "Parallel Rare Term Vector Replacement: Fast and Effective Dimensionality Reduction for Text (Abridged)"

Amie Newton
5 years ago
Views:

1 Parallel Rare Term Vector Replacement: Fast and Effective Dimensionality Reduction for Text (Abridged) Tobias Berka Marian Vajteršic Abstract Dimensionality reduction is an established area in text mining and information retrieval. These methods convert the highly sparse corpus matrices into dense matrix format while preserving or improving the classification accuracy or retrieval performance. In this paper, we describe a novel approach to dimensionality reduction for text, along with a parallel algorithm suitable for private memory parallel computer systems. According to Zipf s law, the majority of indexing terms occurs only in a small number of documents. Our algorithm replaces rare terms by computing a vector which expresses their semantics in terms of common terms. This process produces a projection matrix, which can be applied to a corpus matrix and individual document and query vectors. We give an accurate mathematical and algorithmic description of our algorithms and present an experimental evaluation on two benchmark corpora. These experiments indicate that our algorithm can deliver a substantial reduction in the number of features, from 47,236 to 392 features on the Reuters corpus with a clear improvement in the retrieval performance. We have evaluated our parallel implementation using the message passing interface with up to 32 processes on a Nehalem Xeon cluster, computing the projection matrix for the dimensionality reduction for over 800,000 documents in just under 100 seconds. This is a strongly abridged version of the article Tobias Berka and Marian Vajteršic: Parallel Rare Term Vector Replacement: Fast and Effective Dimensionality Reduction for Text. Journal of Parallel and Distributed Computing, 2012 (to appear). 1 Introduction Dimensionality reduction techniques reduce the number of components of a data set by representing the original data as accurately as possible with fewer features and/or instances. These techniques are put to work in a broad range of content-based retrieval applications, such as text and multimedia mining, web search, etc. The goal is to produce a more compact representation of the data with only limited loss of information in order to reduce the storage and runtime requirements. In some applications the reduction is used to discover and account for latent dependencies between features which are not reflected in the original data [1]. Here, the lower dimensionality not only reduces the computational costs, but also improves the retrieval performance. In the area of text retrieval, low-rank matrix approximations have long been utilized in order to deal with polysemy and synonymy words with multiple meanings (e.g. bank or light) or different words with the same meaning (e.g. drowsy and sleepy). But the problem with information retrieval applications is that these systems operate on tens of thousands of features and millions of documents or more. The key issues that warrant the investigation of new dimensionality reduction techniques are performance and scalability. While traditional methods focus on theoretic properties of the underlying mathematical method, the ultimate measure of success for any dimensionality reduction technique is the impact on the retrieval performance. The development of new methods for dimensionality reduction beyond the traditional, algebraic or statistical methods is therefore fair game, as long as the retrieval performance is equal or better than on the unprocessed raw data. Rare term vector replacement is a new dimensionality reduction technique for text retrieval [2]. It is based on the fact that according to Zipf s law, most terms occur only in a very limited number of documents, which makes them interesting candidates for any form of dimensionality reduction or compression. Deletion Department of Computer Sciences, University of Salzburg, Austria, tberka@cosy.sbg.ac.at, marian@cosy.sbg.ac.at. Department of Informatics, Mathematical Institute, Slovak Academy of Sciences, Bratislava, Slovakia 1

2 of such rare terms is out of the question, because they play an important role in the search process. In our approach, we attempt to replace rare terms with a signature feature vector, which preserves the semantics of the rare term by expressing it in more frequent terms. This vector is then scaled by the frequency of the rare term in this document. The rest of this paper is structured as follows. We describe the serial approach in Section 3 and present a parallelization in Section 4. We evaluate the performance and scalability in Section 5. Lastly, we summarize our findings in Section 6. But first, we are going to briefly examine the state-of-the-art in dimensionality reduction for text retrieval. 2 Related Work There are several well known dimensionality reduction techniques in the fields of numerical linear algebra and statistics. The two foremost are the singular value decomposition (SVD), see e.g. [1], and the principal component analysis (PCA), see e.g. [3]. Non-negative matrix factorizations (NMF), see e.g. [4], are a more recent development that is motivated by factor analysis, where non-negativity may be necessary to interpret the factors. In information retrieval, latent semantic indexing (LSI), see [5], is the straightforward application of the SVD to the task at hand. The PCA has also been applied in the COV approach [6]. Factor analysis based on the SVD applied to automated indexing has been reported as probabilistic latent semantic analysis (PLSA) in [7]. NMF methods are often used in various text classification tasks [8] [9]. The reduction by projection onto the representative vectors of a feature clustering has in fact been developed specifically for text retrieval applications [10]. The use of kernel methods can lead to a square increase in the number of features and is therefore unsuitable for sparse, high-dimensional text data. There are many different parallelizations and algorithmic approaches for all of the above and a full review of parallel SVD, PCA and NMF methods is clearly beyond the scope of this paper. However, for the purposes of text retrieval, the thin SVD computed with the Lanczos algorithm is of primary importance [11] [12]. Our own method for reducing the dimensionality of a text index is based on an entirely new approach. It is based on an intuition about text indices and has no counterpart in matrix factorizations or descriptive statistics. We therefore need to examine the mathematical formulation and the impact on the retrieval quality. 3 Rare Term Vector Replacement We have a set D of D = n documents and a set F of F = m features. Every document d D is represented as a feature vector d R m to which we assign a column index col(d) {1,..., n}. As a convenient notation, we use d j to denote col(d) = j. Analogously, every feature f F has a row index row(f) {1,..., m}, using the short form f i : row(f) = i. We can thus form a corpus matrix C R m n, which contains the documents feature vectors as column vectors C = [ d 1 d 2... d n ] R m n. (1) Since we are interested in the occurrence of features within documents we define a function D : F D to determine which documents contain any particular feature f i, and dually a function F : D F that determines which features occur in a given document, formally D(f i ) := {d j D C i,j 0}, (2) F(d j ) := {f i F C i,j 0}. (3) We select the rare features through a function N : N F that determines the set of features occurring in at most t documents, N (t) := {f F D(f) t}. (4) After choosing an elimination threshold t, which is 1% of all documents for our experiments, we can now define the set of elimination features E F as E := N (t), which will ultimately lead to a reduced-dimensional feature space consisting of k common terms, where k := m E. 2

3 To eliminate an index i of a vector v, we define a truncation operator τ for the elimination of a set of indices using the recursive definition τ {i} (v) := (v 1 v i 1 v i+1 v m ) T, (5) τ {A,b} (v) := τ {b} (τ A (v)), τ (v) := v. (6) We use a linear combination of common features to replace rare features f i E. Formally, we will initially determine the set of documents that contain the feature D(f i ). We will truncate all document vectors d j D(f i ), eliminate all rare features by taking τ E (d i ) and scale them by the quantification C i,j of feature f i in document d j. We compute the sum of these vectors and apply a scaling factor λ to normalize the length of the resulting replacement vector. Formally, we define the function ρ E to compute the replacement vector: 1 ρ E (f i ) := C i,j τ E (C,j ), where λ E (f i ) := C i,j. (7) λ E (f i ) d j D(f i) d j D(f i) If we use the notation e 1,..., e m R m to denote the canonical base vectors, we can define replacement vectors r E, which either preserve features we do not wish to eliminate, or perform the vector replacement on features f E, { ρ E (f i )... f i E r E (f i ) := (8) τ E (e i )... f i E. We now assemble these vectors column-wise to form a replacement matrix R E for the elimination features E, formally R E := [ r E (f 1 ) r E (f 2 ) r E (f m ) ] R k m. (9) For any vector d j R m, we can obtain an improved, dimensionality-reduced vector d j Rk by taking d j := R E d j. To reduce an entire collection of documents, we can map our original corpus matrix C into the reduced dimensional space by taking C := R E C. But we also have to consider the on-line query processing. For any incoming query q R m, we compute q := R E q and evaluate all similarities on the augmented corpus C using cosine similarity: j {1,..., n} : sim(q, d j ) := cos(q, d j). (10) In Algorithm 1 we give a definition of this process in pseudo-code. The algorithmic complexity of this algorithm depends heavily on the distribution of non-zero components in the corpus matrix. But in general, the upper bound for the complexity is O(kmn). (11) In this serial form, our method does not necessarily provide an improvement in the algorithmic complexity, but since the replacement vectors can be computed independently of each other, we have a great potential for parallel scalability. Since the vector replacement method is brand new, we have used the Reuters Corpus Volume I in its corrected second version (RCV1-v2) [13] to conduct an evaluation of the retrieval performance. All 23,149 training documents were used as sample queries and we used the available class labels for the topic categorization to discriminate between relevant and irrelevant search results. We used the pre-vectorized TF-IDF version with 47,236 features in a corpus matrix which is 0.16% dense, i.e. less than one percent of all components are non-zero. Replacing all features which occur in less than 1% of all documents produced 1,430 features and a corpus matrix which is 99.87% dense. In addition, we applied a subsequent spectral rankreduction creating 700 features as a third representation for our evaluation. Figure 1 illustrates our results, which provide a clear indication that our approach succeeds in reducing the dimensionality and improving the retrieval performance. In light of these results, we may accept the replacement vector method as a suitable means for dimensionality reduction for information retrieval. We can therefore proceed to transform the mathematical formulation into one that is more suitable for parallel processing and derive a parallel algorithm. 3

4 Input: The corpus matrix C R m n, the set of documents D, the set of features F and the maximum occurrence count for rare features t N. Data: The occurrence count vector N N m for all features, the elimination features E F, the elimination features present in a document G F, the permutation π : N N, the feature f i F, the document d j D and an integer k N. Output: The replacement matrix R R k m. for d j D do for f i F(d j ) do N(i) += 1; k := 1; for f i F do if N(i) t then E := E {f i }; else π(i) := k, k += 1; k -= 1; for d j D do G := E F(d j ); for f i G do R(1 : k, i) += C(i, j) τ E (C(1 : m, j)), for f i F do if f i E then R(1 : k, i) := (l(i)) 1 R(1 : k, i); else R(1 : k, i) := e π(i) ; l i += C(i, j) ; Algorithm 1: Serial Replacement Vector Construction sparse TD-IDF (47,236) dense rare vector replacement (1,430) rank-reduced vector replacement (700) mean precision hit list rank Figure 1: Evaluation of the precision-at-k retrieval performance on the Reuters Corpus Volume I Version 2 using the topic categorization. 4

5 4 Parallel Construction of Replacement Vectors The obvious approach to parallel construction of replacement vectors is a task-parallel strategy. First, we are going to construct a partitioning of the elimination features E and derive a matrix partitioning for the result data in R E. The parallel computation follows naturally from these partitionings. We therefore partition the set of elimination features E into Q pair-wise disjoint sets E[1],..., E[Q]. Correspondingly, we partition the set of features F into partitions F [1],..., F [Q], so that each feature partition F [q] contains both the elimination features E[q] as well as the common features surrounding them, formally q {1,..., Q} : E[q] F [q]. This gives us a column-partitioning of the projection matrix R E into R E [1],..., R E [Q] according to F [1],..., F [Q]. Now, we can use Q processors to compute these column-partitions in parallel. The computation consists of three phases: 1. Determine the set of local elimination features E and partition E into E[1],..., E[Q] and F into F [1],..., F [Q]. 2. Compute the local column partitions R E [1],..., R E [Q] of the replacement matrix R E. 3. Gather the partitions to form the replacement matrix R E = [R E [1] R E [Q]]. Since every block R E [q] must be computed from the entire document collection, we have to replicate the corpus matrix C to all participating hosts q Q. This is the reason why we refer to this approach as a data-replicated task-parallel strategy. But if the replacement vectors must be computed for an extremely large document collection, we can parallelize the computation of the replacement vectors for a data-partitioned scenario. Mathematically, we partition our corpus into P pairwise disjoint sets of documents D[1],..., D[P ], where D[p] = n p, and the corpus matrix into blocks C[p] R m np, such that C = [ C[1] C[P ] ]. (12) To determine the set of elimination features E we compute the occurrence count for all features as a vector N across all document partitions, N := P N[p] Z m, where N[p](i) := {j {1,..., n p } C[p] i,j 0}, (13) p=1 and compute E = {i {1,..., m} N(i) t}. We restructure the formula for the replacement vectors ρ E according to our partitioning, ρ E (f i ) = 1 λ E (f i ) n P p ρ E [p](f i ), where ρ E [p](f i ) := C[p] i,j τ E (C[p],j ), (14) p=1 with the scaling factor λ E rewritten as λ E (f i ) := j=1 n P p λ E [p](f i ), where λ E [p](f i ) := C[p] i,j. (15) p=1 We compute this data-partitioned strategy in three steps: 1. Compute the local occurrence counts N[1],..., N[P ] and form the global occurrence count N and elimination features E. 2. Compute the local contributions ρ E [1],..., ρ E [P ] and λ E [1],..., λ E [P ]. 3. Compute the global replacement vectors ρ E, scalars λ E and normalize. j=1 5

6 Hosts: P Q processors in a 2D mesh with a switched network. Input: Hosts (1 : P, 1) hold the local corpus matrix C[p] R m np, the local document partition D[p] D, the global set of features F and the maximum occurrence count for rare features t N. Data: The local occurrence count vector N[p] N m, the global occurrence count vector N N m, the global set of elimination features E F, the local set of elimination features E[q] F, the local feature partition F [q] F, the set of elimination features present in a document G F, the index permutation π : N N, the feature variable f i F, the document variable d j D, an integer variable k N, the document partition index p {1,..., P } and the replication index q {1,..., Q}. Output: Host (1, 1) returns the replacement matrix R R k m. parallel for p {1,..., P } do parallel for q {1,..., Q} do if q = 1 then for d j D[p] do for f i F p (d j ) do N[p](i) += 1; all-to-all reduce (1 : P, 1) (1 : P, 1) : k := 1; for f i F do if N(i) t then E := E {f i }; else π(i) := k, k += 1; k -= 1; broadcast (p, 1) (p, 1 : Q) : D[p], C[p], E, π, k; partition E into E[1 : Q] and F into F [1 : Q]; R[p, q] := 0, l[p, q] := 0; for d j D[p] do G := E[q] F p (d j ); for f i G do R[p, q](1 : k, i) += C[p](i, j) τ E (C[p](1 : m, j)); l[p, q](i) += C[p](i, j) ; N := P p=1 N[p]; reduce (1 : P, q) (1, q) : R[q] := P p=1 R[p, q], l[q] := P p=1 l[p, q]; if p = 1 then for f i F [q] do if f i E then R[q](1 : k, i) := (l[q](i)) 1 R[q](1 : k, i); else R[q](1 : k, i) := e π(i) ; gather (1, 1 : Q) (1, 1) : if q = 1 then return R; R := [R[1] R[Q]]; Algorithm 2: Parallel Replacement Vector Construction 6

7 Once we have compute the replacement matrix R, we can use the same processes to compute the projection to the reduced-dimensional space by computing a parallel matrix product using a suitable parallel matrix multiplication algorithm. To achieve scalability in both the number of documents and the number of features, we can combine these two strategies. Algorithm 2 gives the pseudo-code for a combined task and data parallel replacement matrix computation on a 2D mesh of P Q processors. To discuss the theoretic computational complexity without communication costs, we split the algorithm into three phases: 1. We determine the number of occurrences for every feature, the set of elimination features E and the index permutation π on P processes in a complexity of O ( mn P log P ). ( 2. We can use all P Q processes to compute the local parts of the replacement matrix is O ) accumulate their sum in O. ( km log P Q ) kmn P Q and ( ) km 3. Lastly, we use Q processes to normalize the replacement vectors in O Q steps and send all parts to the first hosts. The overall complexity can thus be written as ( ) mnq + kmn + kmp log P + kmp O P Q. (16) As we can see, the hybrid task and data parallelization incurs some overhead. The first phase only operates across the document partitioning on P processes and the third phase can only use the task partitioning with its Q processes. Only the second phase can make effective use of the P Q process mesh. It is therefore imperative that we evaluate the actual performance to see if the data partitioning, the task partitioning and/or the hybrid parallelization deliver a convincing performance improvement in practice. 5 Evaluation Before we turn to an empiric evaluation, let us briefly consider the theoretic speed-up in terms of algorithmic complexity without communication overheads. Combining the complexity of the parallel algorithm in Equation 16 with the serial complexity given in Equation 11, we get a theoretic speed-up S of ( ) knp Q S(k, n, P, Q) = O, (17) k + kp + kp log P + nq with a theoretic parallel efficiency E of ( ) kn E(k, n, P, Q) = O k + kp + kp log P + nq. (18) However, speed-up and efficiency observed in practice differ from this theoretic result, because of the highly sparse nature of our computation. To give an estimate of the parallel performance under real-world conditions, we have implemented our algorithm in C/C++ using MPI as communication middleware. We have conducted execution time measurements on a cluster consisting of four nodes equipped with dual Intel Xeon E5520 CPUs clocked at 2.27 GHz with 4 cores and 48 GiB of RAM on an InfiniBand network running CentOS 64 bit version 5.2, Linux kernel version el5 using OpenMPI version and the GNU Compiler Collection version We executed 8 MPI processes on every node one per physically available core. In our experiments, the sparsity had two distinct effects: Firstly, the sparsity of the corpus matrix C leads to an unequal distribution of the actual workload. This problem is difficult to alleviate, because the short execution time of our algorithm makes it very difficult to apply more advanced partitionings and load balancing techniques. 7

8 n P Q P Q t [s] S E 421, , , , , , , , n P Q P Q t [s] S E 621, , , , , , , , n P Q P Q t [s] S E 804, , , , , , , , Table 1: Experimentally determined optimal mesh sizes: for various numbers of documents n and numbers of processors P Q, we give the most efficient number of document partitions P and elimination feature partitions Q, as well as the resulting execution time t (in seconds), speed-up S and efficiency E. Secondly, the sparse data structures for holding the data make the communication code less efficient. The data must always be packed and unpacked to and from sequential memory segments of varying length. This made message passing with MPI rather inefficient. Furthermore, it forced us to implement all collective operations by hand using point-to-point communications to implement the parallel summation of sparse vectors and matrices with a flat-tree topology. To measure the performance on a realistic data set, we have used the training and test set for the Reuters Corpus Volume I. The test set consists of three parts, which have been accumulated to give us three evaluation sets: the training set and the first part of the test set (421,816 documents), the training set and the first two parts of the test set (621,392 documents) and the the training set together with the entire test set (804,414 documents). Figure 2 depicts our measurements for the response time, speed-up and efficiency. In Table 1, we present the most efficient mesh configurations for a range of problem sizes and numbers of processes, i.e. the number of document partitions P and task partitions Q that delivered the highest parallel efficiency E. Since Q > P across the entire range of problem sizes and process counts, we have a clear indication that the task parallelization is the preferred strategy. The measured efficiency E indicates that the complications caused by the sparsity of the data are indeed manageable. But we also see that the data partitioning becomes increasingly important as the number of documents grows, because the value of P increases for larger data sets and processor counts. 8

9 time [s] processes 421,816 documents 621,392 documents 804,414 documents speed-up processes 421,816 documents 621,392 documents 804,414 documents efficiency processes 421,816 documents 621,392 documents 804,414 documents Figure 2: Empirical performance evaluation showing execution time, speed-up and efficiency for the Reuters Corpus Volume I Version 2. 9

10 6 Summary & Conclusions In this paper, we have introduced a new method of dimensionality reduction for text retrieval and presented the serial algorithm along with an evaluation of the quality of the search results on a standard benchmark data set. We mathematically derived a parallelization based on a hybrid task and data partitioning and presented a parallel algorithm for private-memory parallel computer systems. Despite the fact that the sparse nature of our method presents some difficulties, we analyzed the complexity of both the serial and the parallel algorithm and presented the theoretic speed-up without communication costs. To account for these problems and to give a realistic estimate of the performance of our parallelization, we report the details of an empirical performance evaluation. We believe that we have delivered a solid proof-of-concept implementation and evaluation for our parallel algorithm for dimensionality reduction of text indices in the vector space model. The algorithm delivers a very high quality of search results and the underlying projection matrix can be computed very rapidly. The performance evaluation was conducted for up to 800,000 documents and there is a good indication that the algorithm scales well beyond this problem size. References [1] Berry, M.W., Drmac, Z., Jessup, E.R.: Matrices, Vector Spaces, and Information Retrieval. SIAM Review 41(2) (1999) [2] Berka, T., Vajteršic, M.: Dimensionality Reduction for Information Retrieval using Vector Replacement of Rare Terms. In: Proc. TM. (2011) [3] Jolliffe, I.T.: Principal Component Analysis. 2nd edn. Springer (2002) [4] Paatero, P., Tapper, U.: Positive Matrix Factorization: A Non-negative Factor Model With Optimal Utilization of Error Estimates of Data Values. Environmetrics 5(2) (1994) [5] Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by Latent Semantic Analysis. Journal of the Society for Information Science 41(6) (1990) [6] Kobayashi, M., Aono, M., Takeuchi, H., Samukawa, H.: Matrix Computations for Information Retrieval and Major and Outlier Cluster Detection. Journal of Computational and Applied Mathematics 149(1) (2002) [7] Hofmann, T.: Probabilistic Latent Semantic Indexing. In: Proc. SIGIR, New York, NY, USA, ACM (1999) [8] Janecek, A.G., Gansterer, W.N.: Utilizing Nonnegative Matrix Factorization for Classification Problems. In Berry, M.W., Kogan, J., eds.: Survey of Text Mining III: Application and Theory. Wiley (2010) [9] Langville, A.N., Berry, M.W.: Nonnegative Matrix Factorization for Document Classification. In: Next Generation of Data Mining. Chapman & Hall/CRC (2008) [10] Dhillon, I.S., Modha, D.S.: Concept Decompositions for Large Sparse Text Data using Clustering. Machine Learning 42(1/2) (2001) [11] Berry, M.W., Do, T., Ogielski, A.: Massively-Parallel Implementations of Lanczos Algorithms for Computing the SVD of Large Sparse Matrices. In: Proc. PP, SIAM (1993) [12] Berry, M.W., Mezher, D., Philippe, B., Sameh, A.: Parallel Algorithms for the Singular Value Decomposition. In: Handbook on Parallel Computing and Statistics. Volume 184 of Statistics: A Series of Textbooks and Monographs. Chapman & Hall, Boca Raton, USA (2006) [13] Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. The Journal of Machine Learning Research 5 (December 2004)

Dimensionality Reduction for Information Retrieval using Vector Replacement of Rare Terms

Dimensionality Reduction for Information Retrieval using Vector Replacement of Rare Terms Tobias Berka Marian Vajteršic Abstract Dimensionality reduction by algebraic methods is an established technique