Multilevel Linear Dimensionality Reduction using Hypergraphs for Data Analysis

Similar documents
APPLYING GENETIC ALGORITHM IN QUERY IMPROVEMENT PROBLEM. Abdelmgeid A. Aly

Generalized Edge Coloring for Channel Assignment in Wireless Networks

Classifying Facial Expression with Radial Basis Function Networks, using Gradient Descent and K-means

Generalized Edge Coloring for Channel Assignment in Wireless Networks

Improving Performance of Sparse Matrix-Vector Multiplication

Image Segmentation using K-means clustering and Thresholding

Shift-map Image Registration

Coupling the User Interfaces of a Multiuser Program

Shift-map Image Registration

Tight Wavelet Frame Decomposition and Its Application in Image Processing

BIJECTIONS FOR PLANAR MAPS WITH BOUNDARIES

Non-homogeneous Generalization in Privacy Preserving Data Publishing

Skyline Community Search in Multi-valued Networks

d 3 d 4 d d d d d d d d d d d 1 d d d d d d

THE APPLICATION OF ARTICLE k-th SHORTEST TIME PATH ALGORITHM

1 Surprises in high dimensions

Fast Fractal Image Compression using PSO Based Optimization Techniques

Generalized Low Rank Approximations of Matrices

Data Mining: Clustering

Almost Disjunct Codes in Large Scale Multihop Wireless Network Media Access Control

Particle Swarm Optimization Based on Smoothing Approach for Solving a Class of Bi-Level Multiobjective Programming Problem

Online Appendix to: Generalizing Database Forensics

filtering LETTER An Improved Neighbor Selection Algorithm in Collaborative Taek-Hun KIM a), Student Member and Sung-Bong YANG b), Nonmember

Learning Polynomial Functions. by Feature Construction

Figure 1: 2D arm. Figure 2: 2D arm with labelled angles

Learning convex bodies is hard

Transient analysis of wave propagation in 3D soil by using the scaled boundary finite element method

A New Search Algorithm for Solving Symmetric Traveling Salesman Problem Based on Gravity

Characterizing Decoding Robustness under Parametric Channel Uncertainty

Intensive Hypercube Communication: Prearranged Communication in Link-Bound Machines 1 2

THE BAYESIAN RECEIVER OPERATING CHARACTERISTIC CURVE AN EFFECTIVE APPROACH TO EVALUATE THE IDS PERFORMANCE

A Neural Network Model Based on Graph Matching and Annealing :Application to Hand-Written Digits Recognition

6 Gradient Descent. 6.1 Functions

Throughput Characterization of Node-based Scheduling in Multihop Wireless Networks: A Novel Application of the Gallai-Edmonds Structure Theorem

NTCIR-6 CLIR-J-J Experiments at Yahoo! Japan

A Classification of 3R Orthogonal Manipulators by the Topology of their Workspace

Bayesian localization microscopy reveals nanoscale podosome dynamics

Design of Policy-Aware Differentially Private Algorithms

Fast Window Based Stereo Matching for 3D Scene Reconstruction

A Convex Clustering-based Regularizer for Image Segmentation

New Version of Davies-Bouldin Index for Clustering Validation Based on Cylindrical Distance

Discriminative Filters for Depth from Defocus

Random Clustering for Multiple Sampling Units to Speed Up Run-time Sample Generation

Distributed Decomposition Over Hyperspherical Domains

CS 106 Winter 2016 Craig S. Kaplan. Module 01 Processing Recap. Topics

Classification and clustering methods for documents. by probabilistic latent semantic indexing model

Mathematics and Computer Science

Exploring Context with Deep Structured models for Semantic Segmentation

On Effectively Determining the Downlink-to-uplink Sub-frame Width Ratio for Mobile WiMAX Networks Using Spline Extrapolation

A half-scan error reduction based algorithm for cone-beam CT

Animated Surface Pasting

Synthesis Distortion Estimation in 3D Video Using Frequency and Spatial Analysis

Bends, Jogs, And Wiggles for Railroad Tracks and Vehicle Guide Ways

Exercises of PIV. incomplete draft, version 0.0. October 2009

Detecting Overlapping Communities from Local Spectral Subspaces

Non-Uniform Sensor Deployment in Mobile Wireless Sensor Networks

Adjacency Matrix Based Full-Text Indexing Models

Secure Network Coding for Distributed Secret Sharing with Low Communication Cost

Exploring Context with Deep Structured models for Semantic Segmentation

Discrete Markov Image Modeling and Inference on the Quadtree

Modifying ROC Curves to Incorporate Predicted Probabilities

Multilevel Algorithms for Multi-Constraint Hypergraph Partitioning

Survey of Techniques for Node Differential Privacy

Evolutionary Optimisation Methods for Template Based Image Registration

Robust PIM-SM Multicasting using Anycast RP in Wireless Ad Hoc Networks

Queueing Model and Optimization of Packet Dropping in Real-Time Wireless Sensor Networks

Open Access Adaptive Image Enhancement Algorithm with Complex Background

A Duality Based Approach for Realtime TV-L 1 Optical Flow

Cluster Center Initialization Method for K-means Algorithm Over Data Sets with Two Clusters

Questions? Post on piazza, or Radhika (radhika at eecs.berkeley) or Sameer (sa at berkeley)!

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

Graphics Calculator Applications to Maximum and Minimum Problems on Geometric Constructs

Tracking and Regulation Control of a Mobile Robot System With Kinematic Disturbances: A Variable Structure-Like Approach

Clustered SVD strategies in latent semantic indexing q

arxiv: v4 [cs.si] 22 Dec 2017

Graph and Hypergraph Partitioning for Parallel Computing

2-connected graphs with small 2-connected dominating sets

Dual Arm Robot Research Report

Nearest Neighbor Search using Additive Binary Tree

Indexing the Edges A simple and yet efficient approach to high-dimensional indexing

Iterative Computation of Moment Forms for Subdivision Surfaces

ESTIMATION OF VISUAL QUALITY OF INJECTION-MOLDED POLYMER PANELS

A Plane Tracker for AEC-automation Applications

Data Mining: Concepts and Techniques. Chapter 7. Cluster Analysis. Examples of Clustering Applications. What is Cluster Analysis?

Divide-and-Conquer Algorithms

Computer Organization

A Parameterized Mask Model for Lithography Simulation

Dense Disparity Estimation in Ego-motion Reduced Search Space

Kinematic Analysis of a Family of 3R Manipulators

Feature Extraction and Rule Classification Algorithm of Digital Mammography based on Rough Set Theory

Digital fringe profilometry based on triangular fringe patterns and spatial shift estimation

Frequent Pattern Mining. Frequent Item Set Mining. Overview. Frequent Item Set Mining: Motivation. Frequent Pattern Mining comprises

Rough Set Approach for Classification of Breast Cancer Mammogram Images

Ad-Hoc Networks Beyond Unit Disk Graphs

5th International Conference on Advanced Design and Manufacturing Engineering (ICADME 2015)

IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 31, NO. 4, APRIL

EFFICIENT ON-LINE TESTING METHOD FOR A FLOATING-POINT ADDER

An Adaptive Routing Algorithm for Communication Networks using Back Pressure Technique

Fuzzy Clustering in Parallel Universes

The Reconstruction of Graphs. Dhananjay P. Mehendale Sir Parashurambhau College, Tilak Road, Pune , India. Abstract

Transcription:

Multilevel Linear Dimensionality Reuction using Hypergraphs for Data Analysis Haw-ren Fang Department of Computer Science an Engineering University of Minnesota; Minneapolis, MN 55455 hrfang@csumneu ABSTRACT Classical algorithms use for imension reuction can be time-consuming when the ata set is large In this paper we consier a metho base on hypergraph coarsening to fin a smaller set of ata representing a given ata set, prior to performing the projection into the low-imensional space The cost of the imensionality reuction process is reuce because of this hypergraph-base pre-processing step In a hypergraph moel, each ata item is represente as a vertex an relate ata items are connecte by a hyperege, which is simply a subset of vertices To coarsen the ata, we use a metho that merges pairs of vertices In the multilevel framework, the coarsening is recursively repeate until a coarsene ata set of a certain size is reache Then we project the coarsene ata into a lower imensional space, using a known linear imensionality reuction metho The same linear mapping from the coarsene ata is then applie to the original ata set for projecting ata into lowimensional space As an application of this iea, we consier text mining Experimental results inicate that the multilevel hypergraph technique propose in this paper offer a very appealing cost to quality ratio Categories an Subject Descriptors F21 [Numerical Algorithms an Problems]: Computations on matrices; G22 [Graph Theory]: Hypergraphs; H33 [Information Search an Retrieval]: Retrieval moels, Relevance feeback General Terms algorithms, performance Keywors Multilevel hypergraph coarsening, imensionality reuction, latent semantic inexing This work was supporte by NSF grants DMS 51131 an DMS 528492 an by the Minnesota Supercomputing Institute Yousef Saa Department of Computer Science an Engineering University of Minnesota; Minneapolis, MN 55455 saa@csumneu 1 INTRODUCTION Dimensionality reuction techniques appear in many fiels, incluing ata mining, machine learning, an computer vision The goal of imensionality reuction is to map the high imensional samples into a lower imensional space so that certain properties are preserve When the number of ata samples is large, existing methos, such as those base on Principal Component Analysis (PCA) can be prohibitively expensive A simple iea for reucing the cost of imension reuction techniques is simply to select a smaller ata set that is a goo representative of the whole sample Assume for a moment this can be one This means we woul, for example, replace the original ata set X = [x 1,, x n] R m n by a subset ˆX R m k of X, which without loss of generality we can assume to consist of the k first items of X, so ˆX = [x 1,, x k ] Then, base on ˆX, we woul fin a projector from m-imensional space to -imensional space, with m The same projector can now be use to project any item in R m to -imensional space The only remaining question is how to fin a goo representative subset of X An appealing metho when some ajacency graph of the ata is available is to perform a succession of coarsening steps [9, 1, 17] When the graph is not available then a K-nearest-neighbor graph can be built but this process is expensive However, there are instances such as in text mining when the original ata itself is sparse For such cases, there is a hypergraph that is canonically associate with the ata A coarsening step can be performe using this hypergraph Hypergraphs are generalizations of graphs that allow eges, now calle hypereges, to connect more than two vertices The hypergraph moel combine with a multilevel approach, using coarsening among other tools, has ha a remarkable success for partitioning meshes an, generally, sparse ata in scientific computing, see, eg, [2, 3, 5, 8, 11, 16] This technique has applications in many fiels, incluing parallel sparse-matrix techniques (eg, [2, 5, 16]), an V esign (eg, [8, 11]) Motivate by this success, we explore imensionality reuction techniques for ata analysis, that exploit the multilevel hypergraph framework Formally, a hypergraph H = (V, E) is efine as a set of vertices V an a set of hypereges E, where each hyperege

is a subset of the vertex set V The size of a hyperege is the carinality of this subset (A hyperege is also calle a net) A weighte hypergraph has non-negative numeric weights associate with each vertex, each hyperege, or both A hypergraph can be represente by a boolean matrix where each column represents a vertex, an each row represents a hyperege which connects all vertices with a one in the row When a ata matrix is sparse, as is the case for a termocument matrix, the nonzero pattern efines a hypergraph in a canonical way In this case hypereges correspon to the rows an vertices correspon to the columns of the ata matrix In the particular example of term-ocument matrix, a hyperege represents a relationship between some ocuments Thus, a hyperege which represents row i, is simply the subset of all ocuments containing term i in the ata set uner consieration We nee a coarsening process which will preserve these relationships as best possible The important unerlying assumption here is that the information is very reunant an this reunancy shoul be exploite In the simplest case, if two ocuments have the exact subset of terms, then one is enough to represent both If a ocument x has a set of terms which inclues the union of the terms of two ocuments then x will be enough to represent all 3 ocuments As can be guesse from these examples, there is some form of imension reuction taking place in the ocument space one that is basic an consiers only structure We will recursively compute a coarsene version of the original hypergraph, ie, one with fewer vertices, of the ata set using a metho calle maximal-weight matching, to merges pairs of vertices (See, eg, [5]) Then, we can apply any projective metho of imensionality reuction to the coarsene ata at the lowest level The resulting projector can be use to project all the original ata or any new test ata such as a new query to be processe One might argue that a scheme of this type oes not necessarily achieve the most important goal of imensionality reuction which is to remove noise an reunancies from the ata Recall that etermines a basis which represents the main features ( latent semantic ) of a set of text ocuments an resolves common issues relate to wor usage, such as synonymy an polysemy As was iscusse above, hypergraph coarsening shoul achieve this goal partly though it works on ocuments rather than terms which are processe by the more powerful technique of on the resulting subset of ocuments The experiments confirm this In some cases the multilevel scheme even gives slightly better results than The rest of this paper is organize as follows Section 2 gives some backgroun on the hypergraph moel Section 3 presents the multilevel imensionality reuction methos base on hypergraph coarsening Applications to text mining (information retrieval) are illustrate in Sections 4 A conclusion is given in Section 5 2 THE HYPERGRAPH MODEL A hypergraph H = (V, E), consists of a set of vertices V an a set of hypereges (nets) E Each hyperege is a non-empty subset of V ; the size (carinality) of this subset is calle the egree of this hyperege Likewise, the egree of a vertex is the number of hypereges which inclue it Two vertices are calle neighbors if there is a hyperege connecting them, ie, if they belong to at least one common hyperege Hypergraphs exten the classical notion of graphs In a stanar graph an ege connects two vertices, ie, it is a set of two vertices, whereas a hyperege may connect an arbitrary subset of vertices e net e 1 2 9 c a 3 4 7 8 b 5 Figure 1: A sample hypergraph 6 net A hypergraph H = (V, E) can be canonically represente by a boolean matrix A, where the vertices in V an hypereges (nets) in E are represente by the columns an rows of A, respectively Each hyperege, a row of A, connects the vertices whose corresponing entries in that row are non-zero An example is illustrate in Figure 1, where V = {1,, 9} an E = {a,, e} with a = {1, 2, 3,4}, b = {3,5, 6, 7}, c = {4, 7,8, 9}, = {6, 7,8}, an e = {2, 9} The boolean matrix representation of this hypergraph is 1 2 3 4 5 6 7 8 9 1 1 1 1 a 1 1 1 1 b A = 1 1 1 1 c 1 1 1 1 1 e For ata sets represente by sparse matrices, such as for example, cases of term-ocument ata sets, the ajacency matrix is precisely the matrix representation of the a hypergraph representing the relation term i is containe in ocument j Thus, hyperege i (represente by row i), consists of all ocuments (columns/vertices) containing term i For applications where the ata matrix is ense, such as a matrix of vectorize face images, techniques such as wavelets ecomposition can be applie to sparsify the ata before using multilevel coarsening This approach is currently uner investigation 21 Hypergraph coarsening Consier a ata set of n entries in R m represente by a matrix X R m n, an a hypergraph H = (V, E) with

vertex set V corresponing to the columns of X The hypergraph can be represente by a boolean matrix A, where the columns of A represent the vertices in V, an the rows of A represent the hypereges in E Coarsening a hypergraph H = (V, E) means fining a coarse approximation Ĥ = (ˆV, Ê) to H with ˆV < V, which is a reuce representation of the original hypergraph H, in that it retains as much of the structure of the original hypergraph as possible By recursively coarsening we obtain a succession of smaller hypergraphs which approximate the original graph Several methos exist for coarsening hypergraphs, see [2, 11] for a iscussion The metho use in this paper is base on merging pairs of vertices In orer to select which pairs of vertices to merge in a hypergraph, we consier the maximum-weight matching problem (eg, [2, 5]) Pairing two vertices is terme matching The ege weight between two vertices is the number of hypereges connecting them For example, in Figure 1, the weight of the pair (6,7) is 2 because vertices 6 an 7 both belong to the hypereges b an an no other common hyperege On the other han the pair (5,9) has a weight of zero In the hyperege-vertex matrix representation (a boolean matrix), the weight of the pair i, j (vertices) is the inner prouct of the two columns i an j as can be reaily verifie with the examples just given This inner-prouct weight is aopte as a similarity metric in two software packages for hypergraph partitioning, hmetis [8] an Monriaan [16] The maximum-weight matching problem consists of fining a matching that maximizes the sum of ege weights of the vertex pairs In practice, it is not necessary to fin the optimal set of matching pairs, see eg, [5], as sub-optimal greey approaches yiel satisfactory results A greey Algorithm for maximumweight matching will be use in the experiments The vertices can be visite in a ranom orer, or in the orer in which the ata items are liste For each unmatche vertex v, all the unmatche neighbor vertices u are consiere, an the inner prouct between v an each u is compute The vertex with the highest non-zero inner prouct is matche with u an the proceure is repeate until all vertices have been matche The compute matching is a coarse representation of the hypergraph, with the coarse hypereges inherite from the fine graph More precisely, the coarse vertex set consists of matche fine vertex pairs A fine vertex pair is in a coarse hyperege if any of the two vertices is in the corresponing fine hyperege It is convenient to present the hypergraph coarsening proceure in matrix form The pseuo-coe is given in Algorithm 1 Three remarks on Algorithm 1 must be mae First, if X is a boolean matrix, then the loop (*) results in ip[k] being the inner prouct of column j an column k of X Secon, it is possible that vertex j oes not have any unmatche neighbor an the algorithm will branch to (**) However, this is rare in practice, since a hyperege can connect multiple vertices an vertex j almost always ens up fining an unmatche neighbor, unless X is too sparse Thir, the columns of ˆX are the sums of matche pairs This property is particularly goo for applications with sparse ata matrices which are Algorithm 1 Hypergraph coarsening by maximum-ege matching {Coarsen a hypergraph represente by the sparse pattern of matrix X with n columns } {The n vertices are inexe by 1,, n} S {1,, n} Set of unmatche vertices p Number of vertex pairs repeat p p + 1 Ranomly pick j S; S S {j} Set ip[k] for k = 1,, n Inner proucts for all i with a ij o for all k with a ik o ip[k] ip[k] + 1 (*) en for en for i argmax{ip[k] : k S} if ip[i] = then Vertex j is isolate from unmatche vertices ˆX(:, p) X(:, j) (**) else Vertex i matches vertex j as its nearest unmatche neighbor ˆX(:, p) X(:, i) + X(:, j) S S {i} en if until S = {The sparsity pattern of ˆX correspons to the coarsene graph of X} irectly associate with hypergraphs In such cases we may simply input to Algorithm 1 the sparse ata matrix itself By recursively coarsening the graph we obtain a sequence of sparse matrix X 1, X 2,, X r, where X k correspons to the coarse graph H k of level k for k = 1,, r, an X r represents the lowest level graph H r 3 MULTILEVEL DIMENSIONALITY RE- DUCTION The objective of imensionality reuction is to map the ata in the high imensional space into a low imensional one such that certain properties are preserve More precisely, given a matrix X R m n whose columns correspon to the vertices V an whose rows correspon to the hypereges E in a hypergraph H = (V, E), prouce Y R n ( < m) such that Y preserves certain features of X In the multilevel framework of hypergraph coarsening we apply a linear imensionality reuction metho to the coarsene ata matrix X r R m nr at the lowest level (rth level) an obtain Y r R nr ( < m), where n r is the number of ata items at coarse level r (n r < n) The linear mapping, enote by P, is then applie to the original ata X R m n to obtain a reuce representation Y = PX R n ( < m) of the original ata set The proceure is illustrate in Figure 2 The same linear mapping can also be applie to any out-of-sample test ata For example, in, the same projector is applie to a query ( pseuo-ocument ) an to the ocument set before a comparison is mae The proceure just escribe provies a hypergraph which,

Graph Coarsen Last Level Coarsen n X m m X r n r Project Figure 2: A sketch of the multilevel reuction it is hope, is a goo representation of the original graph However, it oes not tell us much about the resulting reuce set X r relate to the original ata matrix X In a coarsene hypergraph H k, each vertex represents a subset of vertices from the original hypergraph H 1 A ata item (a column in X k ) corresponing to a vertex in a coarsene hypergraph is the sum of all original ata items in X (multiple columns in X) corresponing to this subset of vertices with which it is associate 4 APPLICATION TO TEXT MINING Latent Semantic Inexing () [4] is a well-establishe framework for conceptual information retrieval [1, 6] In this section we compare the retrieval performance of an multilevel- The latter is base on the algorithm just escribe in this paper In the vector space moel, a collection of n ocuments inexe by m terms is represente by a sparse term-ocument matrix X R m n The rows an columns of X are calle term vectors an ocument vectors, respectively The (i, j) entry of X, enote by x ij, is the number of occurrences of term i in ocument j, calle term frequency This matrix is canonically associate with a hypergraph H = (V, E), where the vertices V correspon to the term vectors an the hypereges E correspon to the ocument vectors A term-ocument matrix X R m n is usually scale before its usage In the experiments we aopte TF-IDF (term frequency-inverse ocument frequency) scaling [15] The inverse ocument frequency is efine by n Y Yr n r z i = log(n/ {j : x ij > } ), (1) where {j : x ij > } is the number of ocuments with term i occurring in them A TF-IDF entry is the multiplication of TF entry an IDF entry, x ij = x ijz i The TF-IDF scale matrix X is obtaine by normalizing the columns to be unit vectors In other wors, the (i, j) entry of X is x ij = x ij/ q Pm k=1 x2 kj for i = 1,, m an j = 1,, n Given a query q R m (an array of term frequencies), query matching is the process to fin the relevant ocuments A query is also calle a pseuo-ocument vector Before the matching process, a query is also TF-IDF scale The scale vector is enote by q R m Note that however, the inverse ocument frequencies (IDF) here, efine in (1), are from the term-ocument matrix X The vector space moel measures the similarity of two vectors by the cosine istance (ie, the cosine of the acute angle between them) The full vector space moel potentially contains noise an reunancies, that affect the retrieval performance Instea, approximates a given term-ocument matrix by its truncate SVD, enote by X = [ x 1, x 2,, x n] U Σ V T, where R is a certain esire rank With reuce noise, a lower imensional approximation of X helps iscover the unerlying latent semantic structure of the ata The columns of V T = [ˆx 1, ˆx 2,, ˆx n] R n are use as reuce representations of ocument vectors x 1, x 2,, x n R m Likewise, the rows of U R m are the reuce term vectors Given a query q R m, it is transforme to a reuce representation ˆq = Σ 1 UT q R in -imensional space Document x i is consiere relevant to the query q if the cosine istance between their reuce representations ˆq, ˆx i / ˆq ˆx i is larger than some pre-efine threshol When a relevance vector (a boolean string of size n) is provie, the precision an recall are efine by Precision: DR D T, Recall: DR N R, (2) where D R, D T, an N R are the number of relevant ocuments retrieve by the process, the total number of ocuments in the collection, an the total number of relevant ocuments in the collection, respectively When the term-ocument matrix X is large, the computation of the SVD factorization can be expensive A smaller set of ocument vectors can be obtaine by the multilevel techniques escribe in Section 3 We enote it by X r R m nr, which represents the original X R m n (n r < n) The TF-IDF scaling is then applie to X r, resulting in X r Like the stanar, we compute the truncate SVD of Xr U Σ V T, where is the rank We apply the same mapping to X an obtain a reuce representation Σ 1 UT X = [ˆx 1, ˆx 2,, ˆx n] R n Note that we have applie TF-IDF scaling to the term-ocument matrix X, but here the inverse ocument frequencies (IDF), efine in (1), use the coarsene matrix X r Each query q R m is also scale in the same way to be q, an then transforme to ˆq = Σ 1 UT q R The similarity of q an x i are measure by the cosine istance between ˆq an ˆx i for i = 1,, n We call the resulting scheme multilevel- The precision an recall efine in (2) epen on the tolerance of the similarity scores in the cosine istance measure The performance evaluation may iffer when the tolerance is ifferent Therefore, we use the [7] to assess the retrieval performance Sorting the similarity scores of query q to ocuments x 1,, x n, we consier for i = 1,, n the first i ocuments with the highest scores an obtain the precision an recall P i = R i/i, R i = R i/r n, where R i is the number of relevant ocuments among the first i ocuments The is efine by the

mean of the interpolate precision P = 1 n Xn 1 i= «i ˆP, ˆP(x) = max{pi : x R i} n 1 8 7 6 Meline Three public ata sets were use in our experiments: Meline, Cran an NPL 1 The characteristics of these sets, such as numbers of ocuments, terms, an queries are liste in Table 1 Table 1: Characteristics of the test sets Data set Meline Cran NPL # ocuments 133 1398 11429 # terms 714 3763 7491 sparsity (%) 74% 141% 27% # queries 3 225 93 avg # rel/query 232 82 224 The experiments were performe in sequential moe on a PC equippe with two Intel(R) Core(TM)2 @ 24GHz processors, using our Matlab implementation In all tests we coarsene the ata own to four levels Compare with, multilevel- requires aitional work to process the hypergraph coarsening However, it saves time when computing the truncate SVD of the coarsene (smaller) termocument matrix The CPU time use for coarsening Meline, Cran, NPL ata sets is shown in the secon columns of Tables 2, 3, an 4, respectively The savings on SVD computation are much more significant Note that the epens on the imension use We call the imension that maximizes the average precision being optimal The experimental result using the Meline ata set is now iscusse Figure 3 is the resulting plot of s using various imensions for SVD (ranks of truncate SVD) The number of ocuments, the optimal imensions, an the at all levels are isplaye in Table 2 Using the optimal imensions we obtain the precision-recall plot in Figure 4 Figure 5 shows the savings in CPU time gaine by multilevel- for computing truncate SVD Table 2: Statistics of Meline ata set Level coarsen # optimal optimal avg time oc # im precision #1 N/A 133 3 716% #2 32 517 28 727% #3 13 259 3 715% #4 7 13 27 675% Figure 6 is a plot showing the s using various imensions for SVD (ranks of truncate SVD) for the Cran ata set Table 3 lists the number of ocuments, optimal imensions, an at all levels Using the optimal imensions we obtain the precision-recall plot in Figure 7 The savings in CPU time gaine by multilevel- for computing the truncate SVD are shown in Figure 8 1 ftp://ftpcscornelleu/pub/smart 5 4 3 2 1 multilevel, r=2 multilevel, r=4 5 1 15 Figure 3: Average precision using the Meline set Table 3: Statistics of Cran ata set Level coarsen # optimal optimal avg time oc # im precision #1 N/A 1398 95 398% #2 95 699 11 46% #3 34 35 92 46% #4 18 175 76 375% For the NPL ata set, rather than V T R n, we use columns of Σ V T R n as the reuce ocument vectors Recall that the truncate SVD of the term-ocument matrix is enote by U Σ V R m n, where is the rank Therefore, the reuce representation of a query q R m is ˆq = U T q R This aaption significantly improves the results of an multilevel- using the NPL ata set Figure 9 is the resulting plot of s using various imensions for SVD (ranks of truncate SVD) for the set NPL In aition to the savings of the computation time, the multilevel- usually outperforme for 8 or less imensions The number of ocuments, optimal imensions an at all levels are isplaye in Table 4 Using the optimal imensions we obtain the precision-recall plot in Figure 1 The savings in CPU time gaine by multilevel- for computing the truncate SVD are shown in Figure 11 Table 4: Statistics of NPL ata set Level coarsen # optimal optimal avg time oc # im precision #1 N/A 11429 736 235% #2 368 5717 592 238% #3 219 2861 516 239% #4 15 1434 533 233% The results are summarize as follows Using the Cran ata set, multilevel- achieve similar performance of, but reuce the SVD computation For the Meline ata set, multilevel- slightly outperforme, in aition to the

precision 9 8 7 6 5 4 3 2 Meline 1 multilevel, r=2 multilevel, r=4 1 2 3 4 5 6 7 8 9 1 recall CPU time for truncate SVD (secs) 25 2 15 1 5 multilevel, r=2 multilevel, r=4 Meline 5 1 15 Figure 4: Precision-recall plot using the Meline set Figure 5: CPU time for truncate SVD using the Meline ata set CPU time savings on SVD For the NPL ata set, the recallprecision plots of an multilevel- are comparable, using the optimal imensions However, multilevel- performe better than while fewer imensions were use This issue is important for large ata sets since the SVD computation can be very expensive We also compare the hypergraph-base multilevel techniques presente in this paper with the knn-graph-base multilevel schemes in [14] Using the Meline an Cran ata sets an the same number of levels, the hypergraphbase metho slightly outperforme the knn-graph-base one In aition, a hypergraph is canonically associate with a sparse term-ocument matrix While using the knngraph-base metho, aitional computation is require to construct a knn graph This issue is important for large ata sets since the knn graph construction can be prohibitively expensive We conclue that the hypergraphbase multilevel techniques are more aequate than the knngraph-base multilevel schemes for text information retrieval Relevance feeback is a common technique in text information retrieval The assumption is that we know in avance that some ocument vectors are relate to a query Then the query is ae by the sum of these relate ocuments, followe by a stanar text mining proceure More precisely, a query q is replace by q+b T X, where b is the boolean column vector inicating which ocuments are known a priori relate to query q We teste relevance feeback for the multilevel framework The experiments use the vector b efine above as the relevance vector, assuming that the exact information is available The resulting plots on Meline, Cran an NPL ata sets are given in Figures 12, 13, an 14, respectively These show that with relevance feeback, the multilevel- still worke nicely, but not as goo as that of Note that an multilevel- we use in our experiments are SVD-base However, the multilevel hypergraph framework we have propose oes not rely on the SVD computation Inee, we can incorporate other matrix approximation methos, such as semi-iscrete ecomposition (SDD) [12] an non-negative matrix factorization (NMF) [13], into the multilevel framework, resulting in multilevel SDD-base an NMF-base for text information retrieval 5 CONCLUSION A multilevel framework was presente to perform imensionality reuction in situations when the ata sets are sparse In applications with sparse matrix ata sets, the hypergraph moel can be irectly applie since the pattern of non-zero entries of the sparse matrix yiels a hypergraph To coarsen the ata, we use a metho, calle maximal-weight matching, which merges pairs of vertices Dimensionality reuction is performe on the ata at the lowest (coarsest) level with a linear projection metho The resulting projector is then applie to the original ata The metho is illustrate with applications in text mining generally showing a goo quality of the results at reuce cost Acknowlegments We woul like to thank Sofia Sakellarii for a literature survey on multilevel hypergraph partitioning, an Jie Chen who gathere an processe the ata sets use in this paper 6 REFERENCES [1] M W Berry an M Browne Unerstaning Search Engines SIAM, 1999 [2] U V Catalyurek an C Aykanat Hypergraph-partitioning base ecomposition for parallel sparse-matrix vector multiplication IEEE Transaction on Parallel an Distribute Systems, 1(7):673 693, 1999 [3] U V Catalyurek an C Aykanat PaToH: a multilevel hypergraph partitioning tool version 3 Technical report, Bilkent University, Department of Computer

45 4 35 3 25 2 15 1 Cran 5 multilevel, r=2 multilevel, r=4 5 1 15 2 25 3 CPU time for truncate SVD (secs) 5 45 4 35 3 25 2 15 1 5 multilevel, r=2 multilevel, r=4 Cran 5 1 15 2 25 3 Figure 6: Average precision using the Cran set Figure 8: CPU time for truncate SVD using the Cran ata set 7 Cran 25 NPL 6 5 2 precision 4 3 2 15 1 1 multilevel, r=2 multilevel, r=4 1 2 3 4 5 6 7 8 9 1 recall Figure 7: Precision-recall plot using the Cran set 5 multilevel, r=2 multilevel, r=4 1 2 3 4 5 6 7 8 Figure 9: Average precision using the NPL set Engineering, Ankara, Turkey, 1999 PaToH is available from http://bmiosueu/ umit/softwarehtml [4] S Deerwester, S T Dumais, G W Furnas, T K Lanauer, an R Harshman Inexing by latent semantic analysis J Soc Inf Sci, 41:391 47, 199 [5] K Devine, E G Boman, R Heaphy, R Bisseling, an U V Catalyurek Parallel hypergraph partitioning for scientific computing In 2th International Parallel an Distribute Processing Symposium (IPDPS), page 1, 26 [6] L Elén Unerstaning Search Engines SIAM, 27 [7] D K Harman, eitor The 3r Text Retrieval Conference (TREC-3) NIST Special Publication 5-255, 1995 http://trecnistgov/ [8] G Karypis, R Aggarwal, V Kumar, an S Shekhar Multilevel hypergraph partitioning: application in V omain In 34th Design Automation Conference, pages 526 529, 1997 [9] G Karypis an V Kumar Analysis of multilevel graph partitioning In Supercomputing 95: Proceeings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), New York, NY, USA, 1995 ACM Press Article No 29 [1] G Karypis an V Kumar Multilevel k-way partitioning scheme for irregular graphs J Parallel Distrib Comput, 48(1):96 129, 1998 [11] G Karypis an V Kumar Multilevel k-way hypergraph partitioning V Design, 11(3):285 3, 2 [12] T G Kola an D P O Leary A semi-iscrete matrix ecomposition for latent semantic inexing in information retrieval ACM Trans Information Systems, 16:322 346, 1998 [13] C-J Lin Projecte graient methos for non-negative matrix factorization Neural

5 NPL 45 4 35 9 Meline precision 3 25 2 15 1 multilevel, r=2 5 multilevel, r=4 1 2 3 4 5 6 7 8 9 1 recall Figure 1: Precision-recall plot using the NPL set CPU time for truncate SVD (secs) 14 12 1 8 6 4 2 multilevel, r=2 multilevel, r=4 NPL 8 7 6 5 4 3 2 1 multilevel, r=2 multilevel, r=4 5 1 15 Figure 12: Precision-recall plot with relevance feeback using the Meline ata set 1 2 3 4 5 6 7 8 1 9 Cran Figure 11: CPU time for truncate SVD using the NPL ata set Computation, 19:2756 2779, 27 [14] S Sakellarii, H r Fang, an Y Saa Multilevel linear imensionality reuction for ata analysis using nearest-neighbor graphs Submitte to KDD28 conference [15] G Salton an C Buckley Term-weighting approaches in automatic text retrieval Inf Process Manage, 24(5):513 523, 1988 [16] B Vastenhouw an R H Bisseling A two-imensional ata istribution metho for parallel sparse matrix-vector multiplication SIAM review, 47(1):67 95, 25 [17] F Wang an C Zhang Fast multilevel transuction on graphs In The 7th SIAM Conference on Data Mining (SDM), 27 8 7 6 5 4 3 2 multilevel, r=2 1 multilevel, r=4 5 1 15 2 25 3 Figure 13: Precision-recall plot with relevance feeback using the Meline ata set

7 NPL 6 5 4 3 2 1 multilevel, r=2 multilevel, r=4 1 2 3 4 5 6 7 8 Figure 14: Precision-recall plot with relevance feeback using the NPL ata set