Weighted Feature Subset Non-Negative Matrix Factorization and its Applications to Document Understanding

Size: px

Start display at page:

Download "Weighted Feature Subset Non-Negative Matrix Factorization and its Applications to Document Understanding"

Erik Job Wells
5 years ago
Views:

200 IEEE Internatonal Conference on Data Mnng Weghted Feature Subset Non-Negatve Matrx Factorzaton and ts Applcatons to Document Understandng Dngdng Wang Tao L School of Computng and Informaton

1 200 IEEE Internatonal Conference on Data Mnng Weghted Feature Subset Non-Negatve Matrx Factorzaton and ts Applcatons to Document Understandng Dngdng Wang Tao L School of Computng and Informaton Scences Florda Internatonal Unversty Mam, FL, USA Emal: {dwang003,taol}@cs.fu.edu Chrs Dng Department of Computer Scence and Engneerng Unversty of Texas at Arlngton Arlngton, TX, USA Emal: chqdng@uta.edu Abstract Keyword (Feature) selecton enhances and mproves many Informaton Retreval (IR) tasks such as document categorzaton, automatc topc dscovery, etc. The problem of keyword selecton s usually solved usng supervsed algorthms. In ths paper, we propose an unsupervsed approach that combnes keyword selecton and document clusterng (topc dscovery) together. The proposed approach extends non-negatve matrx factorzaton (NMF) by ncorporatng a weght matrx to ndcate the mportance of the keywords. The proposed approach s further extended to a weghted verson n whch each document s also assgned a weght to assess ts mportance n the cluster. Ths work consders both theoretcal and emprcal weghted feature subset selecton for NMF and draws the connecton between unsupervsed feature selecton and data clusterng. We apply our proposed approaches to varous document understandng tasks ncludng document clusterng, summarzaton, and vsualzaton. Expermental results demonstrate the effectveness of our approach for these tasks. Keywords-Non-negatve matrx factorzaton; feature selecton; weghted feature subset non-negatve matrx factorzaton. I. INTRODUCTION Recently, many research efforts have been reported on developng effcent and effectve technques for analyzng large document collectons. Among these efforts, nonnegatve matrx factorzaton (NMF) has been shown to be useful for dfferent document understandng problems, e.g., document clusterng [40] and summarzaton [38]. The success of NMF s largely due to the newly dscovered ablty of NMF to solve challengng data mnng and machne learnng problems. In partcular, NMF wth the sum of squared error cost functon s equvalent to a relaxed K-means clusterng, the most wdely used unsupervsed learnng algorthm [8]. In addton, NMF wth the I-dvergence cost functon s equvalent to probablstc latent semantc ndexng (PLSI) [22], another unsupervsed learnng method popularly used n text analyss [0], [4]. Furthermore, NMF s able to model wdely varyng data dstrbutons and can do both hard and soft clusterng smultaneously. Several varants of NMF wth dfferent forms of factorzaton and regularzaton have also been developed and appled to many document analyss tasks [], [8], [38], [39]. Although NMF and ts varants have shown ther effectveness n these tasks, they usually perform data clusterng on all feature space. As we know, keyword (feature) selecton can enhance and mprove many document applcatons such as document categorzaton and automatc topc dscovery. However most of exstng keyword selecton technques are desgned for supervsed classfcaton problems. In ths paper, we extends NMF to solve a novel problem of clusterng wth double labelng of mportant features and data ponts, whch means that each data pont s marked as belongng to one of the groups, and each feature and data pont are also weghted to assess ther mportance respectvely. In partcular, we frst extends NMF to feature subset NMF whch combnes keyword selecton and document clusterng (topc dscovery) together. The proposed extenson ncorporates a weght matrx to ndcate the mportance of the keywords. It consders both theoretcally and emprcally feature subset selecton for NMF and draws the connecton between unsupervsed feature selecton and data clusterng. The selected keywords are dscrmnant for dfferent topcs n a global perspectve, unlke those obtaned n co-clusterng, whch typcally assocate wth one cluster strongly and are absent from other clusters. Also, the selected keywords are not lnear combnatons of words lke those obtaned n Latent Semantc Indexng (LSI) [7]: our selected words provde clear semantc meanngs of the key features whle LSI features combne dfferent words together and are not easy to nterpret. We further extend feature subset NMF nto a weghted verson whch assumes documents (data ponts) contrbute dfferently to the clusterng process,.e., some documents are tghtly related to certan topcs, whle some can be consdered as outlers. Fnally, we apply the proposed approaches n document understandng problems such as document clusterng, summarzaton, and vsualzaton. The comprehensve experments demonstrate the effectveness of our approaches. The rest of the paper s organzed as follows. Secton II dscusses the related work n NMF framework and varous /0 $ IEEE DOI 0.09/ICDM

2 document understandng tasks. In Secton III, we derve a generc theorem on the NMF algorthm. Secton IV and Secton V propose our (weghted) feature subsect NMF. An llustratve example s shown n Secton VI, and comprehensve experments on document clusterng, summarzaton, and vsualzaton are conducted n Secton VII. Secton VIII concludes. II. RELATED WORK A. NMF Framework NMF has been shown to be very useful for data clusterng. Lee and Seung [24] proposed the NMF problem and showed that the NMF problem could be solved by a multplcatve update algorthm. In general, the NMF algorthm attempts to fnd the subspaces n whch the maorty of the data ponts le. Let the nput data matrx X = (x,...,x n ) contan the collecton of n nonnegatve data column vectors. The problem of NMF ams to factorze X nto two nonnegatve matrces, X FG T, where X R p n +, F R p k +,andg R n k +. Smlarly, there are other matrx factorzatons whch dffer wth standard NMF by the restrctons on the matrx factors and forms: We lst them as follows. Convex-NMF: Tr-Factorzaton: WFS-NMF: X XWG T X FSG T mn X FG T 2 W Note that WFS-NMF s our proposed algorthm whch extends NMF by ncorporatng a weght matrx to ndcate the mportance of the keywords and data ponts. The detal of the algorthm wll be dscussed n the followng sectons. A prelmnary study of feature subset NMF whch only consders the mportance of keywords was presented as a two-page poster [37]. The relatons between NMF and some of the other matrx factorzaton and clusterng algorthms have been studed n [25]. In general, () Orthogonal NMF s equvalent to K- means clusterng; (2) G-orthogonal NMF, sem-nmf and convex-nmf are dentcal to relaxed K-means clusterng; (3) Tr-factorzaton wth explct orthogonalty constrants can be transformed nto 2-factor NMF; (4) PLSI [22] (whch s further developed nto a more comprehensve Latent Drchlet Allocaton (LDA) model []) solves the problem of NMF wth Kullback-Lebler dvergence; (5) our proposed WFS- NMF combnes clusterng wth double labelng of mportant features and samples by assgnng dfferent weghts to each row and column based on the weght matrx. B. Document Understandng Applcatons There exst varous document understandng applcatons n IR communty. Here, we brefly revew some popular tasks ncludng document clusterng, document summarzaton, and vsualzaton. In ths paper, we also apply our proposed approaches to these three applcatons. Document Clusterng.: The problem of document clusterng has been extensvely studed. Gven a collecton of documents, document clusterng parttons them nto dfferent groups (called clusters) so that smlar documents belong to the same group whle the documents n dfferent clusters are dssmlar. The problem of document clusterng has been extensvely studed. Tradtonal clusterng technques such as herarchcal and parttonng methods have been used n clusterng documents (e.g. herarchcal agglomeratve clusterng (HAC) [2] and K-means clusterng [20]). Model-based clusterng methods such as PLSI and the more comprehensve LDA have also been successfully appled to document clusterng [22], []. Recently, matrx and graph based clusterng algorthms have emerged as promsng clusterng approaches [39], and two representatve examples of whch are spectral clusterng [34] and non-negatve matrx factorzaton (NMF) [24], [40]. Co-clusterng algorthms are then proposed whch am at clusterng document and term smultaneously by makng use of the dual relatonshp nformaton [5], [7], [43]. Subspace clusterng algorthms have also been developed for dscoverng low-dmensonal clusters n hgh-dmenson document space [26], [23]. Mult-Document Summarzaton.: Mult-document summarzaton ams to generate a short summary for a collecton of documents reflectng the maor or queryrelevant nformaton. Exstng summarzaton methods usually rank the sentences n the documents accordng to ther salent scores calculated by a set of predefned lgustc features, such as term frequency-nverse sentence frequency (TF-ISF) [28], sentence or term poston [4], and number of keywords [4]. Gong et al. [6] propose a generc method usng latent semantc analyss (LSA) to select sentences wth hgh rankng for summarzaton. Goldsten et al. [5] propose a maxmal margnal relevance (MMR) method to summarze documents based on the cosne smlarty between a query and a sentence and also the sentence and prevously selected sentences. Other approaches nclude NMF based summarzaton [30], Condtonal Random Feld (CRF) based summarzaton [33], and hdden Markov model (HMM) based method [4]. In addton, graph-rankng based approaches have been proposed to summarze documents usng the sentence relatonshp [3], the dea of whch s smlar to PageRank. Document Vsualzaton.: Document vsualzaton focuses on dsplayng document relatonshps usng varous presentaton technques, whch helps users to understand and navgate nformaton easly. Some technques have been developed to map the document collecton nto multvarate space. Typcal systems for document vsualzaton nclude the Galaxy of News [32], Jgsaw [35], and ThemeRver [2]. In ths paper, we extend the NMF model to allow unsu- 542

3 pervsed feature selecton and data clusterng and rankng to be conducted smultaneously. We apply the proposed approaches n three document understandng applcatons to demonstrate the effectveness of the approaches for mprovng document understandng. III. A GENERIC THEOREM ON NMF ALGORITHM In ths paper, we wll derve several algorthms for NMF problems. Here we frst provde a generc theorem on the NMF algorthm. We wll use ths results repeatedly later. For the followng optmzaton problem mn J(H) H 0 =Tr[ 2RT H + H T PHQ], () where P, Q, H 0 are constant matrces, the optmal soluton for H s gven by the followng updatng algorthm R k H k H k. (2) (PHQ) k Theorem. If the algorthm converges, the converged soluton satsfes the KKT condton. Proof. We mnmze the Lagrangan functon L(H) =Tr[ 2R T H + H T PHQ 2βH], where λ =(λ k ) s the Lagrangan multpler to enforce H k 0. Settng J =( 2R +2PHQ 2β) k =0, H k the KKT complementarty slackness condton β k H k =0 becomes ( R + PHQ) k H k =0. (3) Now, when teratve soluton of H convergees, t satsfes R k H k = H k. (4) (PHQ) k One can see Eq.(4) s dentcal to Eq.(3) ether H k =0or not. Ths proves that the converged soluton satsfes KKT condton. Theorem 2. The updatng algorthm of Eq.(2) converges. Proof. We use the auxlary functon approach [24]. A functon Z(H, H ) s called an auxlary functon of J(H) f t satsfes Z(H, H ) J(H), Z(H, H) =J(H), (5) for any H, H.Defne H (t+) =argmnz(h, H (t) ), (6) H where we note that we requre the global mnmum. By constructon, we have J(H (t) ) = Z(H (t),h (t) ) Z(H (t+),h (t) ) J(H (t+) ). Thus J(H (t) ) s monotone decreasng (non-ncreasng). The key s to fnd () approprate Z(H, H ) and (2) ts global mnmum. Usng the followng matrx nequalty Tr(H T PHQ) k (PH Q) k Hk 2 H k, where H, P, Q 0 and P = P T,Q= Q T,wecanseethat Z(H, H )= k R k H k + k (PH Q) k H 2 k H k s an auxlary functon of J(H) of Eq.(). Now we solve Eq.(6) by dentfyng H (t+) = H and H (t) = H. Settng Z = 2R k +2 (PH Q) k H k H k H k =0, (7) we obtan H k = H k R k (PH. (8) Q) k The second dervatves are 2 Z =2 (PH Q) k H k H l H k δ δ kl, whch s a sem-postve defnte matrx, ensurng the local optma of Eq.(8) obtaned from Eq.(7) s the global mnma for solvng Eq.(6). Thus updatng H usng Eq.(8) wll decrease J(H). One can see Eq.(8) s dentcal to Eq.(2). IV. FEATURE SUBSET NMF (FS-NMF) A. Obectve Let X = {x,,x n ) contans n documents wth m keywords (features). In general, NMF factorzes the nput nonnegatve data matrx X nto two nonnegatve matrces, X FG T, where G R n k + s the cluster ndcator matrx for clusterng columns of X and F =(f,,f k ) R m k + contans k cluster centrods. In ths paper, we propose a new obectve to smultaneously factorze X and rank the features n X as follows: mn W 0,F 0,G 0 X FGT 2 W, s.t. W α =, (9) where W R m m + whch s a dagonal matrx ndcatng the weghts of the rows (keywords or features) n X, andα s a parameter (set to 0.7 emprcally). B. Optmzaton Mnmzng Eq.(9) wth respect to W, F,andG, hasa closed-form soluton. We wll optmze the obectve wth respect to one varable whle fxng the other varables. Ths procedure repeats untl convergence. 543

4 ) Computaton of W : Optmzng Eq.(9) wth respect to W s equvalent to optmzng J = W u λ( W α ), u = J (X FG T ) 2. Now, from the KKT condton W W = (u λαw )W =0, we obtan the followng updatng formula [ ] α W = u u α. (0) 2) Computaton of G: Optmzng Eq.(9) wth respect to G s equvalent to optmzng J 2(G) =Tr(X T W T X 2GF T W T X + F T W T FG T G). Usng the generc algorthm n Secton III, we obtan the followng updatng formula (X T WF ) k G k G k (GF T. () WF ) k 3) Computaton of F : Optmzng Eq.(9) wth respect to F s equvalent to optmzng J 3 (F )=Tr[WXX T 2WXGF T + WFG T GF ]. Usng the generc algorthm n Secton III, we obtan the followng updatng formula (WXG) k F k F k (WFG T. (2) G) k 4) Algorthm Procedure: The detal procedure of FS- NMF s lsted as Algorthm. Algorthm FS-NMF Algorthm Descrpton Input: X : word-document matrx K : the number of clusters Output: F : word cluster ndcator matrx G : document cluster ndcator matrx W : word weghts matrx : Intalze W = I and ntalze (F, G ) as the output of standard NMF 2: repeat [ ] α 3: Update W by W = α u, u where u = (X FGT ) 2 ; (X 4: Update G by G k G T WF ) k k ; (GF T WF ) k (WXG) 5: Update F by F k F k k ; (WFG T G ) k 6: untl converges. V. WEIGHTED FEATURE SUBSET NMF (WFS-NMF) In Secton IV, dfferent weghts are assgned to the term features ndcatng the mportance of the keywords, however all the documents are treated equally. Ths assumpton does no longer hold n case that dfferent documents are created wth dfferent mportance. Thus, we extend our algorthm to a weghted verson n whch each document s also assgned a weght. Smlar to Eq.( 9), the obectve of weghted FS-NMF can be wrtten as: mn X W 0,F 0,G 0 FGT 2 W, where we set W = a b. Ths becomes mn (X W 0,F 0,G 0 FGT ) 2 a b, s.t. a α =, b β =, (3) where α, β are two parameters wth 0 <α<, 0 <β<. A. Optmzaton ) Computaton of W : Snce W = ab T, we optmze a =(a,,a m ) frst. Optmzng Eq.(3) wth respect to a s equvalent to optmzng J a = u a λ( a α ), u = (X FG T ) 2 b. Ths optmzaton has been analyzed n Secton IV-B. The optmal soluton for a s gven by [ a = u α ] α u. (4) We now optmze the obectve Eq.(3) wth respect to b =(b,,b n ) whch s equvalent to optmzng J b = v b λ( b β ), v = (X FG T ) 2 a. The optmal soluton for b s gven by β b = v β. (5) v β β 2) Computaton of F : Let A = dag(a,a 2,...,a m ) and B =dag(b,b 2,...,b n ). Optmzng Eq.(3) wth respect to F s equvalent to optmzng J 4 (F )= [ a (X FG T ) b ] 2 = A 2 (X FG T )B 2 2 = Tr(X T AXB 2G T BX T AF + F T AF G T BG). (6) Usng the generc algorthm of Secton III, we obtan (AXBG) k F k F k (AF G T. (7) BG) k 544

5 3) Computaton of G: Usng Eq.(6), the obectve for G obtaned after pre-processng, s S S2 S3 S4 S5 S6 J 5 (G) =Tr(X T AXB 2G T BX T AF + G T BGF T AF ). (8) start applcaton 0 0 Usng the generc algorthm of Secton III, we obtan X = verson 0 0. create (BX T AF ) k temporary 0 0 G k G k (BGF T. (9) AF ) servce 0 0 k B. Algorthm Procedure After the computaton by WFS-NMF, the weghts for the The detal procedure of WFS-NMF s lsted as Algorthm 2. applcaton 0.58 verson 0.58 a = Algorthm 2 WFS-NMF Algorthm Descrpton create terms are start 0.84 Input: X : word-document matrx temporary 0.65 K : the number of clusters servce 0.65 Output: F : word cluster ndcator matrx G : document cluster ndcator matrx Thus the most mportant two keywords are start and W : word and document weghts matrx create, whch s consstent wth our perspectve. Smlarly, the weghts for the messages are : Intalze W = I and ntalze (F, G) as the output of standard NMF S : repeat S : Update W by W = a b, S [ ] b = α a = α u β S , b = β S β v, S u where u = (X FGT ) 2 and v = (X FGT ) 2 ; v β 4: Update G by G k G k (BX T AF ) k (BGF T AF ) k ; 5: Update F by F k F k (AXBF T ) k (AF GBG T ) k ; 6: untl converges. VI. AN ILLUSTRATIVE EXAMPLE In ths secton, we use a smple example to llustrate the process of weghtng the keywords and data ponts usng the proposed WFS-NMF algorthm. An example dataset wth sx system log messages s presented n Table I, whch s a subset of the Log data descrbed n Secton VII-A. The sx sample messages belong to two dfferent clusters: start and create. S S2 S3 S4 S5 S6 Start User profle applcaton verson.0 started successfully. Database applcaton verson. starts. Start applcaton verson 2.0 for temporary servces. Create Can not create temporary ser vces for the Oracle engne. Can not create temporary servces on the fles. Create applcaton verson 2.0 for temporary servces. Table I AN EXAMPLE DATASET WITH TWO CLUSTERS. In the data pre-processng step, the stop words and the words whch only appear once are removed, and also stemmng s performed. The followng term-message matrx s Then we know S3 and S6 are not mportant words n dscrmnatng the two clusters as they have the lowest weghts. From the example, we clearly observe that the proposed approaches can dscover key features and samples. VII. EXPERIMENTS A. Document Clusterng Frst of all, we examne the clusterng performance of FS- NMF and W-FS-NMF usng four text datasets as descrbed n Secton VII-A, and compare the results wth seven wdely used document clusterng methods as descrbed n Secton VII-A2. Datasets # Samples # Dmensons # Class CSTR Log Reuters WebACE Table II DATASET DESCRIPTIONS. ) Data Sets: Table II summarzes the characterstcs of the datasets used n the expermetns. Detaled descrptons of the data sets are as follows. CSTR. Ths s the dataset of the abstracts of techncal reports (TRs) publshed n the Department of Computer Scence at Unversty of Rochester from 99 to The dataset contaned 476 abstracts, whch were dvded nto four research areas: Natural 545

6 Language Processng(NLP), Robotcs/Vson, Systems, and Theory. Log. Ths dataset contans 367 log text messages colleccted from several dff erent machnes at Florda Internatonal Unversty wth dfferent operatng systems usng logdump2td (an NT data collecton tool). There are 9 categores of these messages,.e., confguraton, connecton, create, dependency, other, report, request, start, and stop. Reuters. The Reuters-2578 Text Categorzaton Test collecton contans documents collected from the Reuters newswre n 987. It s a standard text categorzaton benchmark and contans 35 categores. In our experments, we use a subset of the data collecton whch ncludes the 0 most frequent categores among the 35 topcs and we call t Reuters-top 0. WebAce. Ths s from WebACE proect and has been used for document clusterng [2], [9]. The dataset contans 2340 documents consstng news artcles from Reuters new servce va the Web n October 997. These documents are dvded nto 20 classes. Newsgroups. The 20 newsgroups dataset contans approxmately 20,000 artcles evenly dvded among 20 Usenet newsgroups. The raw text sze s 26MB. To pre-process the datasets, we remove the stop words usng a standard stop lst, all HTML tags are skpped and all header felds except subect and organzaton of the posted artcles are gnored. In all our experments, we frst select the top 000 words by mutual nformaton wth class labels. The feature selecton s done wth the ranbow package [29]. 2) Implemented Baselnes: We compare the clusterng performance of FS-NMF and W-FS-NMF wth the followng most wdely used document clusterng methods. () K- means: Stardard K-means algorthm; (2) PCA-Km: PCA s frstly appled to reduce the data dmenson followed by the K-means clusterng; (3) LDA-Km [9]: an adaptve subspace clusterng algorthm by ntegratng lnear dscrmnant analyss (LDA) and K-means clusterng nto a coherent process; (4) ECC: Eucldean co-clusterng [3]; (5) MSRC: mnmum squared resdueco clusterng [3]; (6) NMF: Nonnegatve matrx factorzaton [40]; (7) TNMF: Tr-factor matrx factorzaton []; (8) Ncut: Spectral Clusterng wth Normalzed Cuts [42]. In these mplemented baselnes, (a) the K-means algorthm s one of the most wdely used standard clusterng algorthm; (b) LDA-Km and PCA-Km are two subspace clusterng algorthms whch dentfy clusters exstng n the subspaces of the orgnal data space; (c) Spectral Clusterng wth Normalzed Cuts (Ncut) s also mplemented snce t has been shown that that weghted Kernel K-means s equvalent to the normalzed cut [6]; (d) both ECC and MSRC are document co-clusterng algorthms that are able to fnd blocks n a rectangle document-term matrx. Coclusterng algorthms generally perform mplct dmenson reducton durng clusterng process. NMF has been shown to be effectve n document clusterng [40], and our methods are both based on the NMF framework. 3) Evaluaton Measures: To measure the clusterng performance, we use accuracy and normalzed mutual nformaton as our performance measures. Accuracy dscovers the one-to-one relatonshp between clusters and classes and measures the extent to whch each cluster contaned data ponts from the correspondng class. It sums up the whole matchng degree between all par class-clusters. Its value s between [0, ]. Accuracy can be represented as: ACC =max( C,L T (C,L ))/N, (20) where C denotes the -th cluster, and L s the -th class. T (C,L ) s the number of enttes whch belong to class are assgned to cluster. Accuracy computes the maxmum sum of T (C,L ) for all pars of clusters and classes, and these pars have no overlaps. Generally, the greater accuracy means the better clusterng performance. Normalzed mutual nformaton (NMI) s another wdely used performance evaluaton measure for determnng the qualty of clusters [36]. For two random varables X and Y, the NMI s defned as NMI(X, Y )= I(X, Y ) H(X)H(Y ), (2) where I(X, Y ) s the mutual nformaton between X and Y, and H(X) and H(Y ) are the entropes of X and Y, respectvely. Clearly, NMI(X, X) =and ths s the maxmum possble value of NMI. Gven a clusterng result, NMI n Eq.( 2) s estmated as follows: k k = = NMI = n log( n n n ˆn ) ( (22) k = n log n )( k n = ˆn log ˆn ) n where n denotes the number of data ponts contaned n the cluster C ( k), ˆn s the number of data ponts belongng to the -th class ( k), and n denotes the number of data ponts that are n the ntersecton between the cluster C and the -th class. In general, the larger the NMI value, the better the clusterng qualty. 4) Clusterng Results: Table III and Table IV show the accuracy and NMI evaluaton results on the text datasets. From the expermental comparsons, we observe that: On most datasets, subspace clusterng algorthms (especally LDA-Km) outperform the standard K-means algorthm due to the pre-processng by LDA or PCA. Co-clusterng algorthms (ECC and MSRC) generally outperform K-means snce they are performng mplct dmenson reducton durng the clusterng process. NMF outperforms K-means sgnfcantly snce NMF can model wdely varyng data dstrbutons due to the flexblty of matrx factorzaton as compared to 546

7 WebACE Log Reuters CSTR K-means PCA-Km LDA-Km ECC MSRC NMF TNMF Ncut FS-NMF WFS-NMF Table III CLUSTERING ACCURACY. WebACE Log Reuters CSTR K-means PCA-Km LDA-Km ECC MSRC NMF TNMF Ncut FS-NMF WFS-NMF Table IV CLUSTERING NMI RESULTS. the rgd sphercal clusters that the K-means clusterng obectve functon attempts to capture [8]. TNMF provdes a good framework for smultaneously clusterng the rows and columns of the nput documents. Hence TNMF generally outperforms NMF. The results of spectral clusterng (Ncut) s better than K-means. Note that spectral clusterng can be vewed as a weghted verson of Kernel K-means and hence t s able to dscover arbtrarly shaped clusters. The expermental results of Ncut s smlar to those of NMF. Note that t has also been that NMF s equvalent to spectral clusterng [8]. The proposed FS-NMF and WFS-NMF extend the NMF model and provde a good framework for weghtng dfferent terms and documents. Hence both of them generally outperform NMF and TNMF on the datasets. And n the meanwhle, mportant term features can be dscovered by our algorthms. As the fact that WFS-NMF consders the mportance of dfferent documents nstead of treatng them equally, the results of WFS-NMF acheves the best performance on most datasets. B. Document Summarzaton ) Data Sets: We use the DUC benchmark datasets (DUC2002 and DUC2004) for generc document summa- DUC2002 DUC2004 number of document collectons number of documents 0 0 n each collecton data source TREC TDT summary length 200 words 665bytes Table V DESCRIPTION OF THE DATA SETS FOR MULTI-DOCUMENT SUMMARIZATION rzaton tasks. Table V gves a bref descrpton of the data sets. 2) Implemented Systems: In ths experment, we compare our algorthms for summarzaton wth several most wdely used document summarzaton methods as follows. () DUCBest: the method developed by the team achevng the hghest scores n the DUC competton. (2)Random: selects sentences randomly for each document collecton. (3) Centrod: smlar to MEAD algorthm proposed n [3] usng centrod value, postonal value, and frstsentence overlap as features. (4) LexPageRank: a graph-based summarzaton method recommendng sentences by the votng of ther neghbors [3]. (5) LSA: conducts latent semantc analyss on terms by sentences matrx as proposed n [6]. (6) NMF: performs NMF on terms by sentences matrx and ranks the sentences by ther weghted scores [24]. In order to use FS-NMF or WFS-NMF to conduct document summarzaton, we use the document-sentence matrx as the nput data X, whch can be generated from the document-term and sentence-term matrces, and now each feature (column) n X represents a sentence. Then the sentences can be ranked based on the sentence weghts n W n both FS-NMF and WFS-NMF. Top-ranked sentences are ncluded nto the fnal summary. Snce WFS-NMF weghts both the samples and features, an alternatve soluton for document summarzaton s to factorze the sentence-term matrx generated from the orgnal documents, and after computaton the sentences are naturally ranked based on ther assgned weghts. Thus, we develop three new summarzaton methods as follows. (7) FS-NMF: performs FS-NMF on document-sentence matrx, and selects the sentences assocated wth the hghest weghts to form summares. (8) WFS-NMF-: smlar to FS-NMF, performs WFS- NMF on document-sentence matrx to select the sentences wth the hghest weghts. (9) WFS-NMF-2: performs WFS-NMF on sentenceterm matrx, and selects the sentences assocated wth the hghest weghts to form summares. 547

8 3) Evaluaton Methods: We use ROUGE [27] toolkt (verson.5.5) to measure the summarzaton performance, whch s wdely appled by DUC for performance evaluaton. It measures the qualty of a summary by countng the unt overlaps between the canddate summary and a set of reference summares. Several automatc evaluaton methods are mplemented n ROUGE, such as ROUGE-N, ROUGE-W and ROUGE-SU. ROUGE-N s an n-gram recall computed as follows. S ref gram ROUGE-N = n S Count match(gram n ) S ref gram n S Count(gram n ) (23) where n s the length of the n-gram, and ref stands for the reference summares. Count match (gram n ) s the maxmum number of n-grams co-occurrng n a canddate summary and the reference summares, and Count(gram n ) s the number of n-grams n the reference summares. ROUGE- W s based on weghted LCS and ROUGE-SU s based on skp-bgram plus ungram. Each of these evaluaton methods n ROUGE can generate three scores (recall, precson and F-measure). As we have smlar conclusons n terms of any of the three scores, for smplcty, n ths paper, we only report the average F-measure scores generated by ROUGE-, ROUGE-2, ROUGE-W and ROUGE-SU to compare the mplemented systems. Systems R- R-2 R-W R-SU DUC Best Random Centrod LexPageRank LSA NMF FS-NMF WFS-NMF WFS-NMF Table VI OVERALL PERFORMANCE COMPARISON ON DUC2002 DATA. Systems R- R-2 R-W R-SU DUC Best Random Centrod LexPageRank LSA NMF FS-NMF WFS-NMF WFS-NMF Table VII OVERALL PERFORMANCE COMPARISON ON DUC2004 DATA. 4) Summarzaton Evaluaton: The expermental results are demonstrated n Table VI and Table VII. From the results, we have the followng observatons: All of the three summarzaton methods developed based on FS-NMF and WFS-NMF algorthms outperform the state-of-the-art generc summarzaton methods. The good results beneft from the weghtng schemes for sentence features (or sentence samples). Among these three methods, n general WFS-NMF- acheves the hghest ROUGE scores. Ths observaton demonstrates that the sentence feature selecton s effectve and the weghts on document sde also helps the sentence weghtng process. Whle further lookng at the selected sentences, we fnd that there do exst some overlap n the selected sentences by the proposed three summarzaton methods, whch ndcates the consstency and effectveness of the weght assgnments n both samples and features. The ROUGE scores of our methods are hgher than the best team n DUC2004 and comparable to the best team from DUC2002. Note that the good results of the best team come from the fact that they perform deeper natural language processng technques to resolve pronouns and other anaphorc expressons, whch we do not use for the data preprocessng. C. Vsualzaton To evaluate the term features selected by our methods n document clusterng and smultaneous keyword selecton, n ths set of experments, we calculate the parwse document smlarty usng the top 20 word features selected by dfferent methods. We use CSTR dataset n ths experment, whch contans four classes of text data. We compare the results of our FS-NMF and WFS-NMF algorthms wth standard NMF and LSI, and Fgure demonstrates the document smlarty matrx vsually. Note that n the CSTR dataset, we order the documents based on ther class labels. From Fgure, we have the followng observatons. Word features selected by FS-NMF and WFS-NMF can effectvely reflet the document dstrbuton. Ths s because the keywords dentfed by FS-NMF dscrmnate dfferent topcs n a global perspectve. NMF Fgure (b) shows no obvous patterns at all. The falure of NMF comes from the fact that t tres to group the terms nto topcs contaned n the documents and uses the terms wth the hghest probabltes n each topc as the keywords, whch are not dscrmnant and usually redundant. LSI can also fnd meanngful words, however, the frst two clusters are not clearly dscovered n Fgure (a), whch ndcates some small classes are hard to dentfed by LSI usng the keywords selected by t. VIII. CONCLUSION In ths paper, we propose the weghed feature subset non-negatve matrx factorzaton, whch s an unsupervsed approach to smultaneously cluster data ponts and select 548

9 (a) LSI (b) NMF mportant features and also dfferent data ponts are assgned dfferent weghts ndcatng ther mportance. We apply our proposed approach to varous document understandng tasks ncludng document clusterng, summarzaton, and vsualzaton. Expermental results demonstrate the effectveness of our approaches for these tasks. ACKNOWLEDGEMENT The work of D. Wang s supported by an Florda Internatonal Unversty (FIU) Dssertaton Fellowshp. The work of T. L s partally supported NSF grants IIS , CCF , and DMS The work of C. Dng s partally supported by NSF grants DMS and CCF REFERENCES [] D. Ble, A. Ng, and M. Jordan. Latent drchlet allocaton. Journal of Machne Learnng Research, 3:993022, [2] D. Boley. Prncpal drecton dvsve parttonng. Data Mnng and Knowledge Dscovery, 2: , 997. [3] H. Cho, I. Dhllon, Y. Guan, and S. Sra. Mnmum sum squared resdue co-clusterng of gene expresson data. In Proceedngs of SDM [4] J. M. Conroy and D. P. O leary. Text summarzaton va hdden markov models. In SIGIR 0: Proceedngs of the 24th annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval, pages , 200. [5] I. Dhllon. Co-clusterng documents and words usng bpartte spectral graph parttonng. In Proceedngs of SIGKDD (c) WFS-NMF (d) FS-NMF [6] I. Dhllon, Y. Guan, and B. Kuls. Kernel k-means: spectral clusterng and normalzed cuts [7] I. Dhllon, S. Mallela, and S. Modha. Informaton-theoretc co-clusterng. In Proceedngs of SIGKDD 200. [8] C. Dng, X. He, and H. Smon. On the equvalence of nonnegatve matrx factorzaton and spectral clusterng. In Proceedngs of Sam Data Mnng, [9] C. Dng and T. L. Adaptve dmenson reducton usng dscrmnant analyss and k-means c lusterng. In Proceedngs of ICML [0] C. Dng, T. L, and W. Peng. Nmf and pls: equvalence and a hybrd algorthm. In SIGIR 06: Proceedngs of the 29th annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval, pages , [] C. Dng, T. L, W. Peng, and H. Park. Orthogonal nonnegatve matrx tr-factorzatons for clusterng. In Proceedngs of SIGKDD 2006, [2] R. Duda, P. Hart, and D. Stork. Pattern Classfcaton. John Wley and Sons, Inc., 200. Fgure. Vsualzaton Results on CSTR Data. CSTR has 4 clusters. [3] G. Erkan and D. Radev. Lexpagerank: Prestge n multdocument text summarzaton. In Proceedngs of EMNLP

10 [4] E. Gausser and C. Goutte. Relaton between plsa and nmf and mplcatons. In SIGIR 05: Proceedngs of the 28th annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval, pages , [5] J. Goldsten, M. Kantrowtz, V. Mttal, and J. Carbonell. Summarzng text documents: Sentence selecton and evaluaton metrcs. In In Research and Development n Informaton Retreval, pages 2 28, 999. [6] Y. Gong and X. Lu. Generc text summarzaton usng relevance measure and latent semantc analyss. In Proceedngs of SIGIR 200. [7] A. Graesser, A. Karnavat, and V. Pomeroy. Latent semantc analyss captures causal, goal-orented, and taxonomc structures. In CogSc. [8] Q. Gu and J. Zhou. Local learnng regularzed nonnegatve matrx factorzaton. In IJCAI, pages , [9] E.-H. S. Han, D. Bole, M. Gn, R. Gross, K. Hastngs, G. Karyps, V. Kuma, B. Mobasher, and J. Moore. Webace: A web agent for document categorzaton and exploraton, 998. [20] J. A. Hartgan and M. A. Wong. Algorthm as 36: A k- means clusterng algorthm. Journal of the Royal Statstcal Socety. Seres C (Appled Statstcs), 28():00 08, 979. [2] S. Havre, E. Hetzler, P. Whtney, and L. Nowell. Themerver: Vsualzng thematc changes n large document collectons. IEEE Transactons on Vsualzaton and Computer Graphcs, 8():9 20, [22] T. Hofmann. Probablstc latent semantc ndexng. In Proceedngs of the Twenty-Second Annual Internatonal SIGIR Conference, 999. [23] L. Jng, M. K. Ng, and J. Z. Huang. An entropy weghtng k- means algorthm for subspace clusterng of hgh-dmensonal sparse data. IEEE Trans. on Knowl. and Data Eng., 9(8):026 04, [24] D. D. Lee and H. S. Seung. Algorthms for non-negatve matrx factorzaton. In In NIPS, pages , 200. [25] T. L. The relatonshps among varous nonnegatve matrx factorzaton methods for clusterng. In In ICDM, pages , [26] T. L, S. Ma, and M. Oghara. Document clusterng va adaptve subspace teraton. In Proceedngs of Twenty-Seventh Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval (SIGIR 2004), pages , [27] C.-Y. Ln and E.Hovy. Automatc evaluaton of summares usng n-gram co-occurrence statstcs. In Proceedngs of NLT- NAACL [28] C.-Y. Ln and E. Hovy. From sngle to mult-document summarzaton: a prototype system and ts evaluaton. In ACL 02: Proceedngs of the 40th Annual Meetng on Assocaton for Computatonal Lngustcs, pages , [29] A. K. McCallum. Bow: A toolkt for statstcal language modelng, text retreval, classfcaton and clusterng. mccallum/bow, 996. [30] S. Park, J.-H. Lee, D.-H. Km, and C.-M. Ahn. Multdocument summarzaton based on cluster usng non-negatve matrx factorzaton. In SOFSEM 07: Proceedngs of the 33rd conference on Current Trends n Theory and Practce of Computer Scence, pages , [3] D. Radev, H. Jng, M. Stys, and D. Tam. Centrod-based summarzaton of multple documents. Informaton Processng and Management, pages , [32] E. Rennson. Galaxy of news: an approach to vsualzng and understandng expansve news landscapes. In UIST 94, pages 3 2, 994. [33] D. Shen, J.-T. Sun, H. L, Q. Yang, and Z. Chen. Document summarzaton usng condtonal random felds. In IJCAI 07: Proceedngs of the 20th nternatonal ont conference on Artfcal ntellgence, pages , [34] J. Sh and J. Malk. Normalzed cuts and mage segmentaton. IEEE Transactons on Pattern Analyss and Machne Intellgence, 22: , 997. [35] J. Stasko, C. Görg, and Z. Lu. Jgsaw: supportng nvestgatve analyss through nteractve vsualzaton. Informaton Vsualzaton, 7(2):8 32, [36] A. Strehl, J. Ghosh, and C. Carde. Cluster ensembles - a knowledge reuse framework for combnng multple parttons. Journal of Machne Learnng Research, 3:583 67, [37] D. Wang, C. H. Q. Dng, and T. L. Feature subset nonnegatve matrx factorzaton and ts applcatons to document understandng. In SIGIR, pages , 200. [38] D. Wang, T. L, S. Zhu, and C. Dng. Mult-document summarzaton va sentence-level semantc analyss and symmetrc matrx factorzaton. In Proceedngs of SIGIR [39] F. Wang, C. Zhang, and T. L. Regularzed clusterng for documents. In Proceedngs of the 30th annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval, pages 95 02, [40] W. Xu, X. Lu, and Y. Gong. Document clusterng based on non-negatve matrx factorzaton. In Proceedngs of SIGIR 2004, [4] W.-t. Yh, J. Goodman, L. Vanderwende, and H. Suzuk. Mult-document summarzaton by maxmzng nformatve content-words. In IJCAI 07: Proceedngs of the 20th nternatonal ont conference on Artfcal ntellgence, pages , [42] S. X. Yu and J. Sh. Multclass spectral clusterng. In ICCV 03. [43] H. Zha, X. He, C. Dng, M. Gu, and H. Smon. Bpartte graph parttonng and data clusterng. Proc. Int l Conf. Informaton and Knowledge Management (CIKM 200),

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features