Learning Topic Structure in Text Documents using Generative Topic Models

Size: px

Start display at page:

Download "Learning Topic Structure in Text Documents using Generative Topic Models"

Rosanna Potter
5 years ago
Views:

1 Learnng Topc Structure n Text Documents usng Generatve Topc Models Ntsh Srvastava CS 397 Report Advsor: Dr Hrsh Karnck Abstract We present a method for estmatng the topc structure for a document corpus by combnng the use of effcent clusterng algorthms wth generatve topc models. Our method bulds the topc structure as a DAG n a layer-wse manner by estmatng the number of topcs n each layer and constructng topc models for them. It dscovers subsumpton relatons between topc nodes n consecutve layers by mappng the problem to fndng maxmum edge-weghted clques on small and sparse graphs. We analyze the sparsty of the nduced graphs and gve bounds on the runnng tme of our algorthm. Most methods for solvng ths problem use varants of herarchcal drchlet procesess to provde a nonparametrc pror on the number of topcs and estmate the number as the topc model s bult. Whle these models have been shown to perform well under certan metrcs, the estmated number of topcs s often qute large, lmtng the utlty of the topc structures bult. Our approach s to buld structures where the number of topcs would be smaller and comparable to that n structures bult by human experts. We evaluate our method usng real world text datasets. Keywords-Data management; Data models; Herarchcal systems; Pattern clusterng methods I. INTRODUCTION Topc models have been shown to be effectve tools for analyss of document corpora. Learnng the topc structure s an mportant problem n ths regard. Our method bulds arbtrary DAG topc structures over a collecton of documents where nodes represent topcs and edges pont from a topc to a more specfc sub-topc. Key contrbutons are the estmaton of number of topcs n each layer of the structure and dscovery of subsumpton relatons between topc nodes n adjacent layers as a max-clque problem. Topc models have been proposed whch automatcally learn the number of topcs usng varous forms of Herarchcal Drchlet Processes (HDP, Teh et al. [1]) such as nonparametrc bayes PAM [2]. The key dea here s to Descrpton of HDP, PAM, NPB-PAM. Crtcsm We explore a dfferent strategy for topc structure extracton. We separate the generatve topc model from the topc ontology. We propose a method whch separates the processes of learnng a topc structure from learnng the topc model. We buld the topc structure usng smple clusterng technques combned wth sngle layer topc models. These structures can then be used by varous sophstcated topc models whch requre the number of topcs and herarchy structure to be specfed. Models such as herarchcal Latent Drchlet Allocaton (hlda, [3]), Correlated Topcs Model (CTM, [4]) and Pachnko Allocaton Model (PAM, [5]) and mxture model extensons of these are some such methods. Document clusterng s a wdely studed feld. Zhao et al. [6] gve an excellent comparson of several clusterng paradgms. We use a recursve bsecton based clusterng algorthm [7] to k-cluster a dataset and evaluate the qualty of each cluster. We use ths to approxmate the number of topcs at dfferent depths n the DAG. Successve levels of depth n the DAG ( layers ) represent more fne-graned topcs. We learn sngle-layer topc models for each depth. The topcs n consecutve layers are then lnked by subsumpton relatons. We cast the problem of dscoverng subsumpton relatons nto fndng a maxmum edge-weghted clque over a sparse topc graph. Besdes fndng subsumpton relatons, our formulaton also merges very smlar topcs n the same layer to further prune the topc structure and make up for errors made n the clusterng phase. In order to valdate our results, we use the 20 newsgroups comp5 (ngcomp5) and the NIPS abstracts datasets. These datasets have sngle level categores. In order to do a more meanngful evaluaton, we collected a herarchcal dataset from a crawl of the webpages lnked from the Open Drectory Project (ODP). Ths allows us to compare aganst human-expert determned topc structures. II. ESTIMATING NUMBER OF TOPICS In ths secton, we descrbe a smple approach to obtan a rough estmate of the number of the topcs gven a document corpus. Nonparametrc varants of topc models whch use Herarchcal Drchlet Processes (HDP) such as Latent Drchlet Allocaton [1] and Pachnko Allocaton Model (NPB-PAM [2]) have been used to estmate the number of topcs whle buldng the correspondng topc model. These methods place a non-parametrc pror on the number of topcs and estmate the number as the model s bult. These methods have been shown to work well n terms of evaluaton metrcs whch nvolve lkelhood estmates (emprcal or otherwse) or classfcaton accuraces on heldout data. Inferng the number of topcs usng HDPs gves more specfc topcs (as qualtatvely shown for the case of NPB-PAM). However, the average number of topcs estmated s often very hgh. For example NPB-PAM dscovers 179 sub-topcs n the 20newsgroups comp5 dataset whch contans 5 topcs. Such models are good for applcaton domans where the topc model s meant for tasks where the

2 Crteron functon Crteron functon Crteron functon number of topcs number of topcs number of topcs (a) ODP (b) NIPS (c) ODP/Computers Fgure 1. Plot of Crteron functon vs Number of topcs dscovered topcs themselves serve only as latent varables. It s not clear f these topcs would correspond to a human determned ontology. Our goal s to generate a topc structure wth typcally fewer topc nodes and more semblance to a topc herarchy that could be appled to domans whch use the actual topc structure for organzng unstructured text nformaton. We begn by obtanng an approxmate number of topcs usng a clusterng algorthm whch uses repeated bsectons. Zhao and Karyps [6] have shown that ths method produces better clusterng than agglomeratve or graph based methods. A key characterstc of ths approach s that t uses a global crteron functon whose optmzaton drves the entre clusterng process. The crteron functon used s maxmze =1 v,u S u.v where each document u s represented as a vector of normalzed tf values. We k-cluster the set of documents for each k {2, 3,..., K}, where K s the maxmum number of topcs to be allowed n any layer of the topc DAG. Computng a K-clusterng takes O(NNZ log(k)) tme where NNZ s the number of non-zero entres n the document smlarty matrx. In the process of K clusterng the dataset, the algorthm also produces k-clusters for all k < K snce the method performs repeated bsectons. Hence, computng all clusterngs from 2 to K also takes O(NNZ log(k)) tme. Each clusterng C k corresponds to a set of clusters {Ck 1, C2 k,..., Ck k }. The qualty of each C k s evaluated usng the functon 1 F ( µ nt, µ ext, σ nt ) = k =1 C k Ck µ extσnt µ (1) nt =1 Here µ denotes the vector of average ntra-cluster smlarty. µ e denotes the vector of average nter-cluster smlarty. σ denotes the vector standard devaton n ntra-cluster smlarty. All smlartes refer to cosne smlartes. A lower value of the functon ndcates a better clusterng. The ntuton behnd the crteron functon s to have low smlarty wth documents outsde the cluster and hgh smlarty, wth low standard devaton nsde. Ths s to ensure that the clusters are tght. The rato s weghed by the sze of the cluster. Ths functon s computed for each k-clusterng. Ths gves a sequence of qualty values q 1, q 2,..., q K correspondng to the sequence of topc numbers. A low value mples a good clusterng. Let {m 1, m 2,..., m l } be the sequence of topc numbers correspondng to local mnma n the sequence of qualty values. These mnma represent the regons where the number of topcs are such that the clusterng produced s locally optmal. These are potentally the number of topcs n at ncreasng depth n the topc structure. The ntuton behnd ths s as follows. Suppose that a document corpus has c topcs. If the corpus was clustered nto a c 1 or c + 1 clusterng, some of the actual c clusters wll have to redstrbute ther documents to ft nto a c 1 or c+1 clusterng. Ths wll decrease the tghtness of the clusters whch wll be captured by the crteron functon n 1. We expect that the value of the functon would be hgher on ether sde of a good clusterng. Fgure 1 shows plots of q 1, q 2,..., q 30 for dfferent datasets. Wth ths heurstc n place, the goal of the clusterng step s to determne {m 1, m 2,..., m L }, the sequence of approxmate number of topcs n each layer. For example, n Fgure 1(b) the sequence of mnma s {6, 9, 14, 18,...}. These wll be chosen as the approxmate number of topcs

3 n the topc structure correspondng to the NIPS abstracts dataset. The underlyng assumpton that mutually exclusve document clusters are representatve of topcs s not entrely accurate snce the actual topc structure may be more complex and documents may contan content that could be best descrbed as a mxture of dfferent topcs. Topcs may exhbt sgnfcant correlatons. However, the purpose of ths clusterng s not to dscover ths rch underlyng structure but to get a rough estmate of the number of topcs n each layer of the topc structure. Ths estmate s mportant because t gves a startng pont for the structure buldng algorthm. A more accurate topc number along wth subsumpton relatons between them s dscovered later. Table?? shows the top few words n the clusters correspondng to the local mnma of the crteron functon. Whle the clusters may not be extremely accurate, they are good enough to justfy ther use as approxmatatons. Secton xxx compares the fnal topc numbers and structure wth these approxmatons. III. BUILDING THE DAG A. Buldng Sngle Layer topc models The next step n buldng the topc structure s to fnd topc models for each value of the number of topcs {m 1, m 2,..., m l } found n the prevous secton. We treat each layer ndependently and buld sngle layer topc models for them. Our method allows the use of any statstcal topc model as long as t s possble to nfer topc proportons present n each document under that model. In ths paper, we demonstrate our method usng smple Latent Drchlet Allocaton [8] though t s possble to use more sophstcated models. The smplcty of the model allows us to demonstrate the key dea behnd the process of dscoverng subsumpton relatons more effectvely. The framework of buldng ndependent models and then subsumng consecutve layers has been prevously used by Zavtsanos et al. [9] who demonstrated ts use n buldng document ontologes usng LDA. Latent Drchlet Allocaton s a generatve probablstc model n whch documents are represented as random mxtures over latent topcs. Topcs are themselves modeled as an nfnte mxture over a set of topc probabltes. The generatve process can be summarzed as: We use varatonal methods descrbed n Ble et al. [8] to nfer topc proportons n each document. We have the sequence (m) L =1 of number of topcs estmated from the clusterng step. For each element m of ths sequence, we buld an LDA topc model T. For each document d of the corpus, the varatonal nference method gves a posteror drchlet parameter γ d, where the proporton of topc u s represented by the component γ d,u. One way to nterpret ths stuaton s to defne a probablty functon P over the set of topcs {u 1, u 2,..., u m }, where P (u ) denotes the probablty of occurence of topc u n a random document. Gven the document corpus D, ths can be estmated as, 1 P (u ) = γ d,u D d D γ d,u γ d,u = m l=1 γ d,u l The jont probablty dstrbuton P (u, u j ) can be estmated as, P (u, u j ) = 1 γ d,u D γ d,u j (2) d D The jont probabltes are a measure of co-occurence of dfferent topcs. Ths dea s exploted next for dscoverng subsumpton relatons. B. Subsumpton as a maxmum weght clque problem Now that we have the topc models for each layer, the next step s to fnd subsumpton relatons between consecutve layers. The problem of dscoverng subsumpton relatons n ontologes has been wdely studed. One of the most relevant works n the context of buldng topc structures s by Zavtsanos et al. [9] who use the dea of condtonal ndependence between sub-topcs gven a canddate parent topc to decde subsumpton relatons. In ther method, f {u 1, u 2,..., u m } s a topc layer and {v 1, v 2,..., v n } s the layer of topcs just below t, then each par (v, v j ) s assgned a parent u k f the occurence of u k n a document makes the occurence of topcs v and v j condtonally ndependent. The ntuton s that f the co-occurence of topcs v and v j s nullfed once we know the parent u k occurs, then the parent captures what s common between the topcs and s therefore a good choce to subsume v and v j. Such a crteron for subsumpton s reasonable, but t suffers from a constrant that t decdes the subsumpton relatons based on par-wse ndependence alone. In ths paper, we use a dfferent crteron and a more general graph framework that captures par-wse relatons and uses a clque based method to buld hgher order sub-topc sets and then determnes subsumpton relatons. Let there be m parent topcs and n chld topcs n some par of consecutve layers. For each chld topc v we fnd u k such that P (v u k ) s maxmum, where the probablty P s defned as n Eq. 2 and 2. The topc v s then drectly subsumed under u k. Ths process assocates each chld topc wth exactly one parent, buldng a tree structure. Next, we dscover par-wse subsumpton relatons,.e., for each par of dstnct chld topcs (v, v j ) we fnd the parent topc u k whch best subsumes them. We later use par-wse relatons to determne global subsumpton relatons. We would lke u k to subsume v and v j f they could be consdered subtopcs of u k. For them to be sub-topcs, two condtons must hold. v and v j should be assocated wth u k but at the same tme they must be suffcently separate so that they can be

4 consdered separate sub-topcs. These two condtons can be nterpreted as demandng that P (v u k ) and P (v j u k ) should both be hgh but P (v, v j u k ) must be low,.e. the jont probablty that v and v j occur together gven u k must be low. In other words, each sub-topc ndvdually must have a hgh condtonal probablty of occurence gven that the parent topc occurs (ndcatng that the subtopcs capture a part of the parent topc s vocabulary) but at the same tme both the sub-topcs must not occur together very often gven the parent (f they do occur, they are not separate enough to be consdered ndvdual sub-topcs). In ths sense, the occurence of v and v j must be negatvely correlated gven u k. Hence for every (v, v j ), we can fnd a u k u k = argmax uk (P (v u k )P (v j u k ) P (v, v j u k )) (3) subject to u k th where th s a postve threshold on the objectve functon to supress very small values. If no such u k exsts, then (v, v j ) cannot be subsumed under any parent topc. A postve threshold ensures that the par of sub-topcs respects the two condtons above. A slghtly dfferent way to look at ths objectve functon s to note that we want P (v j v, u k ) P (v j u k ) P (v u k )P (v j v, u k ) P (v u k )P (v j u k ) P (v, v j u k ) P (v u k )P (v j u k ) 5 are not connected. Ths would mean that they are not suffcently negatvely correlated to be consdered separate subtopcs of the parent topc. Hence we would not want both 4 and 5 to be subsumed. The clques consstng of topcs (8,1,4) or (8,2,4) are better choces snce each topc s then suffcently dfferent from the others. Topc 5 would, n essence, be captured by topc 4. The weghts assocated wth the edges denote the strength of the correspondng negatve correlaton. All vertces n the maxmum edgeweghted clque should be subsumed under the parent topc. Hence the problem of determnng the best subsumpton set for any parent topc k reduces to fndng the maxmum weght clque n graph G k. Though ths problem s hard to solve n general, we observed that the graphs G k that are nduced by real datasets that we expermented on are typcally very sparse (not more than 1 connected component and not more than 6-7 vertces n any graph). Hence, even an exponental tme algorthm would not be too bad keepng n mnd that ths ontology buldng exercse s to be done offlne. The par-wse ndependence crteron used by Zavtsanos et al. [9] can be seen as a specal case of ths method, where only the maxmum 2-clque s used for buldng subsumpton relatons. The next secton explores ths graph and the weght functon n more detal and gves bounds on the sparsty of the nduced graphs. And the more t s less the better. Solvng the optmzaton problem n Eq.(3) gves the best parent topc under whch a par of chld topcs may be subsumed. We emphasze that none of these chld topcs mght actually be subsumed under ths parent n the fnal herarchy. Ths s just the best parent that could subsume ths par. Next, we construct graphs G k for each parent topc u k. G k conssts of those sub-topcs whch occur as a part of a par whch s best subsumed under u k..e., V (G k ) = { jstu k subsumes (v, v j )} (4) E(G k ) = {(, j) st u k subsumes (v, v j )} (5) wt(, j) = P (v u k )P (v j u k ) P (v, v j u k ) (6) Fgure 2 shows some examples of such graphs for the Reuters dataset. The three graphs correspond to the three parent topcs. The chld layer conssts of 8 topcs. Note that any clque C n ths graph represents a set of topcs whch are mutually negatvely correlated n the sense of Eq.(3). We would want that only the largest such set be subsumed under the correspondng parent. For example n the frst graph n Fgure 2, topcs 8 and 5 are connected by an edge meanng that they are negatvely correlated. Also topcs 8 and 4 are connected. However, topcs 4 and C. Sparsty bounds Let there be m parent topcs and n chld topcs n some par of consecutve layers. An edge (, j) can be n G k for a unque value of k snce only the best canddate parent u k subsumes (v, v j ). There are O(n 2 ) such edges dstrbuted across m graphs. Hence, the expected number of edges n each graph s O(n 2 /m). Snce m and n are the number of topcs n adjacent layers, we can assume that m = O(n). Therefore, the expected number of edges s O(n). Ths means that the maxmum-clque can be of sze O( n). Consder the functon f k (, j) = P ( k)p (j k) P (, j k) Where and j refer to chld topcs and k refers to the parent. If the occurence of topcs and j s ndependent gven k, P (, j k) = P ( k)p (j k) and hence f k (, j) = 0. If the occurence of the topcs s negatvely correlated f k (, j) > 0.

5 (a) Graphs G k for Reuters-25 topcs Fgure 2. Determnng subsumpton relatons (b) Top few layers of the ontology Also, f k (, j) = (P ( k)p (j k) P (, j k)),j,j D. Model Trmmng =,j = P ( k)p (j k) P (, j k),j P ( k) j = P ( k)1 1 = 1 1 = 0 P (j k) 1 So far we have bult the topc structure assumng that the number of topcs chosen n the clusterng step were correct. Clusterng methods are however susceptble to errors and use a very prmtve noton of smlarty. The objectve functon 3 can be put to further use for trmmng the topc structure. Ths tme, we can look at t as follows w,j (k) = P (v u k )P (v j u k ) P (v, v j u k ) Recall that a hgh value of w,j (k) would mean that topcs v and v j are separated enough from each other and yet assocated suffcently to u k to be consdered ts sub-topcs. A low value, on the other hand, would mean that the topcs are qute smlar and as far as u k s consdered, they mght be merged. Ths can be seen as a scheme where each parent topc u k votes f (v, v j ) should be merged and the value of the vote s w,j (k). A hgh vote value means that the correspondng parent topc wants the par to be kept separate and a low ndcates that they should be merged. These votes are then polled together to get a fnal value. W (, j) = w,j (k ) k =1 If W (, j) s postve, the par s left alone, else t s merged. A possble mprovement s to wegh the votes of each parent k by P (v u k )P (v j u k ). Ths factor ensures that the vote of a parent topc whose occurence s more correlated wth the occurence of the chld topcs s taken more serously, the dea beng that such a parent topc carres statstcally more weght than a parent topc whch has nothng to do wth ths par of chld topcs. W (, j) = P (v u k)p (v j u k)w,j (k ) k =1 Merge (v, v j ) f W (, j) < 0 Whle the ntuton behnd the method seems sound, more experments need to done to evaluate t. IV. EVALUATION The am of ths study s to provde a method to buld semantcally meanngful topc structures whch are smlar to those bult by human experts. We compare the topc structures determned usng our method wth human bult ones. A. Datasets We chose a subset of the Open Drectory Project (ODP) category structure and crawled a random number of webpages lnked under those categores. Ths data was collected usng the rdf dump avalable from the ODP homepage accessed on 10th Aprl Besdes, we use standard datasets such as the NIPS abstracts dataset, Reuters and the 20 newsgroups dataset.

6 F_ F_ number of topcs number of topcs (a) Reuters-13 (b) Reuters-25 Fgure 3. Plot of Crteron functon vs Number of topcs for dfferent subsets of Reuters dataset Fgure 4. Top 3 layers of onotology for Reuters dataset. Black lnes ndcate major parent, blue lnes ndcate lesser degree of assocaton. B. Qualty of clusterng Reuters topc dataset Mnma for crteron functon at : 3, 8, 12, 20 and 24 topcs. See Fgure 3. Top few words for a 3-topc clusterng usng LDA Topc 1 shp, offc, unon, strke, gulf Topc 2 tonne, mln, wheat sugar, export, gran Topc 3 ol, prce, mln, dlr, pct, produc Top few words for a 8-topc clusterng usng LDA Topc 1 prce, market, dlr, futur, exchange, trade Topc 2 ol, mln, pct, dlr, gold Topc 3 tonne, export, sugar, wheat, mln Topc 4 shp, strke, compan, port, unon Topc 5 ol, opec, prce, mln, bpd, saud Topc 6 coffee, produc, export, quota, stock, cocoa Topc 7 mln, crop, tonne, pct, product, gran, plant Topc 8 propos, offc, farm, wheat, agrcultur, gran Fgure 2 shows the graphs generated for decdng subsumpton between 8-topc layer and 3-topc layer. Fgure 4 shows the frst 3 layers for the generated ontology tree. The subsumpton seems to be qute approprate consderng the topc keywords gven above. ODP/Comp Ths data was collected by us usng the rdf dump avalable from the Open Drectory Project page accessed on 10th Aprl A subset of the scence drectory urls were crawled and converted to bag of words. Topc Structure: Algorthms Artfcal Intellgence : Academc Departments, People, Conferences and Events, Machne Leanrng, Natural Langugae, Neural Networks, Vson. Hardware : Buses, Cables, Perpherals (Audo, Keyboards, Dsplays, Prnters) OpenSource Systems Internet :Organzatons, Searchng, Web desgn and development

7 Securty : Frewalls, Intruson Detecton Systems, Malcous software Fgure 1(c) shows the plot of the cluster qualty crteron functon. A sgnfcant amount of data cleanng needs to be done before ths set s usable for further experments. Ths presents a very qualtatve analyss of the method. A more quanttatve comparson needs to be done to fully valdate t. We are currently workng on mplementng other methods as such Non-Parametrc LDA and Non-Parametrc PAM whch also try to estmate the sze and structure of a topc herarchy. We plan to compare our method aganst these on metrcs such as perplexty and devaton from human defned structure. [9] E. Zavtsanos, S. Petrds, G. Palouras, and G. A. Vouros, Determnng automatcally the sze of learned ontologes, n ECAI, ser. Fronters n Artfcal Intellgence and Applcatons, M. Ghallab, C. D. Spyropoulos, N. Fakotaks, and N. M. Avours, Eds., vol IOS Press, 2008, pp V. CONCLUSIONS AND FUTURE WORK Our method estmates the sze and structure of the topc space underlyng a set of documents. However, an mportant step n usng ths method for a large number of practcal applcatons s to develop a method for category namng that wll be used to label each of the dscovered categores. Our method currently lacks a formal justfcaton and a quanttatve analyss. We plan to work further n these drectons. REFERENCES [1] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Ble, Herarchcal Drchlet processes, Journal of the Amercan Statstcal Assocaton, vol. 101, no. 476, pp , [2] W. L, D. Ble, and A. Mccallum, Nonparametrc bayes pachnko allocaton, n UAI 07, [Onlne]. Avalable: mccallum/papers/npbpamua2007s.pdf [3] D. Ble, T. L. Grffths, M. I. Jordan, and J. B. Tenenbaum, Herarchcal topc models and the nested chnese restaurant process, Advances n Neural Informaton Processng Systems 16., [Onlne]. Avalable: gruffydd/papers/ncrp.pdf [4] D. M. Ble and J. D. Lafferty, Correlated topc models, n NIPS, [5] W. L and A. McCallum, Pachnko allocaton: Dag-structured mxture models of topc correlatons, n ICML 06: Proceedngs of the 23rd nternatonal conference on Machne learnng. New York, NY, USA: ACM, 2006, pp [6] Y. Zhao and G. Karyps, Emprcal and theoretcal comparsons of selected crteron functons for document clusterng, Mach. Learn., vol. 55, no. 3, pp , [7] M. Stenbach, G. Karyps, and V. Kumar, A comparson of document clusterng technques, n KDD Workshop on Text Mnng, [8] D. Ble, A. Y. Ng, and M. I. Jordan, Latent drchlet allocaton, Journal of Machne Learnng Research, vol. 3, p. 2003, 2001.

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning Outlne Artfcal Intellgence and ts applcatons Lecture 8 Unsupervsed Learnng Professor Danel Yeung danyeung@eee.org Dr. Patrck Chan patrckchan@eee.org South Chna Unversty of Technology, Chna Introducton