Text Similarity Computing Based on LDA Topic Model and Word Co-occurrence

2nd Internatonal Conference on Software Engneerng, Knowledge Engneerng and Informaton Engneerng (SEKEIE 204) Text Smlarty Computng Based on LDA Topc Model and Word Co-occurrence Mngla Shao School of Computer, Electroncs and Informaton Guangx Unversty annng, Chna E-mal: shml4@sna.com Langx Qn School of Computer, Electroncs and Informaton Guangx Unversty annng, Chna E-mal: qn_lx@26.com Abstract LDA (Latent Drchlet Allocaton) topc model has been wdely appled to text clusterng owng to ts effcent dmenson reducton. The prevalent method s to model text set through LDA topc model, to make nference by Gbbs samplng, and to calculate text smlarty wth JS (Jensen- Shannon) dstance. However, JS dstance cannot dstngush semantc assocatons among text topcs. For ths defect, a new text smlarty computng algorthm based on hdden topcs model and word co-occurrence analyss s ntroduced. Tests are carred out to verfy the clusterng effect of ths mproved computng algorthm. Results show that ths method can effectvely mprove text smlarty computng result and text clusterng accuracy. Keywords-topc model; LDA (Latent Drchlet Allocaton); JS (Jensen-Shannon) dstance; word co-occurrence; smlarty I. ITRODUCTIO Wth the rapd development of nternet, the amount of nformaton on the nternet ncreases exponentally. How to dscover useful nformaton effcently from the magnanmous text data (one of the man carrer of nformaton) becomes a cryng need. Vector space model (VSM), a classc mode n the text mnng area, represents the documents as space vector and computes the smlarty among the vectors to measure the smlarty among the documents. Herento, the TF-IDF (Term frequency-nverse document frequency) s the most wdely appled smlarty measure method. By ths method, word-weghtng s expressed by the frequency of a partcular word n a partcular document and by the nverse frequency of ths word n the document set. However, ths method gnores the semantc assocatons among words, leavng t dffcult to process the semantc factors. For example, there are no common words between Steve Jobs left us. and wll the prce of Apple products drop?, yet there s certan correlaton between them. For another example, when the word appears n two artcles descrbng respectvely a frut and a cell phone brand, the two apple are consdered as correlated. What s more, ths method also bears problems regardng the hgh-dmensonal sparse of data space. In solvng these problems, modelng the text set through LDA topc model and computng the smlarty of the text wth the JS (Jensen-Shannon) dstance have made preferable clusterng results. However, JS dstance cannot dstngush semantc assocaton among text topcs. Ths may leads to wrong clusterng of texts that have smlar topc probablty yet dfferent topcs. For ths defect, we ntroduce the dea of word co-occurrence to analyze the semantc correlaton of text themes, snce co-occurrence words embody the text topc better. It provdes an mproved text smlarty measure method based on hdden topc model and word cooccurrence analyss. II. RELATED WORKS A. Text hdden topc model Text topcs mnng have receved wde attenton and have been extensvely appled to text clusterng n recent years snce topc model can reduce dmensons effcently and s nterpretable. Currently avalable hdden topc model ncludes LSA (Latent Semantc Analyss) [], PLSA (Probablstc Latent Semantc Analyss) [2] and LDA, etc. LSA apples SVD (Sngular Value Decomposton) and other mathematcal method to dscover hdden semantc structures of documents. Its lmtaton les n ts dsablty to dstngush polysemy n the documents. PLSA s a probablstc model presented by Hofmann on the foundaton of LSA. Basng ts work on the producton model and maxmum lkelhood estmaton method, ths model gets results by EM (Expectaton Maxmzaton) algorthm. Thus PLSA s pror to LSA n dealng wth largescale date sets. LDA ntroduces Drchlet pror parameters to word layer and hdden topc layer n modelng, whch s a groundbreakng extenson of PLSA. It solves the problem of overfttng generated by concomtant lnear ncrease of topc parameters at the ncrease of tranng documents n PLSI model and LSI model, makng t more sutable for largescale corpus processng. Sh Jan-hong [3] et al. appled LDA topc model to Chnese mcro blog topc and carred out effectve mcro blog topc dscoveres. L Wen-bo [4] et al. rased a labeled LDA topc model by addng text class nformaton to the LDA topc model, whch calculated the dstrbuton of hdden topcs n each class and rased the classfcaton ablty of the tradtonal LDA model. Phan [5] et al. adopted hdden topc n text character extenson based on the external corpus. Sh Jng [6] et al. acheved a preferable extracton effect by usng Shannon nformaton to extract key words n LDA probablty dstrbuton. Quan [7] et al. used topc as dependency of thrd-party words and further mned text smlarty. Zhang Zh-fe [8] et al. rased a text 204. The authors - Publshed by Atlants Press 99

classfyng method based on LDA topc model and an overall consderaton of context. B. Word co-occurrence Word co-occurrence analyss s a successful use of natural language processng n nformaton retreval. Its core concept s that the co-occurrence rate of words can to some extent reflect the semantc correlaton of them. Word co-occurrence analyss s beng ncreasngly appled to text analyss. Geng Huan-tong [9] et al. rased the topc word extracton algorthm based on word cooccurrence, expandng extracton scope of the orgnal topc word by mnng the co-occurrence word of canddate words. Chang Peng [0] dd a deep analyss of the nner lnk between text topc representaton and word co-occurrence and desgned a new method of co-occurrence word extracton. He also rased a new document representaton model on ths bass. Yuan L-ch [] proposed to measure word smlarty based on the Mutual nformaton. Ths method effectvely elmnated word ndetermnacy. III. APPLICATIO OF LDA TOPIC MODEL I TEXT REPRESETATIO A. LDA topc model LDA (Latent Drchlet Allocaton) topc model s a threelayer Bayesan probablty model composed of word, topc and text. Its basc dea s that every document s a mxture of several hdden topcs and each hdden topc s a mxture of several words. The relaton between document and topc follows Drchlet pror dstrbuton and the relaton between topc and word follows polynomal dstrbuton. The generatve process of LDA s as shown n Fgure (): Fgure. LDA generaton probablty dagram Among the varables, M denotes the number of documents, K denotes the number of hdden topc, denotes the number of words n a document. α β are the document layer parameters of LDA, α denotes the relatve strength of latent hdden topcs n the document set and β denotes the probablty dstrbuton of all hdden topcs. θ denotes the topc probablty dstrbuton for certan document. φ denotes the word dstrbuton for certan hdden topc. Rectangle denotes repeated samplng process, unlateral crcle denotes hdden varables. Bcrcle denotes observable varables. The computng formula of probablty model s as shown n formula (): p( θ, z, w α, β) p( θ α) p( z θ) p( w z, β) () n n n n The generatve process of LDA topc model s as follows: ) For hdden topc, calculate φ polynomal dstrbuton of feature word of ts hdden topc accordng to Drchlet dstrbuton; 2) Obtan the number of words n the document accordng to Posson dstrbuton; 3) Calculate the topc probablty dstrbuton θ for each text; 4) For each feature word of each document of each document set: a) Select a hdden topc z randomly from the topc probablty dstrbuton θ; b) Select a feature word randomly from the polynomal dstrbuton of topc z. B. Gbbs samplng Parameter estmaton s needed n LDA modelng. Here Gbbs samplng s used. It s easy to understand, easy to realze and can effectvely select topcs from large-scale documents. The man dea of computng s that, for a certan feature word w, use Gbbs samplng to extract the approxmaton of the posteror dstrbuton pz ( z, w) of word from a hdden topc z. The computng formula s as shown n formula (2): ( d ),, ( ) ( d ), β, pz ( z, w) n + β n + α n + W n + Tα Among the varables, n, denotes the number of word tokens of feature word w assgned to the hdden topc. n, denotes the number of word tokens assgned to the hdden topc. n d, denotes the scale of feature words n document d that are assgned to hdden topc. n d, denotes the number of feature words n document d that are assgned to hdden topc. T and W denote nonnegatve weghtng. In teraton extractng process, parameter θ and φ s estmated separately accordng to formula (3) and formula (4). n + α (2) $ d ( d ) θ d n + Tα (3) $ n + β n + Wα (4) ϕ ( ) 200

IV. IMPROVED TEXT SIMILARITY COMPUTIG A. JS dstance Snce the topc dstrbuton of a text s a smple mappng of text space, smlarty of two texts can be measured by computng the topc dstrbuton. KL dstance s the measurement of dfference between two probabltes. Some people have used KL dstance as the crteron of smlarty computng. Let p(x) and q(x) be two probablty densty functons, the KL dstance between ths two can be defned as shown n formula (5): T p D ( p, q) p ln (5) KL However, DKL ( p, q) DKL ( q, p) means the KL dstance s asymmetrcal. So here ts symmetrcal verson s used as shown n formula (6): Dλ ( p, q) λdkl ( p, λp+ ( λ) q) + ( λ) D ( q, λp+ ( λ) q) When λ/2, the above formula turns nto JS dstance. Assgnng the value [0, ] to t, the results s as shown n formula (7): KL q (6) p + q p+ q Ds ( p, q) [ DKL ( p, )+ DKL ( q, )] (7) 2 2 2 B. Improved text smlarty computng based on word cooccurrence JS Dstance can t dstngush the semantc relaton between topcs when t s used to carry out smlarty computng. For ths defect, an mproved smlarty computng method s proposed, whch analyzes the semantc correlaton between topcs from a word co-occurrence angle and adds a semantc correlaton computng of topc feature words to the orgnal JS measurng method. Detals are as follows: Assume T s the topc of text D, word set W { w, w, LLw } s the feature word of topc T. 2 Accordng to co-occurrence formula (8) (as follows), the cooccurrence probablty of feature word s p, p 2, p 3 L p. pw (, w ) pw ( T) pw ( T) (8) m n m n After the computng of co-occurrence probablty of topc feature word, here follows the dscusson of the semantc correlaton between feature words from topc T and topc T. If the probablty of feature word w m n topc T s p m, the co-occurrence probablty of feature word wm and wn n topc T s p mn ( pmn can be obtaned from formula (8), then the smlarty computng formula of w and w s as shown n formula (9) : m n pmn correlaton( wm, wn ) p + p p m n mn Accordng to formula (9), when the value of p mn s 0, correlaton( wm, w n ) 0, whch means feature word wm and w n s uncorrelated. When correlaton( wm, wn ) 0, feature word wm and wn s correlated. When takng a comprehensve consderaton from the angle of probablty dstrbuton of hdden topc and from the angle of feature word co-occurrence of hdden topc, t s known that when the smlarty degree of hdden topc probablty dstrbuton s hgh and the topc feature word s correlated, the text smlarty degree s the hghest and these documents should be placed n one category. When the smlarty degree of hdden topc probablty dstrbuton s low and the topc feature word s not correlated, the text smlarty degree s the lowest and these documents should not be placed n one category. When the smlarty degree of hdden topc probablty dstrbuton s hgh and the correlaton degree of topc feature word s low, smlarty between texts should be reduced. When the smlarty degree of hdden topc probablty dstrbuton s low and the correlaton degree of topc feature word s hgh, smlarty between texts should be enhanced. To sum up, a new text smlarty computng method s proposed, whch s shown n formula (0): Smlarty( d, d ) λ D s( d, d ) V (0) + ( λ ) ( correlaton( wm, w n )) ( V ( V )) mn, Among the varables, d and d denote arbtrary texts from the document set, wm and wn denotes the feature word of d and d separately, V denotes the number of feature word of ths selected document. λ [0,] denotes a correlaton coeffcent assgned to ths document. The smaller the value of Smlarty( d, d ) s, the more smlar the two texts d and d are. Detaled steps of ths mproved computng method are as follows: Computng method: Improved text smlarty computng method. Input: arbtrary text d and d, probablty dstrbuton φ and θ; Output: smlarty between d and d : Smlarty( d, d ) Step : extract the frst letters of hghest document probablty dstrbuton as the feature word of text d and d, (9) 20

based on the dstrbuton of word n probablty dstrbuton φ and θ; Step 2: extract feature word based on formula (8) and Step, calculate the co-occurrence probablty of text feature word; Step 3: calculate the correlaton between arbtrary feature words based on formula (9); Step 4: calculate smlarty between d and s Smlarty( d, d ) Step 3. d, whch, based on formula (7) and results from V. EXPERIMETAL DESIG AD RESULT AALYSIS A. Evaluaton crteron Ths paper measures text smlarty and clusterng effect wth a clusterng analyss of text, adoptng F Metrc, Precson Rato and Recall rato. F Metrc s a balance ndex for nformaton retreval combnng Precson Rato and Recall rato. Precson Rato P (, ), Recall rato R(, ) and F Metrc F (, ) are defned respectvely n formula (), (2) and (3): P (, ) () R (, ) (2) 2 P (, )* R (, ) F (, ) P (, ) + R (, ) (3) Among the varables, denotes the number of text from category n cluster. denotes the number of text from category. denotes the number of text from cluster. B. Corpus choce Ths method s tested n the Chnese Corpus of Fudan Unversty. In the experment, three subsets were extracted and were named as C3-Art C7-Hstory C9-Computer. From each subset, 400 peces of text were extracted, wth a total number of 200. C. Expermental procedure and man parameters selecton ) Preprocessng of document: manly ncludes word segmentaton and the elmnaton of Stop words, etc. Word segmentaton s carred out wth the ICTCLAS system developed by Insttute of Computng Technology n the Chnese Academy of Scences. 2) Document modelng: Document modelng: model the document by LDA topc model and do model solve and effcency analyss by Gbbs samplng algorthm. In the experment, assgn value to α and β accordng to Document [2]. Let α be 50/K, β be 0.0, T be 00, whch generates the best effect. Do the teraton for 000 tmes to get the probablty dstrbuton matrx θ of document-topc and probablty dstrbuton matrx φ of topc-word. 3) Document smlarty computng: measure document smlarty by the smlarty computng method mentoned n 4.2. Let λ be 0. 0.2. Repeated comparatve testng and analyzng show that the result s the best when the value of λ equals. Thus λ s assgned to the value of. 4) Document clusterng: carry out text clusterng through herarchcal clusterng algorthm and analyze the clusterng result to evaluate the degree of accuracy of the computng.. D. Analyses of expermental results Ths paper does a comparatve analyss of the orgnal computng method and LDA+ JS+ Word cooccurrence computng method proposed n ths paper. Expermental results are as shown n TABLE I, Fgure 2, Fgure 3 and Fgure 4. Results prove that the Accuracy rate and Recall rato of LDA+ JS+ Word co-occurrence computng method proposed n ths paper s hgher. Ths result owes to the fact that analyzng co-occurrence word as a whole can better represent the text topc. At the bass of adoptng JS Dstance n the measurng of text smlarty, topc correlaton analyss based on word cooccurrence s added, thus effectvely solvng problems concernng polysemy, synonym and context dependency, better representng text smlarty and effectvely reducng ms-clusterng of texts that have smlar topc probablty yet dfferent topcs. Test results prove that the smlarty computng method proposed n ths paper s feasble. TABLE I. EXPERIMETAL RESULT Category + Word co-occurrence Precson Rato Recall Rato F Metrc Precson Rato Recall Rato F Metrc Art 63 975 789 835 050 94 Computer 298 225 252 500 537 353 Hstory 88 525 669 928 750 734 202

+Word co-occurrence Art Computer Hstory Fgure 2. Precson Rato +Word co-occurrence Art Computer Hstory Fgure 3. Recall Rato +word co-occurrence Art Computer Hstory Fgure 4. F Metrc VI. COCLUSIO In ths paper, we present a research nto text smlarty computng from the two angles of text hdden topc probablty dstrbuton dfferences and of semantc correlaton of text feature words. Modelng documents by LDA hdden topc model greatly reduces text dmenson and mproves the computng effcency. Analyzng semantc correlaton of text feature word from a word co-occurrence angle based on LDA model enhances the use of text topc nformaton and effectvely mproves text clusterng result. Snce LDA topc model s hghly expandable, follow-up works wll be centered on new text modelng method and text smlarty computng method. Ideas may nclude modelng text by replacng the sngle word n LDA model wth co-occurrence word combnaton. Ths topc-based processng method has much sgnfcance for Data Mnng and other dscplnes. REFERECES [] Deerwester S, Dumas S, Landauer T. Indexng by latent semantc analyss [J]. Journal of the Amercan Socety of Informaton Scence, 990,4(6):39-407 [2] Hofmann T. Probablstc latent semantc ndexng[c] // Proc of the 22nd Annual Int ACM SIGIR Conf on Research and Development n Informaton Retreval. ew York: ACM, 999:50-57 [3] Sh Jan-hong, Chen Xng-shu, Wang Wen-xan. Dscoverng topc from mcroblog based on hdden topcs analyss [J]. Applcaton Research of Computers. 204, 3(3):700-704 [4] L Wen-bo, Sun Le, Zhang Da-kun. Text classfcaton based on Labled-LDA model [J]. Chnese Journal of Computers, 2008, 3(4):620-627 [5] Phan X H, guyen L M, Horguch S. Learnng to classfy short and sparse text & web wth hdden topcs from large - scale data collectons[c] In: Proceedngs of the 7th Internatonal Conference on World Wde Web (WWW08). ewyork: ACM, 2008: 9-00 [6] Sh Jng, L Wan-long. Topc words extracton method based on LDA model [J]. Computer Engneerng, 200,9(36): 8-83 [7] Quan X J, Lu G, Lu Z. Short text smlarty based on probablstc topcs [J]. Knowledge Informaton System, 200, 25(3):473-49 [8] Zhang Zh-fe, Mao Duo-qan, Gao Can. Short text classfcaton usng latent Drchlet allocaton [J]. Journal of Computer Applcatons, 203,33(6):587-590 [9] Geng Huan-tong, Ca Qng-Sheng, Yu Kun, Zhao Peng. A knd of automatc text key phrase extracton method based on word cooccurrence[j]. Journal of anng Unversty (atural Scences), 2006,42(2): 56-62 [0] Chang Peng. Research on terms co-occurrence based models and algorthms for Text Mnng [D]. Tann: Tentsn Unversty, 2009: 30-37 [] Yuan L-ch. A word clusterng method based on mutual nformaton [J]. Systems Engneerng, 2008,26(5): 20-22 [2] Huang Bo. Research on mcroblog topc detecton based on VSM model and LDA model[d]. Chengdu: Southwest Jaotong Unversty, 202:36-40 203