Available online at Available online at Advanced in Control Engineering and Information Science

Avalable onlne at wwwscencedrectcom Avalable onlne at wwwscencedrectcom Proceda Proceda Engneerng Engneerng 00 (2011) 15000 000 (2011) 1642 1646 Proceda Engneerng wwwelsevercom/locate/proceda Advanced n Control Engneerng and Informaton Scence The Clusterng Algorthm of Query Result based on Maxmal Frequent WEI Yu-we * 2 Faculty Electromechancal Engneerng,Guangdong Unversty of Technology,Guangzhou 510006, Chna Abstract Most of exstng web page clusterng algorthms s based on short and uneven snppets of web page, whch often cause bad clusterng performance On the other hand, the classcal clusterng algorthm for full text web pages s too complex to provde good cluster label n addton to the ncapablty on-lne clusterng To address above problems, ths artcle presents an on-lne web page clusterng algorthm based on maxmal frequent tem sets (MFIC) At frst, the maxmal frequent tem sets are mned, and then the web pages are clustered based on shared frequent tem sets Secondly, clusters are labelled based on the frequent tems Expermental results show that MFIC can effectvely reduce clusterng tme, mprove clusterng accuracy by 15%, and generate understandable labels 2010 Publshed by Elsever Ltd Open access under CC BY-NC-ND lcense Selecton and/or peer revew underresponsblty of [name organzer] "Keywords: Search Engne;Frequent Itemsets;Page Clusterng" 1 Introducton Wth the ncreasng of nternet nformaton, SE (Search Engne) becomes the ndspensable tools Now the most general SE sort the WebPages based on the correlaton degree to the user enqury, and return the results to user wth a lst vew The users ought to udge every webpage whether the results are satsfed wth ther demand The research shows the most users mae use of the short and uncertan search strng But 85% users only vew the results of the frst page, and 78% user never change ther research terms In * Correspondng author Tel:+8613602446699 E-mal address: weyuwe@gduteducn 1877-7058 2011 Publshed by Elsever Ltd do:101016/proeng201108306 Open access under CC BY-NC-ND lcense

WEI Yu-we / Proceda Engneerng 15 (2011) 1642 1646 1643 addton, because of ther dfferent bacground, the results desred are dfferent Therefore, n order to meet the requrements of user's query qualty ncreasngly, the user wants to mprove the usablty of query results In order to solve the problem, ths artcle puts forward a onlne clusterng algorthms of query results based on webpage maxmal frequent temsets By mprovng the mnng algorthms of maxmal frequent temsets, t can be use for the onlne clusterng of he SE query results New algorthms uses the sharng relaton of webpage sets and frequent temsets to cluster, meanwhle descrbes the clear and defnte tags of every category The experment results show that the clusterng based on maxmal frequent temsets can reduce the clusterng tme based on full-text, at the same tme the clusterng accuracy can mprove 15% or so 2 Maxmal Frequent Itemset Mnng 21 The Basc Concept of Frequent Itemsets Fles should be n MS Word format only and should be formatted for drect prntng Fgures and tables should be embedded and not suppled separately Please mae sure that you use as much as possble normal fonts n your documents Specal fonts, such as fonts used n the Far East (Japanese, Chnese, Korean, etc) may cause problems durng processng In order to adopt maxmal frequent temset as the basc feature of on-lne cluster algorthm based on full-text webpage In ths secton, t frstly ntroduces the basc concept of frequent temset,, a detaled ntroducton related frequent temset refers to other document lterature Defnton 1: Assumng I = { I, I, 1 2 I n } s a set of n dfferent tems For a set X, X I and = X, the X s called as temset, the length of X s the amount of ncludng tems, t s Defnton 2: D = { T, T, 1 2, T m } s a set of m dfferent transactons, among T I For the gven the transacton set D, defne the support of X s the amount of transacton occurred X, named as Sup ( X ) User may defne a mnmal support countng, mn_supp, t s ether absolute countng or relatve countng Defnton 3: Supposng the transacton set D and the mnmal support countng mn_supp, for temsets, X I, f Sup( X )> mn_supp and ( Y I X Y ), Sup( Y )< mn_supp, then X s named as the maxmal frequent temset n the transacton set D In ths artcle, the transacton set s the webpage sets of query results, every webpage s a transacton Itemset s the sets ncluded terms n the webpage, the terms of webpage s the tem of transacton 22 Maxmal Frequent Itemsets Mnng Algorthm The common algorthm of frequent temset mnng s FP-Growth algorthm It frstly constructs a FP-Tree that s a threaded tree structure to storage the transacton of sets [3] The constructon of FP-Tree frstly maes a statstc for support countng of all tems, these tems ts support exceeded mnmal support countng arrange n Header table of FP-Tree n decreasng order Then every tme only read n a transacton, and map to FP-Tree routng Fg 1 s a example of FP-Tree (ts support s 2), among (a) expresses the transacton sets, (b) s the FP-Tree constructed In ths fgure, sold lne presents the routng of transacton mappng tree, the dotted lne ponts to the locaton n the tree from Header Table, the countng of node expresses the support correspondng to temset n the endng routne from root node to current node, such as the node "trademar: 2" presents the support of ths temset {car, Geely, trademar} s 2

1644 WEI Yu-we / Proceda Engneerng 15 (2011) 1642 1646 Fg1 a example of FP-Tree ( mn_supp = 2 ) 3 Query Result Cluster Algorthm Based on Maxmal Frequent Itemsets After mnng the frequent temsets, there are two ways to cluster: Frst, adopt the alternatve word of frequent temset to create the feature vector of webpage and use the tradtonal clusterng algorthm based on vector space model Second, cluster wth the relaton of frequent temset overlap webpage set[4] The former has been proved that the tme complexty cannot satsfy the demand of on-lne cluster, at the same tme, the cluster e The clusterng algorthm ntroduced by ths artcle adopts the relaton of the webpage sharng the maxmal frequent temset to cluster For the purpose of ths artcle followng research, some defntons as follow: D = { T, T, 1 2, T m } s the set of all transactons, and t s the set of query webpage n ths artcle I = { I, I, 1 2, I n } s the set of all tem, and t s the set of terms ncluded n webpage sets S M = { M, M, 1 2, M n } s the set of all maxmal frequent temsets mned n webpages, the webpage overlapped by a maxmal frequent temset M names as P, P D The process of clusterng means that the set of webpages are dvded nto some clusters, named as C = { C1, C2,, C l }, t s the set of cluster The webpage sets ncluded by a cluster C mars CP, CP D, the set of maxmal frequent temsets ncluded names as CM, CM Sm, the set of frequent temsets ncluded s CI, CI I { T, T, 2 } D c = 1, T s the webpage set overlapped by cluster Below ntroduces the core steps of cluster algorthm Step 1: The generaton of cluster The longer the length of frequent temset, the more the terms ncluded, and the better expresses a detal topc, so the long cluster generated by frequent temset s gven prorty to select The frequent temset among S m sorts n the order of ther length, and n proper sequence select the longest frequent temset M to generate cluster C, CP s the webpage set ncluded by C, and t s the webpage set P overlapped by M, record the webpage set overlapped by cluster, Dc = Dc P In order to mprove the speed of cluster generaton, reduce the transmsson effects n subsequent mergng procedure, and further flter the frequent temset of S m If a frequent temset M overlaps the webpage set P Dc, t means that all webpages of P have been overlapped by clusters, and doesn't generate the cluster C correspond to M Step 2: The mergng of clusters The clusters orgnally generated are more, and there are a lot of overlappng, so need to merge and generate the fnal cluster The mergng of clusters means that the clusters wth hgh smlarty merge a

WEI Yu-we / Proceda Engneerng 15 (2011) 1642 1646 1645 cluster; usually the smlarty of clusters s udged wth the smlarty of webpage sets ncluded For the clusterng algorthm based on frequent temset, the frequent temset ncluded by cluster s the mportant feature of cluster The smlarty of cluster s computed wth the smlarty of request temset ncluded [5] In order to mprove the accuracy, adopt the formula (1) to compute the smlarty of clusters The smlarty of cluster C and C names as Sm ( C, C ), the smlarty of webpage ncluded names as SmP, the smlarty of frequent temset ncluded names as SmI CP CP CI CI Sm( C, C ) = + (1) mn CP, CP mn CI, CI The more ( ) ( ) ( ) Sm C, C, the hgher the cluster C and C, and ntends to merge Step 3: The cluster purfcaton The clusters are subdvded nto the hard clusters and the soft clusters The former demands a webpage only s belong to a category, the latter allows a webpage to belong to multple category So the hard clusters can reflect realty well Because of the transmsson effects of clusters mergng, the clusters sometmes nclude some non-correlated webpages It s a vtal problem how to recognze the webpage of clusters s non-correlated webpage or multple category webpage In ths artcle, the recognton of noncorrelated webpage s udged by the support webpage relatve to cluster So ths artcle defnes the support as fellow, the webpage P relatve to cluster C Supp( P, C ) = M f ( P, M ) (3) M CM Supp, C s less than ths value, t s the non-correlatve webpage, and t would be delete from the clusters Accordng the the experment, a emprcal value can be set When ( P ) 4 Expermental Results and Analyss 41 Expermental Condton and Expermental Data The expermental data s the data set respondng to 8 ambguous query terms For SE, and gan the unon set Then mar the partcple and the word characterstc to the webpages, construct the ndex and eep the partcple results for latter algorthm Above-mentoned wor s off-lne completed, and prepares the data for on-lne query clusters The webpage set s manual mared category, every query term webpage set mars several categores T Because the K-Means algorthm demands to set value, separately set 4 value (5,6,7,8) to experment, and every query get the hghest F value as the fnal result n 4 expermental result STC, Lngo and MFIC can automatc generate an arbtrary number of categores, and there are some clusters only ncludng 2~3 webpages, but n practcal applcaton usually only shows the cluster ncludng maor webpages, the category less than 5 webpages names as other category n actual experment Accordng to changng the parameter, the category number of 3 algorthms ranges from5 to 10 42 Expermental Condton and Expermental Data In ths artcle, the experment compares wth MFIC based on full text and K-Means, at the same tme compares the cluster-tme wth STC based on abstract[6] For the full text of webpage, the cluster-tme s too long to apply for on-lne cluster The expermental data dsplays that the tme s more than 10 seconds

1646 WEI Yu-we / Proceda Engneerng 15 (2011) 1642 1646 In addton, the Lngo adopts the open ava experment, and other algorthms mplement wth C++, so they are compared wth t The graph suggests the MFIC cluster-tme s superor to K-Means Because the MFIC cluster s based on webpage full text, t s a foregone concluson that the cluster-tme s longer than STC based on abstract The expermental result shows that the MFIC can satfy wth the demand of on-lne cluster f ts clustertme s about 2 seconds In order to mprove on system response, they can set the cache of cluster result and reduce the user watng tme n detal applcaton Concluson Fg3 The cluster algorthm tme comparson Ths artcle proposes a cluster algorthm of SE returned results based on full text maxmal frequent temsets Frstly, research the mnng algorthm of frequent temsets, and mprove on the maxmal frequent temset mnng combnng FPMax algorthm, at the same tme ncrease the mnng speed Then t proposes a MFIC algorthm based on maxmal frequent temset The MFIC algorthm mostly ncludes three steps Frstly, generate the cluster wth the mned maxmal frequent temset Secondly, merge and udge the clusters combnng the smlarty of frequent temset wth the smlarty of document sets ncluded by cluster Fnally, propose a label generaton algorthm combnng frequent temset wth terms sequence References [1] Lpng Jng, Mchael K Ng An Entropy Weghtng -Means Algorthm for Subspace Clusterng of Hgh-Dmensonal Sparse Data IEEE Transactons on Knowledge and Data Engneerng, 2007, 19(8), p1026-1040 [2] We Song, Soon Cheol Par Genetc Algorthm based text clusterng technque Automatc evoluton of cluster wth hgh effcency 7th Internatonal Conference on Web-Age Informaton Management Worshops Hong Kong, 2006, p17-18 [3] Danel Crabtree, Xaoyng Gao Improvng Web Clusterng by Cluster Selecton The 2005 IEEE/WIC/ACM Internatonal Conference on Web Intellgence 2005, p172-178 [4] Hung Chm, Xaote Deng A New Suffx Tree Smlarty Measure for Document Clusterng World Wde Web Conference Commttee 2007, p121-129 [5] Danel Crabtree, Peter Andreae Query Drected Web Page Clusterng Proceedng of the IEEE/WIC/ACM Internatonal Conference on Web Intellgence 2006, p202-210