Personalized Concept-Based Clustering of Search Engine Queries

Size: px

Start display at page:

Download "Personalized Concept-Based Clustering of Search Engine Queries"

Edgar Roland McKenzie
5 years ago
Views:

1 IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID 1 Personalzed Concept-Based Clusterng of Search Engne Queres Kenneth Wa-Tng Leung, Wlfred Ng, and Dk Lun Lee Abstract The exponental growth of nformaton on the Web has ntroduced new challenges for buldng effectve search engnes. A maor problem of web search s that search queres are usually short and ambguous, and thus are nsuffcent for specfyng the precse user needs. To allevate ths problem, some search engnes suggest terms that are semantcally related to the submtted queres so that users can choose from the suggestons the ones that reflect ther nformaton needs. In ths paper, we ntroduce an effectve approach that captures the user s conceptual preferences n order to provde personalzed query suggestons. We acheve ths goal wth two new strateges. Frst, we develop onlne technques that extract concepts from the web-snppets of the search result returned from a query and use the concepts to dentfy related queres for that query. Second, we propose a new two-phase personalzed agglomeratve clusterng algorthm that s able to generate personalzed query clusters. To the best of the authors knowledge, no prevous work has addressed personalzaton for query suggestons. To evaluate the effectveness of our technque, a Google mddleware was developed for collectng clckthrough data to conduct expermental evaluaton. Expermental results show that our approach has better precson and recall than the exstng query clusterng methods. Index Terms Clckthrough, concept-based clusterng, personalzaton, query clusterng, search engne. 1 INTRODUCTION he amount of nformaton avalable on the web s Tgrowng rapdly. Google [4] reported that ts ndex sze was over 8 bllon pages n 2004, and t was estmated that t had 20 bllon pages n As the web keeps expandng, the number of pages ndexed n a search engne ncreases correspondngly. Wth such a large volume of data, fndng relevant nformaton satsfyng user needs based on smple search queres becomes an ncreasngly dffcult task. Queres submtted by search engne users tend to be short and ambguous. A study by M. Jansen [20] found that the average query length on a popular search engne was only 2.35 terms. These short queres are not lkely to be able to precsely express what the user really needs. As a result, lots of pages retreved may be rrelevant to the user needs because of the ambguous queres. On the other hand, users may not want to reformulate ther queres usng more search terms, snce t mposes addtonal burden on them durng searchng. To mprove user s search experence, most maor commercal search engnes provde query suggestons to help users formulate more effectve queres. When a user submts a query, a lst of terms that are semantcally related to the submtted query s provded to help the user to dentfy terms that he/she really wants, hence mprovng the retreval effectveness. Yahoo's Also Try [6] and Google's Searches related to features provde related K.W. Leung, W. Ng, and D.L. Lee are wth the Department of Computer Scence and Engneerng, Hong Kong Unversty of Scence and Technology, Clear Water Bay, Hong Kong. E-mal: {kwtleung, wlfred, dlee}@cse.ust.hk. queres for narrowng search, whle Ask Jeeves [2] suggests both more specfc and more general queres to the user as shown n Fg. 2. Unfortunately, these systems provde the same suggestons to the same query wthout consderng users specfc nterests. In ths paper, we propose a method that provdes personalzed query suggestons based on a personalzed concept-based clusterng technque. In contrast to exstng methods whch provde the same suggestons to all users, our approach uses clckthrough data to estmate user s conceptual preferences and then provdes personalzed query suggestons for each ndvdual user accordng to hs/her conceptual needs. The motvaton of our research s that queres submtted to a search engne may have multple meanngs. For example, dependng on the user, the query apple may refer to a frut, the company Apple Computer or the name of a person, etc. Thus, provdng personalzed query suggeston (e.g. users nterested n apple as a frut get suggestons about frut, whle users nterested n apple as a company get suggestons about the company's products) certanly helps users to formulate more effectve queres accordng to ther needs. The underlyng dea of our proposed technque s based on concepts and ther relatons extracted from the submtted user queres, the web-snppets 1 and the clckthrough data. Clckthrough data was exploted n the personalzed clusterng process to dentfy user preferences: a user clcks on a search result manly because the websnppet contans a relevant topc whch the user s nterested n. Moreover, clckthrough data can be collected easly wthout mposng extra burden on users, and thus provdng a low-cost means to capture user's nterest. Manuscrpt receved (nsert date of submsson f desred). Please note that all 1 web-snppet denotes the ttle, summary and URL of a Web page acknowledgments should be placed at the end of the paper, before the bblography. re-turned by search engnes. xxxx-xxxx/0x/$xx x IEEE

2 2 IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID TABLE 1 THE CLICKTHROUGH DATA FOR THE QUERY APPLE Lnks Clcked Web-Snppets for the Search Results l 1 Apple Hong Kong ( l 2 Apple Hong Kong - Pod + Tunes ( l 3 apple daly ( l 4 Apple ( l 5 Apple - Pod + Tunes ( l 6 Apple Inc. - Wkpeda, the free encyclopeda ( l 7 Apple II seres - Wkpeda, the free encyclopeda ( l 8 Apple.Mac ( l 9 The Apple Store (US) ( Apple - Support ( l 10 Our approach conssts of the followng four maor steps. Frst, when a user submts a query, concepts (.e. mportant terms or phrases n web-snppets) and ther relatons are mned onlne from web-snppets to buld a concept relatonshp graph. Second, clckthroughs are collected to predct user's conceptual preferences. Thrd, the concept relatonshp graph together wth the user's conceptual preferences s used as nput to a conceptbased clusterng algorthm that fnds conceptually close queres. Fnally, the most smlar queres are suggested to the user for search refnement. Fg. 1 shows the general process of our approach. To evaluate the performance of our approach, we developed a Google mddleware for clckthrough data collecton users were nvted to use our mddleware to search 200 test queres selected from a spectrum of topcal categores. When a user submts a query, concepts related to the query are mned and stored n our databases. If the user clcks on one of the search results, the user's clckthroughs together wth hs/her concept preference profle for the query are updated. The clusterng results on the 200 test queres are compared aganst the predefned clusters prepared by human edtors. We evaluate the performance of our approach usng the standard recall-precson measures. Beeferman and Berger's agglomeratve clusterng algorthm [11] (or smply called BB s algorthm n ths paper) s used as the baselne to compare wth our concept-based approach. Our expermental results show that the average precson at any recall level s better than the baselne method. The man contrbutons of ths paper are summarzed below: 1. Most of the prevous approaches on query clusterng consder two dfferent queres to be semantcally smlar f they lead to common clcks on the same pages. However, the chance for dfferent queres leadng tocommon clcks on the same URLs are rare n web search engnes (see Secton 2 for more dscusson) 2 The mddleware approach s for facltatng expermentaton. The technques developed n ths paper can be drectly ntegrated nto any search engne to provde personalzed query suggestons. Fg. 1. The general process of concept-based clusterng. Based on ths mportant observaton, we propose to use concepts, not pages, as the common ground for relatng semantcally smlar queres. That s, two queres are consdered related f they lead to clcks on pages that share some common concepts, whch are mned from the web-snppets n the search results. 2. To our knowledge, there s no prevous study on the personalzaton of query suggestons. We propose a two-phase clusterng method to cluster queres frst wthn the scope of each user and then for the communty. 3. We conduct experments to evaluate dfferent methods and show that our concept-based, two-phase clusterng method yelds the best precson and recall. The rest of ths paper s organzed as follows. In Secton 2, we compare our method wth other smlar approaches. We also dscuss some works related to concept mnng. In Secton 3, we revew BB s algorthm, whch s also an effectve technque n personalzed query clusterng. In Secton 4, our concept mnng method for extractng concepts from web-snppets s presented. In Secton 5, we adapt BB's algorthm to our concept-based approach. We further extend the concept-based BB's algorthm to a personalzed clusterng algorthm by utlzng the user concept preference profles. Expermental results comparng BB's algorthm wth our methods are presented n Secton 6. Secton 7 concludes the paper. 2 RELATED WORK Query clusterng technques have been developed n dversfed ways. The very frst query clusterng technque comes from nformaton retreval studes [26]. Smlarty between queres was measured based on overlappng keywords or phrases n the queres. Each query s represented as a keyword vector. Smlarty functons such as cosne smlarty or Jaccard smlarty [26] were used to measure the dstance between two queres. One maor lmtaton of the approach s that common keywords also exst n unrelated queres. For example, the queres, apple Pod (an mp3 player) and apple pe (a dessert), are

3 AUTHOR ET AL.: TITLE 3 Fg. 2. Above s part of the search result page generated by Ask.com n response to the query apple. A lst of query suggestons s provded showng many possble choces for query refnement. very smlar snce they both contan the keyword apple. However, the queres are actually expressng two dfferent search needs. Chuang and Chen [14] proposed to cluster and organze users' queres nto a herarchcal structure of topc classes. A Herarchcal Agglomeratve Clusterng (HAC) [25] algorthm s frst employed to construct a bnary-tree cluster herarchy. The bnary-tree herarchy s then parttoned n order to create sub-herarches formng a multway-tree cluster herarchy lke the herarchcal organzaton of Yahoo [6] and DMOZ [3]. Baeza-Yates et al. [10] proposed a query clusterng method that groups smlar queres accordng to ther semantcs. The method creates a vector representaton Q for a query q, and the vector Q composes of terms from the clcked documents of q. Cosne smlarty s appled to the query vectors to dscover smlar queres. More recently, Zhang and Nasraou [33] presented a method that dscovers smlar queres by analyzng users' sequental search behavor. The method assumes that consecutve queres submtted by a user are related to each other. The sequental search behavour s combned wth a tradtonal content-based smlarty method to compensate for the hgh sparsty of real query log data. Recently, Betzel et al. [12] proposed a query classfcaton method that combnes multple classfers. The method combnes technques from machne learnng and computatonal lngustcs. Ther results were compared to those from the 2005 KDD Cup [5], showng that ther combned approach produced hgher recall and smoother tradeoffs between recall and precson than any of the component approaches. On web search engnes, clckthrough data s a knd of mplct feedback from users. Table 1 s an example clckthrough data for the query apple, whch shows the URLs returned from the search engne for the query and the URLs clcked on by the user. Clearly, t s a valuable resource for capturng the user's nterest for buldng personalzed web search systems [7], [8], [17], [18], [21], [22], [24], [27], [28], [29]. Joachms [21] proposed a method whch employs preference mnng and machne learnng to rerank search results accordng to user's personal preferences. Later on, Smyth et al. [27] suggested that user search behavour s repettve and regular. They proposed to rerank search results such that the results whch have been prevously selected for a gven query are promoted ahead of other search results. More recently, Deng et al. [17] proposed an algorthm whch combnes a spyng technque together wth a novel votng procedure to determne user preferences from the clckthrough data. Dou et al. [18] also performed a large scale evaluaton on dfferent personalzed search strateges, ncludng clckthrough-based and profle-based personalzaton. They suggested that clck-based personalzaton strateges perform consstently and consderably well when compared to profle-based methods. To resolve the dsadvantage of keyword-based clusterng methods, clckthrough data has been used to cluster queres based on common clcks on URLs. Beeferman and Burger [11] proposed an agglomeratve clusterng algorthm (.e. BB s algorthm) to explot query-document relatonshps from clckthrough data. Gven a search engne log, BB's algorthm frst constructs a bpartte graph wth one set of vertces correspondng to queres, and another correspondng to documents. If a user clcks on a document, a lnk between the correspondng query and document s created on the bpartte graph. After the bpartte graph s obtaned, agglomeratve clusterng algorthm s used to obtan the clusters. The algorthm s contentndependent n the sense that t explots only the querydocument lnks on the bpartte graph to dscover smlar queres and smlar documents wthout examnng the keywords n the queres or the documents. The detals of the algorthm wll be descrbed n Secton 3. Wen et al. [31] proposed a clusterng algorthm combnng both query contents and URL clcks. They suggested that two queres should be clustered together, f they contan the same or smlar terms, and lead to the selecton of the same documents. However, snce web search queres are usually short and common clcks on documents are rare (see dscusson below), Wen et al's method may not be effectve for dsambguatng web queres. In contrast, our approach relates the queres wth a set of extracted concepts n order to dentfy the precse semantcs of the search queres. One maor problem wth clckthrough-based method s that the number of common clcks on URLs for dfferent queres s lmted. Ths s because dfferent queres wll lkely retreve very dfferent result sets n very dfferent rankng orders. Thus, the chance for the users to see the same results would be small, let alone clckng on them. It was reported that n a large clckthrough dataset from a commercal search engne the chance for two random queres to have a common clck s merely 6.38x10-5 [11]. The small number of common clcks leads to low recall. To allevate ths problem, we ntroduce the noton of concept-based graphs by consderng concepts extracted from web-snppets and adapt BB's method to ths new

4 4 IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID Symbol G TABLE 2 FREQUENTLY USED SYMBOLS A bpartte graph Descrpton m The number of teratons (.e. merges) requred for agglomeratve clusterng n b The number of black vertces n G n w The number of whte vertces n G N max The maxmum number of neghbors of any vertex n G sm(x, Smlarty between vertces x and y n G sm R (t,t ) Smlarty between concepts t and t sf(t ) Snppet frequency of the keyword/phrase t support(t ) Interestngness of a partcular keyword/phrase t wth respect to the returned web-snppets arsng from a query t The number of terms n the keyword/phrase t upper bound The upper bound for the number of operatons requred for agglomeratve clusterng context. In contrast to the exstng methods, our approach provdes effectve personalzaton effect by usng the concept preference profles that are bult upon the extracted concepts and clckthroughs. The use of concepts helps to reduce the sze of the resulted profles, whle retanng the accuracy and capablty to capture user's nterests. Along the lne of concept extracton from websnpplets, Koester [23] combned web mnng technques and formal concept analyss to extract concepts from websnppets and buld a concept lattce capturng user's conceptual needs. However, t was not concerned wth personalzaton. Xu et al. [32] proposed a method to extract concepts from users browsed documents to create herarchcal concept profles for personalzed search n a prvacy-enhanced envronment. Ther method assumes that the system knows the documents that user s nterested n, nstead of usng clckthrough. Thus, ther method s qute dfferent from ours. Another technque to dscover related queres s query expanson. The am of query expanson s to mprove retreval effectveness by expandng the query wth words or phrases to match addtonal documents. Cu et al. [15] proposed a query expanson method based on user nteractons recorded n the clckthrough data. The method focuses on mnng correlatons between query terms and document terms by analyzng user's clckthroughs. Document terms that are strongly related to the nput query are used together to narrow down the search. 3 BB'S GRAPH-BASED CLUSTERING ALGORITHM In BB s graph-based clusterng [11], a query-page bpartte graph s frstly constructed wth one set of the nodes corresponds to the set of submtted queres, and the other corresponds to the sets of clcked pages. If a user clcks on a page, a lnk between the query and the page s created on the bpartte graph. After obtanng the bpartte graph, an agglomeratve clusterng algorthm s used to dscover smlar queres and smlar pages. Durng the clusterng process, the algorthm teratvely combnes the two most smlar queres nto one query node, then the two most Fg. 3. (a) Queres q 1 and q 3 seem unrelated before document clusterng. (b) After document clusterng, queres q 1 and q 3 are then related to each other because they are both lnked to the document cluster {d 1,d 2 }. Fg. 4. (a) A bpartte graph wthout nose. (b) A bpartte graph wth a nose lnk, where the sold edges represent real lnks and the dash edge represents a nose edge. smlar pages nto one page node, and the process of alternatve combnaton of queres and pages s repeated untl a termnaton condton s satsfed. The man reason for not clusterng all the queres frst and then all the pages next s that two queres may seem unrelated pror to page clusterng because they lnk to two dfferent pages but they may become smlar to each other f the two pages have a hgh enough smlarty to each other and are merged later. The example n Fg. 3 helps llustrate ths scenaro. To compute the smlarty between queres or documents on a bpartte graph, the algorthm consders the overlap of ther neghborng vertces as defned n the followng equaton: N( x) N( N( x) N( sm( x, = 0 f N( x) N( > 0 otherwse where N(x) s the set of neghborng vertces of x, and N( s the set of neghborng vertces of y. Intutvely, the smlarty functon formalzes the dea that x and y are smlar f ther respectve neghborng vertces largely overlap and vce versa. As dscussed n Secton 2, a problem of the BB s method s ts low recall rate snce the number of common clcks on the URLs s rather small. Another problem of the smlarty functon proposed by BB s that t cannot dentfy nose lnks n the clusterng process. Consder the example shown n Fg. 4, where the number attached to a lnk s the total number of clcks on the document. In Fg. 4(a), q 2 s a hot query whch generates 1000 clcks for each of the documents d 2 and d 3, whle q 1 s a cold query whch only generates 10 clcks for each of the documents d 1 and d 2. Even though the clck dstrbutons for q 1 and q 2 are dfferent, we can see that d 1 and d 2 are both relevant to q 1 because the number of clcks on d 1 and the number of clcks on d 2 are roughly the same for q 1 (.e. 10 clcks). (1)

5 AUTHOR ET AL.: TITLE 5 Smlarly, we can see that d 2 and d 3 are both relevant to q 2 because the number of clcks on d 2 and the number of clcks on d 3 are roughly the same for q 2 (.e clcks). Thus, we conclude that q 1 and q 2 are smlar queres because they share the common relevant document d 2. However, n Fg. 4(b), d 2 cannot be consdered relevant to q 1 because only a small fracton of the clcks (10 out of 1010) supports that concluson. Consequently, we cannot conclude that q 1 and q 2 are smlar queres. BB s smlarty functon does not detect the nose lnk as shown Fg. 4(b). It gves the same smlarty score of 1/3 n both cases. To solve the problem, the followng smlarty functon was proposed n our earler work [13]. sm( x, = L( x, L( x) L( 0 f L( x) L( > 0 otherwse where L(x, s the set of lnks connectng x and y to the same vertces, L(x) and L( are all the lnks connectng to x and y, respectvely, and L( ) s the cardnalty of L( ). Applyng the smlarty functon, we get a smlarty score of 1010/2020 = 1/2 for sm(q 1,q 2 ) n Fg. 4(a), and smlarty score of 1010/3010 = 1/3 for sm(q 1,q 2 ) n Fg. 4(b). Note that the score for sm(q 1,q 2 ) n Fg. 4(a) s hgher than that of Fg. 4(b), because most people are selectng document d 1 n Fg. 4(b), and the lnks between q 1 and d 2 can be consdered as nose. Therefore, t s reasonable to assgn a lower score to sm(q 1,q 2 ) n Fg. 4(b). Usng the nose-tolerant smlarty functon, the smlarty between two vertces always les between [0,1]. The smlarty for two vertces s 0, f they share no common neghbor, and the smlarty between two vertces s 1, f they have exactly the same neghbor vertces. It s noted that nose elmnaton by tself s a dffcult problem snce t requres complex nference rules to dstngush the nformatve from the erroneous clcks. Snce the nose-tolerant verson has been shown to be superor to the orgnal verson [13] and we are not aware of any better methods, n the rest of ths paper, BB s algorthm refers to ths mproved verson of smlarty functon. 4 CONCEPT EXTRACTION Before explanng our concept-based clusterng method, we frst descrbe our concept extracton method, whch s composed of the followng three basc steps: 1) extractng concepts usng the web-snppets returned from the search engne, 2) mnng concept relatons, and 3) creatng a user concept preference profle usng the extracted concepts, concept relatons and user s clckthroughs. 4.1 Concept Extracton Usng Web-Snppets Our concept extracton method s nspred by the wellknown problem of fndng frequent tem sets n data mnng [9], [19]. When a user submts a query to the search engne, a set of web-snppets are returned to the user for (2) TABLE 3 EXTRACTED CONCEPTS FOR THE QUERY APPLE Concept t support(t ) Concept t support(t ) mac 0.1 macntosh 0.05 pod 0.1 tour 0.05 phone 0.1 slashdot apple 0.04 hardware 0.09 pcture 0.04 software 0.09 apple 0.04 bg apple 0.08 apple varety 0.04 apple store 0.06 musc 0.04 mac os 0.06 farm market 0.04 apple orchard 0.06 apple grower 0.04 apple valley 0.06 gft shop 0.04 apple and macntosh 0.06 apple farm 0.04 apple blossom festval 0.06 dentfyng the relevant tems. We assume that f a keyword or a phrase appears frequently n the web-snppets of a partcular query, t represents an mportant concept related to the query because t co-exsts n close proxmty wth the query n the top documents. We use the followng support formula for measurng the nterestngness of a partcular keyword/phrase t wth respect to the returned web-snppets arsng from a query q: support sf n ( t ) ( t ) = t (4) where n s the total number of web-snppets returned, sf(t ) s the snppet frequency of the keyword/phrase t (.e., the number of web-snppets contanng t ) and t s the number of terms n the keyword/phrase t. For smplcty, we omt q n the above expresson f no ambguty arses. To extract concepts for a query q, we frst extract all the keywords and phrases from the web-snppets returned by the query. After obtanng a set of keywords/phrases (t ), we compute the support for all t (support(t )). If the support of a keyword/phrase t s bgger than the threshold s (support(t ) > s), we would treat t as a concept for the query q. Table 3 llustrates the extracted concepts for the query q = apple. 4.2 Mnng Concept Relatons To fnd relatons between concepts, we apply a wellknown sgnal-to-nose rato formula from data mnng [16] to establsh smlarty between terms t 1 and t 2. The smlarty value of Church and Hanks' formula always les between [0,1], and thus can be used drectly n Step 3. n df ( t1 t2 ) sm( t1, t2 ) = log n (5) df ( t ) df ( t ) 1 2 where n s the number of documents n the corpus, df(t 1 t 2 ) s the ont document frequency of t 1 and t 2 and df(t) s the document frequency of the term t. In our context, two concepts t, t could co-exst n a web-snppet n the followng stuatons: 1) t and t coexst n the ttle, 2) t and t co-exst n the summary or 3) t exsts n the ttle, whle t exsts n the summary (or vce

6 6 IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID Fg. 5. (a) A concept relatonshp graph for the query apple derved wthout ncorporatng user clckthroughs. (b) A concept preference profle constructed usng the user clckthroughs and the concept relatonshp graph n (a). w t s the nterestngness of the concept t to the user. More clcks on a concept gradually ncrease the nterestngness w t of the concept. versa). Therefore, we modfy Church and Hanks' formula for the three dfferent cases n our context as follows. ( t t ) sm ( t, t ) + sm ( t, t ) sm ( t, t ) sm = + (6) R, R, ttle R, summary R, other where sm R (t,t ) s the smlarty between concepts t and t, whch s composed of sm R,ttle (t,t ), sm R,summary (t,t ) and sm R,other (t,t ) as follows. n sfttle ( t t ) smr ttle ( t, t ) log log n, sfttle ( t ) sfttle ( t ) = α (7) n sf summary ( t t ) sm R summary ( t, t ) log log n, sf summary ( t ) sf summary ( t ) = α (8) n sfother ( t t ) smr other ( t, t ) log log n, sfother ( t ) sfother ( t ) = α (9) where n s the total number of web-snppets returned, sf ttle (t t ) s the ont snppet frequency of concepts t and t n document ttles, sf ttle (t) s the snppet frequency of concept t n document ttles, sf summary (t t ) s the ont snppet frequency of t and t n document summares, sf summary (t) s the snppet frequency of concept t n document summares, sf other (t t ) s the ont snppet frequency of concept t n a document ttle and t n the document's summary (or vce versa) and sf other (t) s the snppet frequency of concept t n ether document summares or document ttles. Usng the extracted concepts and concept relatons, we can create a concept relatonshp graph wth the extracted concepts as nodes and mned concept relatons as lnks. Fg. 5(a) shows a concept preference graph for the query q = apple. A lnk s created between concept t and t, f ther smlarty, sm R (t,t ), s greater than zero. The strength of each lnk s determned by sm R (t,t ) whch s the smlarty between concepts t and t. 4.3 Creatng User Concept Preference Profle The concept relatonshp graph s frstly derved wthout takng user clckthroughs nto account. Intutvely, the graph shows the possble concept space arsng from user's queres. The concept space, n general, covers more than what the user actually wants. For example, when the user searches for the query apple, the concept space derved from the web-snppets contans concepts such as pod, phone and recpe. If the user s ndeed nterested n the concept recpe and clcks on pages contanng the concept recpe, the clckthroughs should gradually favor the concept recpe and ts neghborhood (by assgnng hgher weghts to the nodes), but the weghts of the unrelated concepts such as phone, pod and ther neghborhood should reman zero. Therefore, we propose the followng formulas to capture user's nterestngness w t on the extracted concepts t when a clcked websnppet s, denoted by clck(s ), s found: ( s ) t s, wt = wt + 1 ( s ) t s, w = w + sm ( t, t ) f sm ( t, t ) > 0 clck (10) clck t t where s s a web-snppet, w t s the nterestngness weght of the concept t and t s the neghborhood concept of t. When a user clcks on s, the weght of concepts t appearng n s s ncremented by 1 to reflect the user's nterestngness on the concepts embedded n the clcked page s. For other concepts that are related to the clcked concepts on the concept relatonshp graph, they are ncremented accordng to the smlarty score gven n Equaton (5), whch s normalzed to the range [0,1]. Therefore, f a concept s closely related to the clcked concept, t s ncremented to a hgher value (whch could be as close to 1 as the clcked concepts). Otherwse, t s only ncremented by a small fracton (close to 0). By mposng user's nterestngness on the concepts, a concept preference profle wth respect to the nput query s created. Fg. 5(b) shows an example of concept preference profle n whch the user s nterested n nformaton about apple macntosh. w t n Fg 5(b) represents the nterestngness of the concepts to the user. The values of w t for macntosh and mac are hghest because the users have nterest n them (and the values of w t are ncremented usng Equaton (10)). Indrectly, the values of w t for mac os, software, apple store, Pod, Phone, and hardware are ncreased because they are related to apple macntosh and thus ncremented usng Equaton (11). Fnally, the weghts of the concepts about apple as frut are not R R (11)

7 AUTHOR ET AL.: TITLE 7 changed. As a result, the concepts formed two clusters representng the user concept preference profle. 5 CONCEPT-BASED CLUSTERING Usng the concepts extracted from web-snppets, we propose two concept-based clusterng methods. We frst extend BB s algorthm to a concept-based algorthm n Secton 5.1. In Secton 5.2, the concept-based algorthm s further enhanced to acheve effectve personalzed clusterng. 5.1 Clusterng on Query-Concept Bpartte Graph We now descrbe our concept-based algorthm (.e. BB s algorthm usng query-concept bpartte graph) for clusterng smlar queres. Smlar to BB's algorthm, our technque s composed of two steps: 1) Bpartte graph constructon usng the extracted concepts, and 2) agglomeratve clusterng usng the bpartte graph constructed n Step 1. Usng the extracted concepts and clckthrough data, the frst step of our method s to construct a queryconcept bpartte graph, n whch one sde of the vertces correspond to unque queres, and the other corresponds to unque concepts. If a user clcks on a search result, concepts appearng n the web-snppet of the search result are lnked to the correspondng query on the bpartte graph. Algorthm 1 shows the frst step of our method. After the bpartte graph s constructed, agglomeratve clusterng algorthm s appled to obtan clusters of smlar queres and smlar concepts. The nose-tolerant smlarty functon (recall Equaton (2)) s used for fndng smlar vertces on the bpartte graph G. The agglomeratve clusterng algorthm would teratvely merge the most smlar par of whte vertces, and then merge the most smlar par of black vertces and so on. We present the detals n Algorthm 2. Algorthm 1 Bpartte Graph Constructon Input: Clckthrough data CT, Extracted Concepts E Output: A Query-Concept Bpartte Graph G 1: Obtan the set of unque queres Q = {q 1,q 2,q 3 } from CT 2: Obtan the set of unque concepts C = {c 1,c 2,c 3 } from E 3: Nodes(G) = Q C where Q and C are the two sdes n G 4: If the web-snppet s retreved usng q Q s clcked by a user, create an edge e = (q,c ) n G, where c s a concept appearng n s. Algorthm 2 - Agglomeratve Clusterng Input: A Query-Concept Bpartte Graph G Output: A Clustered Query-Concept Bpartte Graph G c 1: Obtan the smlarty scores for all possble pars of queres n G usng the nose-tolerant smlarty functon gven n Equaton (2). 2: Merge the par of queres (q,q ) that has the hghest smlarty score. 3: Obtan the smlarty scores for all possble pars of concepts n G usng the nose-tolerant smlarty functon gven n Equaton (2). 4: Merge the par of concepts (c,c ) that has the hghest smlarty score. 5. Unless termnaton s reached, repeat Steps 1-4. The termnatng condton for BB s algorthm s when all connected components n G c satsfy the followng condtons: max q,q Q sm( q, q ) = 0 and max sm( c, c ) = 0. c,c C However, ths termnatng condton possbly generates a sngle bg cluster of queres and a sngle bg cluster of concepts because havng the smlarty threshold set to zero means that two queres (concepts) would be assgned to the same cluster even f they have only a tny fracton of overlappng concepts (queres). To resolve ths problem, we apply hgher smlarty thresholds, whch have been observed from our experments to yeld hgh precson and recall: max q,q Q sm( q, q ) = 0.18 and max sm( c, c ) = c,c C 5.2 Personalzed Concept-Based Clusterng We now explan the essental dea of our personalzed concept-based clusterng algorthm wth whch ambguous queres can be clustered nto dfferent query clusters. Personalzed effect s acheved by manpulatng the user concept preference profles n the clusterng process. In contrast to BB s agglomeratve clusterng algorthm, whch represents the same queres submtted from dfferent users by one query node, we need to consder the same queres submtted by dfferent users separately to acheve personalzaton effect. In other words, f two gven queres, whether they are dentcal or not, mean dfferent thngs to two dfferent users, they should not be merged together because they refer to two dfferent sets of concepts for the two users. Therefore, we treat each ndvdual query submtted by each user as an ndvdual vertex n the bpartte graph by labelng each query wth a user dentfer. Moreover, concepts appearng n the web-snppet of the search result wth nterestngness weghts greater than zero n the concept preference profle are lnked to the correspondng query on the bpartte graph. An example s shown n Fg. 6(a). We can see that the query apple submtted by users User1 and User3 become two vertces apple (User1) and apple (User3). If User1 s nterested n the concept apple store, as recorded n the concept preference profle, a lnk between the concept apple store and the query apple (User1) would be created. On the other hand, f User3 s nterested n the concept frut, a lnk between the concept frut and apple (User3) would be created. After the personalzed bpartte graph s created, our ntal experements revealed that f we apply BB s algorthm drectly on the bpartte graph, the query clusters generated wll quckly merge queres from dfferent users together and thus losng the personalzaton effect. We

8 8 IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID Fg. 6. Performng personalzed concept-based clusterng algorthm on a small set of clckthrough data. Startng from top left: (a) The orgnal bpartte graph. (b), (c) Intal Clusterng. (d), (e) Communty Mergng. found that dentcal queres, though ssued by dfferent users and havng dfferent meanngs, tend to have some generc concept nodes such as nformaton n common, e.g., apple (User1) and apple (User3) both connect to the nformaton concept node n Fg. 6(a). Thus, these query nodes wll lkely be merged n the frst few teratons and causng more queres from dfferent users to be merged together n subsequent teratons. Consderng Fg. 6(a) agan, f apple (User1) and apple (User3) are merged, the next teraton wll merge the concept nodes apple store, frut and nformaton. When the clusterng algorthm goes further, queres across users wll be further clustered together. At the end, the resultng query clusters have no personalzaton effect at all. To resolve the problem, we dvde clusterng nto two steps. In the ntal clusterng step, an algorthm smlar to BB s algorthm s employed to cluster all the queres, but t would not merge dentcal queres from dfferent users. After obtanng all the clusters from the ntal clusterng step, the communty mergng step s employed to merge query clusters contanng dentcal queres from dfferent users. We can see from Fg 6(d) that apple (User1) and apple (User3) belong, correctly, to dfferent clusters. We wll see further n Secton 6.3 that the ntal clusterng step s able to generate hgh precson rate because t preserves the preference of each user, whle the communty mergng step s able to mprove the recall rate because of the collaboratve flterng effect. Algorthm 3 shows the detals of the personalzed clusterng algorthm. Smlar to the BB's algorthm, a queryconcept bpartte graph s created as nput for the clusterng algorthm. The bpartte graph constructon algorthm s smlar to Algorthm 1, except each ndvdual query submtted by each user s treated as an ndvdual vertex n the bpartte graph. Intal clusterng (.e. Steps 1-5 of Algorthm 3) s smlar to BB's agglomeratve algorthm as already dscussed n Secton 5.1. However, queres from dfferent users are not allowed to be merged n ntal clusterng. Fg. 6(b) and 6(c) show examples of query and concept mergng, respectvely. Fg. 6(d) llustrates the result of ntal clusterng. In communty mergng (.e. Step 6-8 of Algorthm 3), query clusters contanng dentcal queres from dfferent users are compared for mergng. Fg. 6(d) and 6(e) show an example of query cluster mergng. The query clusters {apple computer (User2), apple (User1) } and {apple (User2) and apple mac (User1) } both contan the query apple, and are leadng to the same concept apple store. Therefore, they are merged n communty mergng as one bg cluster. Good tmng to start communty mergng s mportant for the success of the algorthm. If we stop ntal clusterng too early (.e. not all clusters are well formed), communty mergng merges all the dentcal queres from dfferent users frst, and thus generates a sngle bg cluster wthout much personalzaton effect. However, f we stop ntal clusterng too late (.e. clusters are beng overly merged n ths case), the low precson rate generated by ntal clusterng would not be mproved by communty mergng. To obtan the optmal results n our experments, we use the followng termnatng condtons for ntal clusterng (-clusterng) and communty mergng (c-mergng) n Algorthm 3. These parameters are emprcally nvestgated n our experment. We wll further ustfy our choce usng Table 10 n Secton 6.3. max max clusterng q,q Q c mergng q,q Q sm( q, q ) 0.29 and max sm( c, c ) = = clusterng c,c C, q ) = 0.39 and max c mergng c,c C sm( q sm( c, c ) = 0.39.

9 AUTHOR ET AL.: TITLE 9 TABLE 4 CATEGORIES OF THE TEST QUERIES Category Descrpton Category Descrpton 1 Cookng 6 Computer Programmng 2 Dnng 7 Computer Gamng 3 Internet Shoppng 8 Musc 4 Travelng 9 Computer Scence Research 5 Automoble Reparng 10 Computer Hardware TABLE 5 STATISTICS OF THE CLICKTHROUGH DATA COLLECTED IN THE 1 ST EXPERIMENT Statstcs Number of users 30 Number of queres assgned to each use 5 Number of test Queres 150 Number of unque Queres 150 Maxmum number of retreved URLs for a query 100 Maxmum number of extracted concepts for a query 217 Maxmum number of extracted words for a query 1,093 Number of URLs retreved 14,880 Number of unque URLs retreved 12,430 Number of concepts retreved 13,321 Number of unque concepts retreved 6,008 Number of words retreved 117,924 Number of unque words retreved 21,920 The query clusters outputted by the algorthm are shown n Fg. 6(e). We assume n ths example that the lnks between the generc concept nodes, "nformaton", and the two query clusters are weak and the termnatng smlarty s able to prevent the mergng the query clusters about "apple computer" and "apple uce". We can see n the resultng clusters that User1 and User2 both submt the query apple n order to seek nformaton about apple computer, whle User3 submts the query apple to look for nformaton about apple uce. In ths example, even the query apple submtted by User1, User2 and User3 appear to be the same, the algorthm can successfully dfferentate them to archve personalzaton effect accordng to ndvdual user conceptual preferences. Fnally, we can see that queres about apple computer (e.g. apple mac, apple computer ) are suggested to User1 and User2, whle queres about apple uce (e.g. apple uce ) are suggested to User3. Algorthm 3 Personalzed Agglomeratve Clusterng Input: A Query-Concept Bpartte Graph G Output: A Personalzed Clustered Query-Concept Bpartte Graph G p // Intal Clusterng 1: Obtan the smlarty scores n G for all possble pars of queres usng the nose-tolerant smlarty functon gven n Equaton (2). 2: Merge the par of most smlar queres (q,q ) that does not contan same queres from dfferent users. 3: Obtan the smlarty scores n G for all possble pars of concepts usng the nose-tolerant smlarty functon gven TABLE 6 USER S INFORMATION NEEDS FOR THE 2 ND EXPERIMENT User Group Informaton Needs 1 Purchase of dgtal cameras 2 Purchase of prnters 3 Informaton on camera flms 4 Informaton on dessert cookng recpes 5 Purchase of clothes 6 Download of Mac software 7 Purchase of Macntosh 8 Purchase of Pod TABLE 7 STATISTICS OF THE CLICKTHROUGH DATA COLLECTED FOR 2 ND PART OF THE EXPERIMENTATION Statstcs Number of users 10 Number of queres assgned to each use 5 Number of test Queres 50 Number of unque Queres 38 Maxmum number of retreved URLs for a query 100 Maxmum number of extracted concepts for a query 168 Maxmum number of extracted words for a query 938 Number of URLs retreved 4,962 Number of unque URLs retreved 3,239 Number of concepts retreved 4,130 Number of unque concepts retreved 1,971 Number of words retreved 38,831 Number of unque words retreved 8,891 n Equaton (2). 4: Merge the par of concepts (c,c ) havng hghest smlarty score. 5. Unless termnaton s reached, repeat Steps 1-4. // Communty Mergng 6. Obtan the smlarty scores n G for all possble pars of queres usng the nose-tolerant smlarty functon gven n Equaton (2). 7. Merge the par of most smlar queres (q,q ) that contans same queres from dfferent users. 8. Unless termnaton s reached, repeat Steps EXPERIMENTAL RESULTS In ths secton, we evaluate the performance of the proposed clusterng methods for obtanng related queres usng user clckthroughs. In Secton 6.1, we frst descrbe the expermental setup for collectng the requred clckthrough data. In Secton 6.2, we compare the performance of BB's algorthm usng query-url, query-word, and query-concept bpartte graphs (or smply called the QU, QW and QC methods). In Secton 6.3, we evaluate the effectveness of our proposed personalzed concept-based clusterng (or smply called the P-QC method). In Secton 6.4, we dscuss the algorthmc complextes based on the related parameters. 6.1 Expermental Setup To collect the clckthrough data to evaluate our proposed

10 10 IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID 3 Google s one of the most popular commercal search engnes. If a dfferent search engne s used, we expect the absolute performances of the methods under evaluaton to be dfferent but ther relatve performances reman the same. methods, we mplemented a Google mddleware to track user clcks. Google 3 was chosen as a common bass for comparng the performance of the methods under evaluaton. We nvted 40 students from our department to use the mddleware to search 200 gven test queres whch are accessble at [1]. To avod any bas, the test queres are randomly selected from ten dfferent categores and submtted to Google wthout any modfcaton by the mddleware. Table 4 shows the topcal categores n whch the queres we have chosen. When a query s submtted to the mddleware, the top 100 search results from Google are retreved, and the web-snppets of the search results are dsplayed to the users. Snce most users would examne only the top 10 results, our concept extracton method, dggng deep nto the frst 100 results, wll dscover concepts related to the query that would otherwse be mssed by the users. The extracted concept relatonshp graph s then stored n our database. If a user clcks on one of the websnppets of the returned results, the user's clckthrough together wth hs/her concept preference profle are updated as dscussed n Secton 4.3. The threshold s for concept mnng was set to 0.03 and the threshold for establshng concept relatons (as specfed n Eqn 11) s set to zero. We chose these small thresholds so that as many concepts as possble are mned. The qualty of the query suggestons s then reled more on the clusterng algorthms, whch are the man focus of ths paper. In the frst experment (wll be descrbed n Secton 6.2), 30 students were asked to search the 150 test queres, all of whch have unambguous meanngs (e.g. apple pe and cheese cake ). The 150 test queres are separated nto 10 predefned clusters (e.g. the queres apple pe, cheese cake and brownes belong to the cluster about dessert recpes). The users were asked to clck on the websnppets of the returned results that are relevant to the queres. The clckthrough data collected are used to measure the performance of the concept-based clusterng method as dscussed n Secton 5.1. Table 5 shows the statstcs of our collected clckthrough data for ths experment. In the second experment (wll be descrbed n Secton 6.3), 10 students were asked to search usng another 50 test queres. Some of the test queres are ntentonally desgned to have ambguous meanngs (e.g. the query Canon could mean a dgtal camera or a prnter). The 50 test queres are separated nto 8 predefned clusters. Some of the queres could possbly exst n more than one cluster (e.g. the query Canon could belong to the cluster about dgtal cameras or the cluster about prnters). Each user s assgned wth one of the nformaton seekng tasks shown n Table 6. The users are then asked to clck on the web-snppets of the returned results that are both relevant to the queres and ther nformaton needs. The clckthrough data collected are used to measure the performance of the personalzed concept-based clusterng method as dscussed n Secton 5.2. Table 7 shows the statstcs of our collected clckthrough data for ths experment. 6.2 Comparng QU, QW and QC methods We now dscuss the result of the frst experment, whch compares the performance of QU, QW and QC methods. QU method s the orgnal nput of BB s algorthm whch serves as a baselne for comparson. QW method uses query-word bpartte graph whch s smlar to the queryconcept bpartte graph n that they are both constructed usng Algorthm 1. The dfference s that the former contans all words (excludng stopwords) from the websnppets and the latter contans the extracted concepts. QW and QC methods are necessary, snce they allow us to study the benefts of concept extracton. The three methods are also employed to cluster the collected data. The results are compared to our predefned clusters for precson and recall. Gven a query q and ts correspondng query cluster {q 1,q 2,q 3 } generated by a clusterng algorthm, the precson and recall are computed usng the followng formulas: Q _ relevant Q _ retreved precson( q) = (12) Q _ retreved Q _ relevant Q _ retreved recall( q) = (13) Q _ relevant where Q_relevant s the set of queres that exst n the predefned cluster for q, Q_retreved s set of the related queres {q 1,q 2,q 3 } generated by the algorthm. The precson and recall values from all queres are averaged for plottng the precson-recall fgures. The performance of the three methods s compared usng precson-recall fgures and best F-measure values. Fg. 7 shows the precson-recall fgures for QU, QW, QC methods. We observe that QC method yelds better recall rate than QU method (.e. the orgnal BB s algorthm), whle preservng hgh precson rates. Ths can be attrbuted to the fact that the average number of overlappng URLs between queres s only 16.3 accordng to the statstcs n Table 5, whereas the average number of overlappng concepts between the queres s 48.8, whch s much hgher than the URL overlap rate. As a result, related queres that cannot be dscovered by URL overlap can be brought together by our QC method, and thus mprovng the recall rate. The effect of hgh concept overlap rate s also apparent n Fg. 7, whch shows that the recall of QU method can only go up to around 0.8, whle QW and QC methods can go beyond 0.9. Note that QU method can yeld hgh precson rate because of the valuable URL overlaps between queres. However, QC method benefts both precson and recall comparng to QU method, showng that the use of extracted concepts s much better for fndng smlar queres. We also observe that QW method performs the worst among the three methods because common non-stop words such as dscusson, nformaton and news brng unrelated queres together, and thus lowerng both

11 AUTHOR ET AL.: TITLE 11 TABLE 8 BEST F-MEASURE VALUES OF QU, QW AND QC METHODS FOR THE 1 ST EXPERIMENT Best F-Measure Values Precson Recall F-measure QU method QW method QC method Fg. 8. Change of precson when performng QU, QW and QC methods. Fg. 7. Precson vs. recall when performng QU, QW and QC methods. the precson and recall rate. The man dfference between QW and QC methods s the avalablty of concept extracton. Intutvely, QC method outperforms QW method because the concept extracton process can successfully elmnate unrelated common words wthn web-snppets. Fg. 8 and 9 show the change of precson and recall respectvely for the three clusterng methods. In Fg. 8, when the cutoff smlarty score s around 0.3, the precson obtaned usng QU method s very close to that of QC method, whch s much better than the precson obtaned usng QW method. In Fg. 9, at the same cutoff smlarty score, the recall obtaned usng QU method s close to zero, whch s much lower comparng to the recalls obtaned usng QW and QC methods. We can easly see from Fg. 8 and 9 that QC method s able to generate good recall, whle achevng a precson comparable to that of QU method. We observe that the three methods are able to acheve ther optmal precson/recall at dfferent cutoff smlarty scores. To obtan and compare the best F-measures [30] (.e. evenly weghted harmonc means of precsons and recalls) for the three dfferent methods, the followng three termnatng strateges are used: max URL q,q Q max word q,q Q max concept q,q Q sm( q, q ) = and max sm( c, c ) = URL c,c C sm( q, q ) = 0.39 and max sm( c, c ) = word c,c C sm( q, q ) = 0.18 and max sm( c, c ) = concept c,c C The F-measure, F, s defned by the followng formula: Fg. 9. Change of recall when performng QU, QW and QC methods. ( precson recall) F = 2 (14) ( precson + recall ) Table 8 shows the best F-measure values for the QU, QW, and QC method. From the results, we can conclude that query clusters obtaned usng QC method are much more accurate comparng to those obtaned from QU and QW methods. 6.3 Personalzed Concept-Based Clusterng In the second experment, QU, QW, QC and P-QC methods are employed to cluster queres whch are ntentonally desgned to have ambguous meanngs. Agan, the results are compared to our predefned clusters n terms of precson and recall. We analyze the performance of P- QC method usng precson-recall fgures and best F- measure values. Fg. 10 shows the precson-recall fgures of P-QC methods. The sold lne s the precson-recall graph f only ntal clusterng s performed. We can observe that recall s max out at The other three lnes llustrate how communty mergng can further mprove recall be-

Query Clustering Using a Hybrid Query Similarity Measure

Query Clustering Using a Hybrid Query Similarity Measure Query clusterng usng a hybrd query smlarty measure Fu. L., Goh, D.H., & Foo, S. (2004). WSEAS Transacton on Computers, 3(3), 700-705. Query Clusterng Usng a Hybrd Query Smlarty Measure Ln Fu, Don Hoe-Lan