IN recent years, we have been witnessing the explosive

Size: px

Start display at page:

Download "IN recent years, we have been witnessing the explosive"

Anna McDaniel
6 years ago
Views:

1 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 4, JULY/AUGUST Query Expanson by Mnng User Logs Hang Cu, J-Rong Wen, Jan-Yun Ne, and We-Yng Ma, Member, IEEE Abstract Queres to search engnes on the Web are usually short. They do not provde suffcent nformaton for an effectve selecton of relevant documents. Prevous research has proposed the utlzaton of query expanson to deal wth ths problem. However, expanson terms are usually determned on term co-occurrences wthn documents. In ths study, we propose a new method for query expanson based on user nteractons recorded n user logs. The central dea s to extract correlatons between query terms and document terms by analyzng user logs. These correlatons are then used to select hgh-qualty expanson terms for new queres. Compared to prevous query expanson methods, ours takes advantage of the user udgments mpled n user logs. The expermental results show that the log-based query expanson method can produce much better results than both the classcal search method and the other query expanson methods. Index Terms Query expanson, user log, probablstc model, nformaton retreval, search engne. æ 1 INTRODUCTION IN recent years, we have been wtnessng the explosve growth of nformaton on the World Wde Web. People are relyng more and more on the Web for ther dverse needs of nformaton. However, the Web s an nformaton hotpot where nnumerous authors have created and are creatng ther Web stes ndependently. The vocabulares of the authors vary greatly. There s an acute requrement for search engne technology to help users explot such an extremely valuable resource. Despte the fact that keywords are not always good descrptors of contents, most exstng search engnes stll rely solely on keyword-matchng to determne the answers. Users usually descrbe ther nformaton needs by a few keywords n ther queres, whch are lkely to be dfferent from those ndex terms of the documents on the Web. Ths problem s general n Informaton Retreval (IR) systems and has been documented before the popularzaton of the Web: New or ntermttent users often use the wrong words and fal to get the actons or nformaton they want [15]. As a consequence, n many cases, the documents returned by search engnes are not relevant to the user nformaton need. Ths rases a fundamental problem of term msmatch n nformaton retreval, whch s also one of the key factors that affect the precson of the search engnes. Very short queres submtted to search engnes on the Web amplfy ths problem: Many mportant words or terms may be mssng from the queres. To solve ths problem, researchers have nvestgated the utlzaton of query expanson technques to help users formulate better queres.. H. Cu s wth the Department of Computer Scence, School of Computng, Natonal Unversty of Sngapore, Sngapore, E-mal: cuhang@comp.nus.edu.sg.. J.-R. Wen and W.-Y. Ma are wth Mcrosoft Research Asa, 3F Sgma Buldng, No. 49, Zhchun Rd. Hadan Dstrct, Beng , Chna. E-mal: {rwen, wyma}@mcrosoft.com.. J.-Y. Ne s wth the Département d nformatque et Recherche Opératonnelle, Unversté de Montréal C.P. 6128, succursale Centre-vlle Montreal, Quebec H3C 3J7 Canada. E-mal: ne@ro.umontreal.ca. Manuscrpt receved 15 July 2002; revsed 15 Dec. 2002; accepted 6 Jan For nformaton on obtanng reprnts of ths artcle, please send e-mal to: tkde@computer.org, and reference IEEECS Log Number Query expanson nvolves supplementng the orgnal query wth addtonal words and phrases. There are two key aspects n any query expanson technque: the source from whch expanson terms are selected and the method to weght and ntegrate expanson terms. Manual query expanson has been studed by many researchers, for example, [1] and [17]. Manual query expanson demands user nterventons. It s also requred that the user s famlar wth the onlne search system, the ndexng mechansm, and the doman knowledge, whch s generally not true for the users on the Web. In ths paper, we wll focus on automatc query expanson. Current automatc query expanson technques can be generally categorzed nto global analyss and local analyss. A query expanson method based on global analyss usually bulds a thesaurus to assst users reformulatng ther queres. A thesaurus can be automatcally establshed by analyzng relatonshps among documents and statstcs of term co-occurrences n the documents. From the thesaurus constructed n ths way, one wll be able to obtan synonyms or related terms gven a user query. Thus, these related terms can be used for supplementng users orgnal queres. Another group of technques for query expanson s local analyss, whch extracts expanson terms from a subset of the ntal retreval results. Ths subset may be determned drectly by the user accordng to relevance udgments, or by the system (.e., the top-ranked documents). Terms selected from them are added n a new query or ther weghts n the latter are ncreased [31]. Compared to the thesaurus-based expanson technque, local analyss s more query-orented. Prevous experments have shown sgnfcant mpact of local analyss on retreval effectveness. However, f the subset of documents s selected by the user, then we put a heavy burden on the user. If t s selected by the system, then t s questonable whether they are ndeed relevant to the query; thus, the mprovement on retreval effectveness s uncertan. In ths paper, we propose a new query expanson method based on user logs whch record user nteractons /03/$17.00 ß 2003 IEEE Publshed by the IEEE Computer Socety

2 2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 4, JULY/AUGUST 2003 wth the search systems. User logs are exploted so as to extract mplct relevance udgments they encode. In ths approach, we assume that the documents that the user chose to read are relevant documents. The log-based query expanson overcomes several dffcultes of local analyss because we can extract a large number of user udgments from user logs, whle elmnatng the step of collectng feedbacks from users for ad hoc queres. Probablstc correlatons between terms n the user queres and the documents can then be establshed through user logs. Wth these term-term correlatons, relevant expanson terms can be selected from the documents for a query. Our experments show that mnng user logs s extremely useful for mprovng retreval effectveness, especally for very short queres on the Web. In ths paper, we carry out a seres of experments to nvestgate the effects of our query expanson method on queres of dfferent length. The expermental results on both long and short queres are presented n ths artcle. As we wll see, query expanson produces more sgnfcant mprovements on short queres than on long queres. The remander of ths paper s organzed as follows: Secton 2 descrbes the problem of nconsstency between query terms and document terms, whch wll motvate our approach. Our expermental result suggests a large dfference between the terms used n queres and those n documents, therefore, the need n developng approprate query expanson technques for Web search. Secton 3 revews prevous work on query expanson. Our log-based query expanson technque s descrbed n detal n Secton 4. Sectons 5 and 6 descrbe the experments comparng our method wth local context analyss. Secton 7 draws some conclusons. 2 MOTIVATION The problems of under-specfcaton and napproprate term usage n user queres are two motvatons for studyng query expanson. They are due to two facts: queres are often short, thus contan nsuffcent number of terms; and query terms are often nconsstent wth (dfferent from) those n the documents. In ths secton, we wll examne these two facts wth respect to a search engne on the Web. It s generally observed that users on the Web typcally submt very short queres to search engnes and the average length of Web queres s less than two words [34]. A smlar concluson was drawn n [9]. We deduce that the very small overlap of the query terms and the document terms n the desred documents negatvely affects the performance of Web searchng. In [15], t was observed that people use a surprsngly great varety of words when referrng to the same thng and, thus, terms n user queres often fal to match the ndex terms contaned n the relevant documents. It s even worse when the query s very short as on the Web. In ths case, the chance of msmatchng s much larger than for a long query. In fact, we can vew the term usages n the documents as formng a term space, that we call document space. The term usages n the queres form another term space query space. The msmatchng problem we ust descrbed comes from the nconsstency between the two spaces. Ths fact has often been hypotheszed. However, no prevous study has tred to measure the dfference between the two spaces quanttatvely. Ths measurement s dffcult because the number of relevant udgments s always lmted. Wth a large amount of user logs that we consder to encode relevance udgments, ths becomes possble. In order to confrm the large dfference between the two term spaces, we wll measure the smlarty between them. It s to be noted, however, that the resultng measure of smlarty s an approxmaton. A true measure of smlarty s only possble wth real relevance udgments. Our measurement s conducted wth two-month user logs (about 22 GB) from the Encarta search engne ( encarta.msn.com), as well as the 41,942 documents n the Encarta Web ste. The user logs contans 4,839,704 user query sessons. Each query sesson conssts of the query tself and ts correspondng document clckthroughs (the documents on whch the user clcked, see Secton 4). Below s an excerpt of query sessons. n We represent each o document as a document vector W ðdþ 1 ;W ðdþ 2...W ðdþ N n the document space, where W ðdþ s the weght of the th term n a document and t s defned by the tradtonal TF*IDF measure: W ðdþ lnð1 þ tf ðdþ Þdf ðdþ ¼ q ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff P ln 2 ð1 þ tf ðdþ Þ P ; ð1þ ðdf ðdþ Þ 2 df ðdþ ¼ ln N n ; ð2þ where tf ðdþ s the frequency of the th term n the document D, N the total number of documents n the collecton, and n the number of documents contanng the th term. For each document, we can construct a correspondng vrtual document n the query space by collectng and countng all the terms, excludng stopwords, n the queres for whch the document has been selected and clcked on by the user. An vrtual document o s represented as a query vector W ðqþ 1 ;WðqÞ 2...W ðqþ N, where W ðqþ s the weght of the th term n the vrtual document and t s also defned by the TF*IDF measure. The smlarty between the two vectors s calculated and t s assumed to reflect the smlarty between the query space and document space we measure. Specally, the smlarty of each par of vectors s calculated usng the followng Cosne formula:

3 CUI ET AL.: QUERY EXPANSION BY MINING USER LOGS 3 Fg. 1. Smlarty between the query terms and the document terms. P N ¼1 Smlarty ¼ W ðqþ W ðdþ qffffffffffffffffffffffffffffffffffffffffffffffffffffq ffffffffffffffffffffffffffffffffffffffffffffffffffff : ð3þ P N ðqþ ¼1ðW Þ 2 P N ðdþ ¼1ðW Þ 2 We notce that many terms n the document space never or seldom appear n the users queres. Thus, the query vector created s much shorter (wth less nonzero terms) than a document vector. Ths artfact wll dramatcally decrease the smlarty between the two vectors f all the terms are used n the measurement. To obtan a farer measure, we only use the n hghest rankng words n each document vector for the smlarty calculaton, where n s the number of terms n the correspondng query vector. Fg. 1 llustrates the fnal results of smlarty values on the whole document collecton. Ths fgure shows that, n most cases, the smlarty values of term usages between user queres and documents are between 0.1 and 0.4. Only very few documents have smlarty values above 0.8. The average smlarty value across the whole document collecton s 0.28, whch means the average nternal angle between the query vector and the document vector s degree. Ths result suggests that there s ndeed a large gap between the query space and the document space. It s thus very dffcult to retreve the desred documents wth a drect keyword matchng approach. It s mportant to fnd ways to narrow the gap or to brdge the two spaces n order to mprove retreval effectveness. 3 REVIEW OF PREVIOUS WORK ON AUTOMATIC QUERY EXPANSION In ths secton, let us revew some prevous approaches to query expanson. The exstng state-of-the-art query expanson approaches can be classfed manly nto two categores technques based on global analyss, whch obtans expanson terms on the statstcs of terms n the whole corpus, and local analyss, whch extracts expanson terms from a subset of the search results. 3.1 Global Analyss In ths secton, we only revew the approaches that explot term co-occurrences n documents. We do not analyze the approaches that use a manual thesaurus (e.g., WordNet [22]). One can refer to [33] for some examples of utlzaton of such a resource for query expanson. Global analyss s one of the frst technques to produce consstent and effectve mprovements through query expanson. The basc dea of global analyss s to use the context of a term to determne ts smlarty wth other terms. Global analyss selects expanson terms on the bass of the nformaton on the whole document set. It bulds a set of statstcal term relatonshps whch are then used to expand queres. One of the earlest global analyss technques s term clusterng [20], [32]. Queres are smply expanded by addng smlar terms that are grouped nto the same cluster accordng to term co-occurrences n documents. Qu and Fre [24] presented a query expanson model usng a global smlarty thesaurus. Another work based on a global statstcal thesaurus s [10], whch frst clusters documents and then selects low-frequency terms to represent each cluster. PhraseFnder [19] s a component of the INQUERY system that creates an assocaton thesaurus. The phrases selected by PhraseFnder are used n query expanson. Latent Semantc Indexng [12] can also be vewed as a knd of query expanson. In ts reduced dmensonal space, mplct correlatons among terms can be dscovered and employed n expandng orgnal queres. Generally, global analyss requres corpus-wde statstcs, such as statstcs of co-occurrences of pars of terms, resultng n a matrx of smlartes between terms or a global assocaton thesaurus. Although the global analyss technques are relatvely robust, the corpus-wde statstcal analyss consumes a consderable amount of computng resources. Moreover, snce t focuses only on the document sde and does not take nto account the query sde, global analyss only provdes a partal soluton to the term msmatchng problem. 3.2 Local Analyss Dfferent from global analyss, local analyss uses only a subset of documents that s returned wth the gven query. The result s thus more focused on the gven query than global analyss. Local analyss technques are grouped nto two categores: approaches based on user feedback nformaton and approaches based on nformaton derved from a subset of the returned documents Relevance Feedback Relevance feedback s a straghtforward strategy for reformulatng queres. In a relevance feedback cycle, the user s presented wth a lst of ntal results. After examnng them, the user marks those documents he or she consders relevant. The orgnal query s expanded accordng to these relevant documents. The expected result s that the next round of retreval wll move toward the relevant documents and away from nonrelevant documents. Early experments wth the Smart system [30] and later expermental results usng a probablstc model [25] ndcate mprovements n effectveness wth relevance feedback for small collectons. Roccho performed query reformulaton usng vector space model and obtaned sgnfcantly postve results [27]. Salton and Buckley [31] dd experments on sx test collectons to compare varous relevance feedback methods.

4 4 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 4, JULY/AUGUST 2003 Ther work manly conssted of term reweghtng and query expanson. Typcally, expanson terms are extracted from the relevant documents udged by the user. Relevance feedback can acheve very good performance f the user provdes suffcent and correct relevance udgments. Unfortunately, n a real search context, users usually are reluctant to provde such relevance feedback Local Feedback To overcome the dffculty due to the lack of suffcent relevance udgments, local feedback, also known as blnd feedback or pseudofeedback, s commonly used n IR. Local feedback mmcs relevance feedback by assumng that the top-ranked documents are relevant [4]. Expanson terms are extracted from the top-ranked documents to formulate a new query for a second-cycle retreval. Local feedback has been proven effectve n prevous TREC experments. In some cases, t outperforms global analyss [6], [13], [14], [26]. Nevertheless, ths method can hardly overcome ts nherent drawback: If a large fracton of the top-ranked documents are actually rrelevant, then the words added nto the query (drawn from these documents) are lkely to be unrelated to the topc and as a result, the qualty of the retreval usng the expanded query s degraded. Therefore, the effect of pseudofeedback strongly depends on the qualty of the ntal retreval. In recent years, many mprovements for local feedback have been proposed. Mtra et al. [23] suggested mprovng query expanson by refnng the set of documents used n feedback wth Boolean flters and proxmty constrants. Clusterng the top-ranked documents and removng the sngleton clusters are technques used n [21] n order to concentrate on large groups of relevant documents for query expanson. Buckley et al. [5] employed clusterng to dentfy concepts. More recently, Carpneto et al. [7] presented a method of weghtng and selectng expanson terms usng Informaton Theory. To enhance the relablty of pseudorelevance feedback (PRF), Flexble PRF was proposed n [29], whch vares the number of expanson terms accordng to the number of documents retreved. Recently, Xu and Croft [37], [38] proposed a local context analyss method, whch apples the measure of global analyss to the selecton of query terms n local feedback. From the top-ranked documents, noun groups are selected accordng to ther co-occurrences wth the query terms. In ths way, the local context analyss method can solve the problem of nsuffcent statstcal data of local analyss to some extent. However, local context analyss s based on the hypothess that a frequent term from the top-ranked relevant documents wll tend to co-occur wth all query terms wthn the top-ranked documents. Ths s a reasonable hypothess, but not always true, as shown by our examnaton on the gap between the document and query spaces. Ths s precsely the problem we wll address by explotng user logs for query expanson. 4 LOG-BASED QUERY EXPANSION To deal wth the msmatchng problem at ts source,.e., the nconsstency problem between the terms used n the documents and those used n the queres, a possble way s to create relatonshps between the two sets of terms. User logs provde a resource explotable for ths end. 4.1 Prncple of Usng User logs We observe that many search engnes have accumulated a large amount of user logs from whch we can know what the query s and what the documents users have selected to read. These user logs provde valuable ndcatons to understand the knds of documents the users ntend to retreve by formulatng a query wth a set of partcular terms. There has been some work on mnng user logs to enhance Web searchng. Beeferman and Berger [2] exploted clckthrough data n clusterng URLs and queres usng graph-based teratve clusterng technque. Wen et al. [34] used a smlar method to cluster queres accordng to user logs n order to fnd Frequently Asked Questons (FAQs). These FAQs are then used to mprove the effectveness of queston answerng. In ths study, we further extend the prevous utlzatons of user logs by tryng to extract relatonshps between query terms and document terms. These relatonshps are then used for query expanson. Thus, our work may be vewed as a tral to construct a lve thesaurus that brdges the document and the query spaces. The general prncple s: If queres contanng one term often lead to the selecton of documents contanng another term, then we consder that there s a strong relatonshp between the two terms. Ths prncple s an extenson to that explotng term cooccurrences. In prevous approaches, term co-occurrences are observed wthn documents. The term relatonshps extracted from them are those between the terms used by the same authors. Therefore, we can see them as relatonshps wthn the document space. As we explaned earler, an mportant factor of the msmatchng problem s the lack of relatonshps between the document space and the query space. There s an acute need to create a brdge between them. The dea of explotng user logs precsely ams to create such a brdge between the two spaces. To explot ths prncple, our frst task s to extract query sessons from a large set of nosy log data. The query sessons we extract are defned as follows: sesson :¼< query text > ½clcked documentš Each sesson contans one query and a set of documents whch the user clcked on (whch we wll call clcked documents). The central dea of our method s that, f a set of documents s often selected for the same queres, then the terms n these documents are strongly related to the terms of the queres. Thus, some probablstc correlatons between query terms and document terms can be establshed based on the user logs. One mportant assumpton behnd ths method s that the clcked documents are relevant to the query. At the frst glance, ths assumpton may appear too strong. However, although the clckng nformaton s not as accurate as explct relevance udgment n tradtonal relevance feedback, the user s choce does suggest a certan degree of relevance. Typcally, upon gettng a lst of documents, many users do not select resultng documents

5 CUI ET AL.: QUERY EXPANSION BY MINING USER LOGS 5 Fg. 2. Establshng correlatons between query terms and document terms va query sessons. randomly. They have a rough dea of what the documents are about from ther ttles and snppets. In most of the cases, they clck and read those documents whch are the most smlar to what they have n mnd. Therefore, these clcked documents do have some relatonshp wth the queres they submt. Of course there are exceptons, such as an error clck or a sudden shft n the user s ntenton. But, n the long run wth a large amount of log data, the clck-through records allow us to fnd strong correlatons among terms from a statstcal pont of vew. Smlar observaton has been made n [34]. On the whole, user logs can be vewed as a very relable resource contanng abundant mplct relevance udgments. 4.2 Characterstcs of Log-Based Query Expanson In a more general sense, the log-based query expanson method may be vewed as a specal case of local analyss because ts expanson terms are derved from a subset of the documents. However, t s enhanced by human udgments: Not only the clcked documents are usually top-ranked documents, but also they have been selected by the users. Ths method thus has several advantages over relevance feedback and pseudorelevance feedback. Recall that the factor whch lmts the applcaton of relevance feedback s the unavalablty of user relevance udgments n a multple-query process. Users tend to mark only a few, f any, documents when presented wth a lst of resultng documents. In addton, ths feedback nformaton can be exploted only once. Once the query s changed, the same feedback process s to be started agan. Log-based query expanson collects and analyzes all users hstorcal relevance udgments as a whole wthout nterventon of users. We beneft from abundant records of voted documents, whle the bases or errors n a sngle round of feedback can be mnmzed. Thus, we can overcome the problem of lackng suffcent relevance udgments n prevous local feedback technques. On the other hand, compared to the pseudorelevance feedback, our method has an obvous advantage: Not only are the clcked documents part of the top-ranked documents, but also there s a further selecton by the user. Because document clcks are more relable ndcatons than top-ranked documents used n pseudorelevance feedback, log-based query expanson s expected to be more robust and accurate than the former. The log-based query expanson method has three other mportant propertes. Frst, snce the term correlatons can be precomputed offlne, the ntal retreval phase of pseudorelevance feedback s not needed anymore. Second, snce user logs contan query sessons from dfferent users, the term correlatons can reflect the preference of the maorty of the users. For example, f the maorty of the users use wndows to search for nformaton about Mcrosoft Wndows product, the term wndows wll have stronger correlatons wth the terms such as Mcrosoft, OS, and software, than wth the terms such as decorate, door, and house. Thus, the expanded query wll result n a hgher rankng for the documents about Mcrosoft Wndows, whch corresponds to the ntentons of most users. The smlar dea has been used n several exstng search engnes, such as Drect Ht ( Our query expanson approach can produce the same results. Thrd, the term correlatons may evolve along wth the accumulaton of user logs. Hence, the query expanson process can reflect updated user nterests at a specfc tme. 4.3 Correlatons between Query Terms and Document Terms Query sessons n the user logs provde a possble way to brdge the gap between the query space and the document space. As llustrated n Fg. 2, weghted lnks can be created between the query space (all the query terms) and the query sessons, as well as between the document space (all the document terms) and the sessons. In general, we assume that the terms n a query are correlated to the terms n the documents that the user clcked on. If there s at least one path between one query term and one document term, a lnk s created between them. Thus, the correlatons between the query terms and document terms can be measured by nvestgatng the weghts of the lnks consttutng the path between them. By analyzng a large number of such lnks, we can obtan a new matrx storng probablstc correlatons between the terms n these two

6 6 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 4, JULY/AUGUST 2003 spaces (the rght part of Fg. 2). Ths s smlar, n prncple, to buldng a term-term smlarty thesaurus n global analyss as n [36]. However, t benefts from the addtonal user udgments. Let us now dscuss how to determne the degrees of correlaton between terms. We defne these degrees as the Þ for and any query term w ðqþ. The Þ can be determned as follows (where condtonal probabltes between terms,.e., P ðw ðdþ any document term w ðdþ probablty Pðw ðdþ w ðqþ w ðqþ S s a set of clcked documents for queres contanng the query term w ðqþ ): Pðw ðdþ ¼ ¼ w ðqþ Þ¼ pðwðdþ P8D k2s PðwðdÞ P ðw ðqþ Þ P8D k2s PðwðdÞ ;w ðqþ Þ Pðw ðqþ Þ ;w ðqþ w ðqþ ;D k Þ ;D k ÞP ðw ðqþ P ðw ðqþ Þ ;D k Þ : We assume that P ðw ðdþ w ðqþ ;D k Þ¼Pðw ðdþ D k Þ. Ths means that the document D k separates the query term from the document term w ðdþ. Therefore, P ðw ðdþ ¼ X 8D k2s w ðqþ Þ¼ P8D k2s P ðwðdþ P ðw ðdþ D k ÞPðD k w ðqþ Þ: D k ÞP ðd k w ðqþ P ðw ðqþ Þ ÞPðw ðqþ Þ P ðd k w ðqþ Þ s the condtonal probablty of the document D k beng clcked when w ðqþ appears n the user query. D k Þ s the condtonal probablty of occurrence of w ðdþ f the document s selected. PðD k w ðqþ Þ and P ðw ðdþ D k Þ can be estmated, respectvely, from the user logs and from P ðw ðdþ the frequency of occurrences of terms n documents as follows: where P ðd k w ðqþ P ðw ðdþ D k Þ¼ Þ¼ fðqþ k ðwðqþ f ðqþ ðw ðqþ W ðdþ k P8t2D k W ðdþ tk ð4þ ;D k Þ ; ð5þ Þ ; ð6þ. f ðqþ k ðwðqþ ;D k Þ s the number of the query sessons n whch the query term w ðqþ and the document D k appear together.. f ðqþ ðw ðqþ Þ s the number of the query sessons that contan the term w ðqþ.. Pðw ðdþ D k Þ s the normalzed weght of the term w ðdþ n the document D k, whch s dvded by the sum of all term weghts n the document D k. By combnng (4), (5), and (6), we obtan the followng formula for P ðw ðdþ w ðqþ Þ: Pðw ðdþ w ðqþ Þ¼ X 8D k 2S P ðw ðdþ D k Þ fðqþ k ðwðqþ f ðqþ ðw ðqþ! ;D k Þ Þ : ð7þ 4.4 Query Expanson Based on Term Correlatons Equaton 7 descrbes how to calculate the chance of a document term beng selected as an expanson term gven a query term. We also need to determne the relatonshp of a document term to the whole query n order to rank t. For ths, we use an dea smlar to that of [24],.e., we select expanson terms accordng to ther relatonshp to the whole query. The relatonshp of a term to the whole query s measured by the followng coheson calculaton: CoWeght Q ðw ðdþ Þ¼ln ðqþ w t 2Q P wðdþ w ðqþ t þ 1 whch combnes the relatonshps of the term to all the query terms. On the whole, log-based query expanson takes the followng steps to expand a new query Q: 1. Extract all query terms (elmnatng stopwords) from Q. 2. Fnd all documents related to any query term n query sessons. 3. To each document term n these documents, use (8) to calculate ts evdence of beng selected as an expanson term accordng to the whole query. 4. Select n document terms wth the hghest coheson weght and formulate the new query Q 0 by addng these terms nto Q. 5. Use Q 0 to retreve documents n a searchng system. 5 EXPERIMENTAL DATA AND METHODOLOGY Before llustratng the expermental results, let us frst descrbe the test data used. 5.1 Data Due to the characterstcs of our query expanson method, we cannot conduct experments on standard test collectons such as the TREC 1 data snce they do not contan user logs that we need. To deduct term-term correlatons, we use the same twomonth user logs from the Encarta Web ste as descrbed n Secton 2, whch contans 4,839,704 user query sessons. Wth respect to documents set, we collected 41,942 documents from the Encarta Web ste to form the test corpus. Dverse topcs are covered by these artcles wth greatly varyng lengths, from dozens of words to several thousand words. In user logs, each document bears a large number of queres wth whch users have clcked on that document. Ths ensures that we have suffcent clck-through nformaton to establsh meanngful probablstc correlatons among terms n the two spaces. In addton, ths data set can reflect the mpact of our query expanson technque for searches on the Web snce t s obtaned from a real search engne. We focus on usng query expanson to counter the effect of short queres on the Web. Xu and Croft [38] conducted experments on very short queres, n whch the results showed that query expanson can produce even larger 1. ð8þ

7 CUI ET AL.: QUERY EXPANSION BY MINING USER LOGS 7 TABLE 1 Lst of Queres n Both the Long Query Set and the Short Query Set mprovements on short queres than on long queres. We compled two sets of queres n order to see how query expanson affects retreval results on short queres and long queres. In order to test our method on a more general bass, some queres are extracted randomly from the user logs. Some others come from the TREC query set. Yet, another subset of queres s added manually by us. Table 1 shows all the 30 queres n both short and log versons used n our experments. The short queres n our experments are very close to those employed by the real Web users and the average length of these queres s 2.0 words. The average length of the long queres s 4.8 keywords (excludng the stopwords). Though t s stll shorter than the average length of most TREC queres, we consder that t reflects the real stuaton on the Web snce few users use over fve keywords to express ther nformaton needs. We used three human assessors to buld the relevance udgments. Relevant documents for each query were udged accordng to the human assessors manual selectons, and standard relevant document sets were prepared for all of the 30 queres. Assessors had no knowledge of the testng methods, but made decsons wth the assstance of a basc searchng system. To solve ther dsagreements when they occurred, the assessors dscussed them together. All udgments from the assessors consttuted a reference set. We run all experments n a batch mode accordng to the relevance udgment set. 5.2 Word and Phrase Thesaurus Encarta has well-organzed manual ndexes n addton to automatcally extracted ndex terms. In order to test our technque n a general context, we do not use the manual ndexes and the exstng Encarta search engne whch explots t for our evaluaton. Instead, we mplement a vector space model as the baselne method n our experments. We do not use tradtonal methods to extract phrases from documents because we are more nterested n the phrases n the query space. Therefore, we extract all sequences of N-grams, where N s the number of nontrval terms n a query, from the user logs wth occurrences hgher than 5. These N-grams are treated as canddate phrases. Then, we locate the canddate phrases n the document corpus and flter out those not appearng n the documents. In the end, we get a thesaurus contanng over 13,000 phrases, whch are used as addtonal ndexes. When usng phrases and sngle words together, our system always gves prorty to phrases. 5.3 Evaluaton Methodology In order to evaluate our log-based query expanson method, we wll compare ts performance not only wth that of the orgnal queres, but also wth that of local context analyss. We employ nterpolated 11-pont average precson as the man metrc of performance. Statstcal t-test [18] s used to ndcate whether an mprovement s statstcally sgnfcant. A p-value less than 0.05 s deemed sgnfcant.

8 8 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 4, JULY/AUGUST 2003 TABLE 2 A Comparson of Retreval Performance n Average Precson (%) for Long Queres between Baselne, Local Context Analyss (LC Exp), and Log-Based Query Expanson (On Log Exp) TABLE 3 A Comparson of Retreval Performance n Average Precson (%) for Short Queres between Baselne, Local Context Analyss (LC Exp), and Log-Based Query Expanson (On Log Exp) Terms are weghted usng TF*IDF measure n our retreval system. Both the orgnal and the expanded queres are evaluated by the same retreval system, makng t possble to compare the effects of query expanson. For local context analyss, we use 30 expanson terms (ncludng words and phrases) from 100 top-ranked documents for query expanson. The smoothng factor n local context analyss s set to 0.1, as suggested by [38]. For the log-based query expanson, we use 40 expanson terms. We notce that the occurrences of phrases are far less than those of words. Ths creates an unbalance between the weghts we assgned to word correlatons and to phrase correlatons. In order to create a better balance, the probablty assocated wth a phrase correlaton s multpled by a factor S because phrases are less ambguous than words (S s set to 10 n our experments). The formula used to measure phrase correlatons s modfed from (7) to the followng one: PðT ðdþ T ðqþ Þ¼ X 8D k 2S PðT ðdþ D k Þ S fðqþ k ðt ðqþ f ðqþ ðt ðqþ! ;D k Þ ; ð9þ Þ where T ðdþ and T ðqþ are, respectvely, a document phrase and a query phrase. In addton, the above results of (7) and (9) should be dvded by the sum of all P ðw ðdþ w ðqþ Þ and PðT ðdþ T ðqþ Þ n order to satsfy the requrement of the probablstc framework. 6 EXPERIMENTAL RESULTS 6.1 Performance Comparson We now present the expermental results of the local context analyss and the log-based query expanson method. Results wth the orgnal queres wthout expanson are used as the baselne. All the experments are carred out wth both words and phrases. The results wth the long queres and the short queres are presented, respectvely, n Table 2 and Table 3. We see that our log-based query expanson performs very well on both query sets. On the long query set, the logbased query expanson method brngs an average mprovement of percent n precson (p-value = ) over the baselne, whle the local context analyss acheves an average mprovement of 6.56 percent n precson (p-value = 0.33) over the baselne. The p-value suggests that the logbased query expanson gans a statstcally sgnfcant mprovement over the orgnal queres. It s to be noted that the log-based query expanson also provdes an average mprovement of percent compared to local context analyss, whch s also statstcally sgnfcant (p-value = ). In general, we observe that log-based query expanson selects more accurate expanson terms than local context analyss due to the explotaton of user udgments. In contrast, local context analyss searches expanson terms n the top-ranked retreved documents and s more lkely to add some rrelevant terms nto the orgnal query, thus ntroducng some undesrable sdeeffects on retreval performance. The results shown n Table 3 advocate our conecture that our query expanson approach s even more useful for short queres than for long queres. There s a dramatc change n the performances of both local context analyss and the log-based query expanson method when short queres are expanded. The log-based query expanson offers an average mprovement of percent (and maxmum mprovement of percent) n comparson wth the orgnal queres. The p-value for ths augment s whch ndcates ts statstcal sgnfcance. Local context analyss boosts the average precson to percent, whch s percent better than the baselne (p-value = 0.018) (compared to only 6.56 percent mprovement ganed on the long query set). All these results suggest that query expanson s of extreme mportance for short queres. Accordng to our observaton that less than two words are used n most user queres n the Encarta logs, we come to the concluson that query expanson may mprove the effectveness of search engnes whch deals wth short queres. It s nterestng to compare the results of the query expanson technques on both query sets. Wth the local context analyss, the results wth the long queres are slghtly better than those wth short queres, wth an mprovement of 1.33 percent. However, the results obtaned by the log-based query expanson on the long queres are 3.64 percent worse than ther counterparts for short queres.

Comparson of Average Precson (%) obtaned by Log-Based Query Expanson wth Phrases (Phrase) and wthout Phrases (No Phrase) on the Short Query Set Ths may suggest that our method can select expanson

Ths confrms that query expanson s an effectve way to reduce the dfference between short and long queres. 6.

9 CUI ET AL.: QUERY EXPANSION BY MINING USER LOGS 9 TABLE 4 Comparson of Average Precson (%) obtaned by Log-Based Query Expanson wth Phrases (Phrase) and wthout Phrases (No Phrase) on the Long Query Set TABLE 5 Comparson of Average Precson (%) obtaned by Log-Based Query Expanson wth Phrases (Phrase) and wthout Phrases (No Phrase) on the Short Query Set Ths may suggest that our method can select expanson terms for short queres that are even better than those used n the long queres to descrbe the nformaton needs. Globally, wth query expanson, the performances for short and long queres are smlar. Ths confrms that query expanson s an effectve way to reduce the dfference between short and long queres. 6.2 Impact of Phrases Experments on noun phrases n [38] showed that the local context analyss can acheve a small mprovement wth phrases. However, they only tested t wth long queres. We beleve that ths mpact can be even larger for short queres. In fact, even f a word-based representaton s not precse, n a long query, ths mprecson s compensated by the large number of words n the query. The whole set of query words together may gve a qute precse descrpton of the nformaton need. However, ths s not the case for short queres. For short queres, the user s ntenton can be expressed more accurately wth phrases because phrases are nherently less ambguous than sngle words. We conduct experments of the log-based query expanson wth and wthout phrases. The results are shown n Table 4 and Table 5. The results confrm our expectaton ust descrbed. The mprovement ganed wth phrases on the short queres s almost twce of that obtaned wth phrases on the long queres. Smlar to the retreval process, query expanson s also affected by the ambguty of the terms n orgnal queres. The use of phrases can help reduce the ambguty of query terms, thus allow query expanson to extract more relevant expanson terms. For example, for the short verson of the query #8 Sx Day War (see Table 1), each word s common and appears n many documents rrelevant to ths query. If t s parsed as three sngle words, many rrelevant documents wll be found. However, when t s presented as a phrase, the concept represented by t becomes unambguous and t can match less rrelevant documents; so, the retreval effectveness can be mproved. In comparson, gven the long verson of ths query, f the three words are supplemented by the words Israel and Arab, then they descrbe together a more precse meanng, leadng to more relevant documents. So, even though phases are not recognzed n a long query, the mpact s less dramatc than for a short query. Our other results (that are not lsted here) show that, f we use phrases n the baselne method, the performance of ths latter can also be mproved by 2.35 percent and 8.95 percent, respectvely, on the long and the short queres. Integratng phrases nto the local context analyss can acheve mprovements of 8.21 percent and percent for the long and short queres. These results are consstent wth those of the logbased query expanson. In summary, phrases are very mportant for searchng wth short queres. In addton, our method of phrase extracton from user logs, although smple, proved to be effectve. 6.3 Impact of Number of Expanson Terms In general, the number of expanson terms should be wthn a reasonable range n order to produce consstently good performance. Too many expanson terms not only consume Fg. 3. Impact of number of expanson terms.

10 10 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 4, JULY/AUGUST 2003 more tme for the retreval process, but also have sdeeffects on the retreval performance. We examne the performance of the log-based query expanson by usng 10, 20, 30, 40, 50, 60, 70, 80, 90, and 100 expanson terms on the two query sets. The results are shown n Fg. 3. The best performances are obtaned wth around 30 expanson terms for both query sets. It s worth notng that the curve produced on the long query set s flatter than the other one. The curve of the long query set reaches ts summt at 30 expanson terms and remans very flat after 30. In comparson, the curve of the short query set drops after addng more than 60 expanson terms. We attrbute ths to the fact that the short queres have less orgnal terms, whch, when expanded excessvely wthout other terms to serve as context, may produce more sde-effects and generate more rrelevant terms. For long queres, as more terms act together n the selecton of expanson terms, the chance of generatng many rrelevant terms s much less. 7 CONCLUSIONS The prolferaton of the World Wde Web prompts the wde applcaton of search engnes. However, short queres and the ncompatblty between the terms n user queres and documents strongly affect the performance of the exstng search engnes. Many automatc query expanson technques have been proposed, whch can solve the short query and the term msmatchng problem to some extent. However, they do not take advantage of the user logs avalable n varous Web stes, and use them as a means for query expanson. In ths artcle, we presented a novel method for automatc query expanson based on user logs. Ths method ams frst to establsh correlatons between query terms and document terms by explotng the user logs. These relatonshps are then used for query expanson. We have shown that ths s an effectve way to narrow the gap between the query space and the document space. For new queres, hgh-qualty expanson terms can be selected from the document space on the bass of the extracted correlatons. We tested ths method on a data set that s smlar to the real Web envronment. A seres of experments conducted on both long queres and short queres showed that the logbased query expanson method can acheve substantal mprovements n performance. It also outperforms local context analyss, whch s one of the most effectve query expanson methods n the past. Our experments also show that query expanson s more effectve for short queres than for long queres. REFERENCES [1] M.J. Bates, Search Technques. Ann. Rev. of Informaton Scence and Technology, M.E. Wllams, ed., pp , [2] D. Beeferman and A. Berger, Agglomeratve Clusterng of a Search Engne Query Log, Proc. SIGKDD, pp , [3] G. Brank, S. Mzzaro, and C. Tasso, Evaluatng User Interfaces to Informaton Retreval Systems: A Case Study on User Support, Proc. 19th Ann. Int l ACM SIGIR Conf. Research and Development n Informaton Retreval (SIGIR 96), pp , Aug [4] C. Buckley, G. Salton, and J. Allan, Automatc Retreval wth Localty Informaton Usng Smart, Proc. Frst Text Retreval Conf. (TREC-1), pp , [5] C. Buckley, M. Mtra, J. Walz, and C. Carde, Usng Clusterng and Superconcepts wthn Smart, Proc. Sxth Text Retreval Conf. (TREC-6), E. Voorhees, ed., pp , [6] C. Buckley, G. Salton, J. Allan, and A. Snghal, Automatc Query Expanson Usng SMART, Overvew of the Thrd Retreval Conf. (TREC-3), pp , Nov [7] C. Carpneto, G. Romano, and B. Bg, An Informaton-Theoretc Approach to Automatc Query Expanson, ACM Trans. Informaton Systems, vol. 19, no. 1, pp. 1-27, Jan [8] J.W. Cooper and R.J. Byrd, Lexcal Navgaton: Vsually Prompted Query Expanson and Refnement, Proc. Second ACM Int l Conf. Dgtal Lbrares, pp , [9] W.B. Croft, R. Cook, and D. Wlder, Provdng Government Informaton on the Internet: Experences wth THOMAS, Proc. Second Int l Conf. Theory and Practce of Dgtal Lbrares, pp , [10] C.J. Crouch and B. Yang, Experments n Automatc Statstcal Thesaurus Constructon, Proc. ACM-SIGIR Conf. Research and Development n Informaton Retreval, pp , [11] H. Cu, J.-R. Wen, J.-Y. Ne, and W.-Y. Ma, Probablstc Query Expanson Usng User Logs, Proc. 11th World Wde Web Conf., pp , [12] S. Deerwster, S.T. Duma, G.W. Furnas, T.K. Landauer, and R. Harshman, Indexng by Latent Semantc Analyss, J. Am. Soc. Informaton Scence and Technology, vol. 41, no. 6, pp , [13] E. Efthmads and P. Bron, UCLA-Okap at TREC-2: Query Expanson Experments, Proc. Second Text Retreval Conf. (TREC- 2), D.K. Harmon, ed., [14] D. Evans and R. Lefferts, Desgn and Evaluaton of the CLARIT- TREC-2 System, Proc. Second Text Retreval Conf. (TREC-2), [15] G.W. Furnas, T.K. Landauer, L.M. Gomez, and S.T. Dumas, THE Vocabulary Problem n Human-System Communcaton, Comm. ACM, vol. 30, no. 11, pp , [16] G. Grefenstette, Exploratons n Automatc Thesaurus Dscovery. Kluwer Academc Publshers, [17] S.P. Harter, Onlne Informaton Retreval: Concepts, Prncples, and Technques. Orlando, Fla.: Academc Press, [18] D. Hull, Usng Statstcal Testng n the Evaluaton of Retreval Experments, Proc. ACM SIGIR, pp , June [19] Y. Jng and W.B. Croft, An Assocaton Thesaurus for Informaton Retreval, Proc. RIAO, pp , [20] M.E. Lesk, Word-Word Assocatons In Document Retreval Systems, Am. Documentaton, vol. 20, no. 1, pp , [21] A. Lu, M. Ayoub, and J. Dong, Ad Hoc Experments Usng EUREKA, Proc. Text Retreval Conf. (TREC-5), pp , [22] G. Mller, Wordnet: An Onlne Lexcal Database, Int l J. Lexcography, vol. 3, no. 4, [23] M. Mtra, A. Snghal, and C. Buckley, Improvng Automatc Query Expanson, Proc. 21st Ann. Int l ACM SIGIR Conf. Research and Development n Informaton Retreval, pp , [24] Y. Qu and H. Fre, Concept Based Query Expanson, Proc. 16th Int l ACM SIGIR Conf. R & D n Informaton Retreval, pp , [25] S.E. Robertson and K. Sparck Jones, Relevance Weghtng of Search Terms, J. Am. Soc. for Informaton Scences, vol. 27, no. 3, pp , [26] S.E. Robertson, S. Walker, and M. Sparck Jones, et al., Okap at TREC-3, Proc. Second Text Retreval Conf. (TREC-3), [27] J. Roccho, Relevance Feedback n Informaton Retreval, The Smart Retreval System Experments n Automatc Document Processng, G. Salton, ed., pp , [28] R. Baeza-Yates and B. Rbero-Neto, Modern Informaton Retreval. England: Pearson Educaton Lmted, [29] T. Saka, S.E. Robertson, and S. Walker, Flexble Pseudo- Relevance Feedback Va Drect Mappng and Categorzaton of Search Requests, Proc. BCS-IRSG ECIR, pp. 3-14, [30] G. Salton, The SMART Retreval System Experments n Automatc Document Processng. Englewood Clffs, N.J.: Prentce Hall, [31] G. Salton and C. Buckley, Improvng Retreval Performance by Relevance Feedback, J. Am. Soc. for Informaton Scence, vol. 41, no. 4, pp , [32] K. Sparck Jones, Automatc Keyword Classfcaton for Informaton Retreval. London: Butterworths, 1971.

CUI ET AL.: QUERY EXPANSION BY MINING USER LOGS 11 [33] E.M. Voorhees, Query Expanson Usng Lexcal-Semantc Relatons, Proc. 17th Int l Conf. Research and Development n Informaton Retreval, pp.

, On Modelng of Informaton Retreval Concepts n Vector Spaces, ACM Trans. Database Systems, vol. 12, no. 2, pp. 299-321, June 1987. [36] S.K.M. Wong and Y.

Croft, Query Expanson Usng Local and Global Document Analyss, Proc. 19th Int l Conf. Research and Development n Informaton Retreval, pp. 4-11, 1996. [38] J. Xu and W.B.

Hang Cu receved the BS and MS degrees n management nformaton systems from Tann Unversty, Tann, Chna, n 2000 and 2002, respectvely.

11 CUI ET AL.: QUERY EXPANSION BY MINING USER LOGS 11 [33] E.M. Voorhees, Query Expanson Usng Lexcal-Semantc Relatons, Proc. 17th Int l Conf. Research and Development n Informaton Retreval, pp , [34] J.-R. Wen, J.-Y. Ne, and H.-J. Zhang, Query Clusterng Usng User Logs, ACM Trans. Informaton Systems, vol. 20, no. 1, pp , [35] S.K. Wong and W. Zarko et al., On Modelng of Informaton Retreval Concepts n Vector Spaces, ACM Trans. Database Systems, vol. 12, no. 2, pp , June [36] S.K.M. Wong and Y.Y. Yao, A Probablstc Method for Computng Term-by-Term Relatonshps, J. Am. Soc. for Informaton Scence, vol. 44, no. 8, pp , [37] J. Xu and W.B. Croft, Query Expanson Usng Local and Global Document Analyss, Proc. 19th Int l Conf. Research and Development n Informaton Retreval, pp. 4-11, [38] J. Xu and W.B. Croft, Improvng the Effectveness of Informaton Retreval wth Local Context Analyss, ACM Trans. Informaton Systems, vol. 18, no. 1, pp , Jan Hang Cu receved the BS and MS degrees n management nformaton systems from Tann Unversty, Tann, Chna, n 2000 and 2002, respectvely. In July 2002, he was admtted nto the Natonal Unversty of Sngapore, where he s pursung a PhD degree. In 2001 and 2002, he spent one year workng as a vstng student at Mcrosoft Research Asa, Beng, Chna. Hs research nterests nclude text mnng, ntellgent nformaton retreval, machne learnng, and Q & A systems. J-Rong Wen receved the BS and MS degrees n 1994 and 1996, both from School of Informaton, Renmn Unversty of Chna. He receved the PhD degree n 1999 from the Insttute of Computng Technology, the Chnese Academy of Scence. He oned Mcrosoft Research Asa n July 1999 and s currenly a researcher n the Meda Management Group. Hs man research nterests are data management, ntellgent nformaton retreval, and Web mnng. Jan-Yun Ne receved the PhD degree n 1990 from the Unversté Joseph Fourer of Grenoble, France. He s an assocate professor n Département d nformatque et Recherché Opératonnelle, Unversté de Montréal. Hs research nterests are focused on nformaton retreval (IR), n partcular, cross-language and multlngual IR, knowledge- and NLP-based IR, as well as theoretcal aspects of IR such as logcal models of IR. He s also nterested n data mnng. We-Yng Ma receved the BS degree n electrcal engneerng from the Natonal Tsng Hua Unversty n Tawan n 1990, and the MS and PhD degrees n electrcal and computer engneerng from the Unversty of Calforna at Santa Barbara n 1994 and 1997, respectvely. He oned Mcrosoft Research Asa n Aprl 2001 as the research manager of the Meda Management Group. Pror to onng Mcrosoft, he was wth Hewlett-Packard Laboratores n Palo Alto, Calforna, where he was a researcher n the Internet Moble and Systems Lab. From 1994 to 1997, he was engaged n the Alexandra Dgtal Lbrary (ADL) proect at the Unversty of Calforna at Santa Barbara whle completng hs PhD degree. Dr. Ma serves as an assocate edtor for the Journal of Multmeda Tools and Applcatons publshed by Kluwer Academc Publshers. He has served on the organzng and program commttees of many nternatonal conferences and has publshed four book chapters. Hs research nterests nclude mage and vdeo analyss, content-based mage and vdeo search and retreval, machne learnng technques, ntellgent nformaton systems, adaptve content delvery, content dstrbuton and servces networks, and meda delvery and cachng. He s a member of the IEEE.. For more nformaton on ths or any computng topc, please vst our Dgtal Lbrary at

UB at GeoCLEF Department of Geography Abstract

UB at GeoCLEF Department of Geography Abstract UB at GeoCLEF 2006 Mguel E. Ruz (1), Stuart Shapro (2), June Abbas (1), Slva B. Southwck (1) and Davd Mark (3) State Unversty of New York at Buffalo (1) Department of Lbrary and Informaton Studes (2) Department