Empirical Exploitation of Click Data for Query-Type-Based Ranking

Size: px

Start display at page:

Download "Empirical Exploitation of Click Data for Query-Type-Based Ranking"

Octavia Rogers
6 years ago
Views:

1 Emprcal Explotaton of Clck Data for Query-Type-Based Rankng Anle Dong Y Chang Shhao J Cya Lao Xn L Zhaohu Zheng Yahoo! Labs 701 Frst Avenue Sunnyvale, CA {anle,ychang,shhao,cyalao,xnl,zhaohu}@yahoo-nc.com Abstract In machne-learnng-based rankng, each category of queres can be appled wth a specfc rankng functon, whch s called query-type-based rankng. Such a dvdeand-conquer strategy can potentally provde better rankng functon for each query categores. A crtcal problem for the query-type-based rankng s tranng data nsuffcency, whch may be solved by usng the data extracted from clck log. Ths paper emprcally studes how to approprately explot clck data to mprove rank functon learnng n query-type-based rankng. The man contrbutons are 1) the exploraton on the utltes of two promsng approaches for clck par extracton; 2) the analyss of the role played by the nose nformaton whch nevtably appears n clck data extracton; 3) the approprate strategy for combnng tranng data and clck data; 4) the comparson of clck data whch are consstent and nconsstent wth baselne functon. 1 Introducton Learnng-to-rank approaches (Lu, 2008) have been wdely appled n commercal search engnes (Brn and Page, 1998), n whch rankng models are learned usng labeled documents. Sgnfcant efforts have been made n attempt to learn a generc rankng model whch can approprately rank documents for all queres. However, web users query ntentons are extremely heterogeneous, whch makes t dffcult for a generc rankng model to acheve best rankng results for all queres. For ths reason, rankng problems are studed for some specfc categores of queres. A query category can be defned by users query ntentons, such as navgatonal queres, Clck log Generc tranng data Dedcated tranng data Clck data mnng Heurstc-rule-based approach Sequental supervsed learnng approach Sample selecton and combnaton GBrank algorthm Generc clck data Dedcated clck data Query-type-based rankng model Fgure 1: Framework of ncorporatng clckthrough data wth tranng data to mprove dedcated model for query-type-based rankng. news queres, product queres, etc. The motvaton of ths dvde-and-conquer strategy s that, each query category has ts own data dstrbuton and dscrmnant rank features; thus, a dedcated rankng model should gve better rankng results for ths query category comparng wth a generc rankng model. Such a dedcated model can be traned usng the labeled data belongng to ths query category (whch s called dedcated tranng data). However, the amount of tranng data dedcated to one query category s usually nsuffcent because human labelng s expensve and tme-consumng, not to menton there are multple categores need to be taken care of. To deal wth the tranng data nsuffcency problem for query-type-based rankng, we propose to extract clck-through data and ncorporate t wth dedcated tranng data to learn a dedcated model. In order to ncorporate clck nformaton to mprove the rankng for a dedcate query category, t s crtcal to fully explot clck log. We emprcally explore the related approaches for the appro-

2 prate clck data explotaton n query-type-based rank functon learnng. Fgure 1 llustrates the procedures and crtcal components to be studed. 1) Clck data mnng: the purpose s to extract nformatve and relable users preference nformaton from clck log. We employ two promsng approaches: one s heurstc rule approach, the other s sequental supervsed learnng approach. 2) Sample selecton and combnaton: wth labeled tranng data and unlabeled clck data, how to select and combne them so that the samples have the best utlty for learnng? As the data dstrbuton wthn a specfc query category s dfferent from the generc data dstrbuton, t s natural to select those labeled tranng samples and unlabeled clck preference pars whch belong to ths query category, so that the data dstrbutons of tranng set and testng set are consstent for ths category. On the other hand, we should keep n mnd that: a) non-dedcated data,.e, the data that does not belong the specfc category, mght also have smlar dstrbuton as the dedcated data. Such dstrbuton smlarty makes non-dedcated data also useful for query-type-based rank functon learnng, especally for the scenaro that dedcated tranng samples s nsuffcent. b) The qualty of dedcated clck data may be not as relable as human labeled tranng data. In other words, there are some extracted clck preference pars that are nconsstent wth human labelng whle we regard human labelng as correct labelng. 3) Rank functon learnng algorthm: we use GBrank (Zheng et al., 2007) algorthm for rank functon learnng, whch has proved to be one of the most effectve up-to-date learnng-to-rank algorthms; furthermore, GBrank algorthm also takes preference pars as nputs, whch wll be llustrated wth more detals n the paper. 2 Related work Learnng to rank has been a promsng research area whch contnuously mproves web search relevance (Burges et al., 2005) (Zha et al., 2006) (Cao et al., 2007) (Freund et al., 1998) (Fredman, 2001) (Joachms, 2002) (Wang and Zha, 2007) (Zheng et al., 2007). The rankng problem s usually formulated as learnng a rankng functon from preference data. The basc dea s to mnmze the number of contradcted pars n the tranng data, and dfferent algorthm cast the preference learnng problem from dfferent pont of vew, for example, RankSVM (Joachms, 2002) uses support vector machnes; RankBoost (Freund et al., 1998) apples the dea of boostng from weak learners; GBrank (Zheng et al., 2007) uses gradent boostng wth decson tree; RankNet (Burges et al., 2005) uses gradent boostng wth neural net-work. In (Zha et al., 2006), query dfference s taken nto consderaton for learnng effectve retreval functon, whch leads to a mult-task learnng problem usng rsk mnmzaton framework. There are a few related works to apply multple rankng models for dfferent query categores. However, none of them takes clck-through nformaton nto consderaton. In (Kang and Km, 2003), queres are categorzed nto 3 types, nformatonal, navgatonal and transactonal, and dfferent models are appled on each query category. Geng at el. (Geng et al., 2008) propose a KNN method to employed dfferent rankng models to handle dfferent types of queres. The KNN method s unsupervsed, and t targets to mprove the overall rankng nstead of the rank-ng for a certan query category. In addton, the KNN method requres all feature vector to be the same. Qute a few research papers explore how to obtan useful nformaton from clck-through data, whch could beneft search relevance (Carterette et al., 2008) (Fox et al., 2005) (Radlnsk and Joachms, 2007) (Wang and Zha, 2007). The nformaton can be expressed as par-wse preferences (Chapelle and Zhang, 2009) (J et al., 2009) (Radlnsk et al., 2008), or represented as rank features (Agchten et al., 2006). Topcal rankng reles on the accuracy of query classfcaton. Query classfcaton or query ntenton dentfcaton has been extensvely studed n (Betzel et al., 2007) (Lee et al., 2005) (L et al., 2008) (Rose and Levnson, 2004). How to combne edtoral data and clck data s well dscussed n (Chen et al., 2008) (Zheng et al., 2007). In addton, how to use clck data to mprove rankng are also exploted n personalzed or preference-based search (Coyle and Smyth, 2007) (Glance, 2001) (R. Jn, 2008) (Smyth et al., 2003). 3 Techncal approach Ths secton presents the related approaches n Fgure 1. In Secton 4, we wll make deeper analyss based on expermental results.

3 Table 1: Statstcs of clck occurrences for heurstc rule approach. mp cc mpresson, number of occurrence of the tuple number of occurrence of the tuple where two documents both get clcked ncc number of occurrence of the tuple where url 1 s not clcked but url 2 s clcked cnc number of occurrence of the tuple where url 1 s clcked but url 2 s not clcked ncnc number of occurrence of the tuple where url 1 and url 2 are not clcked Table 2: Skp-above pars count vs. human judgements (e.g., the element n the thrd row and second column means we have 40 skp-above pars wth excellent url 1 and perfect url 2 ). P: perfect; E: excellent; G: good; F: far; B: bad. P E G F B P E G F B Clck data mnng We use two approaches for clck data mnng, whose outputs are preference pars. A preference par s defned as a tuple {< x q, y q > x q y q }, whch means for the query q, the document x q s more relevant than y q. We need to extract nformatve and relable preference pars whch can be used to mprove rank functon learnng Heurstc rule approach We use heurstc rules to extract skp-above pars and skp-next pars, whch are smlar to Strategy 1 (clck > skp above) and Strategy 5 (clck > noclck next) proposed n (Joachms et al., 2005). To reduce the msleadng effect of an ndvdual clck behavor, clck nformaton from dfferent query sessons s aggregated before applyng heurstc rules. For a tuple (q, url 1, url 2, pos 1, pos 2 ) where q s query, url 1 and url 2 are urls representng two documents, pos 1 and pos 2 are rankng postons for the two documents wth pos 1 pos 2 meanng url 1 has hgher rank than url 2, the statstcs for ths tuple are lsted n Table 1. Skp-above par extracton: f ncc s much cc mp, ncnc mp larger than cnc, and s much smaller than 1, that means, when url 1 s ranked hgher than url 2 n query q, most users clck url 2 but not clck url 1. In ths case, we extract a skp-above par,.e., url 2 s more relevant than url 1. In order to have hghly accurate skp-above pars, a set of thresholds are appled to only extract the pars that have hgh mpresson and ncc s larger enough than cnc. Skp-next par extracton: f pos 1 = pos 2 1, cnc s much larger than ncc, and cc mp, ncnc mp s much smaller than 1, that means, n most of cases when url 2 s ranked just below url 1 n query q, most users clck url 1 but not clck url 2. In ths case, we regard ths tuple as a skp-next par. Table 3: Skp-next pars vs. human judgements (e.g., the element n the thrd row and second column means we have 10 skp-next pars wth excellent url 1 and perfect url 2 ). P: perfect; E: excellent; G: good; F: far; B: bad. P E G F B P E G F B To test the accuracy of preference pars, we ask edtors to judge some randomly selected pars from skp-above pars and skp-next pars. Edtors label each query-url par usng fve grades accordng to relevance: perfect, excellent, good, far, bad. Table 2 shows skp-above par dstrbuton. The dagonal elements have hgh values, whch are for ted pars labeled by edtors but determned as skp-above pars from heurstc rules. Hgher values appear n the left-bottom trangle than n the rght-top trangle, because there are more skpabove preferences agreed wth edtors than dsagreed wth edtors. Summng up the ted pars, agreed and dsagreed pars, 44% skp-above preference judgments agree wth edtors, 18% skpabove preference judgments dsagree wth edtors, and there are 38% skp-above pars judged as te pars by edtors. Table 3 shows skp-next par dstrbuton. Summng up the ted pars, agreed and dsagreed pars, 70% skp-next preference judgments agree wth edtors, 4% skp-next preference judgments dsagree wth edtors, and 26% skp-next pars judged as te pars by edtors.

4 Therefore, skp-next pars have much hgher accuracy than skp-above. That s because n a search engne that already has a good rankng functon, t s much easer to fnd a correct skpnext pars whch are consstent wth the search engne than to fnd a correct skp-above pars whch are contradctory to the search engne. Skp-above and skp-next preferences provde us two knds of users s feedbacks whch are complementary: skpabove preferences provde us the feedback that the user s vote s contradctory to the current rankng, whch mples the current relatve rankng should be reversed; skp-next preferences shows that the user s vote s consstent wth the current rankng, whch mples the current relatve rankng should be mantaned wth hgh confdence provded by users vote Sequental supervsed learnng The clck modelng by sequental supervsed learnng (SSL) was proposed n (J et al., 2009), n whch user s sequental clck nformaton s exploted to extract relevance nformaton from clck-logs. Ths approach s relable because 1) the sequental clck nformaton embedded n an aggregaton of user clcks provdes substantal relevance nformaton of the documents dsplayed n the search results, and 2) the SSL s supervsed learnng (.e., human judgments are provded wth relevance labels for the tranng). The SSL s formulated n the framework of global rankng (Qn et al., 2008). Let x (q) = {x (q) 1, x(q) 2,..., x(q) n } represent the documents retreved wth a query q, and y (q) = {y (q) 1, y(q) 2,..., y(q) n } represent the relevance labels assgned to the documents. Here n s the number of documents retreved wth q. Wthout loss of generalty, we assume that n s fxed and nvarant wth respect to dfferent queres. The SSL determnes to fnd a functon F n the form of y (q) =F (x (q) ) that takes all the documents as ts nputs, explotng both local and global nformaton among the documents, and predct the relevance labels of all the document jontly. Ths s dstnct to most of learnng to rank methods that optmze a rankng model defned on a sngle document,.e., n the form of y (q) = f(x (q) ), = 1, 2,..., n. Ths formulaton of the SSL s mportant n extractng relevance nformaton from user clck data snce users clck decsons among dfferent documents dsplayed n a search sesson tend to rely not only on the relevance judgment of a sngle document, but also on the relatve relevance comparson among the documents dsplayed; and the global rankng framework s wellformulated to explot both local and global nformaton from an aggregaton of user clcks. The SSL aggregates all the user sessons for the same query nto a tuple <query, n-document lst, and an aggregaton of user clcks>. Fgure 2 llustrates the process of feature extracton from an aggregated sesson, where x (q) = {x (q) 1, x(q) 2,..., x(q) n } denotes a sequence of feature vectors extracted from the aggregated sesson, wth x (q) representng the feature vector extracted for document. Specfcally, to form feature vector x (q), frst a feature vector x (q),j s extracted from each user j s clck nformaton, and j {1, 2,... }, then x (q) s formed by averagng over x (q),j, j {1, 2,... },.e., x(q) s actually an aggregated feature vector for document. Table 4 lsts all the features used n the SSL modelng. Note that some features are statstcs ndependent of temporal nformaton of the clcks, such as Poston and Frequency, whle other features reply on ther surroundng documents and the clck sequences. We use 90,000 query-url pars to tran the SSL model, and 10,000 query-url pars for best model selecton. Wth the sequental clck modelng dscussed above, several sequental supervsed algorthms, ncludng the condtonal random felds (CRF) (Lafferty et al., 2001), the sldng wndow method and the recurrent sldng wndow method (Detterch, 2002), are explored to fnd a global rankng functon F. We omt the detals but refer one to (J et al., 2009). The emphass here s on the mportance to adapt these algorthms to the rankng problem. After tranng, the SSL model can be used to predct the relevance labels of all the documents n a new aggregated sesson, and thus par-wse preference data can be extracted, wth the score dfference representng the confdence of preference predcton. For the reason of convenence, we also call the preference pars contradctng wth producton rankng as skp-above pars and those consstent wth producton rankng as skp-next pars, so that we can analyze these two types of preference pars respectvely.

5 q doc 1 doc 2 doc doc 10 user1 o oo o user2 o o Feature Extracton x (q) = { } x 1 x 2 x x 10 y (q) = { } Fgure 2: An llustraton of feature extracton for an aggregated sesson for SSL approach. x (q) denotes an extracted sequence of feature vectors, and y (q) denotes the correspondng label sequence that s assgned by human judges for tranng. Table 4: Clck features used n SSL model. Poston ClckRank Frequency FrequencyRank IsNextClcked IsPreClcked IsAboveClcked IsBelowClcked ClckDuraton y 1 y 2 y y 10 Poston of the document n the result lst Rank of 1st clck of doc. n clck seq. Average number of clcks for ths doc. Rank n the lst sorted by num. of clcks 1 f next poston s clcked, 0 otherwse 1 f prevous poston s clcked, 0 otherwse 1 f there s a clck above, 0 otherwse 1 f there s a clck below, 0 otherwse Tme spent on the document n the par. To mnmze the loss functon, h(x) has to be larger than h(y) wth the margn τ, whch can be chosen as constant value, or as dynamc values varyng wth pars. When par-wse judgments are extracted from edtors labels wth dfferent grades, par-wse judgments can nclude grade dfference, whch can further be used as margn τ. The GBrank algorthm s llustrated n Algorthm 1, and two parameters need to be determned: the shrnkage factor η and the number of teraton. Algorthm 1 GBrank algorthm. Start wth an ntal guess h 0, for k = 1, 2, Usng h k 1 as the current approxmaton of h, we separate S nto two dsjont sets S + = {< x, y > S h k 1 (x ) h k 1 (y ) + τ} and S = {< x, y > S h k 1 (x ) < h k 1 (y ) + τ} 2. Fttng a regresson functon gk(x) usng gradent boostng tree (Fredman, 2001) and the followng tranng data {(x, h k 1 (y )+τ), (y, h k 1 (x ) τ) (x, y ) S } 3. Form (wth normalzaton of the range of h k ) h k (x) = kh k 1(x)+ηg k (x) k+1 where η s a shrnkage factor. 3.3 Sample selecton and combnaton We use a straghtforward approach to learn rankng model from the combned data, whch s llustrated n Algorthm Modelng algorthm The basc dea of GBrank (Zheng et al., 2007) s that f the orderng of a preference par by the rankng functon s contradctory to ths preference, we need to modfy the rankng functon along the drecton by swappng ths prefence par. Preferences pars could be generated GBrank was proposed n (Zheng et al., 2007). For each preference par < x, y > n the avalable preference set S = {< x, y > x y, = 1, 2,..., N}, x should be ranked hgher than y. In GBrank algorthm, the problem of learnng rankng functons s to compute a rankng functon h, so that h matches the set of preference,.e, h(x ) h(y ), f x y, = 1, 2,..., N as many as possble. The followng loss functon s used to measure the rsk of a gven rankng functon h. R(h) = 1 2 N (max{0, h(y ) h(x ) + τ}), (1) =1 where τ s the margn between the two documents Algorthm 2 Learn rankng model by combnng edtoral data and clck preference pars. Input: Edtoral absolute judgement data. Preference pars from clck data. 1. Extract preference pars from labeled data wth absolute judgement. 2. Select and combne preference pars from clck data and labeled data. 3. Learn GBrank model from the combned preference pars. Absolute judgement on labeled data contans (query, url) pars wth absolute grade values labeled by human. In Step 1, for each query wth n q query-url pars wth correspondng grades, {< query, url, grade > = 1, 2,..., n q }, ts preference pars are extracted as {< query, url, url j, grade grade j >, j = 1, 2,..., n q, j}. When combnng human-labeled pars and clck preference pars, we can gve use dfferent relatve weghts for these two data sources. The loss func-

6 ton becomes R(h) = w (max{0, h(y ) h(x ) + τ}) N l Labeled 1 w (max{0, h(y ) h(x ) + τ}),(2) N c Clck where w s used to control the relatve weghts between labeled tranng data and clck data, N l s the number of tranng data pars, and N c s the number of clck pars. The margn τ can be determned as grade dfference for edtor pars, and be a constant parameter for clck pars. Step 2 s crtcal for the effcacy of the approach. A few factors need to be consdered: 1) data dstrbuton: for the applcaton of query-type-based rankng, our purpose s to mprove rankng for the queres belongng to ths category. An mportant observaton s that the relevance patterns for the rankng wthn a specfc category may have some unque characterstcs, whch are dfferent from generc relevance rankng. Thus, t s reasonable to consder only usng dedcated labeled tranng data and dedcated clck preference data for tranng. The realty s that dedcated tranng data s usually nsuffcent, whle t s possble that non-dedcated data can also help the learnng. 2) clck par qualty: t s nevtable there exst some ncorrect pars n the clck preference pars. Such ncorrect pars may mslead the learnng. So overall, can the clck preference pars stll help the learnng for query-type-based rankng? By our study, skp-above pars usually contan more ncorrect pars compared wth skp-above pars. Does ths mean skp-next pars are always more helpful n mprovng learnng than skp-next pars? 3) clck par utlty: use labeled tranng data as baselne, how much complmentary nformaton can clck pars brng? Ths s determned by the methodology of clck data mnng approach. Whle t s possble to acheve some learnng mprovement for query-type-based rankng by usng clck pars by a plausble method, we attempt to emprcally explore the above nterweavng factors for deeper understandng, n order to use the most approprate strategy to explot clck data. 4 Experments 4.1 Data set Query category: n the experments, we use long query rankng as an example of query-type-based rankng, because t s commonly known that long query rankng has some unque relevance patterns compared wth generc rankng. We defne the long queres as the queres contanng at least three tokens. The technques and analyss proposed n ths paper can be appled to other query categores, such as product queres, news queres, blog queres, etc. Labeled tranng data: we do experments based on a data set for a commercal search engne, for whch there are 16,797 query-url pars (wth 1,123 dfferent queres) that have been labeled by edtors. The proporton of long queres s about 35% of all queres. The data dstrbuton of such long queres may be dfferent from general data dstrbuton, as t wll be valdated n the experments below. The human labeled data s randomly splt nto two sets: tranng set (8,831 query-url pars, 589 queres), and testng set (7,966 query-url pars, 534 queres). The tranng set wll be combned wth clck preference pars for rank functon learnng, and the testng set wll be used to evaluate the effcacy of the rankng functon. In the tranng set, there are 3,842 long query-url pars (229 queres). At testng stage, the learned rank functons are appled only to the long queres n the testng data, as our concern n ths paper s how to mprove querytype-based rankng,.e., long query rankng n the experment. In the testng data, there are 3,210 query-url pars (193 queres) are long query data, whch wll be used to test rank functons. Clck preference pars: usng the two approaches of heurstc rule approach and sequental supervsed approach, we extract clck prefence pars from the clck log of the search engne. Each approach yelds both skp-next and skp-above pars, whch are sorted by confdence descendng order respectvely. 4.2 Setup and measurements We try dfferent sample selecton and combnaton strateges to tran rank functons usng GBrank algorthm. For the labeled tranng data, we ether use generc data or dedcated data. For the clck preference pars, we also try these two optons. Furthermore, as more clck preference pars may brng more useful nformaton to help the learnng whle on the other hand, the more ncorrect pars may be gven so that they mslead the learnng, we try dfferent amounts of these prefence

7 Table 5: Use clck data by heurstc rule approach (Data Selecton: N : not use; D : use dedcated data; G : use generc data. Data Source: T : tranng data; C : clck data) (a) skp-next pars NT DT GT NC n/a DC (1.2%) (2.4%) GC (1.2%) (1.7%) (b) skp-above pars NT DT GT NC n/a DC (-1.6%) (-0.8%) GC (-2.0%) (2.2%) Table 6: Use clck data by SSL approach (Data Selecton: N : not use; D : use dedcated data; G : use generc data. Data Source: T : tranng data; C : clck data) (a) skp-next pars NT DT GT NC n/a DC (1.5%) (1.5%) GC (0.4%) (1.2%) (b) skp-above pars NT DT GT NC n/a DC (-2.2%) (-0.3%) GC (-1.2%) (-0.5%) pars: 5,000, 10,000, 30,000, 50,000, 70,000 and 100,000 pars. We use NDCG to evaluate rankng model, whch s defned as NDCG n = Z n n =1 2r() 1 log(+1) where s the poston n the document lst, r() s the score of Document, and Z n s a normalzaton factor, whch s used to make the NDCG of deal lst be Results Table 5 and 6 show the NDCG 5 results by usng heurstc rule approach and SSL approach respectvely. We do not present NDCG 1 results due to space lmtaton, but NDCG 1 results have the smlar trends as NDCG 5. Baselne by tranng data: there are two baselne functons by usng tranng data sets 1) use NDCG dedcate tran + dedcate clck generc tran + dedcate clck dedcate clck generc clck clck pars x 10 4 Fgure 3: Incorporate dfferent amounts of skpnext pars by heurstc rule approach wth generc tranng data. NDCG ,000 dedcate clck + 10,000 dedcate clck + 30,000 dedcate clck trangng sample weght Fgure 4: The effects of usng dfferent combnng weghts. Skp-next pars by heurstc rule approach are combned wth generc tranng data. dedcated tranng data (DT), NDCG 5 on the testng set by the rank functon s ; 2) use generc tranng data (GT), NDCG 5 s It s reasonable that usng generc tranng data s better than only usng dedcated tranng data, because the dstrbutons of non-dedcated data and dedcated data share some smlarty. As the dedcated tranng data s nsuffcent, the adopton of the extra non-dedcated data helps the learnng. We compare learnng results wth Baselne 2) (use generc tranng data, the slot of NC + GT n the tables), whch s the hgher baselne. Baselne by clck data: we then study the utltes of clck preference pars by usng them alone for tranng wthout usng labeled tranng data. In Table 5 and 6, each of the NDCG 5 results usng clck preference pars s the hghest NDCG 5 value over the cases of usng dfferent amounts of pars (5000, 10,000, 30,000, 50,000, 70,000 and 100,000 pars). The results regardng the pars amounts are llustrated n Fgure 3, whch wll help us to analyze the results more deeply. If we only use clck preference pars for tranng

8 (the two table slots DC+NT and GC+NT, correspondng to usng dedcated clck preference pars and generc clck pars respectvely), the best case s usng skp-next pars extracted by heurstc rule approach (Table 5 (a) ). It s not surprsng that skp-next pars outperform skp-above pars because there are sgnfcantly lower percentage of ncorrect pars n skp-next pars compared wth skp-above pars. It s a lttle bt surprsng that the case of DC+NT has no domnant advantage over GC+NT as we expected. For example, n Table 5 (a), the NDCG 5 values ( and ) are very close to each other. However, n Fgure 3, we fnd that wth the same amount of pars, when we use 30,000 or fewer pars, usng dedcated clck pars alone s always better than usng generc clck pars alone. Wth more clck pars beng used (> 30, 000), the nose rates become hgher n the pars, whch makes the dstrbuton factor less mportant. Combne tranng data and clck data: we compare the four table slots, DC+DT, GC+DT, DC+GT, GC+GT, n Table 5 and 6, and there are qute a few nterestng observatons: 1) Skp-next vs. skp-above: overall, ncorporatng skp-next pars wth tranng data s better than ncorporatng skp-above pars, due to the reason that there are more ncorrect pars n skp-above pars, whch may mslead the learnng. The only excepton s the slot GC+GT n Table 5 (b), whose NDCG 5 mprovement s as hgh as 2.2%. We further track ths result, and fnd that ths s the case by usng only 5,000 generc skp-above pars. The nose rate of these 5,000 pars s low because they have the hghest par extracton confdence values. At the same tme, these 5,000 pars may provde good complementary sgnals to the generc tranng data, so that the learnng result s good. However, n general, skp-next pars have better utltes than skp-above pars. 2) Dedcated tranng data vs. generc tranng data: usng generc tranng data s generally better than only usng dedcated tranng data. If tranng data s nsuffcent, the extra non-dedcated data provdes useful nformaton for relevance pattern learnng, and the dstrbuton dssmlarty between dedcated data and nondedcated data s not the most mportant factor. 3) Dedcated clck data vs. generc clck data: usng dedcated clck data s more effectve than usng generc clck data. From Fgure 3, we observe that when 30,000 or fewer pars are ncorporated nto tranng data, usng dedcate clck pars s always better than usng generc clck pars. NDCG ,000 dedcate clck + 10,000 dedcate clck + 30,000 dedcate clck τ Fgure 5: The effects of usng dfferent margn values for clck preference pars. Skp-next pars by heurstc rule approach are ncorporated wth generc tranng data. 4) Heurstc rule approach vs. SSL approach: the preference pars extracted by heurstc rule approach have better utltes than those extracted by SSL approach. 5) GBrank parameters for combnng tranng data and clck pars: the relatve weght w for combnng tranng data and clck pars n (2) may also affect rank functon learnng. Fgure 4 shows the effects of usng dfferent combnng weghts, for whch skp-next pars by heurstc rule approach are combned wth generc tranng data. We observe that nether over-weghtng tranng data or over-weghtng clck pars yelds good results whle the two data sources are best exploted at certan weght values when there s good balance between them. Another concern s the approprate margn value τ for the clck pars n (2). Fgure 5 shows that τ = 1 consstently yelds good learnng results, whch suggests us that clck par provdes good nformaton at τ = Dscussons we have defactorzed the related approaches for the task of explotng clck data to mprove querytype-based rank learnng. The utlty of clck preference pars depends on the followng factors: 1) Data dstrbuton: f clck pars have good qualty, we should use dedcated clck pars nstead of generc clck pars, so that the samples for tranng have smlar dstrbuton to the task of query-type-based rankng. 2) The amount of dedcated tranng data: the more dedcated tranng data, the more relable the

9 query-type-based rank functon s; thus, the less room for learnng mprovement usng clck data. For the case n the experment that dedcated tranng s nsuffcent, the non-dedcated tranng data can also help the learnng as non-dedcated tranng data share relevance pattern smlarty wth the dedcated data dstrbuton. 3) The qualty of clck pars: f we can extract large amount of hgh-qualty clck pars, the learnng mprovement wll be sgnfcant. For example, as shown n Fgure 3, at the early stage wth fewer clck pars (5,000 and 10,000 pars) beng combned wth tranng data, the learnng mprovement s best. Wth more clck pars are used, the nose rate n the clck pars becomes hgher so that the learnng msleadng factor s more mportant than nformaton complementary factor. Thus, t s mportant to mprove the relablty of the clck pars. 4) The utlty of clck pars: by our study, the qualty of clck pars extracted by SSL approach s comparable to those extracted by heurstc rule approach. The possble reason that heurstc-rulebased clck pars can brng more beneft s that these pars provde more complementary nformaton compared wth SSL approach. As the methodologes of these two clck data extracton approaches are totally dfferent, n future we wll explore the concrete reason that causes such utlty dfference. 5 Conclusons By emprcally explorng the related factors n utlzng clck-through data to mprove dedcated model learnng for query-type-based rankng, we have better understood the prncples of usng clck preference pars approprately, whch s mportant for the real-world applcatons n commercal search engnes as usng clck data can sgnfcantly save human labelng costs and makes rank functon learnng more effcent. In the case that dedcated tranng data s lmted, whle nondedcated tranng data s helpful, usng dedcated skp-next pars s the most effectve way to further mprove the learnng. Heurstc rule approach provdes more useful clck pars compared wth sequental supervsed learnng approach. The qualty of clck pars s crtcal for the effcacy of the approach. Therefore, an nterestng topc s how to further reduce the nconsstency between skpabove pars and human labelng so that such data may also be useful for query-type-based rankng. References E. Agchten, E. Brll, and S. Dumas Improvng web search rankng by ncorporatng user behavor nformaton. Proc. of ACM SIGIR Conference. S. M. Betzel, E. C. Jensen, A. Chowdhury, and O. Freder Varyng approaches to topcal web query classfcaton. Proceedngs of ACM SI- GIR conference. S. Brn and L. Page The anatomy of a largescale hypertextual web search engne. Proceedngs of Internatonal Conference on World Wde Web. C. Burges, T. Shaked, E. Renshaw, A. Lazer, M. Deeds, N. Hamlton, and G. Hullender Learnng to rank usng gradent descent. Proc. of Intl. Conf. on Machne Learnng. Z. Cao, T. Qn, T. Lu, M. Tsa, and H. L Learnng to rank: From parwse approach to lstwse. Proceedngs of ICML conference. B. Carterette, P. N. Bennett, D. M. Chckerng, and S. T. Dumas Here or there: preference judgments for relevance. Proc. of ECIR. O. Chapelle and Y. Zhang A dynamc bayesan network clck model for web search rankng. Proceedngs of the 18th Internatonal World Wde Web Conference. K. Chen, Y. Zhang, Z. Zheng, H. Zha, and G. Sun Adaptng rankng functons to user preference. ICDE Workshops, pages M. Coyle and B. Smyth Supportng ntellgent web search. ACM Transacton Internet Tech., 7(4). T. G. Detterch Machne learnng for sequental data: a revew. Lecture Notes n Computer Scence, (2396): S. Fox, K. Karnawat, M. Mydland, S. Dumas, and T. Whte Evaluatng mplct measures to mprove web search. ACM Trans. on Informaton Systems, 23(2): Y. Freund, R. D. Iyer, R. E. Schapre, and Y. Snger An effcent boostng algorthm for combnng preferences. Proceedngs of Internatonal Conference on Machne Learnng. J. Fredman Greedy functon approxmaton: a gradent boostng machne. Ann. Statst., 29: X. Geng, T. Lu, T. Qn, A. Arnold, H. L, and H. Shum Query dependent rankng wth k nearest neghbor. Proceedngs of ACM SIGIR Conference. N. S. Glance Communty search assstant. Intellgent User Interfaces, pages

10 S. J, K. Zhou, C. Lao, Z. Zheng, G. Xue, O. Chapelle, G. Sun, and H. Zha Global rankng by explotng user clcks. In SIGIR 09, Boston, USA, July T. Joachms, L. Granka, B. Pan, and G Gay Accurately nterpretng clckthough data as mplct feedback. Proc. of ACM SIGIR Conference. T. Joachms Optmzng search engnes usng clckthrough data. In Proceedngs of the ACM Conference on Knowledge Dscovery and Data Mnng (KDD). I. Kang and G. Km Query type classfcaton for web document retreval. Proceedngs of ACM SIGIR Conference. J. Lafferty, A. McCallum, and F. Perera Condtonal random felds: Probablstc models for segmentng and labelng sequence data. In ICML, pages U. Lee, Z. Lu, and J. Cho Automatc dentfcaton of user goals n web search. Proceedngs of Internatonal Conference on World Wde Web. X. L, Y.Y Wang, and A. Acero Learnng query ntent from regularzed clck graphs. Proceedngs of ACM SIGIR Conference. T. Y Lu Learnng to rank for nformaton retreval. SIGIR tutoral. T. Qn, T. Lu, X. Zhang, D. Wang, and H. L Global rankng usng contnuous condtonal random felds. In NIPS. H. L R. Jn, H. Valzadegan Rankng refnement and ts applcaton to nformaton retreval. Proceedngs of Internatonal Conference on World Wde Web. F. Radlnsk and T. Joachms Actve exploraton for learnng rankngs from clckthrough data. Proc. of ACM SIGKDD Conference. F. Radlnsk, M. Kurup, and T. Joachms How does clckthrough data reflect retreval qualty? Proceedngs of ACM CIKM Conference. D. E. Rose and D. Levnson Understandng user goals n web search. Proceedngs of Internatonal Conference on World Wde Web. B. Smyth, E. Balfe, P. Brggs, M. Coyle, and J. Freyne Collaboratve web search. Proceedngs of IJ- CAI Conference. X. Wang and C. Zha Learn from web search logs to organze search results. In Proceedngs of the 30th ACM SIGIR. H. Zha, Z. Zheng, H. Fu, and G. Sun Incorporatng query dfference for learnng retreval functons n world wde web search. Proceedngs of the 15th ACM Conference on Informaton and Knowledge Management. Z. Zheng, K. Chen, G. Sun, and H. Zha A regresson framework for learnng rankng functons usng relatve relevance judgments. Proc. of ACM SIGIR Conference, pages

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto