Generalized Team Draft Interleaving

Generalzed Team Draft Interleavng Eugene Khartonov,2, Crag Macdonald 2, Pavel Serdyukov, Iadh Ouns 2 Yandex, Russa 2 Unversty of Glasgow, UK {khartonov, pavser}@yandex-team.ru 2 {crag.macdonald, adh.ouns}@glasgow.ac.uk ABSTRACT Interleavng s an onlne evaluaton method that compares two rankng functons by mxng ther results and nterpretng the users clck feedback. An mportant property of an nterleavng method s ts senstvty,.e. the ablty to obtan relable comparson outcomes wth few user nteractons. Several methods have been proposed so far to mprove nterleavng senstvty, whch can be roughly dvded nto two areas: (a) methods that optmze the credt assgnment functon (how the clck feedback s nterpreted), and (b) methods that acheve hgher senstvty by controllng the nterleavng polcy (how often a partcular nterleaved result page s shown). In ths paper, we propose an nterleavng framework that generalzes the prevously studed nterleavng methods n two aspects. Frst, t acheves a hgher senstvty by performng a jont data-drven optmzaton of the credt assgnment functon and the nterleavng polcy. Second, we formulate the framework to be general w.r.t. the search doman where the nterleavng experment s deployed, so that t can be appled n domans wth grd-based presentaton, such as mage search. In order to smplfy the optmzaton, we addtonally ntroduce a stratfed estmate of the experment outcome. Ths stratfcaton s also useful on ts own, as t reduces the varance of the outcome and thus ncreases the nterleavng senstvty. We perform an extensve expermental study usng largescale document and mage search datasets obtaned from a commercal search engne. The experments show that our proposed framework acheves marked mprovements n senstvty over effectve baselnes on both datasets. Categores and Subject Descrptors: H.3.3 [Informaton Storage & Retreval]: Informaton Search & Retreval Keywords: nterleavng; onlne evaluaton. INTRODUCTION Onlne evaluaton approaches, such A/B testng and nterleavng, are crucal tools n modern search engne evaluaton [2, 5, 7,, 2, 3]. These approaches leverage the mplct feedback of the real users to evaluate changes to the search Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. Copyrghts for components of ths work owned by others than ACM must be honored. Abstractng wth credt s permtted. To copy otherwse, or republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. Request permssons from Permssons@acm.org. CIKM 5, October 9 23, 205, Melbourne, Australa. c 205 ACM. ISBN 978--4503-3794-6/5/0...$5.00. DOI: http://dx.do.org/0.45/280646.2806477. engne and can be appled even when the offlne evaluaton approaches mght be mpractcal [9]. Snce the onlne evaluaton approaches rely on the nosy user feedback, a consderable number of observatons s needed before a statstcally sgnfcant concluson can be made [2, 2]. Usually, an A/B testng experment s deployed for a week or two [2]. A typcal length of an nterleavng experment used n the lterature s up to fve days [2]. Such a long duraton of the onlne experments consderably lmts ther usefulness, and bounds the rate of the search engne s evoluton. Another concern s that a consderable fracton of the changes evaluated onlne turn out to actually degrade the user s search experence [2]. When evaluated n an onlne experment durng a perod of a week, such a change ncreases the users frustraton wth the results and mght even force them to swtch to another search engne. These observatons support the need to ncrease the speed of onlne evaluaton experments. When comparng two web document search rankng functons, nterleavng s faster to obtan the comparson outcome than an A/B test [2, Secton 7]. A varety of methods were proposed to further reduce the duraton of nterleavng experments by mprovng the nterleavng senstvty. Roughly, ths research can be dvded nto two areas: optmzaton of the credt assgnment [2, 4, 20]; and optmzaton of the probablty of showng of the nterleaved result pages (nterleavng polcy) [0, 5]. In both areas, only document search has been studed so far. Furthermore, the current research n the nterleavng polcy optmzaton explctly reles on a user model that s specfc to the lst-based representaton. In ths paper we propose an nterleavng framework that generalzes the exstng research n two aspects. Frst, we consder both the nterleavng polcy and the credt assgnment functon as optmzed parameters n our framework. As a result, our framework has a hgher flexblty that can be used to acheve a hgher senstvty. Second, we formulate our framework to be general w.r.t. the actual presentaton of the result pages, so that t can be appled for the domans such as mage search, where grd-based presentaton s used. In order to smplfy the parameter optmzaton procedure, we propose to use a stratfed estmate of the nterleavng experment outcome, where the stratfcaton s performed accordng to the teams of the results on the result pages shown. We demonstrate that our proposed stratfcaton approach s also useful on ts own, as n some cases t consderably ncreases the nterleavng senstvty. Overall, the contrbutons of our work are three-fold: We propose a prncpled, data-drven framework to develop senstve nterleavng that combnes the stratfcaton, the nterleavng polcy optmzaton, and the

credt functon learnng n a sngle framework that can be appled n domans wth the lst-based and the grdbased result presentatons; We propose suffcent condtons that the clck feature representaton and the nterleavng polcy need to satsfy so that the resultng nterleavng method remans unbased; We perform a large-scale evaluaton study of the proposed framework, usng two datasets that contan document and mage search onlne experments. The remander of ths paper s organsed as follows. In Secton 2 we dscuss the related work. In Secton 3 we defne our nterleavng framework and dscuss ts detals n Secton 4. Our proposed stratfcaton technque, and how to optmze the nterleavng parameters, s dscussed n Sectons 5 and 6, respectvely. In Secton 8 we descrbe the nstantatons of our framework for the web document search and for the mage search domans. The datasets and the evaluaton scenaro we use are descrbed n Secton 7 and 9, respectvely. We dscuss our obtaned results n Secton 0. We conclude ths paper and dscuss future work n Secton. 2. RELATED WORK Snce the ntroducton of the frst nterleavng method, Balanced Interleavng [7, 8], several other nterleavng methods were proposed, ncludng Team Draft [6], Probablstc Interleavng [6], Optmzed Interleavng [5]. An mportant characterstc of an nterleavng method s ts senstvty,.e. ablty to obtan a relable experment outcome wth as few user nteractons as possble. The problem of ncreasng the senstvty of an nterleavng method has attracted a consderable attenton from the research communty, and below we revew the most relevant work n ths area. Yue et al. [20] proposed a method to learn a more senstve credt assgnment functon for the Team Draft nterleavng experments. Later, ths approach was also dscussed by Chapelle et al. [2]. Informally, the core dea of [20] s to learn how to weght user clcks n the nterleavng comparsons so that the confdence n the already performed experments s maxmzed. As a result, new nterleavng experments wll acheve the requred level of confdence n ther outcomes wth fewer user nteractons,.e. the nterleavng method wll have a hgher senstvty. Yue et al. refer to ths learnng problem to as an nverse hypothess test: gven user nteracton data for the comparsons wth known outcomes, one learns a credt assgnment functon that maxmzes the power of the test statstc n these comparsons. Our work s based on the deas of Yue et al. [20], and ams to overcome some of the shortcomngs of ther approach. Frst, t s not straghtforwardly clear what knd of features and weghtng functons are allowed so that no bases are ntroduced when learnng the credt assgnment functon. It s possble to buld an example of the clck feature representaton that make the credt functon learnng process prone to bases (Secton 4). In our work, we propose a formal unbasedness requrement that ensures that a feature-based credt assgnment functon s not based. Moreover, we propose a restrcted famly of the clck features that allow us to make ths requrement easy to operate n practce. In ther work, Yue et al. assume that the nterleavng polcy (the probabltes of showng of the dfferent nterleaved result pages) s fxed. We propose to optmze both the nterleavng polcy and the credt assgnment functon jontly, and ths results n a hgher nterleavng senstvty. Radlnsk and Craswell [5] proposed the Optmzed Interleavng framework, whch specfes a set of requrements that an nterleavng method has to meet so that (a) ts results are not based, (b) t s senstve, and (c) the users are not too frustrated by the nterleaved results pages. Our framework s based on Optmzed Interleavng, but t also has sgnfcant dfferences from t. Frst, Optmzed Interleavng s formulated specfcally wth a partcular web document search user model n mnd, and the nterleavng polcy optmzaton t performs s formulated wth respect to a clck model that s specfc for the lst-based result presentaton. Ths hnders extendng the nterleavng approaches to other domans. In contrast, we propose a generalzed unbasedness requrement, that can be appled for the grdbased result pages. Second, we perform a jont data-drven optmzaton of the nterleavng polcy and the credt assgnment functon. In contrast, n Optmzed Interleavng, the credt assgnment functon s fxed; and the nterleavng polcy s optmzed wth respect to a randomly clckng user, n a data-free manner. Moreover, the nterleavng polcy optmzaton consdered by Radlnsk and Craswell s performed on a per-query bass. Ths makes the evaluaton runtme system more sophstcated, as ths optmzaton must be performed each tme a long-tal query s submtted. Fnally, due to usng a large set of possble nterleaved result pages that s dfferent from the result pages generated by Team Draft and Balanced Interleavng, t s hard to perform a representatve evaluaton study of ths method wthout actually mplementng t and deployng a large set of real-lfe experments. Unlke Optmzed Interleavng, our proposed framework has the same nterleavng polcy for all queres and experments whch s fxed once learned, thus t mposes lttle mplementaton costs over the standard Team Draft nterleavng. Further, ts performance can be evaluated usng a set of hstorcal experments, avalable to each search engne that uses Team Draft-based nterleavng experments. Whle studyng a dfferent problem of the document search multleavng, Schuth et al. [8] used the varance of the outcome as a proxy for the nterleavng senstvty. Ths approach can be consdered as a hybrd between approaches n [20] and [5]: the optmzaton s performed w.r.t. to a randomly clckng user, as n [5]; the optmzaton objectve s close to z-score used n [20], as the latter favours a lower varance, too. We beleve ths approach can be suboptmal n comparson to the data-drven optmzaton used by Yue et al. [20] and our framework. Indeed, n the case of the suffcently large dataset, the parameters can be optmzed based on the real-lfe data, wthout relyng on a model of a randomly clckng user. Khartonov et al. [0] addressed the problem of mprovng the nterleavng senstvty by predctng the future user clcks usng the hstorcal clck data. Ther approach explctly reles on the document search clck models, thus t s hard to generalze to other domans. Moreover, [0] requres the search engne to store per-query clck models n runtme, and use them when nterleavng the result pages. It s not clear how practcal ths requrement can be n the presence of the long-tal queres that form a consderable part of the query stream. Further, as t can be appled only for the head queres wth clck data avalable, t s not clear f any addtonal bases can occur due to ncreased senstvty to the changes n the top queres. In contrast, our framework does not requres sgnfcant changes n the search engne s expermentaton nfrastructure and can be appled for the domans wth the grd-based presentaton of the results.

In ther work [3], Chukln et al. proposed an nterleavng method that goes beyond the classc ten blue lnks web document presentaton and deals wth the vertcal results (e.g. News, Images, Fnance) ncorporated n the man web search result page. However, the challenges that Chukln et al. address (e.g., ensurng that n the nterleaved result page vertcal results are stll grouped) are qute dfferent from the problems faced when developng an nterleavng mechansm for a new doman. In the latter case, one needs to decde how to specfy the credt assgnment functon, how to select the nterleavng polcy, etc. To the best of our knowledge, our work s the frst to address the problem of nterleavng n a doman wth the grd-based result presentaton. One of the approaches to mprove the nterleavng senstvty we dscuss s stratfcaton, a smple yet effectve technque that has ts roots n the Monte-Carlo stratfed samplng methods [, 7]. Prevously, ts applcaton for onlne A/B tests was studed by Deng et al. [4], but t was never consdered n the context of nterleavng. Overall, our framework fnds a sold foundaton n the research dscussed above, but t also addresses several shortcomngs of the earler approaches. In the next secton we formally ntroduce t. 3. FRAMEWORK DEFINITION Frst, we nformally outlne how nterleavng experments are performed. Suppose, that we need to compare a changed system B to the producton system A usng an nterleavng experment. To do that, a random subset of the users s selected to take part n the experment. When a query s submtted, both results from A and B are retreved. Further, the nterleavng polcy s used to determne whch of the possble mxed (nterleaved) result pages to show to the user. Next, the users clcks on the nterleaved result page are observed, and the credt assgnment functon s used to nfer the credts of the alternatves. After the experment s stopped, the aggregated credts of the alternatves are compared. If B has a statstcally sgnfcantly hgher credt, t s accepted that B outperformed A. The works of Yue et al. [20] and Radlnsk and Craswell [5] lay the foundaton for our framework. However, our framework has sgnfcant dfferences from [20] and [5]. Specfcally, our proposed framework performs a jont optmzaton of the nterleavng polcy and the credt assgnment functon, whle Yue et al. and Radlnsk and Craswell optmze only one of these parameters. Further, our framework can be appled for search domans wth grd-based result pages. Below, we provde a formal requrement that a feature-based credt assgnment functon, the clck feature representaton, and the nterleavng polcy have to meet for the nterleavng to be unbased. In contrast, Yue et al. do not dscuss possble bases that can emerge due to feature-based learnng, and Radlnsk and Craswell only dscuss smple, feature-less credt assgnment rules. By addressng the above dscussed gaps, we buld a senstve nterleavng framework that generalzes approaches proposed by Yue et al. [20] and Radlnsk and Craswell [5]. In our framework, we consder the result pages that are obtaned by applyng the Team Draft mxng algorthm [6] to the lsts of the results of the underlyng rankers A and B, sorted accordng to ther relevance. The exact mappng of the sorted result lst nto a result page s doman-specfc (the lst-based for document search, or the grd-based for mage search). Assumng that under ths mappng the results ranked hgher n the ranked lst are mapped nto postons wth hgher examnaton probablty, mxng the sorted result lsts of the rankers A and B accordng to Team Draft wll result n a result page that cannot be more frustratng for the users than both the result pages generated from outputs of A and B. Due to ths assumpton we avod the necessty of specfyng the mxng algorthm for each possble doman-specfc presentaton, and can work wth the underlyng ranker output, whch s always lst-wse n practce. Apart from that, relyng on the Team Draft-based result pages allows us to re-use a large-scale dataset of the experments collected by a search engne for our evaluaton study (Secton 0). The Team Draft mxng algorthm bulds the nterleaved result lst n steps. At each step, both teams contrbute one result each to the combned lst. Each team contrbutes the result that t ranks hghest among those that are not n the combned lst. However, the team that contrbutes frst at each step s decded by a con toss. For nstance, as there are usually 0 results on a document search result page, 5 con tosses are requred to buld t. Thus, there are exactly 2 5 = 32 dfferent dstrbutons of the result teams on a result page. Now we can defne the frst component of our framework: F. The set {(L, T )} l = of the pars of the nterleaved result pages L L and ther correspondng dstrbutons of the result teams T T. The result pages L are obtaned by applyng the Team Draft [6] mxng algorthm to the sorted outputs of the rankers A and B, and further doman-specfc presentaton of the ranked lst. We defne T (p) to be equal to ( ) f the team of the result on poston p of the nterleaved lst that produced L s B (A); It s possble that some pars (L, T ) contan dentcal result pages L, despte that the team dstrbutons T assocated wth them are dfferent (e.g. f A and B produce dentcal result lsts). We consder such pars to be dfferent. Further, followng [5] we explctly defne the nterleavng polcy as a parameter of the framework: F2. An nterleavng polcy π, π R l determnes the probablty of usng a partcular team dstrbuton when buldng an nterleaved result page: π = P (T ); Under our framework, the nterleavng polcy s the same for all queres and nterleavng experments. Informally, t can be consdered as a dstrbuton over the random seeds that can be used to ntalze the con used n Team Draft. From [20] we adopt the feature representaton of the user s clck φ( ) and the form of the credt assgnment functon S: F3. A functon φ( ) that maps a user clck c on an nterleaved result page to ts feature representaton φ(c) R n. We also defne an auxlary ndcator T (c) that equates to ( ) f the team of the clcked result s B (A); F4. A scorng rule, S = S(q; w) = c q T (c) wt φ(c) that maps a sequence of clcks n the nteracton q to the score of the alternatve B. The vector w s a parameter, w R n. After runnng an experment e, the score statstc (e) can be calculated: (e) = S(q; w) () Q q Q They can be enumerated as ababababab, ababababba, abababbaab,..., bababababa.

where Q s a set of the user nteractons n the experment e. If (e) s statstcally sgnfcantly above zero, t s concluded that B outperforms A n the experment e. To ensure that the nterleavng s unbased, Radlnsk and Craswell [5] suggested the followng crteron for the document search scenaro: a randomly clckng user should not create any preference between A and B. To formalze ths dea, they consdered a user who (a) samples the number of the consdered top results k randomly and (b) clcks unformly at random on η results from the top-k results. Ths formulaton explctly reles on a lst-based presentaton. Furthermore, n our case the formalzaton s even more challengng as the credt S(q; w) s a functon tself, snce some feature representatons mght be prone to bases (we dscuss ths further n Secton 4). We propose the followng generalzaton of unbasedness crteron from [5]: R. For any fxed sequence of clcks, the expectaton of the total credt over the all pars (L, T ) of the nterleaved pages L and dstrbutons of teams T should be zero. Denotng the length of the sequence as J, the postons clcked as p, p 2,..., p J, and ther correspondng clck features as φ, φ 2,..., φ J we formalze ths requrement as follows: J, {(p j, φ j)} J j= π T (p j) w T φ j = 0 Due to the lnearty of the expectaton, R s suffcent to guarantee the absence of the preferences for any randomzed combnaton of the clck sequences, too. Informally, ths guarantees that a user who specfes an arbtrary nteracton scenaro that does not depend on the presented documents (e.g., clck on the frst poston, sample the dwell tme unformly from [0, 30], clck on the thrd result,... ) wll not create any preference for A or B n expectaton. Next, we requre the polcy π to be a vald dstrbuton: R2. π 0; π = Among all of the possble combnatons of {π, w} that satsfy R and R2, we want to select the combnaton that maxmzes the nterleavng senstvty. Based on [20], we use a dssmlarty measure D between compared alternatves n a set of hstorcal experments E as a proxy for the senstvty n future experments. Indeed, the more dssmlar the alternatves are, the easer t s to dfferentate them. O. The optmal combnaton of parameters π and w should maxmze the dssmlarty D over a set of experments E: ˆπ, ŵ = arg max D(E, π, w) π,w Ths ends the framework descrpton. In the next secton, we dscuss the requrement R n more detal. 4. UNBIASEDNESS REQUIREMENT The motvaton behnd R s to ensure that a user who clcks accordng to a fxed pattern that does not depend on the results shown would not provde any preference for A or B. Clearly, f R s not satsfed, a certan bas towards one of the alternatves mght appear. To llustrate how such a bas mght arse, let us consder the followng toy example. Let us assume that the feature representaton vector φ(c) s a two dmensonal vector, wth j ts frst component φ 0(c) beng equal to f the clcked result s from A, and zero otherwse. Smlarly, φ (c) s equal to f clck c s performed on a result from B. Suppose we fx the nterleavng polcy to be unform, and learn the vector of weghts w based on the dataset of experments. It s possble that, as a result of the learnng, the weghts of the features wll obtan dfferent values, e.g. f the learnng dataset has more experments wth A wnnng. Ths results n poor generalzaton capabltes and based nterleavng. By consderng a user who always clcks on the frst poston, we notce that n our toy example R requres w to be equal to w 2. In ths work we smplfy R by usng a restrcted famly of features. Namely, we use clck features that do not depend on the result page 2 L. By restrctng the set of possble features, we acheve an ntutve symmetry property: after swappng A and B ( renamng A to B, and B to A), the experment outcome (e) wll only change ts sgn, but not ts absolute value (whch s volated n our toy example). Furthermore, the followng Lemma shows the condtons that are suffcent to satsfy R f we restrct the used features: Lemma. For a feature representaton φ, and a polcy π to satsfy R, t s suffcent that: φ s ndependent from L ; For each poston p on the result page, the probablty of observng a result from A must be equal to the probablty of observng a result from B: p π T(p) = 0. Proof. Frst, usng the ndependence of φ from L, we re-wrte R as follows: w T φ j π T (p j) = 0 (2) j An obvous way to satsfy Equaton (2) s to select π such that for any clck poston the expectaton of T s zero for every poston: p π T (p) = 0 (3) Lemma provdes us wth a convenent approach to satsfy R whle optmzng the nterleavng parameters. Indeed, once we use only the features that are ndependent from the partcular nterleaved result pages shown, whether R s satsfed or not depends only on the nterleavng polcy. In that case, R reduces to the followng equalty constrant: Rπ = 0 (4) where R R m l s a matrx wth ts element R j equal to the team T (j) ( or ) of the result shown on jth poston of the nterleaved result page L. Equaton (4) gves an ntuton how the optmzaton of the nterleavng polcy can be performed: the number of ndependent 3 equalty constrants grows lnearly as m/2 wth the number of postons m, but the number of dfferent team 2 The features cannot depend on the clcked result, ts team, and ts poston n A and B. In contrast, the features can depend on the propertes of the clcks tself (e.g. the poston of the clck, ts dwell tme) and the total number of clcks. 3 As dscussed n F, our framework reles on the Team Draft mxng algorthm. Due to ts specfcs, f Equaton (3) holds for a poston 2k and a polcy π, t also holds for the poston 2k + and π.

dstrbutons T and thus the dmensonalty of the polcy vector π grows exponentally as 2 m/2. As a result, some degrees of freedom appear that can be used to fnd a senstve yet unbased polcy. Ths ntuton s smlar to the one behnd the optmzaton n Optmzed Interleavng [5]. 5. STRATIFIED SCORING In Secton 3, the experment outcome s calculated as a sample mean of the scores of the ndvdual nteractons (e), Equaton (). Ths approach s smlar to the one used prevously [2, 8, 5, 6]. We propose to use a stratfed estmate s(e), where the stratfcaton s performed accordng to the dstrbuton of the teams (ababababab,...) on the result pages shown to the users. Further, by Q we denote the set of the user nteractons where the dstrbuton of the teams on the result page shown s T. Usng ths notaton, our proposed stratfed estmate can be estmated as follows: s(e) = π Q q Q S(q; w) (5) Both the stratfed estmate s(e) and the sample mean (e) have the same expected values, but the varance of s(e) can be lower and, consequently, t has hgher senstvty. Indeed, denotng the number of nteractons n the experment e as N, the varance and the expectaton of the nteracton score S among the sessons n the th stratum as var [S] and E [S], and applyng the law of total varance, we obtan: var [ (e)] = π var[s] + π(e[s] π E[S])2 N π var [S] = var [ s(e)] N Snce the frequency of T s determned by π, the probablty of each stratum s known and fxed before startng an nterleavng experment. As can be seen from Equaton (6), the stratfcaton reduces the varance only when the nner-strata means E [S] are dfferent from the overall mean π E[S]. In our proposed approach of Equaton (5), the stratfcaton s performed accordng to the teams of the results on a result page T. In the case of the document search, T s a strong ndcator of the outcome of a sngle comparson, as t specfes, for nstance, f the clck on the frst result s counted n favour of A or B. The stratfcaton alone can consderably mprove the senstvty of the nterleavng experments n some cases (Secton 0). Moreover, as we dscuss n Secton 6, the use of the stratfed outcome e consderably smplfes the optmzaton of the nterleavng parameters. 6. OPTIMIZATION OF THE PARAMETERS To specfy an nstantaton of our proposed nterleavng framework, we need to specfy the nterleavng polcy π, the feature representaton φ(c), and the vector of weghts w. The feature representaton s doman-specfc. However, our proposed approach to determne the vector of weghts w and the nterleavng polcy π are the same rrespectve of the doman. We adopt a data-centrc approach [20] to select π and w and select them maxmze the senstvty on the prevously collected data. (6) We assume that a dataset E of nterleavng experments s avalable, so that for each experment n ths dataset the user nteractons are recorded, and the experment outcome s known. Such a dataset can be obtaned from runnng nterleavng experments by a search engne (e.g., Team Draftbased experments) and selectng the experments wth a hgh confdence n the outcome [2, 20] or by deployng data collecton experments where B s obtaned by manually degradng A, and all possble combnatons of the result lsts and the team dstrbutons are shown to the users wth the unform polcy. We dscuss these two approaches n more detal n Secton 8. To smplfy the notaton, wthout any loss n generalty, we further assume that n all experments e E the alternatve B outperformed A so that s(e) s postve. If t s not the case n a partcular experment, A and B can be swapped for that experment. As stated n the senstvty optmzaton objectve O, we want to fnd the values of parameters π and w that maxmze the dssmlarty between A and B over the avalable experments and satsfy constrants R and R2. Snce the senstvty of the nterleavng does not depend on the scalng of w, to make the optmzaton problem well-posed, we addtonally constran w to have the unt norm. Overall, ths results n a general optmzaton problem of the followng form: ˆπ, ŵ = arg max D(E, π, w) s.t. R, R2, w T w = π,w Further, we dscuss two ways to specfy the dea of dssmlarty, proposed by Yue et al. [20]: the mean score and the z-score dssmlartes. Mean score We start wth the smplest case, when dssmlarty s calculated as the mean value of the stratfed score: D m(e, π, S) = E π Q e, T (c) w T φ(c) e E c q,q Q e, (7) where Q e, s the set of user nteractons wth the team dstrbuton T demonstrated. Further, we ntroduce a matrx X wth ts columns correspondng to the ndvdual features, and rows correspondng to the strata, so that the element X kr s equal to the mean value of the rth feature φ r n the kth stratum: X kr = E e E Q e,k c q,q Q e,k T (c) φ r(c) Usng the ntroduced notaton, the optmzaton objectve can be re-wrtten as follows: D m(e, π, w) = π T Xw Thus, we are lookng for π, w that maxmze (8): ˆπ, ŵ = arg max π,w [ π T Xw ] s.t. R, R2, w T w = Fnally, we notce that f we set π to be the unform polcy, the soluton of the optmzaton problem (8) becomes smlar to the soluton of the correspondng case n Yue et al. [20]: w les on the unt sphere w T w = and maxmzes the dot product π T X w, so ŵ = πt X. The dfference s n the π T X 2 way X s calculated, as the scores are stratfed n our case. Z-score The second way to specfy the level of dssmlarty between A and B proposed by Yue et al. [20] s to (8)

measure the z-score statstc. Informally, ths measures how the dstance between A and B s far from zero n terms of the varance of ths dstance. Followng Yue et al., we smplfy the optmzaton by combnng the set of experments E nto a sngle artfcal experment ē. In that case, the z-score can be calculated as follows: D z(e, π, w) = s(ē) var [ s(ē)] As earler, we ntroduce a matrx X wth ts elements equal to the per-stratum means of the ndvdual features: X kr = T (c) φ r(c) Qē,k c q,q Qē,k Agan, the score s(ē) can be found as π T Xw. Due to the stratfed representaton of the score, the varance of s(ē) breaks down to a weghted sum of the per-stratum varances: var [ s(ē)] = π var [S] = π w T Z w N N where N s the number of nteractons n ē, and Z s the covarance matrx of the nteracton scores c q T (c) φ(c) for the th stratum: Z = ( ) T T (c)φ(c) φ q Qē, Qē, c q T (c)φ(c) φ ) ( c q and φ s the mean feature vector for the th stratum: φ = T (c) φ(c) Qē, c q,q Qē, Overall, we obtan the followng optmzaton problem: π T Xw ˆπ, ŵ = arg max π,w w T ( π Z )w s.t. R, R2, w T w = (9) (0) The use of stratfcaton consderably smplfes the form of the optmzaton problem (0). Indeed, to calculate the varance of s(e) n the denomnator of Equaton (9) we used the rght part of the nequalty (6). In the non-stratfed case, the varance s represented by the left part of (6). The latter case s harder for the optmzaton due to addtonal mutual dependences of the varables (e.g. the varance becomes a thrd-order polynomal w.r.t. π, whle t s lnear n the stratfed case). In contrast to the case consdered by Yue et al., there s no closed-form soluton to the problems (8) and (0) (due to the addtonal varable π and requrements R and R2). Instead, we optmze (8) and (0) numercally 4. As an ntal approxmaton, we use the unform polcy and the soluton of the correspondng problem n [20]. 7. DATASETS In our evaluaton study we use two datasets: a dataset of Team Draft-based document search onlne experments performed by a commercal search engne, and a dataset of prelmnary nterleavng experments performed on the mage search servce. We dscuss them n more detal below. Document search We buld the dataset of the Team Draft-based onlne experments as follows. Frst, we randomly sample a subset of nterleavng experments performed 4 Usng the SLSQP routne mplemented n scpy [9]. Table : Datasets statstcs. Doman # exp. B > A mean/medan # sessons mean/medan # days Document 67 30 840K/620K 9.8/8.0 Image 5 0 38K / 34K 4.0/4.0 by the search engne n the perod from January to November, 204. These experments test changes n the search rankng algorthm that were developed as a part of the search engne s evoluton. The experments also dffer by country, and geographcal regon they are deployed on. We select the experments where the wnner (A or B) s determned wth a hgh level of confdence, p 0.005 (bnomal sgn test, deduped clck weghtng scheme [2]). Image search In contrast to the web document search case, a representatve set of onlne nterleavng experments s not avalable to us. Instead, we take fve data collecton experments. In each of these experments, the evaluated ranker B s obtaned by degradng A n a controlled manner. After that, the correspondng comparson of A and B s deployed. In these experments the nterleaved result pages are obtaned by nterleavng the ranked lsts returned by A and B, as dscussed n Secton 3, and showng them wth the unform polcy (.e. applyng Team Draft). The followng modfcatons of the producton ranker to generate the alternatve system B were used: swappng the results ranked as..5 wth the results ranked 6..30; random permutaton of the top-ranked results; promotng results wth a low resoluton; settng an mportant subset of the rankng features to zero; randomly gnorng some subsets of the search ndex. As a result, we obtaned a dataset of experments, whch can be used to adjust the nterleavng parameters w and π, as dscussed n Secton 6. Once the number of organc evaluaton experments have grown, the optmzaton procedure we propose can be repeated on a more representatve dataset. We provde descrptve statstcs of the datasets n Table. 8. INSTANTIATION As dscussed above, what changes for dfferent domans s the feature representaton of the clcks (φ(c)). Further we descrbe what features we use n our expermental study. All features we use are ndependent from the result page demonstrated, so they meet the requrements of Lemma. Document search features For each clck n a user nteracton, we calculate a set of 24 features, splt nto four famles: Rank-based, Dwell tme-based, Order-based, and Lnear score-based features. We report these features along wth ther descrptons n Table 2. Image search features The clck features we use for mage search nterleavng are smlar to the features used for document search. We exclude some rank-based features, as they are not meanngful for the two-dmensonal result presentaton (e.g. feature # assumes that the users tend to examne results n a rank-wse order). The full lst of features used for the mage search clck representaton s provded n Table 3. Stratfcaton In the document search scenaro, we stratfy the estmate of the experment outcome accordng to the teams of the results on the frst result page. Ths gves us 2 0/2 = 32 strata. The same strata are used for the polcy optmzaton: the polcy specfes the probablty of usng a specfc team dstrbuton to generate the frst nterleaved

Table 2: Clck features for document search. Feature famly d Descrpton Rank-based Dwell tme-based Order-based Lnear score-based Transformatons of the clck s rank, normalzed by the number of clcks -0 poston ndcators, f = I{rank = } rank 2 rank 3 log(rank) 4 I{rank > 4} 5 I{rank > d}, where d s the number of dentcal results n the tops of A and B Indcators of the dwell tme (seconds), normalzed by the number of clcks 6 I{dwell 30} 7 I{dwell (30, 60]} 8 I{dwell (60, 90]} 9 I{dwell (90, 20]} 20 I{dwell > 20} Indcators of the clck s poston n the nteracton 2 s the clck frst 22 s the clck last after applyng the scorng rule F4, these features represent the (normalzed) number of clcks the results from B receved 23 f 23 = 24 f 24 = /n, where n s the total number of clcks result page. The remanng pages are generated usng the standard Team Draft procedure, and t can be shown that the nterleavng s unbased n terms of R n that case. In the case of mage search, the stratfcaton s less straghtforward. Indeed, the stratfcaton accordng to the teams of the top 30 results on the frst result page, wll yeld 2 30/2 = 32768 strata. On one hand, accordng to Equaton (6), usng more fne-graned strata results n equal or lower varance. On the other hand, to run the optmzaton dscussed n Secton 6, we need to estmate per-stratum means and covarances of the features. Ths results n a trade-off between an ncreased senstvty due to more fnegraned stratfcaton and a hgher error of the optmzaton wth unrelable parameters. Thus we performed the search for the optmal number of top results to be used n stratfcaton as a part of the tranng process, as dscussed n Secton 9.3. 9. EVALUATION In our evaluaton study, we am to answer the followng research questons: (RQ) s our framework more senstve than the baselnes on the document and mage search data, and (RQ2) f yes, then what aspects of the senstvty optmzaton (stratfcaton, credt assgnment and polcy optmzaton) contrbute to the ncreased senstvty? To answer these questons, we frstly descrbe the baselnes we use n Secton 9.. After that, we ntroduce the metrc we use n Secton 9.2. Fnally, we descrbe the evaluaton methodology n Secton 9.3. 9. Baselnes In our study, we compare the senstvty of our proposed framework to the Team Draft algorthm wth the credt assgnment functons vared. We consder credt assgnment functons of two types: the heurstc clck weghtng schemes that are applcable for Team Draft and consdered n [2], and the learned scorng functons traned accordng to the approach of Yue et al. [20]. All these baselnes are nonstratfed. Lnear In the smplest scorng scheme, we calculate the dfference n the number of clcks on the results from A and B: S(q; w) = c q T (c) Normalzed Lnear In the Normalzed Lnear scheme, the score of B n a partcular nteracton s normalsed by the number of clcks n ths nteracton: S(q; w) = q T (c) Bnary Another approach to aggregate clcks n a sngle mpresson s to assgn a unt credt to the alternatve that receved more clcks: ( ) S(q; w) = sgn T (c) Deduped Bnary In the web document search scenaro, t s often assumed that the users examne result lsts from top to bottom. In that case, f the top k documents are dentcal both n A and B, all the nterleaved lsts have the same top k results, too. Thus, clcks on these top k results add a zero mean addtve nose to the dfference between the number of clcks A and B receve. A useful trck s to gnore such clcks. We combne ths approach wth the bnary aggregaton scheme: ( ) S(q; w) = sgn T d (c) where T d ( ) s a modfed team ndcator functon, equal to zero f the clck s performed on one of the top results, dentcal for A and B, and equal to T ( ) otherwse. The deduped bnary scheme s one of the most senstve schemes [2]. Learned-mean, Learned-z In contrast to the above dscussed credt assgnment functons that are based on ntutve consderatons, Learned-mean and Learned-z are machnelearned credt assgnment functons that based on the approach of Yue et al. [20]. These baselnes use the same feature representatons as our proposed nterleavng framework. However, the optmzaton of the nterleavng polcy s not performed, and t s fxed to be constant and unform (as n Team Draft). Learned-mean selects the vector of weghts w such that the dfferences between A and B are maxmzed, and Learned-z maxmzes the z-score objectve. These objectves are close to the objectves we use n Secton 6, but they assume a non-stratfed experment outcome and the unform polcy. It would be nterestng to compare our framework to the Optmzed Interleavng framework [5]. However, Optmzed Interleavng reles on consderably larger sets of nterleaved result pages, thus the datasets of Team Draft-based nterleavng experments cannot be re-used to evaluate ts performance. An alternatve approach s to leverage the natural varaton of the search engne s rankngs as a source of the result pages, as used n [5]. However, n ths case, the evaluaton s performed on a query level, and t s restrcted to be based on the head queres only. Overall, ths mght lead to a less representatve study. 9.2 Metrc In ths work, we use the z-score metrc that s used to measure the nterleavng senstvty on the hstorcal data [2, 0]. z-score ndcates the confdence of the evaluated method n the experment outcome, thus t serves as a proxy to measure c q c q c q

Table 3: Clck features for mage search. Feature famly d Descrpton Rank-based Dwell tme-based Order-based Lnear score-based poston ndcators -30 f = I{rank = } 3 I{rank > d}, where d s the number of dentcal results n the tops of A and B Indcators of the dwell tme (seconds), normalzed by the number of clcks 32 I{dwell 30} 33 I{dwell (30, 60]} 34 I{dwell (60, 90]} 35 I{dwell (90, 20]} 36 I{dwell > 20} Indcators of the clck s poston n the nteracton 37 s the clck frst 38 s the clck last after applyng the scorng rule F4, these features represent the (normalzed) number of clcks the results from B receved 39 f 48 = 40 f 49 = /n, where n s the total number of clcks the senstvty of the method: a hgher confdence ndcates a hgher senstvty. Assumng that s(e) s normally dstrbuted 5 and usng the notaton ntroduced above, we defne the z-score statstc on the data of the experment e as follows: Z = s(e) var[ s(e)] = s(e) π var[s] N () To calculate the z-score statstc for an nterleavng method wth a non-unform polcy on data obtaned from an experment wth the unform polcy, we use the per-stratum sample estmates of the expectaton E [S] and the varance var [S] (Equaton (6)), calculated on the expermental data, and the polcy specfed by the nterleavng method. The value of () ndcates how far the score s(e) devates from zero n the standard normal dstrbuton. Thus t ndcates the confdence level of the experment outcome and can be mapped nto p-value (under the null hypothess the true value of s(e) s 0). For nstance, Z of.96 (2.58) corresponds to the two-sded p-value of 0.05 (0.0). In the case of the non-stratfed estmate (e), z-score s calculated smlarly: Z = (e) = (e) N (2) var[ (e)] var[s] For each nterleavng experment, we calculated the relatve z-score by dvdng the outcome s z-score by the z-score of the Team Draft method wth the lnear clck weghtng scheme. The relatve z-score z e has an ntutve nterpretaton [2]: the correspondng nterleavng method needs z 2 e less nteractons n the same experment e than the Team Draft algorthm wth the lnear weghtng scheme to acheve the same level of confdence. 9.3 Procedure In our evaluaton on the document search dataset, we use 0-fold cross-valdaton: n each splt, 90% of the nterleavng experments are used for optmzaton, and 0% are used to evaluate the resultng senstvty. The same splts are used for all the approaches that run optmzaton 5 Ths assumpton holds when π N s large enough for all wth π > 0, as s(e) s a sum of approxmately normally dstrbuted per-stratum sample means, thus t s normally dstrbuted. (our proposed framework wth two types of dssmlarty, and Learned-mean and Learned-z baselnes). In each splt, we measure the relatve z-scores of an nterleavng method on the experments n the test set. For each nterleavng method, we report the overall mean and the medan relatve z-scores collected across all folds. We use the pared t-test on the absolute values of the non-normalzed z-scores when testng the statstcal sgnfcance of the performance dfferences. In the case of mage search, due to the smaller dataset, we replace the 0-fold cross-valuaton wth the leave-oneout procedure: one experment s used for evaluaton, whle the others are used for tranng. Further, wthn a tranng step, we addtonally run a nested 2-fold cross-valdaton procedure on the tranng set to fnd the optmal number of the result teams to be consdered n stratfcaton: for k n 3,..., 5 we evaluate the performance of our proposed method when teams of the top 2k are used for the stratfcaton. The search s stopped when the performance degrades. In most folds the optmal k s found to be equal to 3 (.e., the top 6 results are used for the stratfcaton). 0. RESULTS AND DISCUSSION In ths secton, we use the followng notaton. Lnear, Normalzed Lnear, Bnary, and Deduped Bnary weghtng schemes correspond to Lnear, NLnear, Bnary, and Deduped, respectvely. L m and L z ndcate the Learnedmean and Learned-z baselnes. The nstantatons of our proposed framework are referred to as F m and F z, when the optmzaton s performed to maxmze the mean dfference (8) and the z-score (0) objectves, respectvely. As we are nterested n evaluatng the effects of the stratfcaton and the effects of the jont optmzaton ndvdually, we addtonally measure the performance of the baselnes when the stratfed outcome s(e) s calculated. The stratfed modfcatons of the nterleavng methods L m and L z are denoted as L s m and L m z. L s m and L s z use the stratfed objectves we proposed n Secton 6, and correspond to our framework wth the nterleavng polcy fxed to be unform. In our experments on both document and mage search datasets, all of the studed nterleavng methods correctly determned the preference for A or B. 0. Document Search In Table 4 we report the results of the evaluaton procedure dscussed n Secton 9.3 appled for the web document search data. In the left part of Table 4 (Non-stratfed column), we report the mean and medan relatve z-scores for the baselnes wth no stratfcaton appled. In the rght part (Stratfed column), we report the performance of our proposed framework as well as for the baselnes wth the stratfcaton appled. On analysng the results of the non-stratfed baselnes, reported n the left part of Table 4, we notce that ther relatve performance s generally n lne wth the results reported n [2]. Indeed, the deduped bnary scheme wth ts medan relatve z-score of.59 consderably outperforms other consdered heurstc schemes: Lnear (.0), Normalzed Lnear (0.93), and Bnary (0.98); smlarly, L z outperforms L m. On comparng the relatve z-scores of Lnear, NLnear, Bnary, and Deduped wth and wthout the stratfcaton appled (left vs rght parts of Table 4), we observe that n some cases the stratfcaton greatly ncreases the nterleavng senstvty. For nstance, the mean and the medan relatve z-scores of Bnary grow from.0 and 0.98 to.22 and

Table 4: Relatve confdence levels of the nterleavng outcomes, document search. The scores of the nterleavng method wth the hghest senstvty (p < 0.0) are denoted by. Non-stratfed Stratfed Lnear NLnear Bnary Deduped L m L z Lnear NLnear Bnary Deduped L s m L s z F m F z Mean.00.03.0.88.34 2.4.06.6.22.88.39 2.28.38 2.45 Medan.00 0.93 0.98.59.20.80.04.03.0.60.24.96.23 2.05 Error probablty 0.30 0.25 0.20 0.5 0.0 0.05 Lnear, stratfed 0.00 0 20000 40000 60000 80000 00000 nteractons, N Fz Deduped Bnary Lnear Fgure : The probablty that an nterleavng method dsagrees wth the true preference, dependng on the sze of the sample..0, respectvely. Notceable mprovements are obtaned for all of the consdered baselne schemes, except for Deduped, where the mprovement s small. Interestngly, a consderable mprovement s also observed for L z: ts stratfed modfcaton L s z exhbts a medan relatve z-score of.96, whle L z has a medan z-score of.80. Ths level of mprovement s roughly comparable to the dfference between the best heurstc baselne (Deduped, medan relatve z-score.59) and the best machne-learned baselne (L z, medan relatve z-score.80) n the non-stratfed case. In all cases, the credt assgnment functon that optmzes the mean dfference between A and B performs worse the credt assgnment functons learned to maxmze the z-score. For nstance, L s z demonstrates consderably hgher medan relatve confdence than L s m (.96 vs.24). By addtonally performng the nterleavng polcy optmzaton, F z acheves a consderable senstvty gan n comparson wth the stratfed L s z (F z, 2.05 vs L z,.96). Ths gan s roughly smlar to the dfference between performance obtaned by performng stratfcaton (L z,.80 vs L s z,.96). F z also acheves the hghest overall senstvty, wth the medan relatve z-score of 2.05 and the mean z-score of 2.45. Ths mples that an nterleavng experment that uses the nterleavng method F z requres 2.05 2 = 4.20 tmes less user nteractons (n medan) than the non-stratfed Team Draft wth the lnear scorng to acheve the same level of confdence. In comparson wth the best performng baselne, L z, t requres ( 2.05.80 )2 =.30 tmes less data to acheve the same level of confdence (n medan). Vsualzaton We llustrate the relatve performance of the studed nterleavng methods on the document search dataset usng the followng procedure. We randomly select one experment to be used as a test experment, and use the remanng experments to optmze the nterleavng parameters. Further, we estmate the probablty that an nterleavng method dsagrees wth the ground truth preference n Lz the test experment by obtanng 0,000 samples of N user nteractons. We vared N n (0 3,..., 0 5 ). For the baselne methods, N nteractons are obtaned by samplng from the experment s nteractons wth replacement (bootstrap samplng). For F z, the sample s obtaned by frstly allocatng N nteractons to the strata accordng to multnomal dstrbuton specfed by the polcy π, and further samplng from the ndvdual strata (wth replacement). The outcome s calculated usng the stratfed estmate e. Ths samplng process smulates the case of polcy π to be appled n a real-lfe scenaro. The stratfed modfcaton of Lnear dffers from Lnear only by usng the stratfed estmate of the outcome, s nstead of the sample mean. A hgher error probablty ndcates lower senstvty and t s related to the outcome s p-value under the bootstrap test. In Fgure, we report the obtaned error probabltes. From Fgure we observe that the optmzaton-based methods (F z and L z) dramatcally ncrease the nterleavng senstvty and outperform Lnear, stratfed Lnear, and Deduped by a consderable margn. For nstance, the probablty error of 0.05 s acheved by F z wth less than 0,000 nteractons, but Lnear requres about 90,000 nteractons to acheve the same level of error. Among the methods that use optmzaton, F z consstently demonstrates lower probablty of error than L z. For nstance, when 5, 000 nteractons s used, F z has the probablty of error below 0.06, whle L z makes an error n more than 0.09 of the samples. Further, we observe that the performance of Lnear s notceably mproved by addng stratfcaton. Overall, these observatons are n lne wth results reported n Table 4. However, ths llustraton s also mportant as t does not rely on the z-score statstc. 0.2 Image Search In Table 5 we report the results of the evaluaton for the case of mage search. Generally, we observe the results smlar to the document search case. The machne-learned nterleavng methods that optmze the z-score objectve (L z, L s m, L s z, and F z) outperform both the methods wth the heurstc credt assgnment (Lnear, NLnear, Bnary, and Deduped) and the methods that optmze the mean dfference. In contrast to the document search experments, the senstvty gans due to stratfcaton are less notceable on the mage search data. A possble explanaton s that the dfferences of the means of the strata are smaller than n the case of the document search. Indeed, f the users tend to examne most of the results (whch s easer for mages than for document snppets) and clck more, then the teams of the frst results s not such a strong ndcator of the total credt n an nteracton, as n the case of document search. Interestngly, Deduped s senstve n mage search, too. The overall hghest performance (p < 0.05) s acheved by our proposed framework wth the z-score-based optmzaton objectve (F z, mean relatve z-score.2, medan.8). Ths value of the metrc mples that our proposed framework requres.8 2 =.39 less data (medan) than the Lnear baselne to acheve the same level of confdence. In comparson to the best-performng baselne L z, the correspondng de-