arxiv: v1 [cs.ir] 23 Nov 2017

Size: px

Start display at page:

Download "arxiv: v1 [cs.ir] 23 Nov 2017"

Erika Barnett
6 years ago
Views:

1 A Deep Relevance Matchng Model for Ad-hoc Retreval Jafeng Guo, Yxng Fan, Qngyao A, W. Bruce Croft CAS Key Lab of Network Data Scence and Technology, Insttute of Computng Technology, Chnese Academy of Scences, Bejng, Chna Center for Intellgent Informaton Retreval, Unversty of Massachusetts Amherst, MA, USA arxv: v1 [cs.ir] 23 Nov 2017 ABSTRACT In recent years, deep neural networks have led to exctng breakthroughs n speech recognton, computer vson, and natural language processng (NLP) tasks. However, there have been few postve results of deep models on ad-hoc retreval tasks. Ths s partally due to the fact that many mportant characterstcs of the ad-hoc retreval task have not been well addressed n deep models yet. Typcally, the ad-hoc retreval task s formalzed as a matchng problem between two peces of text n exstng work usng deep models, and treated equvalent to many NLP tasks such as paraphrase dentfcaton, queston answerng and automatc conversaton. However, we argue that the ad-hoc retreval task s manly about relevance matchng whle most NLP matchng tasks concern semantc matchng, and there are some fundamental dfferences between these two matchng tasks. Successful relevance matchng requres proper handlng of the exact matchng sgnals, query term mportance, and dverse matchng requrements. In ths paper, we propose a novel deep relevance matchng model (DRMM) for ad-hoc retreval. Specfcally, our model employs a jont deep archtecture at the query term level for relevance matchng. By usng matchng hstogram mappng, a feed forward matchng network, and a term gatng network, we can effectvely deal wth the three relevance matchng factors mentoned above. Expermental results on two representatve benchmark collectons show that our model can sgnfcantly outperform some well-known retreval models as well as state-of-the-art deep matchng models. Keywords Relevance Matchng, Semantc Matchng, Neural Models, Ad-hoc Retreval, Rankng Models 1. INTRODUCTION Machne learnng methods have been successfully appled to nformaton retreval (IR) n recent years. Typcally, a rankng functon whch produces a relevance score gven a Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. Copyrghts for components of ths work owned by others than ACM must be honored. Abstractng wth credt s permtted. To copy otherwse, or republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. Request permssons from permssons@acm.org. CIKM 16, October 24-28, 2016, Indanapols, IN, USA c 2016 ACM. ISBN /16/10... $15.00 DOI: query and document par s learned based on a set of human defned features. However, handcraftng features can be tme-consumng, ncomplete and over-specfed. On the other hand, deep neural networks, as a representaton learnng method, are able to dscover from the tranng data the hdden structures and features at dfferent levels of abstracton that are useful for the tasks. Recently, deep models have been appled to a varety of applcatons n computer vson [16], speech recognton [10] and NLP [25, 17], and have yelded sgnfcant performance mprovements. Gven the success of deep learnng n these domans, t seems that deep learnng should have a major mpact on IR. However, there have been few postve results of deep models on IR tasks, especally ad-hoc retreval tasks, untl now. Wthout loss of generalty, when applyng deep models to ad-hoc retreval, the task s typcally formalzed as a matchng problem between two peces of text (.e., the query and document). Such a matchng problem formalzaton s often consdered general n the sense that t can cover both ad-hoc retreval tasks as well as manynlptasks suchas paraphrase dentfcaton, queston answerng (QA), and automatc conversaton [17, 11]. A varety of deep matchng models have been proposed to solve ths matchng problem, whch can be categorzed nto two types accordng to ther model archtecture. One s the representaton-focused model, whch tres to buld a good representaton for a sngle text wth a deep neural network, and then conducts matchng between the compostonal and abstract text representatons. Examples nclude DSSM [12], C-DSSM [23, 8] and ARC-I [11]. The other s the nteracton-focused model, whch frst bulds local nteractons (.e., local matchng sgnals) between two peces of text, and then uses deep neural networks to learn herarchcal nteracton patterns for matchng. Examples nclude DeepMatch [17], ARC-II [11] and MatchPyramd [19]. However, n ths work, we argue that the matchng problems n many NLP tasks and the ad-hoc retreval task are fundamentally dfferent. Most NLP tasks concern semantc matchng,.e., dentfyng the semantc meanng and nferrng the semantc relatons between two peces of text, whle the ad-hoc retreval task s manly about relevance matchng,.e., dentfyng whether a document s relevant to a gven query. We pont out three major dfferences between these two matchng problems whch may lead to sgnfcantly dfferent archtecture desgn for the deep matchng models. We also show that most exstng deep matchng models are desgned for semantc matchng rather than relevance matchng. Based on these dfferences, we propose a deep relevance

2 Fgure 1: Two types of deep matchng models: (a) Representaton-focused models employ a Samese (symmetrc) archtecture over the text nputs; (b) Interacton-focused models employ a herarchcal deep archtecture over the local nteracton matrx. matchng model (DRMM) for ad-hoc retreval by explctly modelng the three major factors n relevance matchng. Overall, our model s an nteracton-focused model whch employs ajont deeparchtectureat thequeryterm 1 level for relevance matchng. Specfcally, we frst buld local nteractons between each par of terms from a query and a document based on term embeddngs. For each query term, we map the varable-length local nteractons nto a fxed-length matchng hstogram. Based on ths fxed-length matchng hstogram, we then employ a feed forward matchng network to learn herarchcal matchng patterns and produce a matchng score. Fnally, the overall matchng score s generated by aggregatng the scores from each query term wth a term gatng network computng the aggregaton weghts. We show how our major model desgns, ncludng matchng hstogram mappng, a feed forward matchng network, and a term gatng network, address the three key factors n relevance matchng for ad-hoc retreval. We evaluate the effectveness of the proposed DRMM based on two representatve ad-hoc retreval benchmark collectons. For comparson, we take nto account some wellknown tradtonal retreval models, as well as several stateof-the-art deep matchng models ether desgned for the general matchng problem or proposed specfcally for the adhoc retreval task. The emprcal results show that the exstng deep matchng models cannot compete wth the tradtonal retreval models on these benchmark collectons, whle our model can outperform all the baselne models sgnfcantly n terms of all the evaluaton metrcs. The major contrbutons of ths paper nclude: 1. We pont out three major dfferences between semantc matchng and relevance matchng, whch may lead to sgnfcantly dfferent archtecture desgn of the deep matchng models. 2. We propose a novel deep relevance matchng model for ad-hoc retreval by explctly addressng the three key factors of relevance matchng. 3. We conduct rgorous comparsons over state-of-the-art retreval models on benchmark collectons and analyze the defcences of exstng deep matchng models and advantages of the DRMM. 1 Here we use term to denote the ndexed unts n search systems, whch could be stemmed words or phrases. 2. AD-HOC RETRIEVAL AS A MATCHING PROBLEM Accordng to exstng lterature [12, 17], the core problem n ad-hoc retreval,.e., the computaton of the relevance for a document gven a partcular query, can be formalzed as a text matchng problem as follows. Gven two texts T 1 and T 2, the degree of matchng s typcally measured as a score produced by a scorng functon based on the representaton of each text: match(t 1,T 2) = F(Φ(T 1),Φ(T 2)), where Φ s a functon to map each text to a representaton vector, and F s the scorng functon based on the nteractons between them. Such a text matchng problem s consdered general snce t also descrbes many NLP tasks, such as paraphrase dentfcaton, queston answerng, and automatc conversaton [17, 11]. A varety of deep matchng models have been proposed ether for the specfc ad-hoc retreval task or for the general matchng problem. Dependng on how you choose the two functons, exstng deep matchng models can be categorzed nto two types. The frst one, the representaton-focused model, tres to buld a good representaton for a sngle text wth a deep neural network, and then conducts matchng between two compostonal and abstract text representatons. In ths approach, Φ s a complex representaton mappng functon whle F s a relatvely smple matchng functon. For example, n DSSM [12], Φ s a feed forward neural network, whle F s the cosne smlarty functon. In C-DSSM [23, 8], Φ s a convolutonal neural network (CNN) [16], whle F s the cosne smlarty functon. In ARC-I [11], Φ s a CNN, whle F s a mult-layer perceptron (MLP). Wthout loss of generalty, all the model archtectures of representatonfocused models can be vewed as a Samese (symmetrc) archtecture over the text nputs, as shown n Fgure 1(a). The second one, the nteracton-focused model, frst bulds the local nteractons between two texts based on some basc representatons, and then uses deep neural networks to learn the herarchcal nteracton patterns for matchng. In ths approach, Φ s usually a smple mappng functon whle F sacomplexdeepmodel. Forexample, ndeepmatch[17], Φ smply maps each text to a sequence of words, whle F s a feed forward neural network powered by a topc model over the word nteracton matrx. In ARC-II [11] and MatchPyramd [19], Φ maps each text to a sequence of word vectors,

3 whle F s a CNN over the nteracton matrx between word vectors from the two texts. Wthout loss of generalty, all the model archtectures of nteracton-focused models can be vewed as a herarchcal deep archtecture over the local nteracton matrx, as shown n Fgure 1(b). Although varous deep matchng models have been proposed under such a general matchng problem formalzaton, most of them have only beendemonstrated to be effectve on a set of NLP tasks such as paraphrase dentfcaton and QA [11, 26]. There have been few postve results on the ad-hoc retreval task. Even the deep models specally desgned for Web search, e.g., DSSM and C-DSSM, were only evaluated on <query, doc ttle> pars whch are not a typcal ad-hoc retreval settng. If we drectly apply these deep matchng models on some benchmark retreval collectons, e.g. TREC collectons, we fnd relatvely poor performance compared to tradtonal rankng models, such as the language model [31] and BM25 [22]. All these observatons rase some questons such as: Is matchng n ad-hoc retreval really the same as that n NLP tasks? Are the exstng deep matchng models sutable for the ad-hoc retreval task? 3. SEMANTIC MATCHING VS. RELEVANCE MATCHING In ths secton, we dscuss the dfferences between text matchng n ad-hoc retreval and other NLP tasks. The matchng n many NLP tasks, such as paraphrase dentfcaton, queston answerng and automatc conversaton, s manly concerned wth semantc matchng,.e., dentfyng the semantc meanng and nferrng the semantc relatons between two peces of text. In these semantc matchng tasks, the two texts are usually homogeneous and consst of a few natural language sentences, such as questons/answer sentences, or dalogs. To nfer the semantc relatons between natural language sentences, semantc matchng emphaszes the followng three factors: Smlarty matchng sgnals: It s mportant, or crtcal to capture the semantc smlarty/relatedness between words, phrases and sentences, as compared wth exact matchng sgnals. For example, n paraphrase dentfcaton, one needs to dentfy whether two sentences convey the same meanng wth dfferent expressons. In automatc conversaton, one ams to fnd a proper response semantcally related to the prevous dalog, whch may not share any common words or phrases between them. Compostonal meanngs: Snce texts n semantc matchng usually consst of natural language sentences wth grammatcal structures, t s more benefcal to use the compostonal meanng of the sentences based on such grammatcal structures rather than treatng them as a set/sequence of words [25]. For example, n queston answerng, most questons have clear grammatcal structures whch can help dentfy the compostonal meanng that reflects what the queston s about. Global matchng requrement: Semantc matchng usually treats the two peces of text as a whole to nfer the semantc relatons between them, leadng to a global matchng requrement. Ths s partally related to the fact that most texts n semantc matchng have lmted lengths and thus the topc scope s concentrated. For example, two sentences are consdered as paraphrases f the whole meanng s the same, and a good answer fully answers the queston. The matchng n ad-hoc retreval, on the contrary, s manly about relevance matchng,.e., dentfyng whether a document s relevant to a gven query. In ths task, the query s typcally short and keyword based, whle the document can vary consderably n length, from tens of words to thousands or even tens of thousands of words. To estmate the relevance between a query and a document, relevance matchng s focused on the followng three factors: Exact matchng sgnals: Although term msmatch s a crtcal problem n ad-hoc retreval and has been tackled usng dfferent semantc smlarty sgnals, the exact matchng of terms n documents wth those n queres s stll the most mportant sgnal n ad-hoc retreval due to the ndexng and search paradgm n modern search engnes. For example, Fang and Zha [7] proposed the semantc term matchng constrant whch states that matchng an orgnal query term exactly should always contrbute no less to the relevance score than matchng a semantcally related term multple tmes. Ths also explans why some tradtonal retreval models, e.g., BM25, can work reasonably well purely based on exact matchng sgnals. Query term mportance: Snce queres are manly short and keyword based wthout complex grammatcal structures n ad-hoc retreval, t s mportant to take nto account term mportance, whle the compostonal relaton among the query terms s usually the smple and relaton n operatonal search. For example, gven the query btcon news, a relevant document s expected to be about btcon and news, where the term btcon s more mportant than news n the sense that a document descrbng other aspects of btcon would be more relevant than a document descrbng news of other thngs. In the lterature, there have been many formal studes on retreval models showng the mportance of term dscrmnaton [5, 6]. Dverse matchng requrement: In ad-hoc retreval, a relevant document can be very long and there have been dfferent hypotheses concernng document length [22] n the lterature, leadng to a dverse matchng requrement. Specfcally, the Verbosty Hypothess assumes that a long document s lke a short document, coverng a smlar scope but wth more words. In ths case, the relevance matchng mght be global f we assume short documents have a concentrated topc. On the contrary, the Scope Hypothess assumes a long document conssts of a number of unrelated short documents concatenated together. In ths way, the relevance matchng could happen n any part of a relevant document, and we do not requre the document as a whole to be relevant to a query. As we can see, there are sgnfcant dfferences between relevance matchng n ad-hoc retreval and semantc matchng n many NLP tasks. These dfferences affect the desgn of deep model archtectures and t may be dffcult to fnd a one-ft-all soluton to such dfferent matchng problems. If we revst the exstng deep matchng models, we fnd that most of them concern semantc matchng rather than relevance matchng. For example, the representaton-focused models such as DSSM, C-DSSM and ARC-I focus on the compostonal meanng of the texts and ft the global matchng requrement. In these models, detaled matchng sgnals and, especally, exact matchng sgnals are lost snce they defer the nteracton between two texts untl ther ndvdual representatons have been created [11]. Although the nteracton-focused models such as DeepMatch, ARC-II and

4 Fgure 2: Archtecture of the Deep Relevance Matchng Model. MatchPyramd preserve both exact and smlarty matchng sgnals, they do not dfferentate these sgnals but treat them as equally mportant. These models focus on learnng the composton of local nteractons wthout addressng term mportance. In partcular, the convolutonal structures n ARC-II and MatchPyramd are desgned to learn postonal regulartes, whch may work well under the global matchng requrement but fal under the dverse matchng requrement.(there s more dscusson on ths n Secton 4.) 4. DEEP RELEVANCE MATCHING MODEL Based on the above analyss, we propose a novel deep matchng model specfcally desgned for relevance matchng n ad-hoc retreval by explctly addressng the three factors descrbed n Secton 3. We refer to our model as a deep relevance matchng model (DRMM). Overall, our model s smlar to nteracton-focused models rather than representaton-focused models snce the latter would nevtably lose the detaled matchng sgnals whch are crtcal for relevance matchng n ad-hoc retreval. Specfcally, our model employs a jont deep archtecture at the query term level over the local nteractons between query and document terms for relevance matchng. We frst buld local nteractons between each par of terms from a query and a document based on term embeddngs. For each query term, we then transform the varable-length local nteractons nto a fxed-length matchng hstogram. Based on the fxed-length matchng hstogram, we employ a feed forward matchng network to learn herarchcal matchng patterns and produce a matchng score for each query term. Fnally, the overall matchng score s generated by aggregatng the scores from each sngle query term wth a term gatng network computng the aggregaton weghts. The model archtecture s depcted n Fgure 2. More formally, suppose both query and document are representedasasetoftermvectorsdenotedbyq={w (q) 1,...,w(q) and d = {w (d) 1,...,w(d) }, where w(q), = 1,...,M and w (d) N M } j,j = 1,...,N denotes a query term vector and a document term vector, respectvely, and s denotes the fnal relevance score, we have z (0) = h(w (q) d), = 1,...,M z (l) = tanh(w (l) z (l 1) +b (l) ), = 1,...,M,l= 1,...,L M s = g z (L) =1 where denotes the nteracton operator between a query term and the document terms, h denotes the mappng functon from local nteractons to matchng hstogram, z (l),l = 0,...,L denotes the ntermedate hdden layers for the -th query term, and g, = 1,...,M denotes the aggregaton weght produced by the term gatng network. W (l) denotes the l-th weght matrx and b (l) denotes the l-th bas term, whch are shared across dfferent query terms. Note that we adopt cosne smlarty, a wdely used measure for semantc closeness n neural embeddngs [18, 20], as the nteracton operator between each par of term vectors from a query and a document. In our work, we assume the term vectors are learned a pror usng exstng neural embeddng models such as Word2Vec [18]. We do not learn term vectors n our deep relevance matchng model for the followng reasons: 1) Relable term representatons can be better acqured from large scale unlabeled text collectons rather than from the lmted ground truth data for ad-hoc retreval; 2) By usng the a pror learned term vectors, we can focus the learnng of our model on relevance matchng patterns and consderably reduce the model complexty. In the followng, we wll descrbe the major components of our model, ncludng the matchng hstogram mappng, feed forward matchng network, and term gatng network n detal, and dscuss how they address the three key factors of relevance matchng n ad-hoc retreval. Matchng Hstogram Mappng: The nput of our deep relevance matchng model s the local nteractons between each par of terms from a query and a document. A major problem s that the sze of local nteractons s not fxed due to the vared lengths of queres and documents. Prevous nteracton-based models vew the local nteractons as a matchng matrx by preservng the sequental term orders

5 n both queres and documents. Clearly the matchng matrx s a poston preservng representaton, whch s useful f the learnng task s poston related. However, accordng to the dverse matchng requrement, relevance matchng s not poston related snce t could happen n any poston n a long document. Thus the matchng matrx may not be a sutable representaton for ad-hoc retreval due to the potentally nosy postonal sgnals n t. In our work, we adopt a strength preservng representaton, namely a matchng hstogram, whch groups local nteractons accordng to dfferent levels of sgnal strengths rather than ther postons. Specfcally, snce the local nteracton (.e., cosne smlarty between term vectors) s wthn the nterval [ 1,1], we dscretze the nterval nto a set of ordered bns and accumulate the count of local nteractons n each bn. In ths work, we consder fxed bn sze and treat exact matchng as a separate bn. Other dscretzaton schemes could be explored n future work. For example, suppose the bn sze s set as 0.5, we wll obtan fve bns {[ 1, 0.5),[ 0.5, 0),[0,0.5), [0.5,1),[1,1]} n an ascendng order. Gven a query term car and a document (car, rent, truck, bump, njuncton, runway), and the correspondng local nteractons based on cosne smlarty are (1,0.2,0.7,0.3, 0.1,0.1), we wll obtan a matchng hstogram as [0,1,3,1,1]. We explore three ways of the matchng hstogram mappng: Count-based Hstogram (CH): Ths s the smplest way of transformaton as descrbed above whch drectly takes the count of local nteractons n each bn as the hstogram value. Normalzed Hstogram (NH): We normalze the count value n each bn by the total count to focus on the relatve rather than the absolute number of dfferent levels of nteractons. LogCount-based Hstogram (LCH): We apply logarthm over the count value n each bn, both to reduce the range, and to allow our model to more easly learn multplcatve relatonshps [1]. We compare our matchng hstogram representaton wth prevous matchng matrx representatons to show the advantages. Frstly, by settng exact matchng as a separate bn, the matchng hstogram clearly dstngushes the exact matchng sgnals from smlarty matchng sgnals, whle n a matchng matrx all the sgnals are mxed together. Secondly, to solve the problem of varable sze n the matchng matrx, a zero-paddng scheme s often adopted n prevous methods [11]. However, the zero-paddng scheme ntroduces addtonal nteracton sgnals whch may be unfar for short documents. In contrast, we map the varable-sze nteractons nto a fxed-length matchng hstogram wthout ntroducng any addtonal sgnals. Feed forward Matchng Network: Based on the matchng hstogram above, we employ a feed forward matchng network to learn the herarchcal matchng patterns and produce a matchng score for each query term. Snce our model follows the approach of nteracton-focused models, we dscuss the major dfferences between the learnng of our feed forward matchng network and that n prevous nteractonfocused models. Exstng nteracton-focused models, e.g., ARC-II and MatchPyramd, employ a CNN to learn herarchcal matchng patterns over the matchng matrx. These models are bascally poston-aware usng convolutonal unts wth a local receptve feld and learnng postonal regulartes n matchng patterns. Ths may be sutable for the mage recognton task, and work well on semantc matchng problems due to the global matchng requrement (.e., all the postons are mportant). However, t may not be sutable for the ad-hoc retreval task, snce such postonal regularty may not exst n relevance matchng due to the dverse matchng requrement dscussed n Secton 3. Besdes, snce CNN parameters are poston related, these models wll treat both exact matchng and smlarty matchng sgnals equally. Our deep relevance matchng model, on the contrary, ams to extract herarchcal matchng patterns from dfferent levels of nteracton sgnals rather than dfferent postons. The poston-free and strength-focused property makes t better at handlng the dverse matchng requrement n ad-hoc retreval. Meanwhle, snce the matchng hstogram drectly dstngushes exact matchng sgnals from the rest, our model can naturally learn the mportance of exact matchng sgnals. There have been some nteracton-focused models that employ specal poolng strateges to turn the poston-aware nteractons nto strength-based fxed-length representatons. For example, MV-LSTM [26] used K-max poolng strategy [13] to select the top K strongest nteracton sgnals from the matchng matrx as the nput of a MLP. However, such a poolng strategy smply truncates the sgnals and thus wll be strongly based to long documents snce t s more lkely for long documents to contan more strong sgnals. The poolng strategy s appled over the entre matchng matrx n MV-LSTM, makng t possble that the top K strongest sgnals all come from the nteractons between a sngle query term and the document terms. In contrast, our model does not rely on any poolng strategy to truncate the nteractons so that we can avod these problems. Term Gatng Network: One sgnfcant dfference of our model from exstng nteracton-focused models s that we employ a jont deep archtecture at the query term level. In ths way, our model can explctly model query term mportance. Ths s acheved by usng the term gatng network, whch produces an aggregaton weght for each query term controllng how much the relevance score on that query term contrbutes to the fnal relevance score. Specfcally, we employ the softmax functon as the gatng functon. g = exp(w gx (q) ) M j=1 exp(wgx(q) j ), = 1,...,M, where w g denotes the weght vector of the term gatng network and x (q), = 1,...,M denotes the -th query term nput. We tred dfferent nputs for the gatng functon as follows: Term Vector (TV): Inspred by the work [32] where term embeddngs can be leveraged to learn the term weghts n queres, we use query term vectors as the nput of the gatng functon. In ths method, x (q) denotes the -th query term vector, and w g s a weght vector wth the same dmensonalty of term vectors. Inverse Document Frequency (IDF): An mportant sgnal of term mportance n ad-hoc retreval s the nverse document frequency. We also tred ths smple but powerful sgnal n the gatng functon. In ths method, x (q) denotes

6 Table 1: Statstcs of the TREC collectons used n ths study. The ClueWeb-09-Cat-B collecton has been fltered to the set of documents n the 60 th percentle of spam scores. Robust04 ClueWeb-09-Cat-B Vocabulary 0.6M 38M Document Count 0.5M 34M Collecton Length 252M 26B Query Count the nverse document frequency of the -th query term, and w g reduces to a sngle parameter. 4.1 Model Tranng Snce the ad-hoc retreval task s fundamentally a rankng problem, we employ a parwse rankng loss such as hnge loss to tran our deep relevance matchng model. Gven a trple (q,d +,d ), where document d + s ranked hgher than document d wth respect to query q, the loss functon s defned as: L(q,d +,d ;Θ) = max(0,1 s(q,d + )+s(q,d )) where s(q, d) denotes the predcted matchng score for (q, d), and Θ ncludes the parameters for the feed forward matchng network and those for the term gatng network. The optmzaton s relatvely straghtforward wth standard backpropagaton[29]. We apply stochastc gradent descent method Adagrad [4] wth mn-batches (20 n sze), whch can be easly parallelzed on sngle machne wth mult-cores. For regularzaton, we fnd that the early stoppng [9] strategy works well for our model. 5. EXPERIMENTS In ths secton, we conduct experments to demonstrate the effectveness of our proposed model. 5.1 Data Sets To conduct experments, we use two TREC collectons, Robust04 and ClueWeb-09-Cat-B. The detals of the two collectons are provded n Table 1. As we can see, they represent dfferent szes and genres of heterogeneous text collectons. Robust04 s a small news dataset. Its topcs are collected from TREC Robust Track ClueWeb-09- Cat-B, on the other hand, s a large Web collecton, whose topcs are accumulated from TREC Web Tracks 2009, 2010, and Note that ClueWeb-09-Cat-B s fltered to the set of documents wth spam scores n the 60 th percentle, usng the Waterloo Fuson spam scores [3]. For both datasets, we made use of both the ttle and the descrpton of each TREC topc n our experments. The retreval experments descrbed n ths secton are mplemented usng the Galago Search Engne 2. Durng ndexng and retreval, both documents and query words are whte-space tokenzed, lowercased, and stemmed usng the Krovetz stemmer [15]. Stopword removal s performed on query words durng retreval usng the INQUERY stop lst [2]. 5.2 Baselnes and Expermental Settngs 2 We adopt three types of baselne methods for comparson, ncludng tradtonal retreval models, representatonfocused deep matchng models and nteracton-focused deep matchng models. Tradtonal retreval models nclude QL: Query lkelhood model based on Drchlet smoothng [31] s one of the best performng language models. BM25: The BM25 formula [22] s another hghly effectve retreval model that represents the classcal probablstc retreval model. Representaton-focused deep matchng models nclude DSSM T/DSSM D: DSSM [12] s a state-of-the-art deep matchng model for Web search. In the orgnal paper, the model was evaluated based on <query, doc ttle> pars where doc ttle s extracted from the ttle feld. We denote ths model as DSSM T. Snce other baselne models and our model are based on the full text of the documents, we also evaluated the DSSM model under the same settng, denoted by DSSM D. Snce DSSM needs large scale tranng data due to ts huge parameter sze, we drectly used the released model 3 (traned on large clck-through dataset) n our experments. C-DSSM T/C-DSSM D: C-DSSM [23, 8] s a smlar deep matchng model to DSSM for Web search, replacng the feed forward neural network wth a convolutonal neural network. For the same reason as DSSM, we also made use of the released model 3 drectly and adopt two versons of the C-DSSM model, one based on ttle felds of documents denoted as C-DSSM T andthe other based thewhole document denoted as C-DSSM D. ARC-I: ARC-I [11] s a general representaton-focused deep matchng model that has been tested on a set of NLP tasks ncludng sentence completon, response matchng, and paraphrase dentfcaton. We mplemented the ARC-I model accordng to the orgnal paper snce there s no publcly avalable code. Interacton-focused deep matchng models are as follows: ARC-II: ARC-II [11] was proposed by the authors of the model ARC-I, but focuses on learnng herarchcal matchng patterns from local nteractons usng a CNN. We also mplemented ACR-II snce there s no publcly avalable code. MP: MatchPyramd [19] s another state-of-the-art nteracton-focused deep matchng model and has been tested on two NLP tasks ncludng paraphrase dentfcaton and paper ctaton matchng. There are three varants of the model based on dfferentnteracton operators, denotedas MP IND, MP COS, and MP DOT. We obtaned the orgnal mplementaton of the model from the authors for comparson. We refer to our proposed deep relevance matchng model as DRMM. Wth dfferent types of hstogram mappng functons (.e., CH, NH and LCH) and term gatng functons (.e., TV and IDF), we obtaned sx dfferent varants of our proposed model. For example, by DRMM CH IDF we refer to DRMM wth Count-based hstogram and term gatng network usng nverse document frequency. Term Embeddngs: For all the models based on term embeddng nputs, ncludng ARC-I, ARC-II, MatchPyramd and DRMM, we used 300-dmensonal term vectors traned wth the Contnuous Bag-of-Words (CBOW) Model [18] on the Robust04 and ClueWeb-09-Cat-B collectons, respectvely. Specfcally, we used 10 as the context wndow sze and used 10 negatve samples and a subsamplng of fre- 3

7 quent words wth samplng threshold of 10 4 as suggested by Word2Vec 4. Each corpus was pre-processed by removng HTML tags and stemmng. We also dscarded from the vocabulary all the terms that occur less than 10 tmes n the corpus, whch resulted n a vocabulary of sze 0.1M and 4.1M on the Robust04 and ClueWeb-09-Cat-B collectons, respectvely. To address the out-of-vocabulary (OOV) terms (.e., some rare terms or numbers not traned by CBOW) n queres, we follow the practce n prevous work [14] to only allow exact matchng between such query terms and document terms. Network Confguratons: For network confguratons (e.g., numbers of layers and hdden nodes), we tune the hyper parameters on a valdaton set (as part of the tranng set). For ARC-I, ARC-II and MatchPyramd, we tred both the default confguratons n ther orgnal paper and other settngs. We fnd that models wth less layers and feature maps perform better, probably due to the lmted tranng data n TREC collectons. Specfcally, for ARC-I and ARC- II,we use3-wordwndows, 64feature mapsand6layers (two for convolutons, two for max-poolng and two full connecton). For MatchPyramd, we use one convolutonal layer, one dynamc poolng layer and two full connecton layers. The number of feature maps s 8 and the kernel sze s set to be 3 3. For DRMM, we also use a four-layer archtecture throughout all experments,.e., one hstogram nput layer (30 nodes), two hdden layers n the feed forward matchng network (5 nodes and 1 node respectvely), and one output layer (1 node) wth the term gatng network for the fnal matchng score. 5.3 Evaluaton Methodology Gven the lmted number of queres for each collecton, we conduct 5-fold cross-valdaton to mnmze over-fttng wthout reducng the number of learnng nstances. Topcs for each collecton are randomly dvded nto 5 folds. The parameters for each model are tuned on 4-of-5 folds. The fnal fold n each case s used to evaluate the optmal parameters. Ths process s repeated 5 tmes, once for each fold. Mean average precson (MAP) s the optmzed metrc for all retreval models. Throughout ths paper each dsplayed evaluaton statstc s the average of the fve foldlevel evaluaton values. For evaluaton, the top-ranked 1, 000 documents are compared usng the mean average precson (MAP), normalzed dscounted cumulatve gan at rank 20 (ndcg@20), and precson at rank 20 (P@20). Statstcal dfferences between models are computed usng the Fsher randomzaton test [24] (α = 0.05). Note that for all the deep matchng models, we adopt a re-rankng strategy for effcent computaton. An ntal retreval s performed usng the QL model to obtan the top 2,000 ranked documents. We then use the deep matchng models to re-rank these top results. The top-ranked 1, 000 documents are then used for comparson. 5.4 Retreval Performance and Analyss Ths secton presents the performance results of dfferent retreval models over the two benchmark datasets. A summary of results s dsplayed n Table 2. As we can see, all the representaton-focused models perform sgnfcantly worse than the tradtonal retreval models, demonstratng the unsutablty of these models for rel- 4 evance matchng. Both DSSM T and C-DSSM T can work better than ther counterpart on the whole document on ClueWeb-09-Cat-B, showng that models desgned for global matchng requrement cannot handle the dverse matchng requrement n long documents. Note that we do not reporttheperformance ofdssm T andc-dssm T onrobust04 snce there s no ttle feld n many subsets n ths collecton. The ARC-I model, although traned on the correspondng corpus, performs even worse than DSSM and C-DSSM. A possble reason s that ARC-I concatenates the query and document representaton for computng the matchng score, whchmaybeless effectvethanthecosne functonndssm and C-DSSM. When we look at the nteracton-focused models, we fnd that these baselne models cannot compete wth the tradtonal retreval models ether. Among these models, ARC-II can outperform ARC-I by drectly learnng from local nteractons, but performs worse than MatchPyramd models due to the ndrect local nteractons (.e., local nteracton s based on the weghted sum of query and document term vectors rather than cosne smlarty or dot product), whch s consstent wth prevous results n [11, 19]. Moreover, the best performng nteracton-focused model, MP COS, can consstently outperform all the representaton-focused models on both test collectons. When comparng the MatchPyramd models, we fnd that both MP IND and MP COS perform muchbetter thanmp DOT. Note thatmp IND s purely based on exact matchng sgnals, MP COS and MP DOT nvolve both exact and smlarty matchng sgnals where exact matchng sgnals are always stronger than smlarty sgnals n MP COS, but ths may not be true n MP DOT. The performance gap between MP DOT and the other two MPs ndcates the mportance of the exact matchng sgnals n relevance matchng. In fact, when evaluated on the semantc matchng tasks n [19], MP DOT performed better than the other two MPs even though t cannot dfferentate the exact matchng sgnals from the rest, demonstratng the sgnfcant dfferences between semantc matchng and relevance matchng. As for our proposed DRMMs, we have the followng observatons: (1) NH-based models perform sgnfcantly worse than CH-based models, whle LCH-based models acheve the best performance on both collectons. The low performance of NH-based models may be related to the loss of document length nformaton after normalzaton whch s mportant n ad-hoc retreval [6]. Meanwhle, the good performance of LCH-based models ndcates that deep neural networks can beneft from nput sgnals wth reduced range and nonlnear transformaton useful for learnng multplcatve relatonshps [1]; (2) The term gatng functon based on nverse document frequency works better than that based on term vectors. There are two possble reasons for ths result. Frstly, term vectors do not contan suffcent nformaton for the term mportance. Secondly, the learnng of the model mght be domnated by the term gatng network when we use term vectors as the nput snce there are more parameters (.e., 300 parameters) n the gatng network compared to the feed forward matchng network (.e., 155 parameters). Fnally, we can see that the best performng DRMM (.e., DRMM LCH IDF) s sgnfcantly better than all the exstng deep matchng models as well as tradtonal retreval models. For example, on ClueWeb-09-Cat-B topc ttles, the relatve mprovement of our model over the best perform-

8 Table 2: Comparson of dfferent retreval models over the Robust-04 and ClueWeb-09-Cat-B collectons. Sgnfcant mprovement or degradaton wth respect to QL s ndcated (+/-) (p-value 0.05). Robust-04 collecton Topc ttles Topc descrptons Model Type Model Name MAP ndcg@20 P@20 MAP ndcg@20 P@20 Tradtonal Retreval QL Baselnes BM DSSM Representaton-Focused D CDSSM Matchng Baselnes D ARC-I Interacton-Focused Matchng Baselnes Our Approach ARC-II MP IND MP COS MP DOT DRMM CH TV DRMM NH TV DRMM LCH TV DRMM CH IDF DRMM NH IDF DRMM LCH IDF ClueWeb-09-Cat-B collecton Topc ttles Topc descrptons Model Type Model Name MAP ndcg@20 P@20 MAP ndcg@20 P@20 Tradtonal Retreval QL Baselnes BM DSSM T DSSM Representaton-Focused D CDSSM Matchng Baselnes T CDSSM D ARC-I Interacton-Focused Matchng Baselnes Our Approach ACR-II MP IND MP COS MP DOT DRMM CH TV DRMM NH TV DRMM LCH TV DRMM CH IDF DRMM NH IDF DRMM LCH IDF ng baselne (.e., BM25) s about 11.9%, 14.7%, and 12% n terms of MAP, ndcg@20 and P@20, respectvely. Another nterestng fndng s that on the Robust04 collecton, the performance of DRMM LCH IDF on topc descrptons can be comparable to that on topc ttles, whch s seldom observed on prevous models. Ths also demonstrates the potental of our model n handlng long queres n ad-hoc retreval. 5.5 Analyss on DRMM model We conducted experments to verfy the effectveness of dfferent components n the DRMM and analyze the effect of term embeddng dmensons. Through these experments, we try to gan a better understandng of the DRRM Impact of Dfferent Model Components To study the effect of dfferent model components, we compare the orgnal DRMM LCH IDF wth several smpler versons of the model. Frstly, we removed the term gatng network and used a smple sum to aggregate the scores from all the query terms. Snce the aggregaton weght s unform, we denote ths model as DRMM LCH UNI. We also tred removng the hstogram mappng layer but kept the rest unchanged. To turn the varable-length local nteractons nto a fxed-length representaton, we adopted two poolng strateges. One s dynamc poolng as n [25, 19] whch keeps the poston nformaton, and the other s K- max poolng as n [26] whch turns the postonal sgnals nto strength related sgnals. For a far comparson, we requre the sze of the representaton after poolng to be the same as the sze of the matchng hstogram (.e., 30). Note that although the matchng network structure s the same, the learned model s sgnfcantly dfferent due to the change of

DRMMLCHxIDF DRMMLCHxUNI DRMMDYNxIDF DRMMKMAXxIDF Table 3: Performance comparson of DRMM over dfferent dmensonalty of term embeddngs traned by CBOW on the Robust04 collecton.

9 DRMMLCHxIDF DRMMLCHxUNI DRMMDYNxIDF DRMMKMAXxIDF Table 3: Performance comparson of DRMM over dfferent dmensonalty of term embeddngs traned by CBOW on the Robust04 collecton. Topc Embeddng MAP CBOW-50d CBOW-100d Ttles CBOW-300d CBOW-500d CBOW-50d CBOW-100d Descrptons CBOW-300d CBOW-500d Fgure 3: Comparson of several smpler versons of DRMM over topc ttles of the two test collectons n terms of MAP. the nput. The matchng model based on dynamc poolng s a poston-aware model, whle the model based on K-max poolng s learned wth respect to the top strong nteracton sgnals. We denote the former model as DRMM DYN IDF and the latter as DRMM KMAX IDF. Thecomparson resultsoverthetopcttlesonthetwotest collectons n terms of MAP are depcted n Fgure 3. As we can see, wthout the term gatng network, DRMM LCH UNI performs slghtly worse than the orgnal DRMM. Specfcally, the relatve MAP drop of DRMM LCH UNI compared wth DRMM LCH IDF s about 6.8% and 3.5% on Robust04 and ClueWeb-09-Cat-B, respectvely. The results demonstrate the effectveness of the dfferentaton of query term mportance n relevance matchng. Besdes, we fnd that DRMM DY N IDF basedon poston-related sgnals performs sgnfcantly worse than the other two models based on strengthrelatedsgnals(.e., DRMM LCH IDF anddrmm KMAX IDF). The results ndcate that ad-hoc retreval s more lkely to be a strength-related task rather than a poston-related task. When comparng DRMM KMAX IDF and the orgnal DRMM LCH IDF, we fnd that DRMM KMAX IDF works qute well on Robust04 but fals on ClueWeb-09-Cat-B. The possble reason s that the document length varaton on Web data (.e., ClueWeb-09-Cat-B) s much larger than that on news data (.e., Robust04), leadng to the falure of the K- max poolng method whch has potental bas towards very long documents. Ths further demonstrates the effectveness of our matchng hstogram mappng and the correspondng hstogram based feed forward matchng network Impact of Term Embeddngs Snce we leverage a pror learned term embeddngs n our model, we further study the effect of embeddng dmensonalty on the retreval performance. Here we report the performance results on the Robust04 collecton usng term embeddngs traned by CBOW model wth 50, 100, 300, and 500 dmensons, respectvely. As shown n Table 3, the performance frst ncreases and then slghtly drops wth the ncrease of dmensonalty. Term embeddngs of dfferent dmensonalty provde dfferent granularty of semantc smlarty; they may also requre dfferent amounts of tranng data. Wth lower dmensonalty, the smlarty between term embeddngs mght be coarse and hurt the relevance matchng performance. However, wth larger dmensonalty, one may need more data to tran relable term embeddngs. Our results suggest that 300 dmensons s suffcent for learnng term embeddngs effectve for relevance matchng on the Robust04 collecton. 6. RELATED WORK By formalzng ad-hoc retreval as a text matchng problem, deep matchng models can be appled to ths task so that features can be automatcally acqured n an end-to-end way. In recent years, avarety of deep matchngmodels have been proposed for the text matchng problems. As mentoned before, we can categorze the exstng deep matchng models nto two major types, namely representatonfocused models and nteracton-focused models. We have descrbed several representatve deep matchng models n these two classes n prevous sectons ncludng DSSM, C- DSSM, ARC-I, ARC-II and MatchPyramd. Here we wll dscuss some other related work n ths drecton. In the class of representaton-focused models, Qu et al.[21] proposed Convolutonal Neural Tensor Network (CNTN) for communty-based queston answerng. The CNTN model s smlar to ARC-I, usng CNN to buld the representatons for each pece of texts. The major dfference between CNTN and ARC-Is that CNTN employs a tensor layer rather than MLP on topof thetwocnns tocomputethematchngscore betweenthetwopecesoftext. In[25], Socheretal.proposed an Unfoldng Recursve Autoencoder(uRAE) for paraphrase dentfcaton. They frst employed recursve autoencoders to buld the herarchcal compostonal text representatons based on syntactc trees, and then conducted matchng at dfferent levels for the dentfcaton task. In [30], Yn et al. ntroduced MultGranCNN whch employs a CNN to obtan herarchcal representatons of texts, and then computes the matchng score based on the nteractons between these multgranular representatons. In the class of nteracton-focused models, Wang et al. [28] proposed Deep Match Tree (DeepMatch tree) for the short text matchng problem. Dfferent from DeepMatch[17] whch bulds local nteractons between texts based on semantc topcs, DeepMatch tree defnes nteractons n the product space of dependency trees. A deep neural network s then leveraged for makng a matchng decson on the two short texts, on the bass of these local nteractons. In [27], Wan et al. ntroduced Match-SRNN to model the recursve matchng structure n the local nteractons so that long-dstance dependency between the nteractons can be captured. The proposed model was evaluated on two tasks, ncludng communty based queston answerng and paper ctaton matchng. Most of these deep matchng models are desgned for the semantc matchng problem, whch s sgnfcantly dfferent

10 from the relevance matchng problem n ad-hoc retreval. In ths work, we ntroduce a model specfcally desgned for the relevance matchng problem. 7. CONCLUSIONS In ths paper, we pont out that there are sgnfcant dfferences between semantc matchng for many NLP tasks and relevance matchng for the ad-hoc retreval task. Many exstng deep matchng models desgned for the semantc matchng problem thus may not ft the ad-hoc retreval task. Based on ths analyss, we propose a novel deep relevance matchng model for ad-hoc retreval, by explctly addressng the three factors n relevance matchng. The proposed model contans three major components,.e., matchng hstogram mappng, a feed forward matchng network, and a term gatng network. Expermental results on two representatve benchmark datasets show that our model can sgnfcantly outperform tradtonal retreval models as well as state-of-the-art deep matchng models. For future work, we would lke to leverage larger tranng data, e.g. clck-through logs, to tran deeper DRMM so that we can further explore the potental of the proposed model on ad-hoc retreval. We may also nclude phrase embeddngs so that phrases can be treated as a whole rather than separate terms. In ths way, we expect the local nteractons can better reflect the meanng by usng the proper semantc unts n language, leadng to better retreval performance. 8. ACKNOWLEDGMENTS Ths work was supported n part by the Center for Intellgent Informaton Retreval, n part by the 973 Program of Chna under Grant No. 2014CB and 2013CB329606, n part by the Natonal Natural Scence Foundaton of Chna under Grant No , , , and , and n part by the Youth Innovaton Promoton Assocaton CAS under Grant No and REFERENCES [1] C. Burges, T. Shaked, E. Renshaw, A. Lazer, M. Deeds, N. Hamlton, and G. Hullender. Learnng to rank usng gradent descent. In ICML, pages ACM, [2] J. P. Callan, W. B. Croft, and J. Broglo. Trec and tpster experments wth nquery. IPM, 31(3): , [3] G. V. Cormack, M. D. Smucker, and C. L. Clarke. Effcent and effectve spam flterng and re-rankng for large web datasets. Informaton retreval, 14(5): , [4] J. Duch, E. Hazan, and Y. Snger. Adaptve subgradent methods for onlne learnng and stochastc optmzaton. JMLR, 12: , [5] H. Fang, T. Tao, and C. Zha. A formal study of nformaton retreval heurstcs. In SIGIR, pages ACM, [6] H. Fang, T. Tao, and C. Zha. Dagnostc evaluaton of nformaton retreval models. TOIS, 29(2):7, [7] H. Fang and C. Zha. Semantc term matchng n axomatc approaches to nformaton retreval. In SIGIR, pages ACM, [8] J. Gao, P. Pantel, M. Gamon, X. He, L. Deng, and Y. Shen. Modelng nterestngness wth deep neural networks. EMNLP, October [9] R. C. S. L. L. Gles. Overfttng n neural nets: Backpropagaton, conjugate gradent, and early stoppng. In NIPS, volume 13, page 402. MIT Press, [10] G. Hnton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jatly, A. Senor, V. Vanhoucke, P. Nguyen, T. N. Sanath, et al. Deep neural networks for acoustc modelng n speech recognton: The shared vews of four research groups. Sgnal Processng Magazne, 29(6):82 97, [11] B. Hu, Z. Lu, H. L, and Q. Chen. Convolutonal neural network archtectures for matchng natural language sentences. In NIPS, pages , [12] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. Learnng deep structured semantc models for web search usng clckthrough data. In CIKM, pages ACM, [13] N. Kalchbrenner, E. Grefenstette, and P. Blunsom. A convolutonal neural network for modellng sentences. arxv preprnt arxv: , [14] T. Kenter and M. de Rjke. Short text smlarty wth word embeddngs. In CIKM, pages ACM, [15] R. Krovetz. Vewng morphology as an nference process. In SIGIR, pages ACM, [16] Y. LeCun and Y. Bengo. Convolutonal networks for mages, speech, and tme seres. The handbook of bran theory and neural networks, 3361(10):1995, [17] Z. Lu and H. L. A deep archtecture for matchng short texts. In NIPS, pages , [18] T. Mkolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Dstrbuted representatons of words and phrases and ther compostonalty. In NIPS, pages , [19] L. Pang, Y. Lan, J. Guo, J. Xu, S. Wan, and X. Cheng. Text matchng as mage recognton [20] J. Pennngton, R. Socher, and C. D. Mannng. Glove: Global vectors for word representaton. In EMNLP, pages , [21] X. Qu and X. Huang. Convolutonal neural tensor network archtecture for communty-based queston answerng. In IJCAI, pages , [22] S. E. Robertson and S. Walker. Some smple effectve approxmatons to the 2-posson model for probablstc weghted retreval. In SIGIR, pages ACM, [23] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnl. Learnng semantc representatons usng convolutonal neural networks for web search. In WWW, pages , [24] M. D. Smucker, J. Allan, and B. Carterette. A comparson of statstcal sgnfcance tests for nformaton retreval evaluaton. In CIKM, pages ACM, [25] R. Socher, E. H. Huang, J. Pennn, C. D. Mannng, and A. Y. Ng. Dynamc poolng and unfoldng recursve autoencoders for paraphrase detecton. In NIPS, pages , 2011.

A Deep Relevance Matching Model for Ad-hoc Retrieval

A Deep Relevance Matching Model for Ad-hoc Retrieval Jiafeng Guo 1, Yixing Fan 1, Qingyao Ai 2, W. Bruce Croft 2 1 CAS Key Lab of Web Data Science and Technology, Institute of Computing Technology, Chinese