Learning to Reweight Terms with Distributed Representations

Size: px

Start display at page:

Download "Learning to Reweight Terms with Distributed Representations"

Garry Lane
5 years ago
Views:

1 Learning to Reweight Terms with Distributed Representations School of Computer Science Carnegie Mellon University August 12, 215

2 Outline Goal: Assign weights to query terms for better retrieval results Motivation Method Evaluation Conclusions

3 Motivation Assigning weights to terms is not a new problem score(q, d) = t q s(t, q, C)f(t, d) (1) Brief summary of existing representative methods: Type Method Performance Features Model Constant bad - - Unsupervised IDF average - - Supervised LtR-methods manual, good non-convex (Rank measure) (WSD) external Supervised Zhao et al average manual requires SVD (Term recall) Can we have something that is directly term weight supervised, can lead to good retrieval performance without manual / expensive feature construction?

4 Probably yes? What might be a good choice of term weights? Term recall The probability of term t of query q appears in relevant documents R q, can be estimated as Rq,t R q if relevance judgments are available True term recall proves to be highly effective as shown by Zhao et al[3, 2] But which features are good to predict that?

5 Distributed word vectors Distributed word vectors learnt from a large corpora can be used to effectively measure semantic similarity vec(king) vec(man) vec(queen) vec(woman) Another important characteristic of word vector is the addition property, i.e., it s often semantically meaningful to construct word vectors for phrases from those of individual terms, such as representing vec(chinese river) vec(chinese) + vec(river)

6 Idea in this work Intuitively, term recall measures the relative importance of a term w.r.t the whole query in determining document-query relevance With the addition property of word vectors, we can construct feature vectors for terms from their word vector representations, which resembles the above intuition of term recall For a term t in query q, we construct the feature vector x(t) given the word vector representation w(t) as x(t) vec(t) vec(q) where vec(q) is the average word vector of all the terms in q. The above defined feature vector is essentially the offset vector of a term to the center of the entire query

7 Feature vector illustration Figure: Demonstration of transforming global 2-dimensional word vectors into feature vectors local to the query.

8 Term recall learning with feature vectors Model term recall as a function of the feature vectors defined Preprocessing: transform term recall r from [,1] to y R by a logit function y = logit(r) Models the transformed term recall y as a linear function of the feature vectors x Employ l 1 norm regularization in learning (l 2 norm also tried) LASSO optimization: 1 ˆβ = arg min β R p 2 M n i (y ij β x ij ) 2 + λ β 1 (2) i=1 j=1 Predict term recall for testing query term with feature vector x as ( ) P(t R) = sigmoid ˆβ x = exp{ ˆβ x } 1 + exp{ ˆβ x } (3)

9 Incorporating predicted weights to query models (I) Base query models: Bag-of-Words (BOW): apple pie recipe Sequential Dependency model (SD): #weight(.8 #combine( apple pie recipe ).1 #combine( #1(apple pie) #1(pie recipe) ).1 #combine( #uw8(apple pie) #uw8(pie recipe) ) )

10 Incorporating predicted weights to query models (II) Term recall enhanced query models: DeepTR-BOW: #weight( P(apple R) apple P(pie R) pie P(recipe R) recipe ) DeepTR-SD: #weight(.8 #weight( P(apple R) apple P(pie R) pie P(recipe R) recipe ).1 #combine(#1(apple pie) #1(pie recipe) ).1 #combine( #uw8(apple pie) #uw8(pie recipe) ) )

11 Evaluation Data sets Collection # Docs # Words TREC Topics ROBUST4 528K 253M , 61-7 WT1g 1,692K 1,76M GOV2 25,25K 24,7M ClueWeb9B 29,38K 23,89M 1-2 Baseline query models: Bag of Words (BOW), Sequential Dependency model (SD), Weighted Sequential Dependency model (WSD) [1] Retrieval models: Language model and BM25 Evaluation metrics: P@1, MAP for all collections, NDCG@1 and ERR@2 for collections with graded relevance judgments

12 Evaluation Word vector source: trained from all above collections with dimensions ranging from 1, 3, 5 pre-trained by Google trained from Google News (1 billion words) with dimension 3

13 Experimental results Retrieval results of DeepTR-BOW with language model (dimension=3) Query Model ROBUST4 WT1g GOV2 ClueWeb9B MAP MAP MAP MAP BOW SD WSD DeepTR-BOW.2591 b b.718 (Corpus) (+3.2/-1.9) (+8.2/+3.5) (+6.3/-1.6) (+2.2/-3.6) DeepTR-BOW.265 b.2111 b.2646 b.741 (GOV2) (+5.5/+.3) (+8.7/+3.9) (+6.3/-1.6) (+5.6/-.5) DeepTR-BOW.2657 b b.718 (ClueWeb9B) (+5.8/+.5) (+9.6/+4.8) (+7.9/-.1) (+2.2/-3.6) DeepTR-BOW.2673 b.2122 b.263 b.732 (Google) (+6.4/+1.2) (+9.3/+4.5) (+5.7/-2.2) (+4.2/-1.8) b : Statistically significant difference with BOW s : Statistically significant difference with SD Weighted BoW queries significantly outperforms unweigted BoW queries and are even comparable to SD queries.

14 Experimental results (cont.) Retrieval results of DeepTR-SD with language model (dimension=3) Query Model ROBUST4 WT1g GOV2 ClueWeb9B MAP MAP MAP MAP BOW SD WSD DeepTR-SD.2754 b s.2182 b s.2831 b s.748 (Corpus) (+9.6/+4.2) (+12.3/+7.4) (+13.8/+5.3) (+6.5/+.5) DeepTR-SD.2781 b s.2223 b s.2831 b s.86 b s (GOV2) (+1.7/+5.2) (+14.4/+9.4) (+13.8/+5.3) (+14.8/+8.2) DeepTR-SD.281 b s.2279 b s.289 b s.748 (ClueWeb9B) (+11.9/+6.3) (+17.3/+12.1) (+16.2/+7.5) (+6.5/+.5) DeepTR-SD.2842 b s.2256 b s.2814 b s.78 b s (Google) (+13.1/+7.5) (+16.1/+11.) (+13.1/+4.7) (+11./+4.7) b : Statistically significant difference with BOW s : Statistically significant difference with SD DeepTR-SD significantly outperforms both BoW and SD queries.

15 Experimental results (cont.) Retrieval results of DeepTR-BOW with BM25 (dimension=3) Query Model ROBUST4 WT1g GOV2 ClueWeb9B MAP MAP MAP MAP BOW SD DeepTR-BOW.2566 b.193 b.243 b.593 b (Corpus) (+4.3/+.8) (+6.7/+4.7) (+4.1/+.9) (+4.7/-1.2) DeepTR-BOW.2561 b.191 b.243 b.596 b (GOV2) (+4.1/+.6) (+7.1/+5.1) (+4.1/+.9) (+5.3/-.7) DeepTR-BOW.259 b.1932 b.2462 b.593 b (ClueWeb9B) (+5.3/+1.8) (+8.3/+6.3) (+5.5/+2.2) (+4.7/-1.2) DeepTR-BOW.26 b.1922 b.2449 b.584 (Google) (+5.7/+2.1) (+7.8/+5.7) (+4.9/+1.7) (+3.1/-2.7) b : Statistically significant difference with BOW s : Statistically significant difference with SD Similar results observed for BM25 retrieval.

16 Experimental results (cont.) Retrieval results of DeepTR-SD with BM25 (dimension=3) Query Model ROBUST4 WT1g GOV2 ClueWeb9B MAP MAP MAP MAP BOW SD DeepTR-SD.2595 b s.1944 b s.2478 b s.613 b s (Corpus) (+5.5/+1.9) (+9./+7.) (+6.1/+2.9) (+8.2/+2.1) DeepTR-SD.269 b s.1941 s.2478 b s.616 b s (GOV2) (+6.1/+2.5) (+8.8/+6.8) (+6.1/+2.9) (+8.8/+2.6) DeepTR-SD.263 b s.1951 b s.252 b s.613 b s (ClueWeb9B) (+6.9/+3.3) (+9.4/+7.3) (+7.2/+3.9) (+8.2/+2.1) DeepTR-SD.2627 b s.1938 s.2461 b.64 (Google) (+6.8/+3.2) (+8.7/+6.7) (+5.4/+2.2) (+6.8/+.7) b : Statistically significant difference with BOW s : Statistically significant difference with SD Similar results observed for BM25 retrieval.

17 Word vector dimensionality vectors trained from ClueWeb9B, Language Model retrieval Query Model ROBUST4 WT1g GOV2 ClueWeb9B MAP MAP MAP MAP BOW SD WSD DeepTR-BOW.2681 b.2239 b.2667 b.727 (1) (+6.7/+1.4) (+15.3/+1.2) (+7.2/-.8) (+3.6/-2.4) DeepTR-BOW.2657 b b.718 (3) (+5.8/+.5) (+9.6/+4.8) (+7.9/-.1) (+2.2/-3.6) DeepTR-BOW.2679 b.2179 b.2655 b.726 (5) (+6.6/+1.4) (+12.1/+7.2) (+6.7/-1.2) (+3.4/-2.5) DeepTR-SD.2851 b s.2373 b s.2848 b s.778 b s (1) (+13.5/+7.9) (+22.1/+16.8) (+14.4/+5.9) (+1.8/+4.5) DeepTR-SD.281 b s.2279 b s.289 b s.748 (3) (+11.9/+6.3) (+17.3/+12.1) (+16.2/+7.5) (+6.5/+.5) DeepTR-SD.2817 b s.236 b s.286 b s.781 b s (5) (+12.2/+6.6) (+18.7/+13.5) (+14.9/+6.4) (+11.3/+4.9) b : Statistically significant difference with BOW s : Statistically significant difference with SD d = 1 works slightly better

18 Experimental results (cont.) ROBUST BOW SD DeepTR SD (ROBUST4 3) DeepTR SD (WT1g 3) DeepTR SD (GOV2 3) DeepTR SD (ClueWeb9B 1) DeepTR SD (ClueWeb9B 3) DeepTR SD (ClueWeb9B 5) DeepTR SD (Google 3).25 MAP LM BM25

19 WT1g BOW SD DeepTR SD (ROBUST4 3) DeepTR SD (WT1g 3) DeepTR SD (GOV2 3) DeepTR SD (ClueWeb9B 1) DeepTR SD (ClueWeb9B 3) DeepTR SD (ClueWeb9B 5) DeepTR SD (Google 3) MAP LM BM25

20 GOV2 BOW SD DeepTR SD (ROBUST4 3) DeepTR SD (WT1g 3) DeepTR SD (GOV2 3) DeepTR SD (ClueWeb9B 1) DeepTR SD (ClueWeb9B 3) DeepTR SD (ClueWeb9B 5) DeepTR SD (Google 3) MAP LM BM25

21 ClueWeb9B.1 BOW SD DeepTR SD (ROBUST4 3) DeepTR SD (WT1g 3) DeepTR SD (GOV2 3) DeepTR SD (ClueWeb9B 1) DeepTR SD (ClueWeb9B 3) DeepTR SD (ClueWeb9B 5) DeepTR SD (Google 3) MAP.5 LM BM25

22 Robustness Number of queries ROBUST4 SD DeepTR SD (trec7) DeepTR SD (wt1g) DeepTR SD (gov2) DeepTR SD (clueweb1) DeepTR SD (clueweb3) DeepTR SD (clueweb5) DeepTR SD (google3) 2 1 < 1% [ 1%, 8%) [ 8%, 6%) [ 6%, 4%) [ 4%, 2%) [ 2%, ) (, 2%) [2%, 4%) [4%, 6%) [6%, 8%) [8%, 1%] >1% Relative Learning gain to Reweight Terms with Distributed Representations

23 Number of queries WT1g SD DeepTR SD (trec7) DeepTR SD (wt1g) DeepTR SD (gov2) DeepTR SD (clueweb1) DeepTR SD (clueweb3) DeepTR SD (clueweb5) DeepTR SD (google3) 1 5 < 1% [ 1%, 8%) [ 8%, 6%) [ 6%, 4%) [ 4%, 2%) [ 2%, ) Relative gain (, 2%) [2%, 4%) [4%, 6%) [6%, 8%) [8%, 1%] >1%

24 Number of queries GOV2 SD DeepTR SD (trec7) DeepTR SD (wt1g) DeepTR SD (gov2) DeepTR SD (clueweb1) DeepTR SD (clueweb3) DeepTR SD (clueweb5) DeepTR SD (google3) 1 < 1% [ 1%, 8%) [ 8%, 6%) [ 6%, 4%) [ 4%, 2%) [ 2%, ) Relative gain (, 2%) [2%, 4%) [4%, 6%) [6%, 8%) [8%, 1%] >1%

25 Number of queries ClueWeb9B SD DeepTR SD (trec7) DeepTR SD (wt1g) DeepTR SD (gov2) DeepTR SD (clueweb1) DeepTR SD (clueweb3) DeepTR SD (clueweb5) DeepTR SD (google3) 1 5 < 1% [ 1%, 8%) [ 8%, 6%) [ 6%, 4%) [ 4%, 2%) [ 2%, ) Relative gain (, 2%) [2%, 4%) [4%, 6%) [6%, 8%) [8%, 1%] >1%

26 Query length Relative improvement over LM Relative improvement over LM ROBUST4 SD DeepTR SD Len 3 (7%) Len 4 5 (27%) Len 6+ (66%) GOV2 SD DeepTR SD Relative improvement over LM Relative improvement over LM WT1g SD DeepTR SD Len 3 (26%) Len 4 5 (39%) Len 6+ (35%) ClueWeb9B SD DeepTR SD Len 3 (11%) Len 4 5 (41%) Len 6+ (48%) Len 3 (26%) Len 4 5 (4%) Len 6+ (34%)

27 Graded relevance Query Model Language Model Retrieval GOV2 ClueWeb9B BOW SD DeepTR-SD.4121 b.1626 b.1743 b.1312 s (GOV2) DeepTR-SD.4146 b.1614 b (ClueWeb9B) DeepTR-SD.4112 b (Google) Query Model BM25 Retrieval GOV2 ClueWeb9B NDCG@2 ERR@2 NDCG@2 ERR@2 BOW SD DeepTR-SD.48 b s.159 b s (GOV2) DeepTR-SD.433 b s.1587 b s (ClueWeb9B) DeepTR-SD.41 b s.1586 s (Google) b : Statistically significant difference with BOW s : Statistically significant difference with SD

28 Conclusions and future work Novel (simple) framework for learning term weightes from features directly constructed from distributed word vectors Extensive experiments with two retrieval models on four test collections demonstrates the effectiveness Completing the table: Type Method Performance Features Model Unsupervised Constant bad - - IDF average - - Supervised LtR-methods manual, good (Rank measure) (WSD) external non-convex Supervised (Term recall) Zhao et al average manual requires SVD Supervised Proposed work good automatic convex, cheap (Term recall) Future directions include predicting term recalls for bigrams and proximity terms, other possible ways to construct feature vectors from distributed vectors, using word vectors for query expansion, theoretical justifications, etc.

29 Acknowledgements and QA Reproducibility: codes, preprocessed queries, experimental setups are available Contact: Thanks to SIGIR for the student travel grant Questions?

A Deep Relevance Matching Model for Ad-hoc Retrieval

A Deep Relevance Matching Model for Ad-hoc Retrieval Jiafeng Guo 1, Yixing Fan 1, Qingyao Ai 2, W. Bruce Croft 2 1 CAS Key Lab of Web Data Science and Technology, Institute of Computing Technology, Chinese