Attentive Neural Architecture for Ad-hoc Structured Document Retrieval

Size: px

Start display at page:

Download "Attentive Neural Architecture for Ad-hoc Structured Document Retrieval"

Lynette Burns
5 years ago
Views:

Nikolaev 1,2 1 Textual Data Analytics Lab, Department of

1 Attentive Neural Architecture for Ad-hoc Structured Document Retrieval Saeid Balaneshin 1 Alexander Kotov 1 Fedor Nikolaev 1,2 1 Textual Data Analytics Lab, Department of Computer Science, Wayne State University 2 Kazan Federal University 1/25

2 Ad-hoc Structured (Multi-field) Document Retrieval IR research traditionally views documents as holistic and homogeneous units of text The task of retrieving structured (multi-field) documents arises in many information access scenarios: Entity retrieval from knowledge graph(s) Web document retrieval Product search in e-commerce 2/25

3 Entity Retrieval from Knowledge Graph(s) Names Attributes Categories Similar Entity Names Related Entity Names 3/25

4 Entity Retrieval from Knowledge Graph(s) Names Attributes Categories Similar Entity Names Related Entity Names 3/25

5 Entity Retrieval from Knowledge Graph(s) Names Attributes Categories Similar Entity Names Related Entity Names 3/25

6 Entity Retrieval from Knowledge Graph(s) Names Attributes Categories Similar Entity Names Related Entity Names 3/25

7 Product Search Title Description Attributes 4/25

8 Product Search Title Description Attributes 4/25

9 Product Search Title Description Attributes 4/25

10 Product Search Title Description Attributes 4/25

11 Web Search Title Texts in Large Font Contents Incoming Hyper-links Document Meta-data Alternative Texts for Images 5/25

12 Web Search Title Texts in Large Font Contents Incoming Hyper-links Document Meta-data Alternative Texts for Images 5/25

13 Web Search Title Texts in Large Font Contents Incoming Hyper-links Document Meta-data Alternative Texts for Images 5/25

14 Web Search Title Texts in Large Font Contents Incoming Hyper-links Document Meta-data Alternative Texts for Images 5/25

15 Document vs. Structured Document Retrieval Document Retrieval - relevance is quantified by aggregating heuristics calculated at the document or collection level (# of occurrences and proximity of query terms, IDF, document length) Structured Document Retrieval - requires strategies for aggregating heuristics calculated at the level of document fields into the matching score of an entire document - effective for retrieving documents with lexically similar, but semantically diverse fields 6/25

16 Importance of Document Fields Aggregation of field-level statistics of query terms in structured document retrieval is informed by a relative importance of document fields, which depends on: properties or semantics of document fields: e.g. a query term matched in a section of a Web page, which is in larger font, should have a different importance than a query term matched in other sections query intent: e.g. in the query attractive outdoor light with security features attractive refers to product description, outdoor light to product name and security features to product attributes 7/25

17 Mixture of Language Models (MLM) [Ogilvie and Callan, SIGIR 03] Document D with F fields is ranked w.r.t query Q according to: P(Q D) rank = q i Q P(q i θ D ) n(q i,q) where F P(q i θ D ) = w j P(q i θ j ) j=1 8/25

18 Fielded Sequential Dependence Model (FSDM) [Zhiltsov et al., SIGIR 15] Extends SDM to the case of structured document retrieval (i.e. accounts for both unigram and sequential bigram concepts in a query and document structure) Document D with F fields is ranked w.r.t query Q according to: P(D Q) rank = λ T q i Q f T (q i, D) + λ O f O (q i, q i+1, D)+ λ U f U (q i, q i+1, D) q i Q Potential function for query unigram q i : q i Q F f T (q i, D) = log w j P(q i θ j ) j=1 9/25

19 Challenges of Structured Document Retrieval Methods for structured document retrieval (SDR) face three major challenges: identifying the key concepts (words or phrases) in keyword queries semantic matching of the key query concepts in different fields of structured documents aggregating the scores of the matched query phrases into the overall score of a structured document Key limitation: all previously proposed SDR methods are based on direct matching of concepts in queries and document fields lexical gap 10/25

20 Proposed Neural Architecture Attention-based Neural Architecture for Ad-hoc Structured Document Retrieval (ANSR): Input: embeddings of words in a query and document fields Pooling layers: create compressed interaction matrices of the same dimensions between unigram- and bigram-based query and document field phrases Matching score aggregation layers: combine the matching scores of query phrases in different document fields into the overall document relevance score by taking into account relative importance of query phrases and document fields Document field attention layers: calculate relative importance of document fields Query phrase attention layers: calculate relative importance of query phrases 11/25

21 Pooling Layers (1) Step 1: create distributed representations of a query and each document field Query: automobile capital and the Detroit of Italy Document: Taurinum Turin is an important business and cultural center in northern Italy, capital city of the Piedmont region attributes located mainly on the left bank of the Po River Susa Valley Italy it is also dubbed la capitale Sabauda Savoyard capital related entity names Space Station Teatro Carignano Savoie List of political philosophers Haifa Parola, Carlo Residences of the Royal House of Savoy Eco, Duchy of Milan Mezzo-soprano Genoa Ginzburg Alessandro Pertini attributes related entity names Taurinum city space Parola Italy capital Milan Pertini 12/25

22 Pooling Layers (2) Step 2: create document fields interaction matrix for each query phrase distributed representations of query and document fields compressed interaction matrices for unigram-based query phrases automobile capital attributes automobile capital Taurinum Italy city capital Italy related entity names space Parola 0.19 Milan Pertini 0.22 attributes Taurinum city Italy capital attributes Taurinum city 0.35 Italy Italy capital 0.39 related entity names space Parola Milan Pertini related entity names space Milan Parola Pertini /25

23 Document Field Attention Layers Goal: compute the importance weights of document fields for aggregating the matching scores of query phrases Document: attributes Taurinum city Italy capital importance weights 0.21 attributes related entity names space Parola Milan Pertini softmax 0.18 related entity names 14/25

24 Query Phrase Attention Layers Goal: compute the importance weights of query phrases for aggregating the matching scores of query phrases of the same type Query: automobile capital and the Detroit of Italy query phrase importance weights automobile capital 0.24 automobile capital softmax query phrase Italy 0.19 Italy 15/25

Matching Score Aggregation Layers attributes related entity names attributes Interaction Matrices query phrase: automobile capital Taurinum city space Parola 0.19 Taurinum city 0.35 Italy capital 0.

25 Matching Score Aggregation Layers attributes related entity names attributes Interaction Matrices query phrase: automobile capital Taurinum city space Parola 0.19 Taurinum city 0.35 Italy capital Milan Pertini 0.22 query phrase: Italy Italy capital 0.39 matching score of 'automobile capital' in all document fields matching score of unigram based query phrases 0.30 aggregation of matching scores of all unigramand bigram-based query phrases related entity names space Parola Milan Pertini matching score of 'Italy' in all document fields aggregation of query phrase matching scores in document fields aggregation of matching scores of query phrases of the same type 16/25

26 Training ANSR is trained to minimize contrastive max-margin loss, given a collection of triplets <q, d n, d r > consisting of relevant d r and non-relevant d n documents for query q: ( min W <q,d n,d r > T max(0, ζ s(q, d r ) + s(q, d n )) + γ 2 W 2 2 ) 17/25

27 Experiments Language modeling and probabilistic baselines: PRMS (Probabilistic Retrieval Model for Semistructured Data) [Kim, Xue and Croft, ECIR 09] MLM (Mixture of Language Models) [Ogilvie and Callan, SIGIR 03] BM25F [Robertson, Zaragoza and Taylor, CIKM 04] FSDM (Fielded Sequential Dependence Model) [Zhiltsov, Kotov and Nikolaev, SIGIR 15] Neural baselines: DRMM (Deep Relevance Matching Model) [Guo, Fan, Ai and Croft, CIKM 16] DESM (Dual Embedding Space Model ) [Nalisnick, Mitra, Craswell and Caruanan, WWW 16] NRM-F (Neural Ranking Model with Multiple Document Fields) [ Zamani, Mitra, Song, Craswell and Tiwary, WSDM 18] 18/25

28 Performance of ANSR and the baselines GOV2 collection MAP PRMS (-39.49%) (-32.62%) (-30.16%) MLM (-10.41%) (-6.23%) (-4.21%) BM25F (-9.00%) (-9.05%) (-7.72%) FSDM (-7.21%) (-3.42%) (-3.00%) DESM (-8.56%) (-5.13%) (-7.33%) DRMM (4.10%) (-2.37%) (-4.35%) NRM-F (-54.07%) (-51.80%) (-56.82%) ANSR ANSR achieved 7.21% and 3% improvement over FSDM in terms of MAP and and 4.35% improvement over DRMM in terms of 19/25

29 Performance of ANSR and the baselines HomeDepot collection MAP PRMS (-19.64%) (-21.57%) (-17.57%) MLM (-13.00%) (-14.09%) (-9.71%) BM25F (-10.86%) (-12.78%) (-7.87%) FSDM (-8.96%) (-12.42%) (-5.62%) DESM (-17.46%) (-19.61%) (-8.15%) DRMM (-12.72%) (-17.86%) (-7.87%) NRM-F (-46.03%) (-47.49%) (-42.82%) ANSR ANSR achieved 8.96% and 5.62% improvement over FSDM as well as 12.72% and 7.87% improvement over DRMM in terms of MAP and 19/25

30 Performance of ANSR and the baselines DBPedia-v2 collection MAP PRMS (-26.50%) (-15.55%) (-14.26%) MLM (-13.15%) (-8.67%) (-9.29%) BM25F (-4.83%) (-4.21%) (-4.30%) FSDM (-7.84%) (-4.30%) (-5.99%) DESM (-11.75%) (-8.51%) (-5.92%) DRMM (-7.77%) (-5.73%) (-6.17%) NRM-F (-52.96%) (-50.85%) (-50.08%) ANSR ANSR achieved 4.83% and 4.30% improvement over BM25F as well as 7.77% and 6.17% improvement over DRMM in terms of MAP and 19/25

31 Topic-level difference in retrieval accuracy between ANSR and FSDM difference difference 0.0 difference k k k (a) GOV2 (b) HomeDepot (c) DBpedia-v2 ANSR has higher average precision than FSDM for 58.88% of the queries in HomeDepot collection In GOV2, the magnitude of improvements in average precision is 1.66 times greater than the magnitude of reductions Superior ability of ANSR to deal with long field documents, due to utilization of compressed representations and explicit correction of the pooling bias 20/25

32 The effect of the pooling size (k) on the performance of ANSR and ANSR-no-pooling ANSR-no-pooling: select the first k terms in each document field instead of pooling map ANSR ANSR no pooling map ANSR ANSR no pooling map ANSR ANSR no pooling k k k (a) GOV2 (b) HomeDepot (c) DBpedia-v2 ANSR has substantially better retrieval accuracy in terms of MAP than ANSR-no-pooling The optimal value of k depends on the collection and the retrieval task: ANSR has the best performance on GOV2, DBpedia-v2 and HomeDepot collections when k = 10, k = 6 and k = 6, respectively 21/25

33 Best performing queries in comparison to FSDM The best performing query is single lever hole bathroom sink faucet : only one relevant document with the title Belle Foret Single Hole 1-Handle High Arc Bathroom Vessel Faucet in Chrome with Metal Lever Handles in relevance judgments This document has longer fields than the average field length in this collection 22/25

34 Worst performing queries in comparison to FSDM The worst performing query is popular : only one relevant document with the title Bloomsz Most Popular Water Plant Collection (8-Pack) in relevance judgments ANSR ranked the document with the title South Shore Furniture Popular Twin Mates Bed in Mocha as the top-ranked document, since it has more words that are semantically similar to the query term popular This can be a consequence of using word embeddings by ANSR, which can cause topic drift for very short queries 23/25

35 Summary ANSR utilizes pooling to generate fixed-size interactions matrices between representations of phrases in a query and document fields and employs an attention mechanism to focus on the most important document fields and query phrases ANSR includes the layers to compute and aggregate the relevance score of a structured document at different levels of granularity ANSR outperforms state-of-the-art LM and neural baselines in different SDR tasks, such as Web search, product search and entity retrieval from a knowledge graph. 24/25

36 Thank you! Questions? 25/25

Attentive Neural Architecture for Ad-hoc Structured Document Retrieval

Attentive Neural Architecture for Ad-hoc Structured Document Retrieval Saeid Balaneshinordan saeid@wayne.edu Wayne State University Detroit, MI, USA Alexander Kotov otov@wayne.edu Wayne State University