Window Extraction for Information Retrieval

Similar documents
Efficient Execution of Dependency Models

AComparisonofRetrievalModelsusingTerm Dependencies

Indexing and Query Processing

UMass at TREC 2017 Common Core Track

Indexing Word-Sequences for Ranked Retrieval

University of Delaware at Diversity Task of Web Track 2010

Modeling Query Term Dependencies in Information Retrieval with Markov Random Fields

Efficient Dynamic Pruning with Proximity Support

An Investigation of Basic Retrieval Models for the Dynamic Domain Task

Robust Relevance-Based Language Models

Efficient Diversification of Web Search Results

Query Likelihood with Negative Query Generation

Indri at TREC 2005: Terabyte Track (Notebook Version)

On Duplicate Results in a Search Session

Diversifying Query Suggestions Based on Query Documents

Global Statistics in Proximity Weighting Models

Book Recommendation based on Social Information

Experiments with ClueWeb09: Relevance Feedback and Web Tracks

A Study of Pattern-based Subtopic Discovery and Integration in the Web Track

The anatomy of a large-scale l small search engine: Efficient index organization and query processing

Verbose Query Reduction by Learning to Rank for Social Book Search Track

Introduction to Information Retrieval. (COSC 488) Spring Nazli Goharian. Course Outline

Microsoft Research Asia at the Web Track of TREC 2009

Efficient Text Proximity Search

On Duplicate Results in a Search Session

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

ICTNET at Web Track 2010 Diversity Task

UMass at TREC 2006: Enterprise Track

A Deep Relevance Matching Model for Ad-hoc Retrieval

Term Frequency Normalisation Tuning for BM25 and DFR Models

Webis at the TREC 2012 Session Track

Estimating Embedding Vectors for Queries

Indexing Proximity-based Dependencies for Information Retrieval

Efficient Document Retrieval in Main Memory

An Incremental Approach to Efficient Pseudo-Relevance Feedback

5. Novelty & Diversity

Indri at TREC 2005: Terabyte Track

Static Index Pruning for Information Retrieval Systems: A Posting-Based Approach

From Neural Re-Ranking to Neural Ranking:

An Attempt to Identify Weakest and Strongest Queries

A Cluster-Based Resampling Method for Pseudo- Relevance Feedback

Content-based search in peer-to-peer networks

A BELIEF NETWORK MODEL FOR EXPERT SEARCH

Reducing Redundancy with Anchor Text and Spam Priors

Efficiency vs. Effectiveness in Terabyte-Scale IR

RMIT University at TREC 2006: Terabyte Track

Learning Concept Importance Using a Weighted Dependence Model

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

A Formal Approach to Score Normalization for Meta-search

IO-Top-k at TREC 2006: Terabyte Track

Introduction & Administrivia

Ternary Tree Tree Optimalization for n-gram for n-gram Indexing

Learning to Reweight Terms with Distributed Representations

Semi-Parametric and Non-parametric Term Weighting for Information Retrieval

Entity and Knowledge Base-oriented Information Retrieval

Attentive Neural Architecture for Ad-hoc Structured Document Retrieval

Informativeness for Adhoc IR Evaluation:

Evaluating Verbose Query Processing Techniques

UTILIZATION OF CROSS-TERMS TO ENHANCE THE LANGUAGE MODEL FOR INFORMATION RETRIEVAL

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

Leveraging Transitive Relations for Crowdsourced Joins*

Query Processing in Highly-Loaded Search Engines

ResPubliQA 2010

Indri at TREC 2004: Terabyte Track

Improving Difficult Queries by Leveraging Clusters in Term Graph

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval

Using Coherence-based Measures to Predict Query Difficulty

Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014

Repositorio Institucional de la Universidad Autónoma de Madrid.

Filtered Dynamic Indexing of Text Streams

Exploiting Global Impact Ordering for Higher Throughput in Selective Search

Improving Patent Search by Search Result Diversification

Exploiting Index Pruning Methods for Clustering XML Collections

Bi-directional Linkability From Wikipedia to Documents and Back Again: UMass at TREC 2012 Knowledge Base Acceleration Track

A New Measure of the Cluster Hypothesis

An Exploration of Query Term Deletion

Lecture 5: Information Retrieval using the Vector Space Model

Real-time Text Queries with Tunable Term Pair Indexes

Inverted List Caching for Topical Index Shards

Modeling Higher-Order Term Dependencies in Information Retrieval using Query Hypergraphs

From Neural Re-Ranking to Neural Ranking: Learning a Sparse Representation for Inverted Indexing

Exploring Reductions for Long Web Queries

A Security Model for Multi-User File System Search. in Multi-User Environments

Citation for published version (APA): He, J. (2011). Exploring topic structure: Coherence, diversity and relatedness

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst

Register allocation. instruction selection. machine code. register allocation. errors

NUSIS at TREC 2011 Microblog Track: Refining Query Results with Hashtags

Finding Relevant Relations in Relevant Documents

Exploiting Progressions for Improving Inverted Index Compression

Evaluating an Associative Browsing Model for Personal Information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Search Costs vs. User Satisfaction on Mobile

Semantic Matching by Non-Linear Word Transportation for Information Retrieval

Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization

Query Transformations for Result Merging

Open Research Online The Open University s repository of research publications and other research outputs

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

An Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments

Transcription:

Window Extraction for Information Retrieval Samuel Huston Center for Intelligent Information Retrieval University of Massachusetts Amherst Amherst, MA, 01002, USA sjh@cs.umass.edu W. Bruce Croft Center for Intelligent Information Retrieval University of Massachusetts Amherst Amherst, MA, 01002, USA croft@cs.umass.edu ABSTRACT Proximity-based term dependencies have been proposed and used in a variety of effective retrieval models. The execution of these dependency models is commonly supported through the use of positional inverted indexes. However, few of these models detail how instances of proximate terms should be extracted from the lists of positional data. In this study, we investigate three algorithms for the extraction of windows that span a range of assumptions about the reuse of terms in multiple window instances. We observe that computed collection statistics of unordered windows are significantly affected by the choice of algorithm. We also observe that retrieval efficiency and effectiveness of a state-of-the-art dependence model are not significantly affected by the selection of window extraction algorithm. 1. INTRODUCTION Dependency retrieval models have been repeatedly shown to improve information retrieval performance over bag-ofwords retrieval models [1; 3; 7; 9; 10; 11]. A common approach in these models is to assert that all adjacent pairs of query terms are dependent [1; 7; 9]. Windows of text that contain the identified dependent pairs or sets of terms are identified in each document. Features used in these retrieval models can then be aggregated from the set of windows that are identified in a particular document. The execution of these dependency models is commonly supported by positional indexes. See Witten et al. [13] for a definition and discussion of this type of index. Positional indexes provide access to the location of each instance of each query term, in each indexed document. However, very few dependency retrieval models specify precisely how window instances should be identified using these term-position lists. Implicit in these underspecified dependency models is an assumption of how terms that could occur in multiple windows should be treated. In this study, we focus ontheextractionoftheorderedandunorderedwindows originally used by Metzler and Croft [7] in the Markov Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX...$15.00. Document... A B A B... No Domination Terms Reused Figure 1: Example extracted sets of unordered windows for each assumption of term reuse. random field model framework. We note that similar windows of are also used in other retrieval models, including: BM25-TP [10], BM25-TP2 [12], and pdfr [9]. We discuss three possible assumptions of term reuse. We detail corresponding window extraction algorithms. We discuss the theoretical efficiency bounds of each algorithm, then experimentally test the impact of each algorithm. We observe that there is a significant difference between the collection frequencies of windows extracted by each of these algorithms. For the state- of-the-art sequential dependence model (SDM) [7], we observe that, despite this, there is no difference in retrieval effectiveness or retrieval efficiency, between each of the algorithms. 2. ASSUMPTIONS OF TERM REUSE As mentioned there is a potentially important efficiency and effectiveness trade-off for the extraction of ordered and unordered windows. Several different algorithms can be devised depending on choices of whether or not to reuse specific term instances in multiple window instances. We focus on three possible assumptions of term reuse: noreuse, no-domination, and all-terms-reused. Figure 1 shows an example of the windows extracted using each of these assumptions. In general, the number of window instances extracted under the no-reuse assumption is less than or equal to the number of windows instances extracted under the no-domination assumption. Similarly, the number of instances extracted under the no-domination assumption is less than or equal to the number of instances extracted under the all-terms-reused assumption.

Algorithm Unordered-Windows-No-Reuse 4: while true do 5: minpos = minpos(itrs) 6: maxpos = maxpos(itrs) 7: if (maxpos minpos) < w then 8: output.add( window(itrs) ); 9: for i = 0; i < itrs ; i = i+1 do 10: itr[i].next() 11: if itrs[i].done(); return output 12: end for 13: else 14: for i = 0; i < itrs ; i = i+1 do 15: if itr[i].pos() == minp os then 16: itr[i].next() 17: if itrs[i].done(); return output 18: end if 19: end for 20: end if 21: end while Algorithm Unordered-Windows-No-Domination 4: while true do 5: minpos = minpos(itrs) 6: maxpos = maxpos(itrs) 7: if (maxpos minpos) < w then 8: output.add( window(itrs) ); 9: end if 10: for i = 0; i < itrs ; i = i+1 do 11: if itr[i].pos() == minp os then 12: itr[i].next() 13: if itrs[i].done(); return output 14: end if 15: end for 16: end while We now propose three unordered window extraction algorithms that extract sets of window instances matching each of the term reuse assumptions. An unordered window is defined over a set of dependent terms (often a pair), and a window width. Each window instance must include one instance of each term, and the distance from the first term to the last term in the window instance must be less than the width parameter, w. First, the no-reuse assumption permits each term instance to be present in at most one window instance. Following this assumption, the Unordered-Windows-No-Reuse algorithm details a process that extracts all unordered windows that do not share any term instances. We observe that this algorithm processes each posting list only once, so we can bound the complexity of this algorithm at O( n i posi ) operations. Second, the no-domination assumption allows a limited amount of reuse of term instances. Observing that each extracted instance covers a specific region of the document, Algorithm Unordered-Windows- 4: itr 0 = itrs.pop() 5: while not itr 0.done() do 6: output.add( ExtractWindows(itrs,itr 0.pos(),itr 0.pos())) 7: itr 0.next() 8: end while 9: return output; 10: function ExtractWindows(itrs, begin, end) 11: itr i = itrs.pop() 12: while not itr i.done()) do 13: nbegin = min(begin,itr i.pos()) 14: nend = max(end,itr i.pos()) 15: if newend newbegin w then 16: if itrs is empty then 17: output.add( window(itrs) ); 18: else 19: output.add( ExtractWindows(itrs, nbegin, nend)) 20: end if 21: end if 22: itr i.next() 23: end while 24: itr i.reset() 25: itrs.push(itr i) 26: return output; 27: end function we can omit dominated windows. A window dominates another when both share a starting position, and its ending position is smaller than the dominated window. The Unordered-Windows-No-Domination algorithm details a process that extracts all non-dominated windows. Again, this algorithm only processes each posting list once, so we can bound the complexity at O( n i posi ) operations. In practice, it is reasonable to expect that this algorithm will require slightly longer to execute than the Unordered- Windows-No-Reuse algorithm above, as the term iterators are not moved as rapidly. Finally, the all-terms-reused assumption permits the reuse of each term instance in all possible window instances. The Unordered-Windows- algorithm details a recursive algorithm that extracts all possible window instances in the document. Note that this assumption requires that each list of positions must be processed several times to ensure that all windows are extracted. So, the worst case complexity of this algorithm is bounded at O( n i itri ) operations. For longer lists of positions, or larger values of n, this algorithm is likely be considerably less efficient than both of the previous algorithms. 3. EXPERIMENTS We now empirically evaluate the impact these different assumptions of term reuse have on retrieval performance. We measure three aspects: first, the collection frequency; second, retrieval efficiency of each algorithm; and finally, retrieval effectiveness of each algorithm. We compare each of these algorithms in the context of the

Collection # Doc. # Terms # Topics Robust-04 0.5 M 0.252 B 250 GOV2 25 M 22 B 150 ClueWeb-09-B 33 M 26 B 200 ClueWeb-09-A 201 M 123 B 200 Mean Collection Frequency 1000 5000 50000 500000 Table 1: Summary of data sets used. Figure 2: Mean collection frequencies extracted using three window extraction algorithms, for 8 retrieval settings (2 types of queries for 4 TREC collections). Mean Average Precision (MAP) 0.00 0.10 0.20 0.30 Figure 3: Mean average precision of SDM, using each of the three assumptions of term reuse, in 8 retrieval settings.(2 types of queries for 4 TREC collections). Average retrieval time (ms/query) 200 1000 5000 20000 100000 sequential dependence model [7]. That is, all adjacent pairs of query terms are selected to be modeled as an ordered window and as an unordered window. The width of the ordered window is too small to permit any term reuse. So, the focus of the study is on unordered windows. For all experiments, the width of the extracted unordered windows is set to 8. For each of these experiments, we use 4 TREC collections: Robust-04, GOV2, Clueweb-09-Cat-B and Clueweb- 09-Cat-A. See Table 1 for the details of each collection. A total of 600 unique topics are available for these collections. Note that the topics for Clueweb-09-B are the same as the topics available for ClueWeb-09-A. Each topic has two formulations, a short title query, consisting of 1-4 terms, and an longer description query of between 5 and 30 terms. In accordance with previous research on the ClueWeb-09 collection [4], we have removed all documents with a fusionspam-score lower than 60 from these collections. We implement each of the above window extraction algorithms using the Galago retrieval framework 1. 3.1 Results We start by measuring the collection frequency of unordered windows for each collection, and query set. Figure 2 shows the mean collection frequency of unordered windows of width 8, as extracted by each of the three algorithms, for each TREC collection. adjacent query terms from each of the topics for each collection are used as the input dependent query terms. As expected, this data shows that the Unordered-Windows- No-Reuse algorithm identify fewer window instances than the Unordered-Windows-No-Domination algorithm, and both identify fewer window instances that the Unordered- Windows- algorithm. The observed differences in collection frequency between 1 http://lemurproject.org/galago.php Figure 4: Average time to retrieve the top 1000 documents, using SDM, with each of the three assumptions of term reuse, in 8 retrieval settings.(2 types of queries for 4 TREC collections). the three algorithms are relatively small. The frequencies of no-reuse extracted windows are around 10% lower than the no-domination extracted windows, and the frequencies of all-term- reuse windows are around 7% lower than no-domination extracted windows, across all collections and both types of queries. Next, we investigate how retrieval effectiveness is affected by each algorithm. For this experiment, we use optimized the parameters for SDM, for each collection using a coordinate ascent approach described by Metzler [8]. Figure 3 shows how each algorithm effects retrieval effectiveness, as measured using mean average precision. We can clearly see that there is almost no change in effectiveness for each window extraction algorithm. We observe similar results for other retrieval metrics, including ndcg@20, P@20, and ERR@20. For each metric, the Fisher randomization test was applied to all pairs of results, for each collection and query type. No significant differences were observed (α = 0.05). Finally, we measure retrieval efficiency of each algorithm. Given the complexity of each algorithm, we expect to observe that the no-reuse algorithm will execute faster than the no-domination algorithm, which will execute faster than the all algorithm. For this experiment, we implement similar algorithms for the extraction of ordered windows. Even though the output of the ordered window extraction algorithms remains identical in all cases for SDM, the algorithm

may have some impact on retrieval efficiency. Through these modifications, we aim to maximize any observed time differences between the assumptions. Figure 4 shows the retrieval efficiency of each algorithm. The retrieval efficiency of each algorithm is measured as the time to retrieve the top 1000 documents, averaged over 5 repeated runs of all queries, for each tested collection, and type of query. We observe almost no difference in efficiency between the no-reuse and no-domination algorithms. A small difference (< 2%) is observed between the no-domination and all algorithms in some cases. The standard deviation of the reported mean processing times is computed to be less than 2% of the mean processing time for each collection. 4. RELATED WORK While the experiments presented in this study focus on the unordered windows used in SDM, these results are applicable to several other effective dependency retrieval models. For example, BM25-TP [10], and pdfr [9] both use similar unordered window features, but do not explicitly assert how windows should be extracted from positional data, or whether extracted windows may or may not share term instances. However, as the width of the unordered windows in these retrieval models are not larger thanin SDM, we expect to observe little difference in retrieval performance between the three tested assumptions of term reuse. There are a small number of dependency models that specify how to extract window instances from positional data. First, a variant of BM25-TP [3] specifies how unordered windows can be efficiently extracted from positional data. We note that this algorithm produces a set of windows that is similar to those produced using the assumption of no-termreuse. Second, BM25-Span [11] defines a text span as the output of the provided span extraction algorithm. Lv and Zhai [6] proposes an alternative approaches to incorporating term proximity into the evaluation of a document. The positional language model scores documents at a specific location in the document, term instances are then propagated to the desired scoring position. So, this model does not require windows to be identified in documents. An alternative to extracting windows from sets of positional data, is to directly index the frequency or score of the required windows. Heuristic approaches to indexing term dependencies have been shown to improve retrieval efficiency [2; 5]. However, we note that these indexes also rely on some unstated assumptions of term reuse. 5. CONCLUSIONS Retrieval models that use windows to measure term dependence have not specified exactly how term occurrence is counted in these windows. In this study, we have defined how window instances can be counted and investigated possible trade-offs between retrieval efficiency and retrieval effectiveness present in the extraction of window instances for dependency models. We have identified three reasonable assumptions of term reuse that could allow for different retrieval performance. We present three corresponding algorithms that extract the appropriate set of unordered windows from lists of positional data for a given document. We empirically tested the use of these algorithms in a retrieval system. From the observed data, there is no clear reason to prefer any of these window algorithms. We observe that the collection frequency of the window increases asmore termreuseis permitted. However, we donotobserve any significant changes in retrieval effectiveness or retrieval efficiency. It is reasonable to expect that larger window widths will increase the differences between the algorithms. In future work, we plan to investigate the impact these assumptions have on retrieval performance for retrieval models that use larger window widths. Acknowledgments This work was supported in part by the Center for Intelligent Information Retrieval and in part by NSF grant #CNS- 0934322. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor. References [1] Michael Bendersky, Donald Metzler, and W. Bruce Croft. Learning concept importance using a weighted dependence model. In Proceedings of the third ACM WSDM, pages 31 40, 2010. [2] Andreas Broschart and Ralf Schenkel. Highperformance processing of text queries with tunable pruned term and term pair indexes. ACM Trans. Inf. Syst., 30(1):5:1 5:32, March 2012. [3] Stefan Büttcher, Charles L. A. Clarke, and Brad Lushman. Term proximity scoring for ad-hoc retrieval on very large text collections. In Proc. of the 29th ACM SIGIR, pages 621 622, 2006. [4] Gordon V. Cormack, Mark D. Smucker, and Charles L. A. Clarke. Efficient and effective spam filtering and re-ranking for large web datasets. Information retrieval, 14(5):441 465, 2011. [5] S. Huston, A. Moffat, and W.B. Croft. Efficient indexing of repeated n-grams. In Proc. of the 4th ACM WSDM, pages 127 136. ACM, 2011. [6] Y. Lv and C.X. Zhai. Positional language models for information retrieval. In Proc. of the 32nd ACM SIGIR, pages 299 306. ACM, 2009. [7] D. Metzler and W.B. Croft. A markov random field model for term dependencies. In Proc. of the 28th ACM SIGIR, pages 472 479, 2005. [8] Donald Metzler. Using gradient descent to optimize language modeling smoothing parameters. In Proc. of the 30th ACM SIGIR, pages 687 688, 2007. [9] Jie Peng, Craig Macdonald, Ben He, Vassilis Plachouras, and Iadh Ounis. Incorporating term dependency in the dfr framework. In Proc. of the 30th ACM SIGIR, pages 843 844, 2007. [10] Yves Rasolofo and Jacques Savoy. Term proximity scoring for keyword-based retrieval systems. In Proc. of the 25th ECIR, pages 207 218, 2003. [11] Ruihua Song, Michael J. Taylor, Ji-Rong Wen, Hsiao- Wuen Hon, and Yong Yu. Viewing term proximity from a different perspective. In Proc. 30th ECIR, pages 346 357, 2008.

[12] Krysta M. Svore, Pallika H. Kanani, and Nazan Khan. How good is a span of terms?: exploiting proximity to improvewebretrieval. InProc. of the 33rd ACM SIGIR, pages 154 161, 2010. [13] I.H. Witten, A. Moffat, and T.C. Bell. Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann, 1999.