A Topic-based Measure of Resource Description Quality for Distributed Information Retrieval

Similar documents
Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16

Federated Text Retrieval From Uncooperative Overlapped Collections

Obtaining Language Models of Web Collections Using Query-Based Sampling Techniques

Query-Based Sampling using Only Snippets

CS47300: Web Information Search and Management

Using Coherence-based Measures to Predict Query Difficulty

CS54701: Information Retrieval

Federated Text Search

Query-Based Sampling using Snippets

Citation for published version (APA): He, J. (2011). Exploring topic structure: Coherence, diversity and relatedness

Risk Minimization and Language Modeling in Text Retrieval Thesis Summary

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language

jldadmm: A Java package for the LDA and DMM topic models

Evaluating Sampling Methods for Uncooperative Collections

Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection

An Investigation of Basic Retrieval Models for the Dynamic Domain Task

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky

Examining the Authority and Ranking Effects as the result list depth used in data fusion is varied

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

VisoLink: A User-Centric Social Relationship Mining

Term Frequency Normalisation Tuning for BM25 and DFR Models

Distributed Information Retrieval

The Open University s repository of research publications and other research outputs. Search Personalization with Embeddings

Capturing Collection Size for Distributed Non-Cooperative Retrieval

CS54701: Information Retrieval

Ranking models in Information Retrieval: A Survey

A Formal Approach to Score Normalization for Meta-search

Exploiting Index Pruning Methods for Clustering XML Collections

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Query Likelihood with Negative Query Generation

MCMC Methods for data modeling

Behavioral Data Mining. Lecture 18 Clustering

Text Document Clustering Using DPM with Concept and Feature Analysis

Navigating the User Query Space

Probabilistic Graphical Models

Focused Retrieval Using Topical Language and Structure

More Efficient Classification of Web Content Using Graph Sampling

Robust Relevance-Based Language Models

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Text Modeling with the Trace Norm

ABSTRACT. Categories & Subject Descriptors: H.3.3 [Information Search and Retrieval]: General Terms: Algorithms Keywords: Resource Selection

Liangjie Hong*, Dawei Yin*, Jian Guo, Brian D. Davison*

WITHIN-DOCUMENT TERM-BASED INDEX PRUNING TECHNIQUES WITH STATISTICAL HYPOTHESIS TESTING. Sree Lekha Thota

Opinions in Federated Search: University of Lugano at TREC 2014 Federated Web Search Track

Hierarchical Location and Topic Based Query Expansion


From Passages into Elements in XML Retrieval

Computer vision: models, learning and inference. Chapter 10 Graphical Models

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks

On Duplicate Results in a Search Session

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track

Modeling and Reasoning with Bayesian Networks. Adnan Darwiche University of California Los Angeles, CA

Quantitative Biology II!

Document Allocation Policies for Selective Searching of Distributed Indexes

Information Discovery, Extraction and Integration for the Hidden Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

MCMC Diagnostics. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) MCMC Diagnostics MATH / 24

How SPICE Language Modeling Works

A Query Expansion Method based on a Weighted Word Pairs Approach

Company Search When Documents are only Second Class Citizens

A Methodology for Collection Selection in Heterogeneous Contexts

Hidden-Web Databases: Classification and Search

Federated Search. Contents

UMass at TREC 2017 Common Core Track

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

A New Measure of the Cluster Hypothesis

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches

Post-Processing for MCMC

1 Methods for Posterior Simulation

Query Side Evaluation

Information Retrieval: Retrieval Models

Reducing Redundancy with Anchor Text and Spam Priors

Federated Search in the Wild

Statistical Matching using Fractional Imputation

An Introduction to Markov Chain Monte Carlo

Analyzing Document Retrievability in Patent Retrieval Settings

Random Sampling of Search Engine s Index Using Monte Carlo Simulation Method

Summary: A Tutorial on Learning With Bayesian Networks

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Noisy Text Clustering

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

Improving Difficult Queries by Leveraging Clusters in Term Graph

Automatically Generating Queries for Prior Art Search

Clustering using Topic Models

Analysis of Incomplete Multivariate Data

ResPubliQA 2010

Unsupervised Learning and Clustering

Samuel Coolidge, Dan Simon, Dennis Shasha, Technical Report NYU/CIMS/TR

Exploring archives with probabilistic models: Topic modelling for the European Commission Archives

Are people biased in their use of search engines? Keane, Mark T.; O'Brien, Maeve; Smyth, Barry. Communications of the ACM, 51 (2): 49-52

STREAMING FRAGMENT ASSIGNMENT FOR REAL-TIME ANALYSIS OF SEQUENCING EXPERIMENTS. Supplementary Figure 1

Exploiting Conversation Structure in Unsupervised Topic Segmentation for s

arxiv: v1 [cs.ir] 31 Jul 2017

Federated Text Retrieval from Independent Collections

Approximate Bayesian Computation. Alireza Shafaei - April 2016

CS281 Section 9: Graph Models and Practical MCMC

Summarizing and Searching Hidden-Web Databases Hierarchically Using Focused Probes

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Information Retrieval. (M&S Ch 15)

Transcription:

A Topic-based Measure of Resource Description Quality for Distributed Information Retrieval Mark Baillie 1, Mark J. Carman 2, and Fabio Crestani 2 1 CIS Dept., University of Strathclyde, Glasgow, UK mb@cis.strath.ac.uk 2 Faculty of Informatics, University of Lugano, Lugano, Switzerland {mark.carman, fabio.crestani}@lu.unisi.ch Abstract. The aim of query-based sampling is to obtain a sufficient, representative sample of an underlying (text) collection. Current measures for assessing sample quality are too coarse grain to be informative. This paper outlines a measure of finer granularity based on probabilistic topic models of text. The assumption we make is that a representative sample should capture the broad themes of the underlying text collection. If these themes are not captured, then resource selection will be affected in terms of performance, coverage and reliability. For example, resource selection algorithms that require extrapolation from a small sample of indexed documents to determine which collections are most likely to hold relevant documents may be affected by samples which do not reflect the topical density of a collection. To address this issue we propose to measure the relative entropy between topics obtained in a sample with respect to the complete collection. Topics are both modelled from the collection and inferred in the sample using latent Dirichlet allocation. The paper outlines an analysis and evaluation of this methodology across a number of collections and sampling algorithms. 1 Introduction Distributed information retrieval (DIR) [1], also known as federated search or selective meta-searchering [8], links multiple search engines into a single, virtual information retrieval system. DIR encompasses a body of research investigating alternative solutions for searching online content that cannot be readily accessed through standard means such as content crawling or harvesting. This content is often referred to as the deep or hidden-web, and includes information that cannot be accessed by crawling the hyperlink structure. A DIR system integrates multiple searchable online resources 3 into a single search service. For cooperative collections, content statistics can be accessed through an agreed protocol. When cooperation from an information resource provider cannot be guaranteed, it is necessary to obtain an unbiased and accurate description of 3 By resource we intend any online information repository that is searchable. This could be a free-text or Boolean search system, relational database, etc. We only make the assumption that a site will have a discoverable search text box.

the underlying content with respect to a number of constraints including: costs (computation and monetary), consideration of intellectual property, handling legacy and uncooperative systems and different indexing choices of the resource provider [1, 2]. Query-based sampling (QBS) [3] and related approaches [4 6] can be used to sample documents from uncooperative resources in order build such content descriptions. Despite these techniques, it remains an unresolved research problem in DIR how to most efficiently obtain and accurately represent large document repositories for resource selection the process of deciding which collections are likely to contain relevant information with respect to a user s information request; resource selection has also been referred to as database selection [5], collection selection [7] or server selection [8] in other studies. In resource selection, the description of each information resource is used to choose between resources and thus the quality of the acquired representation has a large impact on resource selection accuracy and ultimately retrieval performance. It is critical that the underlying resource descriptions capture the broad themes or topicality of the database: If a resource description greatly underestimates the prevalence of a topic within a collection then queries related to that topic will not be routed to it. Similarly if the topic frequency is overestimated the collection will receive queries for which it contains little content. Different sampling approaches result in different representations. It is important, therefore, to be able to measure the quality of the resource descriptions obtained from different approaches in terms of their coverage of the range of topics present in a collection. This is important for resource selection algorithms like CORI [1] that require that the sample and collection term distributions are similar. It is especially important for more recent algorithms that rely on a sampled centralised index [2, 8], as they need to be able to extrapolate over the topic distributions of samples to determine the most appropriate collection for each query. In this paper we revisit the problem of measuring resource description quality. We conjecture that currently adopted measures do not evaluate the quality of an obtained resource at a sufficiently fine grained level. Current methods compare the vocabulary overlap or term distribution similarity between a collection and its description. For algorithms that use a sampled centralised index, a more important aspect is to measure the coverage of topics in the resource description. In response to this issue we outline an approach that exploits probabilistic topic models such as latent Dirichlet allocation (LDA) [9] to provide summary descriptive statistics of the quality of an obtained resource description across various sampling policies. The measures provide an insight into the topical densities of the sample compared to the collection. The remainder of this paper is structured as follows. We provide a brief outline of QBS and current and measures of resource description quality (Section 2), then we outline how LDA can be used for measuring resource description quality (Section 3). We then describe a series of experiments evaluating two sampling strategies with the topical-based measure (Section 4), before concluding the paper (Section 5).

2 Motivations and related work Query-based sampling (QBS) is a technique for obtaining an unbiased resource description for selective meta-searching over uncooperative distributed information resources [3]. The QBS algorithm submits queries to the resource, retrieves the top r documents and updates the estimated resource description by extracting terms and their frequencies from the documents. This process continues until a stopping criterion is reached. The document cut-off threshold r may be a limit enforced by the resource itself or a parameter of the algorithm. The criteria used to select singleton query terms for querying the resource may vary. The most widely used approach is to select terms from the already sampled documents. In Callan et al. [3], it was originally assumed that selecting frequently occurring terms would be more effective at obtaining a random and unbiased sample of documents and thereby a better resource estimate. A uniform selection of terms was, however, shown to produce comparable representations. Instead of submitting randomly generated queries to the resource, Craswell et al. [4] investigated using real multi-term queries taken from the title field of TREC topics. The focus was to measure the effectiveness of each resource s search service in terms of its ability to retrieve documents that were known to be relevant to the query. Gravano et al. [5] created a biased resource description using topically focused query probing. In their algorithm single term queries are chosen according to their association to a particular topic in a topic hierarchy. At each iteration, a query term is selected from a sub-category that lies further down the topic hierarchy. As a result, probing can zoom in on specific aspects of a topic. The result is both a biased (topic-specific) representation of the resource and a categorisation of the database in the topic hierarchy. The application of this approach is primarily scenarios in which resources contain topic specific and homogeneous content, such is the case for vertical search engines (e.g. portals focused on health, sport, etc). QBS is expensive in terms of time, computation and cost. It is therefore important to be as efficient as possible in generating a sufficiently representative resource description. Initial approaches to QBS terminated sampling using a simple heuristic stopping criterion such as when a fixed number of unique documents, terms or submitted queries had been reached. These fixed thresholds were typically set through empirical analysis, where the number of documents required to be sampled on average was estimated to be approximately 300-500 [3]. The estimates were generated using simplistic quality measures based on overlap and the Spearman rank correlation between the vocabulary of the collection and the sample. Later studies illustrated that these fixed thresholds did not generalise to all collections, with large and heterogeneous collections not being well described by their representations [10, 7]. These findings were enabled through the use of information theory measures such as Kullback-Leibler (KL) divergence [11]; a measure of the relative entropy between the probability of a term occurring in the collection and in the sampled resource description. The KL divergence has also be used for determining the difference between old and new resource descriptions for measuring the dynamicity of a resource [12, 13]. Given

the dynamic nature of information, it is important to determine when suitable updates to a resource description are required. Using measures of resource description quality updating policies can be defined. Adaptive sampling termination procedures provide a stopping criteria that is based on the goodness of fit of a resource description estimate [10], thus avoiding the potential problems associated with heuristic-based criteria such as generalisation. In [10] the predictive likelihood of a resource description generating a set of typical queries submitted to a DIR system was used as a guide for determining when a representative sample had been acquired. How to identify a representative set of queries required for measuring the predictive likelihood is an open research problem. In [7], the rate at which significant terms were added to the resource description to determine when to stop. Significant terms were considered terms with a high tf.idf score. Recently it has been shown that QBS provides a biased selection of documents because the sampling of documents does not follow a binomial distribution, i.e. random selection [14]. Since the selection of documents is through a search engine, it is unlikely that the null hypothesis of randomness holds because the probability of each document being drawn from the collection is not equal. The bias of document sampling is dependent on the underlying search engine (of the local resource). For example, longer, more content rich documents [14] or documents with a larger proportion of in-links tend to be favoured [15]. Attempts to correct for this bias involve more complex sampling procedures such as Markov chain Monte Carlo (MCMC) sampling. Bar-Yossef et al. [16] introduced a Random walk approach to sampling via queries using MCMC, specifically the Metropolis-Hastings algorithm which can be used to generate a sequence of samples from a probability distribution that is difficult to sample from directly, in this case the binomial distribution. A query-document graph is generated to determine the link between queries and documents returned from a search engine. This graph is then used to select or reject queries to be submitted to the search engine. Depending on a burn-in period, the Random walk protocol provides a random sample of documents in comparison to QBS at the expense of increased complexity [14]. Obtaining an unbiased estimate is particularly important for the task of estimating the size of a collection [16]. It is not clear, however, whether the increased complexity of MCMC sampling is warranted for the task of building resource descriptions and whether more uniform sampling actually results in a more representative coverage of topics within the collection. Furthermore, current measures of description quality such as the KL divergence between the term distribution of the sample and that of the entire collection do not measure this topic coverage directly. Such measures make the implicit assumption that both the collection and obtained resource descriptions are big documents. In other words, the document boundaries are considered unimportant and ignored during measurement. This assumption is in accordance with resource selection approaches such as CORI [1]. However, more recent resource selection approaches retain document boundaries from documents sampled from a resource. These

algorithms attempt to infer the true topical distribution of a resource from the sample of documents in order to determine which collections the query should be forwarded to. These new approaches make the implicit assumption that the sampled documents represent the topical distribution of the underlying collection. Therefore, measures such as the KL divergence of the sample and collection term distribution may not be appropriate for measuring the quality, or goodness of fit, of the resource description. In order to be able to attempt to answer such questions, we believe it is important to measure the coverage of the main topical themes within an underlying collection. If a sample covers these distributions then it is believed to be a sufficient representation. In the following sections we outline a new approach for evaluating resource description quality based on probabilistic topical modelling. 3 A Topic-based Measure of Sample Quality We are interested in measuring how similar a sample of documents is to the collection it comes from in terms of how well it covers the major themes of the collection. In order to measure this, we need to first discover the important topics in the collection and then estimate and compare the prevalence of these topics in the sample with the collection as a whole. There are a number of different ways that the major themes of a collection could be estimated including clustering approaches [18] and dimensionality reduction techniques [19]. In this paper we use a recent and theoretically elegant technique called latent Dirichlet allocation (LDA) [9], which has been shown to perform well on a number of IR test collections. LDA is a probabilistic generative model for documents within a collection, where each document is modelled as a mixture of topics and each topic is a distribution over terms. LDA has been applied to the problem of modelling topics in text corpora, including modelling and tracking the development of scientific topics [17]; classification, collaborative filtering [9], and retrieval [20] amongst others. The LDA model specifies how a document may have been generated, the underlying assumption being that documents are mixture of (sub-)topics. Representing concepts as probabilistic topics enables each topic to be interpretable and thereby presentable to the user. For an accurate estimation of the coverage of topics in a sample with respect to the collection, a good representation of the collection is required using LDA. As exact inference using LDA is intractable, we use the approximate inference approach defined by Griffiths and Steyvers [17] which uses Gibbs sampling to approximate the posterior distribution. For each collection of documents D, we first use LDA to estimate a set of K term distributions, each representing a major theme in the collection. The term distribution for a topic k {1,.., K} is written as p(t k) and relates to the term distribution for a document p(t d) as follows (ignoring hyperparameters): p(t d) = K p(t k)p(k d) k=1

In order to get a distribution over topics for the sample, we then calculate the distribution over topics for each document p(k d) using a maximum a posteriori estimate. We calculate the average over the documents in the collection D and sample D θ D in order to get topic distributions for the collection and sample as a whole: p(k Θ) = 1 p(k d) D d D p(k ˆθ) = 1 D θ d D θ p(k d) Here p(k Θ) and and p(k ˆθ) are the posterior topic distributions averaged over all documents in the collection and sample respectively. Then in order to compare topic distributions over a sample with the collection as a whole, we use the KL divergence, which is defined as: D KL (Θ ˆθ) = k K p(k Θ)log p(k Θ) p(k ˆθ) Note that the KL divergence is not a symmetric measure and that we calculate the divergence between the collection and the sample D KL (Θ ˆθ) (and not the other way around), i.e. we measure the quality of the sample in terms of its ability to describe the true distribution. We reiterate here the point that by calculating the divergence between the mean topic distributions rather than the mean term distributions, we are measuring the amount to which the sampled documents cover the major themes within the collection. We note that it is quite possible for a sample of documents to have a very similar term distribution to a collection as a whole while still not covering all of the major themes in the collection. Calculating the divergence over the topic distribution is intended to remedy with this. 4 Experiments In this section we describe a series of experiments comparing the topical distributions of acquired resource descriptions across different sampling policies and protocols. As a comparison, we also compare directly the KL divergence over term distributions [10]. We run experiments to measure three different aspects of sampling: 1) convergence, 2) variation in sampling parameters, and 3) different sampling protocols. The first two experiments analyse resource descriptions obtained using QBS [3], selecting singleton terms at random from the current resource description for resource querying. In the third experiment, we compare QBS with the MCMC Random walk algorithm [16] using the same settings described in [14]. For all experiments, an implementation of LDA with Gibbs sampling was used to estimate topic models for all collections and to perform inference over the documents in the resource description [17]. The document collections used in the experiments are shown in Table 1. They consist of a collection

of transcribed radio news broadcasts from the TDT3 corpora, articles from the LA times, a set of Reuters news wires and a subset of the TREC WT10g collection. The collections were chosen so as to provide variation in collection size, document length, content and style. Collections were indexed using the Lemur toolkit 4, with terms being stemmed and stop-words removed. Experiments described in this paper use BM25 as the resource s local search engine. In further tests we did, however, vary the underlying search engine with similar results and trends. Table 1. Collections. Collection No. of docs. No. of topics (K) Avg. doc. length Style ASR sgm 37,467 100 62 Transcribed radio news LA Times 131,896 100 232 News articles Reuters 13,755,685 160 132 News wires WT10g 63,307 100 341 Varied online content In order to choose reasonable values for the number of topics K, we ran a number of initial empirical tests using cross validation [17]. A held out sample of each collection was used to compare the perplexity of the model as the value for K was increased in steps of 20. The selected values for K are reported in Table 1. Figure 1 also illustrates the effect of changing K on sample convergence over the ASR corpus, which we now discuss. 4.1 Convergence as sample size increases This experiment was concerned with the distribution of topics in the sample as further documents are retrieved and added to the resource description. As the number of documents increases we would expect to see convergence on the true collection topic distribution. At this point we can assume that the resource description provides a sufficient representation of the underlying collection. For this experiment we set the maximum of retrieved documents per query r to be 10. We submitted 500 queries for each run and performed 15 restarts per collection changing the starting query seed each time. Measurements of the resource description quality were taken at steps of 20 queries for each run. Figure 1 illustrates the correspondence between measuring the D KL of vocabulary terms and the D KL of topic distributions between collection and resource description as further documents are sampled. The x-axis is the number of queries submitted, the left-hand y-axis is the D KL divergence in vocabulary and D KL divergence in topics. To illustrate the effect of K, the number of topics, we also show D KL of the topic distribution for three settings of K for the ASR collection: changing K from 80 to 120 topics had negligible effect on D KL. For the remaining collections we focus on the relationship between measuring the vocabulary term and topic distributions. The results indicated that as 4 www.projectlemur.org/

!#!'" 678#$9:&.%#+2#!*# +,* +,%! +,%& +-"./012" %#$" $" 6,0%,-*# +,"-./01" +,"-23451"!#!&$" ("!#!&" #$%+,-.)*#!#!&"!#!%" %"!#$" #$%&'()*# #$%,-.*+# '" &"!#!%$"!#!%" #$%&'()*+# %"!#!!$" & ( %! %) %* && &( '! ') '* )& )( $! /0(1&'#+2#30&'-&)#)01(-4&5# & ) %! %( %* && &) '! '( '* (& () $! /0.1,-#&2#30,-(,*#*01.(4,5# "6#7(.,*# +,"-./01" +,"-23451" 6789:#*01*,%# +,"-./01" +,"-23451" '"!#!)" $"!#'$" #$%,-.*+# &" %" $"!#!("!#!'"!#!&"!#!%"!#!$" #$%&'()*+# #$%,-.*+# (" '" &" %"!#'"!#&$"!#&"!#%$"!#%"!#!$" #$%&'()*+# % ) $! $' $* %% %) &! &' &* '% ') (! /0.1,-#&2#30,-(,*#*01.(4,5# & ) %! %( %* && &) '! '( '* (& () $! /0.1,-#&2#30,-(,*#*01.(4,5# Fig. 1. Convergence of resource descriptions as more documents are sampled. As further documents are sampled the term and topic distributions of the resource description begin to represent that of the collection. For the ASR collection, we further illustrate the effect of changing K on convergence. further documents were sampled the divergence between description and collection decreases. This result provides evidence that for both vocabulary and also topicality the resource descriptions begin to represent the underlying collection. Also, convergence differs across collections demonstrating that a fixed stopping threshold is not generalisable to all collections. For the WT10G subset collection, D KL did not stabilise after 500 queries. Further inspection indicated that the rate of new unseen documents returned was substantially lower than for the other collections. Out of a potential 5,000 documents that could be retrieved after 500 queries, approximately 1,240 documents, were returned across the 15 runs. Figure 2 displays the relative frequency of topics in the WT10G collection and also in the resource description obtained through QBS for a single run. The figure illustrates that the prevalence of topics in the resource description did not correspond to that of the collection with some topics over represented and others under represented. In comparison, the topical distribution of resource descriptions obtained for ASR provided a closer representation to the collection (Figure 2). This result provided evidence that QBS was not as effective for sampling across a range of topics held in more heterogeneous collections. This is an advantage of using a topic-based measure as it possible to visualise which topics are under or over represented in the resource description, providing a more descriptive analysis of the weaknesses of sampling strategies under investigation.

!#!'" 789:;'4+64"0'<-1=2$)341,'1('0123-'5340)36+%1,4>' -.//012.3"" 456" 7839.:";8/<" #$%&"'()"*+",-.'!#!&$"!#!&"!#!%$"!#!%"!#!!$" %" % & ' ( $ ) * +, %! /'0123-4'1)5")"5'6.')"#$%&"'()"*+",-.'3,'-1##"-%1,'!&,$ 78!'9-1:2$)341,'1('0123-'5340)36+%1,4;' -.//012.3$ 456$ 7839.:$;8/<$!&+$ #$%&"'()"*+",-.'!&*$!&)$!&($!&'$!&&$!&$!!%$!!#$ &$ &!$ '!$ (!$ )!$ *!$ +!$,!$ #!$ %!$ &!!$ /'0123-4'1)5")"5'6.')"#$%&"'()"*+",-.'3,'-1##"-%1,' Fig. 2. The relative frequency of topics in the collection and resource descriptions obtained by QBS and the Random walk approaches for the WT10G and ASR collections. The closer a point is to the solid line of the collection the better the representation. 4.2 Changing the number of results per query In this experiment we were concerned with the effect of changing parameters for QBS. More specifically, whether increasing r provides a more representative sample of the collection. This in essence is testing the question of whether sampling few documents with many queries or many documents with a small number of queries obtains better representations. QBS was evaluated over a range of values: r = {10, 100, 1000}. To ensure fair comparison, we initially analysed each policy when 5,000 documents were sampled. We also continued sampling until 1,000 queries, to further analyse the impact on resource description quality and

,-.#$/01231456#7#8(%9#%&'()12:;1*/<#=/1*37/+# $)!!*+,-" %)!!./0'10-" +,-#$&./01/234#'#56%7#%&'(#8/)&9#(&/)1'&*# (.!!/012" $.!!345,652"!#!&$"!#!+"!#!*"!#!&"!#!)" #$%&'()*+#!#!%$"!#!%" #$%&'()*#!#!("!#!'"!#!&"!#!%"!#!!$"!#!$" '(% '(%! '(%!!,-$,-$!,-$!! Fig. 3. Comparison of sampling in terms of D KL for the ASR collection. The left-hand plot sampling was stopped when 5,000 documents were returned, and the right-hand plot after 1,000 queries. test whether the larger document samples obtained were reflected in better D KL scores. As a case study we focus on the ASR collection, Figure 3 reports the mean and standard error for D KL over 15 restarts for the ASR collection when comparing term (left) and topic distributions (right). Stopping at 5,000 documents, D KL for term distributions are comparable for all settings of r. When comparing the topic distributions, however, setting r to be 10 or 100 provides better topic coverage than 1000. This result indicates the topic bias in setting r to be 1,000 and stopping at 5,000 documents sampled, as only a small number of queries have been submitted in comparison to the other policies i.e. a query ranks the documents in the collection based on topical relevance to that query. As further queries are submitted, Figure 3 (right), a larger proportion of documents are sampled r = 1, 000, which results in a closer representation of the topic distribution and lower D KL. 678# 6789:# +,-"./0123"4/56" +,-"./0123"4/56"!#!('"!#'$"!#!(&"!#'"!#!(%" #$%&'()*+#!#!($"!#!("!#!!'"!#!!&" #$%&'()*+#!#&$"!#&"!#%$"!#%"!#!!%"!#!!$"!#!$" $ & (! (% (' $$ $& )! )% )' %$ %& *!,-./01#&2#3-01(0*#*-/.(405# & ( %! %) %* && &( '! ') '* )& )( $!,-./01#&2#3-01(0*#*-/.(405# Fig. 4. Comparing to different sampling strategies: QBS setting r=10, and Random walk. For both plots the Random walk method converges on the collection more quickly than QBS.

4.3 Changing the sampling strategy In this experiment we compared QBS using r = 10 with the Random walk approach to sampling which is designed to obtain a random sample of documents via querying. For each interaction with the local search engine 10 documents were retrieved and added to the resource description for each approach. Sampling was stopped after 500 interactions with the resource. Figure 2 displays a comparison in the topic distributions for both approaches while Figure 4 presents the trend in D KL for topic distributions the ASR and WT10G collections. The Random walk method provides a better representation which was closer to the collection distribution of topics in comparison to QBS. This is reflected in the closer proximity of topics in the resource description to the true prevalence of topics in the collection (Figure 2) and by the lower divergence in resource descriptions (Figure 4). In the case of the ASR collection both sampling approaches retrieved a comparable number of unique documents, indicating that the coverage of documents was less biased for Random walk. For the WT10G collection, the Random walk method both retrieved a large proportion of unique documents but also a more random distribution in terms of topics covered. This result indicated that for the increased sampling complexity, a more random, representative sample of documents were obtained using MCMC sampling. 5 Conclusions and future work In this paper we investigated the use of LDA as a basis for measuring the topical distribution of acquired document samples via query sampling. By using LDA we generated a set of topics that could be used to characterise a collection as well as resource descriptions samples from it. This new topic-based measure was used to determine if acquired resource description were sufficiently representative of the collection in terms of topical coverage i.e. we examined which topics were under or over represented in the sample. The subsequent analysis indicated a number of important results using this new measure. Firstly, it was first reported that a small sample of 300-500 documents was not sufficient in terms of topical coverage for all collections, where the number of required documents was dependent not only on collection size but also the topical cohesiveness of the collection i.e. if the collection was topically heterogeneous or homogenous. Secondly, it was identified that by changing r, the number of documents sampled per query, could increase or minimise any topical bias for QBS. Sampling less often provided a more representative sample of topics as more queries were submitted probing more aspects of the collection. Over a larger number of queries, however, this bias levelled off. Thirdly, it was highlighted that the Random walk sampling approach provided a more random and representative sample in comparison to QBS, especially given a more heterogeneous collection such as a subset of general online web pages. Finally, this paper focused on the task of obtaining resources descriptions for resource selection, and specifically how to measure the quality of an obtained sample. The implications of this study may generalise to other tasks. Sampling resources via queries has been applied to a variety of tasks such as:

search engine diagnostics and index size estimation [16, 14]; information extraction and database reachability [6]; and evaluation of information retrievability and bias [15]. Future work will investigate the applicability of topic-based measures to these problems. Acknowledgments: This research was supported by the EPSRC grant EP/F060475/1 Personalised Federated Search of the Deep Web. References 1. Callan, J.P.: Distributed information retrieval. In: Advances in information retrieval. Kluwer Academic Publishers (2000) 127 150 2. Si, L., Callan, J.P.: Relevant document distribution estimation method for resource selection. In: ACM SIGIR 03, Ontario Canada, (2003) 298 305 3. Callan, J.P., Connell, M.: Query-based sampling of text databases. ACM Transactions of Information Systems 19 (2001) 97 130 4. Craswell, N., Bailey, P., Hawking, D.: Server selection on the world wide web. In: DL 00: Proceedings of the fifth ACM conference on Digital libraries, (2000) 37 46 5. Gravano, L., Ipeirotis, P.G., Sahami, M.: Qprober: A system for automatic classification of hidden-web databases. ACM Trans. Inf. Syst. 21 (2003) 1 41 6. Ipeirotis, P.G., Agichtein, E., Jain, P., Gravano, L.: To search or to crawl?: towards a query optimizer for text-centric tasks. In: SIGMOD 06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data, (2006) 265 276 7. Shokouhi, M., Scholer, F., Zobel, J.: Sample sizes for query probing in uncooperative distributed information retrieval. In: APWeb 2006. (2006) 63 75 8. Hawking, D., Thomas, P.: Server selection methods in hybrid portal search. In: ACM SIGIR 05. Brazil, (2005) 75 82 9. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3 (2003) 993 1022 10. Baillie, M., Azzopardi, L., Crestani, F.: Adaptive query-based sampling of distributed collections. In: SPIRE 2006, Glasgow, UK (2006) 11. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22 (1951) 79 86 12. Ipeirotis, P.G., Ntoulas, A., Cho, J., Gravano, L.: Modeling and managing content changes in text databases. In: ICDE 05. (2005) 606 617 13. Shokouhi, M., Baillie, M., Azzopardi, L.: Updating collection representations for federated search. In: SIGIR 07. Amsterdam, Netherlands. (2007). 511 518 14. Thomas, P., Hawking, D.: Evaluating sampling methods for uncooperative collections. In: SIGIR 07. Amsterdam, Netherlands. (2007) 503 510 15. Azzopardi, L., Vinay, V.: Accessibility in information retrieval. In: ECIR 2008, Glasgow, UK, (2008) 482 489 16. Bar-Yossef, Z., Gurevich, M.: Random sampling from a search engine s index. In: ACM WWW 06, (2006) 367 376 17. Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Science 101 (2004) 5228 5235 18. Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press (2008) 19. Hofmann, T.: Probabilistic latent semantic indexing. In: ACM SIGIR 99, Berkley, USA. (1999) 50 57 20. Wei, X., Croft, W.B.: Lda-based document models for ad-hoc retrieval. In: ACM SIGIR 06. Seattle, USA, (2006) 178 185