A copy can be downloaded for personal non-commercial research or study, without prior permission or charge

Size: px

Start display at page:

Download "A copy can be downloaded for personal non-commercial research or study, without prior permission or charge"

Osborn Bond
6 years ago
Views:

Broccolo, Daniele, Macdonald, Craig, Orlando, Salvatore, Ounis, Iadh, Perego, Raffaele, Silvestri, Fabrizio, and Tonellotto, Nicola(213) Loadsensitive selective pruning for distributed search.

1 Broccolo, Daniele, Macdonald, Craig, Orlando, Salvatore, Ounis, Iadh, Perego, Raffaele, Silvestri, Fabrizio, and Tonellotto, Nicola(213) Loadsensitive selective pruning for distributed search. In: CIKM '13: 22nd ACM International Conference on Conference on Information and Knowledge Management, 27 Oct - 1 Nov 213, San Francisco CA, USA. Copyright 213 ACM A copy can be downloaded for personal non-commercial research or study, without prior permission or charge Content must not be changed in any way or reproduced in any format or medium without the formal permission of the copyright holder(s) When referring to this work, full bibliographic details must be given Deposited on: 13 May 214 Enlighten Research publications by members of the University of Glasgow

2 Load-Sensitive Selective Pruning for Distributed Search Daniele Broccolo 1,3, Craig Macdonald 2, Salvatore Orlando 1,3, Iadh Ounis 2, Raffaele Perego 1, Fabrizio Silvestri 1,4, Nicola Tonellotto 1, 1 National Research Council of Italy 2 University of Glasgow 3 Ca Foscari University of Venice 4 Yahoo! Research, Barcelona, Spain {firstname.lastname}@isti.cnr.it 1, {craig.macdonald, iadh.ounis}@glasgow.ac.uk 2, silvestr@yahoo-inc.com 4 ABSTRACT A search engine infrastructure must be able to provide the same quality of service to all queries received during a day. During normal operating conditions, the demand for resources is considerably lower than under peak conditions, yet an oversized infrastructure would result in an unnecessary waste of computing power. A possible solution adopted in this situation might consist of defining a maximum threshold processing time for each query, and dropping queries for which this threshold elapses, leading to disappointed users. In this paper, we propose and evaluate a different approach, where, given a set of different query processing strategies with differing efficiency, each query is considered by a framework that sets a maximum query processing time and selects which processing strategy is the best for that query, such that the processing time for all queries is kept below the threshold. The processing time estimates used by the scheduler are learned from past queries. We experimentally validate our approach on 1, queries from a standard TREC dataset with over 5 million documents, and we compare it with several baselines. These experiments encompass testing the system under different query loads and different maximum tolerated query response times. Our results show that, at the cost of a marginal loss in terms of response quality, our search system is able to answer 9% of queries within half a second during times of high query volume. Categories and Subject Descriptors: H.3.3 [Information Storage & Retrieval]: Information Search & Retrieval Keywords: Query Efficiency Prediction, Scheduling 1. INTRODUCTION Commercial Web search engines are expected to process user queries under tight response time constraints while being able to operate under heavy query traffic loads. Queries that cannot be processed within their time constraint experience degraded result quality [5]. Operating under these conditions requires building a very large infrastructure involving thousands of computers and making continuous investments Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. CIKM 13, Oct. 27 Nov. 1, 213, San Francisco, CA, USA. Copyright 213 ACM /13/1...$ Query traffic rate (queries/sec) 14 Peak query load experienced by the search engine Hour of the day Figure 1: The query distribution of a 24 hours time span covering 1st May 26 from MSN query log. to maintain this infrastructure [3]. Hence, optimising the efficiency of Web search engines is important to reduce their infrastructure costs. The user query volume typically received by a Web search engine is illustrated in Figure 1, showing how the rate of queries received can vary through the course of a day. In order to guarantee that each query is processed with subsecond response times, the computing/communication infrastructure has to support worst-case query volume, which reaches its maximum during the day time (about 13 queries per second from 1: to 14: in the workload depicted in Figure 1), typically around midday. Hence, the typical approach taken by Web search engines is to deploy a distributed search architecture [8]. According to this architecture, the servicing of a search query uses many query servers, each in charge of a partition of the global index. When a query reaches one of these servers, it is processed immediately if the server is idle, otherwise it is placed in a queue waiting for processing. Hence, the completion time of a query at each server includes both a waiting time and a processing time, in turn relying on the processing strategy. We argue that in order to realise a Web search engine that can answer a query within a given deadline, there are three options for how the system can respond in presence of high query volume. Firstly, if the system is not able to reduce the processing time of each query and the scheduling of the queries cannot be modified, queries with long processing times might simply be dropped resulting in error pages being returned to users or interrupted where a partial result list is presented. While this technique does guarantee high quality results for a (possibly small) subset of queries, it is unfortunate that this is the only possible choice when the query volume cannot be sustained by a given search engine infrastructure. The second option, discussed in recent works [1, 14], is based on query efficiency predictions, along with suitable scheduling algorithms, to re-order the

3 query queue. The aim of this technique is to reduce the overall queueing time of queries, thus increasing the query throughput. A third, alternative option, proposed in this paper, is to dynamically select a suitable query processing (retrieval) strategy to process the queries to satisfy the perquery deadlines, thus reducing the query processing time. In particular, when deciding between processing strategies, our proposed approach considers the time necessary to satisfy the per-query deadlines for all queries queued after the current query. In this way it can allocate available resources fairly across the waiting queries. On the other hand, a strategy that reduces the processing time of a query has obvious drawbacks: we need to exploit approximate processing strategies, such as dynamic pruning [4, 16, 2], that can reduce the quality of the query results. However, pruning can be applied selectively, on a per-query basis [19], depending on the expected processing time of the query and the status of the search engine. In this work, we go further, as our novel scheduling methodology can selectively adopt dynamic pruning processing strategies only when the system is experiencing high workloads, thereby trading off some effectiveness to ensure efficiency. In summary, this paper argues that the effectiveness of search results can be maintained whilst meeting completion time constraints by choosing an appropriate pruning strategy to use for each query to be processed by a given server. In particular, our method can examine not just the current query, but also the other queries queued for processing. Hence, the contributions of this paper are two-fold: We propose a load-sensitive selective pruning framework for bounding the permitted processing time of a query, which consider goals such as meeting a time threshold, effectiveness and fairness to other queries waiting to be processed; Moreover, to support our proposed framework, we propose an accurate approach for query efficiency prediction of term-ata-time dynamic pruning strategies. Our experiments show that the proposed framework is able to produce results of quality comparable to that of a search system that does not bound query processing time, while at high query workloads the system can still respond to queries in less than a predetermined time threshold. For instance, when 4 queries per second are arriving at the search engine, our framework is able to answer 9% of queries within.5 seconds with a 5% drop in results quality compared to an effective processing strategy for which % of queries meet the time threshold. The remainder of this paper is structured as follows: In Section 2, we introduce the necessary preliminaries by discussing the context of our work; Section 3 discusses related work in efficient retrieval; In Section 4, we propose our framework for load-sensitive selective pruning; Section 5 discusses the processing strategies we deploy, and how their response times can be accurately predicted; Section 6 defines the experimental setup for the evaluation that follows in Section 7; We provide concluding remarks in Section PRELIMINARIES Web search engines have to manage huge quantities of documents while achieving the goal of effectively answering users queries, and doing so efficiently i.e., within a fraction of a second. To achieve this multi-objective goal despite the large size of the Web, the corpus of documents the search engine must manage are partitioned into sub-collections that are each manageable by a single machine. This results in several query servers engaged in answering a user s query, Figure 2: Our reference architecture of a distributed search engine node (based on [5]). each of them storing the index shard [3] for a subset of the index built on the corpus. Without loss of generality, in this work we assume a distributed search engine where data are distributed according to a document partitioning strategy [2]. The index is thus partitioned into shards each one relative to a particular partition of the documents. To increase query throughput, each index shard is typically replicated into several replicas and a query received by the search front-end is routed to one of the available replicas. In this work, we assume a multi-node search engine without replicas, because our experimental results are independent from the number of replicas, and hence can be applied directly to each replica independently [5]. Figure 2 depicts our reference architecture for a single replica. New queries arrive at a front-end machine called query broker, which broadcasts the query to the query servers of all shards, before collecting and merging the final results set for presentation to the user. When a query reaches a query server, it is processed immediately if the server is idle. Indeed, each query server comprises a query processor, which is responsible for tokenising the query and ranking the documents of its index shard according to a scoring function (in our case we use the BM25 scoring function [18]). Strategies such as dynamic pruning [4, 16, 2] can be used to process queries in an efficient manner on each query server. In this work, we consider document-sorted indices, as used by at least one major search engine [8]. Other efficient retrieval techniques such as frequency-sorted [2] or impact-sorted indices [1] are possible, which also support our objective of early termination of long running queries. However, there is no evidence of such index layouts in common use within commercial search engines [15], perhaps as suggested by Lester et al. [12] due their practical disadvantages such as difficulty of use for Boolean and phrasal queries. As such, in this work, we focus on the realistic scenario of standard document-sorted index layouts. Finally, we use disjunctive semantics for queries, as supported by Craswell et al. [7] who highlighted that disjunctive semantics does not produce significantly different high-precision effectiveness compared to conjunctive retrieval. If the query server is already busy processing another query, each newly arrived query is placed in a queue, waiting to be selected by a query scheduler for processing. Hence, the time that a query spends with a query server, i.e. its completion time, can be split into two components: a waiting time, spent in the queue, and a processing time, spent being processed. While the latter depends on the particular retrieval strategy (which we call the processing strategy) and the shard s characteristics, the former depends on the specific scheduling algorithm implemented to manage the queue and on the number of queries in the queue itself.

4 Indeed, it has been observed that a query scheduler can make some gains in overall efficiency by re-ordering queries, thereby delaying the execution of expensive queries [14]. However, this approach only considers the cost of executing single queries, and hence cannot respond to surges in query traffic. Instead, in this work, we take a different approach, by arguing that the time available to execute a query on a query server whilst meeting the completion time constraints is influenced by the other queries queued on that query server. Hence in this paper, we estimate the target completion times for a query on a server based on the prediction of queueing and completion times for the queries scheduled after the query in the queue. The utility of query scheduling is particularly evident when queries arrive at a higher rate than the maximum sustainable peak load of the system [11]. Indeed, in our proposed framework, we set the maximum query processing time to a carefully chosen value (see Section 4), such that the system load is kept under control, thereby enabling an optimal management of the peak load at the cost of a slightly reduced results quality (see Section 7.2). Our proposed framework exploits novel machine learning models for estimating processing time under different processing strategies. 3. RELATED WORK Having defined the architecture context of our work, in this section we discuss some related work on which various components of our architecture rely, namely dynamic pruning (Section 3.1), query efficiency prediction (Section 3.2) and selective pruning (Section 3.3). 3.1 Dynamic Pruning The strategies to match documents to a query fall in two categories [16]: in a term-at-a-time (TAAT) strategy, the posting lists of query terms are processed and scored in sequence, while, in a document-at-a-time (DAAT) strategy, the query term postings lists are processed in parallel. To attain the typical sub-second response times of Web search engines, various techniques to enhance retrieval efficiency have been proposed (e.g. [4, 16, 2]). In particular, dynamic pruning aims to eliminate the scoring of documents that will not be present in the final list of top results. Most DAAT dynamic pruning strategies [4, 2] exhibit efficiency improvements without negatively impacting effectiveness, but some TAAT dynamic pruning techniques [12, 16], while they enhance efficiency, negatively impact retrieval effectiveness because some relevant document can be pruned. In this work, we consider the Continue TAAT dynamic pruning strategy [16], which we denote TAAT-CS. Our choice of the TAAT-CS strategy is motivated by the fact that its overall efficiency is directly proportional to the number of accumulator to create in the first phase [16]. Indeed, the fine tuning of the number of accumulators gives us the flexibility to directly control the efficiency of the pruning strategy. 3.2 Query Efficiency Prediction The query scheduler component must select the next query to be processed from the queue of waiting queries. To achieve this, it is fundamental to know in advance an estimate of the processing time for the query to be scheduled. Indeed, efficiency predictions estimate the response time of a search engine for a query [14]. Moffat et al. [15] stated that the response time of a query is related to the posting list lengths of its constituent query terms. However, in dynamic pruning strategies (e.g. Wand [4]), the response time of a query is more variable, as not every posting is scored, and many postings can be skipped [16], resulting in reduced retrieval time. As a result, for Wand, the length of the posting lists is insufficient to accurately predict the response time of a query [14]. Query efficiency predictors [14] have been proposed to address the problem of predicting the response time of Wand for an unseen query. In particular, various term-level statistics are computed for each term offline. When a new query arrives, the term-level features are aggregated into query-level statistics, which are used as input to a learned regression model. In this work, arising from our focus on the TAAT-CS pruning strategy, we propose query efficiency predictions for TAAT-CS, by describing a set of features that can be easily used to estimate the efficiency of a query through a learned approach. These predictions represent our estimates for the query processing time, which we exploit to determine a maximum amount of processing time to allocate for each query. 3.3 Selective Pruning Dynamic pruning strategies, such as Wand and TAAT-CS can all be configured to be made more aggressive. In doing so, the strategy becomes more efficient, but at a possible loss of effectiveness [4]. For instance, reducing the maximum number of accumulators in the TAAT-CS strategy results in less documents being examined before the second stage of the algorithm commences, when no new accumulators can be added. Hence, reducing the number of accumulators increases efficiency, but can result in relevant documents not being identified within the set of accumulators, thereby hindering effectiveness [16]. Typically, the aggressiveness is selected a priori to any retrieval, independent of the query to be processed and its characteristics. However, in [19], Tonellotto et al. showed how the Wand pruning strategy could be configured to prune more or less aggressively, on a per-query basis, depending on the expected duration of the query. They call this approach selective pruning. Our work makes an important improvement to selective pruning compared to [19], by observing that the appropriate aggressiveness for a query should be determined not just by considering the current query. Instead, our proposed loadsensitive selective pruning framework also accounts for the other queries waiting to be processed, and their predicted response times, together with their positions in the waiting queue. These are used to select the dynamic pruning aggressiveness in order to process the queries with a fixed time threshold, when possible, or to process it more efficiently, when the time constraint cannot be respected. 4. LOAD-DRIVEN SELECTIVE PRUNING One of the problems that must be addressed to build a large-scale Web search engine is how to provide the service when the received query volume is excessively high. In particular, when the entire system is overloaded, the response time of the queries increases, making it necessary to answer queries more rapidly. A common strategy is to drop queries that have been waiting or executing for a long time, returning empty results list; alternatively, it is possible to set a time threshold and interrupt the retrieval whenever a query is going to take too much time. Both strategies are suboptimal and have the huge drawback of disappointing the users who submitted those queries that have been dropped.

q1 t1 qn qi q2 t2 q2 q1 qi ti 1 p T qn tn Predict Bound e1(q1) t ep(q1) f(q1) Select Figure 3: The components of the proposed loadsensitive selective pruning framework (bottom), along with a

5 q1 t1 qn qi q2 t2 q2 q1 qi ti 1 p T qn tn Predict Bound e1(q1) t ep(q1) f(q1) Select Figure 3: The components of the proposed loadsensitive selective pruning framework (bottom), along with a representation of the variables depicting the queries currently queued (top). Typically, in search systems critical situations arise when bursts of queries are submitted (almost) at the same time. See, for instance, the peak load around 12 PM in the query workload plotted in Figure 1. In this section, we discuss a novel load-sensitive framework, based on query efficiency predictors and taking into account other features like the length of the list of queries waiting to be processed and the duration each query has been queued for. We aim to dynamically adapt the retrieval strategy, by reducing the processing time of queries when the system is heavily loaded. Indeed, during high query load, we propose to adopt aggressive pruning strategies, thus speeding up query processing, while possibly impacting negatively on the effectiveness of the returned results. Let us consider the search engine state depicted in Figure 3, which shows the system at time t. There are n queries q 1,..., q n waiting to be processed in the scheduler s queue. Let t i be the arrival time of query q i, where t i t j whenever i < j, i.e., t 1..., t n t. Query q 1 is the head of the queue, as it has been queued for the longest time. Until time t, the query processor was busy by processing the previous queries (not shown in the figure), and at time t it becomes idle. Then, the query scheduler must select the next query to be processed. We assume that scheduling follows a first-in first-out discipline, that is, query q 1 which has been queued for the longest time is selected for processing next. Furthermore, each query can be processed by several processing strategies σ 1,..., σ p, such as TAAT or DAAT with different levels of dynamic pruning aggressiveness. We assume that strategy σ 1 is the search engine s full processing strategy, such as TAAT or DAAT, while subsequent strategies are increasingly more efficient, such that σ p is the most efficient processing strategy. Moreover, we assume that, while σ k+1 is more efficient than σ k, the effectiveness of σ k is, in general, better than the effectiveness of σ k+1. This assumption is well-founded, because efficient processing strategies typically have a negative impact on the corresponding retrieval effectiveness [13, 19, 21]. For query q 1, we associate with each strategy σ k the processing time e k (q 1), which the strategy is predicted to take to process query q 1. This means that, for example, e 1(q 1) represents the processing time of query q 1 when the less efficient (but most effective) processing strategy is adopted, while e p(q 1) represents the most efficient yet less effective predicted processing strategy. A constant time threshold T represents the maximum time budget for the processing of any query: the completion time of any query must be not greater than T, such that its results can be presented to the user in a timely manner. This time means that the time elapsed between the arrival of any query and its processing finish time must not exceed T. Note that, since the query has already spent some time in the queue, its available processing time, i.e., the maximum time it is allowed to spend in processing, is not, in general, equal to T, but it is decreased by the time it has spent in the queue. Moreover, if there are other subsequent queries queued, then it can be considered unfair for the query to take all available time, while other queries are starved. Hence, we argue that the available processing time for each query is bounded by some time budget depending on various factors such as the time the query has spent in the queue, and the number of queued queries. The definition of a suitable time budget is central to this paper. Let f(q i) be this time budget for query q i, which has to ensure fairness in query processing: whenever the query workload is close to the maximum allowed, enqueued queries should be assigned reduced time budgets for their processing. Once f(q i) has been computed, we have to select the processing strategies able to process the query within the time budget, i.e. any strategy σ k (q i) such that e k (q i) f(q i). Finally, among all these strategies, we select the best strategy in term of effectiveness, i.e., according to our assumptions, the strategy that takes the largest processing time among all admissible strategies. The definition of a suitable time budget function f(q i) depends on various aspects: the position of the query in the queue, its arrival time, the current time, and the status of the queue. The outline of the proposed selective pruning framework is shown in Algorithm 1. For a queue of queries awaiting processing, q 1,..., q n, their expected processing times for all possible processing strategies are estimated. This allows the time budget to be calculated f(q 1) for the next query to be processed. Thereafter, we choose an appropriate query processing strategy, which aims to ensure that the query meets its completion time threshold T, while providing results that are as effective as possible. Algorithm 1 Load-Sensitive Selective Pruning Framework Input: The queries q 1,..., q n The completion time threshold T Output: The selected processing strategy σ for query q 1 1: for all processing strategies σ k, k = 1,..., p 2: for all enqueued queues q i, i = 1,..., n 3: expected processing time e k (q i ) Predict(σ k, q i ) 4: Time budget f(q 1 ) Bound(T, σ 1 (q 1 ),..., σ p(q n)) 5: Processing strategy σ Select(f(q 1 ), e 1 (q 1 ),..., e p(q 1 )) In order to select the processing strategy σ, we must implement the following functions within our framework: Predict(): Defines a mechanism allowing to predict the processing time for each query in the queue when the processing strategy can be selected among the different dynamic pruning strategies. This mechanism is used to estimate the processing times e k (q i) of the available processing strategies, and the pruning strategy that will most likely process the query within the desired time threshold T. Bound(): Defines a method to compute the time budget f(q 1) for query q 1, depending on the global time threshold T and on the queries waiting to be processed. The time budget defines a bound on the processing time that query q 1 will be permitted.

6 Select(): Defines a mechanism to select the best processing strategy that is able to process query q 1 according to the maximum processing time, f(q 1), that q 1 is allowed to take and that maximises the resulting query effectiveness. Similar to previous work on selective pruning [19], it follows that the processing times of a query can be estimated through the use of query efficiency prediction [14], i.e. Predict(). However, as no such predictors have previously been defined for TAAT strategies such as TAAT-CS, in Section 5 we address query efficiency prediction for TAAT. In the remainder of this section, we propose mechanisms for Bound() (Section 4.1) and Select() (Section 4.2). 4.1 Bound() We assume a list of queries q 1, q 2,..., q n that are currently (at time t) in the queue of the system. Each query is associated with its arrival time t i. Roughly speaking, the query processing time bound f(q 1) has the following goals: 1. Efficiency: q 1 (the least recently queued query) will have a completion time not greater than T, the global time threshold. 2. Effectiveness: The time available to process q 1 will be as large as possible, such that the most effective processing strategy can be deployed. 3. Fairness: Queries q 2,..., q n received after q 1 are not starved of processing time, and hence are each able to meet T. Clearly, these three goals can be at odds with each other. In the following, we describe four methods of defining f(q 1) that address some or all of the goals to varying extents:. Query q 1 is processed as effectively as possible, i.e. using the most inefficient processing strategy: f(q 1) = argmax{e k (q 1)} = e 1(q 1). k This method ignores the waiting time spent in the queue, and makes no attempt to prune aggressively queries such that the threshold T can be met, by this query or other queries in the queue. In other words, it is a method that is neither fair nor efficient. For this reason, we use it as a baseline with maximal effectiveness.. Query q 1 is processed as fast as possible, by using the most efficient, aggressive pruning strategy for all queries: f(q 1) = argmin{e k (q 1)} = e p(q 1). k In this method, we ignore the waiting time that the query q 1 has spent in the queue. Similarly to the method, serves as a baseline method that does not explicitly consider the fairness or effectiveness goals. However, in contrast to, consumes the least computing resources, and hence is the fairest method, even if the other queries do not exploit the unused resources. Selfish. The query q 1, enqueued at time t 1, should be processed by time t 1 + T. Hence, at time t, the amount of remaining time 1 to process the query such that threshold T is met has decreased by t t 1 seconds, i.e.: 1 = (t 1 + T ) t If 1 >, the processing time bound is f(q 1) = 1, and depends only on the time q 1 has spent in the queue, without consideration for the processing time needed for other queued queries. Then, if the time threshold T for this query has elapsed ( 1 ), the query is processed as fast as possible, as in the case: { 1 if 1 > f(q 1) = e p(q 1) otherwise Altruistic. The previous method has the disadvantage that q 1 processing is bound with the maximum amount of time available (given the time spent in the queue), disregarding the queries that are still in the queue. This can penalise queued queries q 2,..., q n that have not yet been processed. In contrast, Altruistic enforces fairness, by firstly computing how much time is left to empty the current queue. This is simply the time at which the lastly queued query q n should be completed (t n+t ) minus the current time. Formally, n, the remaining time to finish processing up to query n, is: n = (t n + T ) t Then, to compute the maximum time available for q 1 we have to subtract the minimum time necessary to process all the queued queries. This time is simply given by the sum of the estimations e p(q i) of the processing time needed by the fastest processing strategy p. Hence, we define the available slack time, n, as: n n = n e p(q i). If n >, we evenly distribute this extra slack time to the queued queries. In doing so, if some time is left to process all enqueued queries faster than the minimum possible, each one might receive a fair amount of extra processing time 1. Hence the processing bound for query q 1 becomes e p(q 1) + n/n. However, this quantity can exceed 1, and will result in too much extra budget assigned to query q 1, beyond the time threshold T. In this case, the processing bound for the query q 1 is simply 1. Finally, if n, we process the query as fast as possible, as in the case, i.e., f(q 1) = { min e p(q 1) i=1 { 1, e p(q 1) + n/n } if n > otherwise The Altruistic method to compute Bound() is a central contribution of our paper. Once the time budget f(q 1) has been computed, it is used by the query processor to select the most suitable processing strategy among those available to process the query. In the following, we describe Select(), which is the function used to take these decisions. 4.2 Select() Given the time budget f(q 1) granted by Bound(), the role of the Select() function is to choose the most effective strategy σ = σ k {σ 1,... σ p} to resolve query q 1 within the assigned budget f(q 1). Primarily, the selection of an appropriate processing strategy is based on the estimated query processing times e 1(q 1),..., e p(q 1). Assuming the estimates are sorted in descending order of expected processing times, i.e., e 1(q 1) e p(q 1), we can identify the strategy σ k where 1 k p is the smallest such that e k (q 1) f(q 1). In order words, we select σ k as the best strategy in terms of effectiveness, whose expected completion time is not greater than the budget the query has been granted by Bound(). 1 This is true as far as no additional queries are received.

7 Note that, in the case that no strategy is able to process query q 1 within the computed time budget, we always select the most aggressive processing strategy, i.e., σ p. As a remark, when the and methods are used Select() will resort to always pick CS 1 (i.e. σ p) and DAAT (i.e. σ 1), respectively. Both Bound() and Select() descriptions have been given using the informal, and implicit, concept of an efficiency predictor. In the next section, we detail in a more precise way how inspired by the work in [19] we predicting the efficiency of a TAAT-CS strategy before processing commences. 5. PRUNING STRATEGIES & PREDICTORS The framework we described in the previous section relies on the concept of query efficiency predictors. In our definition, given a query and a set of query processing strategies, efficiency predictors return the estimated query processing time for each one of the strategies considered. The load-sensitive selective pruning framework proposed in Section 4 is general with respect to the deployed retrieval strategy. However, in this work we focus on two particular strategies, namely DAAT and TAAT-CS. In particular, we adopt document-at-a-time (DAAT) for full processing. Full-processing is chosen when, in normal load conditions, processing time is not constrained. On the other hand, when the system is experiencing a high workload, we resort to use faster and less precise processing strategies, specifically, based on the term-at-a-time-continue strategy (TAAT-CS) [16]. In the remainder of this section, we define the details of TAAT-CS (Section 5.1), before explaining how the processing time of both DAAT and TAAT-CS can be accurately measured (Section 5.2). 5.1 TAAT-CS Dynamic Pruning As defined in [16], TAAT-CS works as follows. Given a set of terms to process, sorted in decreasing order of posting list length, an OR phase processes the posting lists one by one until we have K accumulators. From this point, no new accumulators are created, and an AND phases processes the remaining posting lists by intersecting them with the existing accumulators. The efficiency of the AND phase can benefit from skip pointers [16] within the posting lists, such that the postings of documents that are not in the top K accumulators are not decompressed, leading to IO benefits. Therefore, smaller values of K correspond to more aggressive pruning, as the AND phase is started earlier, and more skipping can occur during this phase. However, smaller K values are likely to lead to result lists with degraded effectiveness. Our implementation of the TAAT-CS dynamic pruning strategy adopts a further heuristic, to optimise the initial phase in which new accumulators are created. Given that DAAT processing is faster than TAAT processing [9], we alter the accumulator creation phase as follows. We select the shortest l posting lists, such that the sum of their lengths is greater than or equal to the number of accumulators K. These posting lists for this initial set of terms are processed using a DAAT strategy, instead of TAAT. In doing so, the resulting number of accumulators will never be greater than the number of accumulators we will get after processing the first list with a classic TAAT-CS strategy. After this modified OR phase, the processing strategy proceeds with the AND phase as in TAAT-CS. Using our refined strategy, we may end up with less accumulators than using the tradi- Query Efficiency Prediction Features total number of postings in the query s term lists number of terms in the query variance of the length of the posting lists mean of the length of the posting lists length of the shortest posting list length of the longest posting list number of terms processed in the first phase of CS length of the posting lists processed in the first phase of CS number of terms processed in the second phase of CS length of the posting lists processed in the second phase of CS Table 1: Features used for prediction processing time: the top features are method independent, the bottom features are method dependent, for CS. tional TAAT-CS. However, in our initial experiments, we found that this happens only for.1% of the 1, queries used in this paper. Yet, on average, the response time of our DAAT/TAAT-CS strategies exhibit a 2x improvement over the classical TAAT-CS strategy. The adoption of the DAAT/TAAT-CS strategy motivates also the comparison of our selective pruning strategies with DAAT, instead than TAAT. Indeed, in terms of efficiency, out-performing DAAT as a baseline is, in general, more difficult than for TAAT [9]. In the following, we refer to our DAAT/ TAAT-CS with K accumulators as CS-K (e.g. CS- 1 uses K = 1 accumulators), without further mention of the use of DAAT for the initial phase. As a side note, we are not aware of any previous work studying this small variation on TAAT-CS. Therefore, to the best of our knowledge, this is another new contribution presented by this work. 5.2 Query Efficiency Prediction In the preceding section, we defined the processing strategies used within this paper. In this section, we describe how we obtain query efficiency predictions for the processing strategies. In particular, we are inspired by the query efficiency predictors for DAAT previously defined by Macdonald et al. [14]. However, in this work we also use TAAT- CS for aggressive pruning. Hence, in the following we devise a method for predicting the processing time of CS-K, before retrieval commences, using a Linear Regression-based technique. First of all, we define a set of features to represent each query. In the case of DAAT, Macdonald et al. [14] show that there is a strong correlation between the distribution of postings in the query terms and the response time of the query itself. Therefore, to predict the response time of DAAT we use the features listed in the top part of Table 1. On the other hand, as discussed above, TAAT-CS strategies do not score all postings in the posting lists of the query terms. Hence, we do not expect that relying only on posting features can lead to good predictions. Instead, given the characteristics of our TAAT-CS strategies (a first phase where we fully evaluate a subset of terms using DAAT, and a second phase where we use the remaining terms to update the accumulators found in the first phase) we build a regression model using the features listed in the bottom part of Table 1, in addition to the method-independent features listed in the top part. It is of note that all of these query efficiency prediction features can be calculated using commonly available statistics, particularly the length of the query term s corresponding posting lists, before retrieval commences, and

8 hence query efficiency predictions can be made with very low overheads, as soon as a query arrives at a query server. In total, our prediction method models the problem using a feature space made up of 1 distinct features. As our reference architecture is a distributed one, each query server might have different response times for the same query. For this reason, we need to build different models for each server. We adopt a linear regression model to estimate the running time e j(q i) of query q i when scored using method j. In other words, we model e j(q i) as a linear combination of the features fi weighted by a real value λ f. Features and weights are different for each scoring method thus we indicate f ji and λ jf to refer to values for scoring method j. Formally, e j(q i) = λ jf j λ j9f j9. Linear regression is then used to find the values for various λ jf with the goal of minimising the least square error of processing time on a training set of queries [14]. In the next section, we define the experimental setup for our experiments. In particular, our experiments demonstrate the accuracy of the proposed efficiency predictors for TAAT-CS, before showing how the proposed selective scheduling framework proposed in Section 4 can increase the ability of a search engine to effectively and efficiently handle different traffic query loads. 6. EXPERIMENTAL SETUP In the following experiments, we deploy a widely used document collection created as part of TREC, namely the ClueWeb9 (cat. B) collection, which comprises around 5 million English Web documents, and is designed to represent the first-tier index of a commercial Web search engine. We index the document collection using the Terrier search engine [17], removing standard stopwords and applying Porter s English stemmer. The resulting index is document partitioned into ten separate index shards, while maintaining the original ordering of the collection. Each inverted index shard, which is stored on disk, also has skipping information embedded, to permit skipping [16] during the Continue phase of TAAT-CS. For the retrieval experiments, we use a distributed C++ search system engine, accessing the index produced by Terrier. Our experiments are conducted on a cluster of twelve quad-core machines, where each machine has one Intel Xeon 2.4GHz X3223 CPU and 8GB of RAM, connected using Gigabit Ethernet. Only a single core on each query server is used to serve queries. 2 Two additional nodes are used as follows: one as the query broker, and one as the client application that sends the queries to the system. Finally, each query server has a queue used to keep queries coming from the broker, while the query processor on each query server processes queries one at a time. As query processing strategies, we use DAAT, as well as TAAT-CS with different accumulators, i.e. CS-1, CS-2, CS-5 and CS-1. Documents are scored using BM25, with parameters at the default settings [18]. We use queries from the TREC Million Query Track 29 [6], which contains 4, queries, some of which have relevance assessments. In our experiments, 3, of these queries 2 While increasing the number of cores on each query server obviously increases throughput, we prefer to use a singlethreaded environment to reduce any resource contention that may reduce the reliability of experimental results. are used as the training set for learning λ values in our regression models, while the other 1, are used for testing the accuracy of the predictors, and retrieval experiments. Indeed, for measuring the accuracy of our query efficiency predictors, we use root mean square error (RMSE), while for retrieval effectiveness, we compute NDCG@1 using the 687 queries out of the 1, that have relevance assessments from TREC 29. Efficiency is measured using mean response time computed over 5 runs for each test. In our experiments, we do not use query caching, in order to better analyse the impact of our models on the processing performance. Moreover, adding a cache in front of our architecture would only reduce the query arrival rate, but not the efficiency and effectiveness of our method. 7. EXPERIMENTS In the following, we address these research questions: RQ1. What is the accuracy of the linear regression-based approach for query efficiency prediction for TAAT-CS? (Section 7.1) RQ2. Do the proposed methods achieve effective and efficient retrieval under different query loads? (Section 7.2) RQ3. To what extent can efficient query per second servicing be attained for different time thresholds? (Section 7.3) 7.1 Predictors Error Evaluation Efficiency predictors, which aim to predict the processing time of a query before retrieval commences, are an important component of our work. In this first research question, we aim to ensure that our estimations, particularly for TAAT-CS pruning strategies, are accurate. We compare the accuracy of the features listed in Table 1 when combined using linear regression. In particular, we compare the set that only includes the six method independent features, with the set that includes, in addition to the previous six, the four method dependent features proposed for TAAT-CS. Table 2 reports the accuracy of the linear regression models combining the six and ten features, as well as a baseline predictor that uses only the total number of postings for the query terms as a feature. In the table, we report the mean, over the ten query servers, of the query processing time (QPT) for each strategy, as well as the Root Mean Square Error (RMSE), and the percentage of queries for which the prediction error is less than 1 milliseconds. The best value in each row for each measure is highlighted. On analysing Table 2, we note that for DAAT, using the six features improves over the baseline single feature predictor by 42% (from RMSE to ), with 95% of the queries having a prediction error of less than 1 ms. On the other hand, using only the six features is insufficient for accurate processing time prediction for the CS-K strategies for instance, for CS-1, only 65% of queries are accurately predicted within 1 ms. However, for the linear regression models that uses the additional 4 method dependent features (1 features in all) 3, the error is one order of magnitude lower, and for the vast majority of queries (95-99%) our linear model is able to predict the correct response time up to a 1ms error. Therefore, in answering research question RQ1, we find that the proposed linear regression model is accurate, with an error smaller than 1 ms in more than 95% of the cases. 3 The 4 method dependent features do not apply to DAAT.

9 1 Feature: sum of postings 6 Features: method independent 1 Features: incl. method depend. Strategy QPT RMSE err 1 ms RMSE err 1 ms RMSE err 1 ms DAAT.11 s % % - - CS-1.25 s % % % CS-2.3 s % % % CS-5.37 s % % % CS-1.44 s % % % Table 2: Mean query processing time (QPT, in seconds), as well as prediction accuracy using various feature sets, for each processing strategy. Average query response time (seconds) CS-1 Selfish Altruistic Figure 4: Average query response time in seconds for different methods, T =.5. In particular, the best performing models for predicting CS- K strategies are those obtained by the full set of ten features described in Section 5, while in the case of DAAT, the six features describing the lengths of the lists associated with query terms perform very well. Therefore, in the following experiments, we use six features for the prediction of the DAAT processing times and the full set of ten features for the prediction of TAAT CS-K processing times. 7.2 Efficiency and Effectiveness Analysis In this section, we experiment to address RQ2, in comparing the efficiency and effectiveness of our proposed loadsensitive selective pruning framework. In particular, we compare our methods, Selfish and Altruistic, with three different baselines: and, as well as applying CS 1 for all queries. We remark that, by their respective definitions, corresponds to a pure DAAT full processing strategy and corresponds to using CS- 1. Within this section, we use a maximum threshold time of T =.5 seconds, which mandates that the results for each query must be returned, including both queueing and processing, before this time elapses. Later, in Section 7.3, we analyse how T affects the performances of our methods. We analyse our methods in terms of query response time and effectiveness, stressing our search system with different rates of queries, measured in queries per second (q/s). The query response time corresponds to measuring how much time the query spends within the queues and being processed in other words the time a user waits for the results to be returned. We evaluate effectiveness using NDCG@1, exploiting the 687 queries that have relevance assessments. Firstly, we experiment to determine the average response time of the various methods by varying the number of queries per second submitted to the search system. As the Million query track query set does not have query arrival times, queries are submitted at uniform query rate in other words a submission rate of N q/s corresponds to submitting a query every 1/N seconds. This allows us to measure the behaviour of the various techniques under various load conditions, as shown in Figure 4. As expected, when using Perfec- Average query response time (seconds) CS-1 Selfish Altruistic Figure 5: Average query response time in seconds for different methods (enlargement of Figure 4). Query Response Time Query ID - 4 q/s CS-1-4 q/s Selfish - 4 q/s Altruistic - 4 q/s Figure 6: Query response time for 1 queries, arrival rate 4 q/s, T =.5. tionist method, the mean response time exceeds the threshold (T =.5) for all except very low workloads. CS 1 can sustain slightly higher loads than, however for loads greater than 2 q/s the response times are well above the threshold. Figure 5 enlarges the curves of Figure 4 for query response times up to the threshold T =.5. This allows us to better analyse the behaviour of the various methods for a workload of 4 q/s or less. Clearly, attains the smallest response times, as it aggressively prunes all queries. However, both Selfish and Altruistic methods are less efficient than, but still achieve the threshold up to 4 q/s. To show how the various methods cope with queries of varying efficiency, Figure 6 plots the actual query response times for a subset (one hundred) of all the test queries, for a query workload of 4 q/s. In particular, the response times for, CS-1, Selfish, and Altruistic are shown. Spikes in the lines correspond to the effect of expensive queries on other later queries. Indeed, expensive queries delay the queries submitted later, as expected though Selfish and Altruistic are more uniform than the others. In particular, in the case of Altruistic, the line is also close to the time threshold, indicating a better utilisation of the resources. To determine how the threshold is adhered to for different methods and workloads, Figure 7 shows the percentage of queries whose response time are within the threshold

Load-Sensitive Selective Pruning for Distributed Search

Load-Sensitive Selective Pruning for Distributed Search Daniele Broccolo 1,3, Craig Macdonald 2, Salvatore Orlando 1,3, Iadh Ounis 2, Raffaele Perego 1, Fabrizio Silvestri 1,4, Nicola Tonellotto 1, 1 National