Improving the Efficiency of Multi-site Web Search Engines

Size: px

Start display at page:

Download "Improving the Efficiency of Multi-site Web Search Engines"

Lynne Walsh
5 years ago
Views:

1 Improving the Efficiency of Multi-site Web Search Engines ABSTRACT A multi-site web search engine is composed of a number of search sites geographically distributed around the world. Each search site is typically responsible for crawling and indexing the web pages that are in its geographical neighborhood. A query is selectively processed on a subset of search sites that are predicted to return the best-matching results. The scalability and efficiency of multi-site web search engines have attracted a lot of research attention in the recent years. In particular, research has focused on replicating important web pages across sites, forwarding queries to relevant sites, and caching results of previous queries. Yet, these problems have only been studied in isolation, but no prior work has properly investigated the interplay between them. In this paper, we take this challenge up and conduct what we believe is the first comprehensive analysis of a full stack of techniques for efficient multi-site web search. Specifically, we propose a document replication technique that improves the query locality of the state-of-the-art approaches with various replication budget distribution strategies. We devise a machine learning approach to decide the query forwarding patterns, which achieves significantly lower false positive ratio than the state-of-the-art thresholding approach with little impact on search result quality. We propose three result caching strategies that reduce the number of forwarded queries and analyze their trade-off in storage and network overhead. Importantly, we show that the combination of the best-of-the-class techniques yields very promising search efficiency, rendering multi-site distributed web search engines an attractive alternative to centralized web search engines. Categories and Subject Descriptors H.3.3 [Information Storage Systems]: Information Retrieval Systems Keywords Distributed web search, query processing, document replication, query forwarding, result caching, efficiency. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX...$ INTRODUCTION As the vast amounts of data available in the Web and the number of users accessing it keep their steady growth, an emerging line of research has been focusing on the scalability problems faced by single-site centralized web search engines [3, 10, 13, 21]. An interesting alternative, multi-site distributed web search engines have been shown to reduce the resource consumption in query processing and thus the financial costs of search engines, while also decreasing query response time perceived by users [3]. In multi-site web search engines, the whole document collection is partitioned into a number of distributed indexes, usually based on the geographical proximity between web servers and users. The objective is to exploit the so-called query locality, i.e., the fact that a fraction of queries can get their best-matching documents in the index of the site to which they are issued, saving the need to process them on remote sites and thus reducing response latencies and query processing workloads. Multi-site web search engines raise three major research problems regarding query processing, the focus of our work. First, since each search site only indexes a subset of the documents, some queries may not be able to get all the best-matching results from the local index. A query forwarding strategy is thus necessary to determine, every time a query is received at a local site, which search sites may be indexing some best-matching documents, so that the query can be forwarded to those sites. In order to avoid unnecessary increases in query response time, it is important to forward queries only to the sites which indexed the bestmatching results. Second, forwarding queries increases the query response time due to the network latency among sites. If we replicate and index locally those documents that are frequently fetched from remote sites to serve queries, more queries will benefit from the faster query response time ensured by the query locality. The challenge here is to determine the right documents to replicate in each site in order to achieve high query locality with low storage overhead. Third, as commonly used in centralized search engines, a result cache is useful to reduce the query response time and the query processing workload. In addition to the problems faced by result caches in centralized search engines (e.g., admission, eviction, invalidation, etc.), it is also important for a multi-site search engine to determine in which sites a query should be cached. Contribution. We propose and evaluate new strategies for each of the outlined problems (document replication, query forwarding and multi-site result caching), and conduct what we believe is the first realistic simulation of the full stack of

2 strategies in a multi-site search engine: We propose a document replication technique that relies on per-document utility to select the documents to replicate on each site, improving the query locality of the state-of-the-art approaches by up to 5.88% with various replication budget distribution strategies. We propose a machine learning approach for the query forwarding problem that exploits the trade-off between result quality, on the one hand, and query response time and system workload, on the other hand. Without using any document replication or result caching techniques, our technique is able to increase the fraction of locally processed queries by up to 61% with respect to a state-of-the-art approach. We propose three multi-site result caching strategies that improve the query locality of the local cache approach by up to 3.4% and evaluate the trade-off in terms of the network and storage overheads. We conduct extensive experiments with real datasets to study the interplay among the proposed techniques and show that the combination of the best-of-the-class techniques improves the average query response time of the state-of-the-art approach by up to 24% for the same 0.99 NDCG result quality. Paper outline. The rest of the paper is organized as follows. We describe the general architecture of a multi-site web search engine in Section 2. Section 3 briefly describes the dataset and setup used in our simulations. Sections 4, 5, and 6 present the strategies we propose for document replication, query forwarding, and multi-site result caching, respectively. Section 7 combines and evaluates the best-ofthe-class techniques. Section 8 surveys the related work and Section 9 concludes the paper. 2. MULTI-SITE WEB SEARCH 2.1 Architecture A distributed web search engine is composed of a set S = {s 1, s 2,..., s m} of m of geographically distributed sites. The set D = {d 1, d 2,..., d n} of n documents is partitioned among the indexes of these m sites so that each site s i maintains a local index comprising a disjoint subset of the documents D i (D = 1 i m Di and Di Dj = ).1 We denote by size(d) the size of document d, measured in the number of postings contributed to the inverted index. 2 Additionally, each site s i serves a set Q i of queries that are issued to the site. We denote by f i(q) the number of times that a query q was issued to s i (in a given query set). Finally, we denote by R(q) D the global top-k result set for the query q (i.e., the result set returned by a centralized architecture). 3 Query forwarding. Upon the reception of a query q at some search site s i, a query forwarder is responsible for deciding which of the m sites might have documents that are among the best-matching documents for the query. In accordance with this decision, the query is then forwarded to a number of remote sites, processed on their indexes, and the 1 In this work, we assume that this partition of documents into sites is given by some external design factor. 2 This is the number of unique terms the document contains. For any set of documents D, we also define size(d) = d D size(d). 3 For the sake of simplicity, throughout the rest of this paper we assume a standard value of k = 10. different top-k result sets returned by them are then merged at the local site s i and returned to the user. 4 Replication. As the network latencies between sites might be significant, and given that the occurrence of documents in result sets usually follows a power law distribution, one might be interested in replicating a certain subset of the documents (typically, the most popular) into some of the sites, expecting that this might help reducing the amount of forwarding [3, 21]. Thus, each site s i has a replicated index that contains a certain subset R i D \D i of the documents that are not present in the local index of the site. Caching. Finally, some form of result caching might be used in order to improve response time and reduce the overall system workload. 2.2 Performance metrics Query locality. We define the query locality as the fraction of queries that are answered by the local site they are submitted to without any forwarding, given a combination of document replication, query forwarding, and result caching strategies. A higher query locality will permit a faster query response time and help decrease the overall system load. Query forwarder accuracy rates. Given a query q received at site s i, the query forwarder on that site needs to decide whether to process q on the index of each site s j (1 j m). Each of these m binary decisions might turn out to be a true positive (TP), a true negative (TP), a false positive (FP) or a false negative (FN). If the decision is not to process the query on s j, we have an FN if indeed s j indexes some best-matching document in R(q) (and that document is not locally replicated), and a TN otherwise. If, on the contrary, the decision is to process the query on s j, we have an FP if s j indexes no best-matching document, and a TP otherwise. An FN will degrade the quality of the search results, as the result set will be missing some important documents, but on the other hand, it might reduce the average response time and system workload. A FP might degrade average response time (because the user might have to wait for unnecessary query processing and round-trip time due to forwarding), as well as increment the network congestion and system workload, but it will not affect the quality of the results. We report the FP and FN rates for different forwarding strategies, as well as the overall accuracy (defined as 1 FP FN). Result quality. We aim at generating the same result set that a centralized search engine can provide. In order to measure the degradation of the result quality caused by the FN rate of the query forwarder, we use a number of standard metrics to compare the result sets returned by the distributed architecture to those of a centralized architecture. We are interested in measuring their values at different ranks p, i.e., when considering only the first p results, although for practical reasons we restrict ourselves to p {1, 5, 10}: (i) Exact match rate at rank p (EMR@p). We define the exact match rate as the fraction of all queries for which the retrieved top-k results exactly coincide with the centralized top-k results. (ii) Average overlap at rank p (overlap@p). For any query q, let D p(q) and R p(q) be the set containing the first p results returned by, respectively, a distributed and a centralized engine. We define the overlap at rank p for 4 In contrast with previous approaches, we allow the query forwarder not to process q on the local site s i, as described later.

3 that query as the fraction of results from R p(q) that are present in D p(q), i.e., D p(q) R p(q) / R p(q), 5 and we average this value over the whole query test set (see Section 3). (iii) Average normalized discounted cumulative gain at rank p (NDCG@p). Finally, we are also interested in measuring the average NDCG of the query set [19]. For the sake of simplicity, we consider that a document is relevant if and only if the document belongs to the top-k result set of the centralized architecture. 6 We normalize each query DCG value by dividing it by the maximum DCG attainable, which is that of the centralized architecture. Query response time. We approximate the distribution of query response time by simulating the processing of a full query set. System workload. We compute the workload induced by a query as the sum of the lengths of the inverted lists that need to be traversed in order to process a query. We average this measure over all the queries in a given query set. 3. EXPERIMENTAL SETUP Datasets. We simulate a distributed search engine composed of five sites, located in five different major continents. For each site, we sample consecutive queries from the corresponding search frontend of a commercial web search engine. Queries are normalized by case-folding, stop-word removal, term uniquing, and alphabetical ordering of query terms. The full query set is split roughly 75% 25% between a training set (5.25 million queries), which we use to compute the replication utility of the documents and to train our query forwarding models, and a test set (1.72 million queries), which we use to evaluate the performance metrics. The document collection contains about 200 million web pages obtained after various cleansing and filtering steps such as spam filtering. A proprietary classifier is used to assign the documents to a home country (some documents are not assigned to any of the five sites). This classifier uses features such as document language, IP address, and domain name, among others. Simulation parameters. For each search site, we build a separate index using the documents assigned to it. Each site is assumed to have computational resources proportional to its index size. The simulated response times include query processing times and network latencies. For the former, we assume a processing cost of 200 ns per posting and 20 ms of fixed preprocessing overhead per query. The top k query results are retrieved using the open-source search engine Terrier. For the latter, we considered the user-to-site and siteto-site network latencies, estimated based on the geodesic distances between users and sites. Baselines. We compare the performance of our query forwarder to the state-of-the-art forwarding model proposed in [13]. In order to facilitate the interpretation of our results, we often show the performance of an ORACLE forwarder that always knows, for any given query, in which search sites the top k best-matching documents are indexed. 5 Note that this is similar to the query recall if R p(q) were the only results considered relevant. 6 This approach is rather restrictive, in the sense that it attributes no usefulness to any result beyond the k-th one. (a) No replication Search site 1 (b) Identical replication Search site 1 (c) Individual replication - Global budget Search site 1 (d) Individual replication - budget Search site 1 Search site 2 Search site 2 Search site 3 Search site 2 Search site 2 Search site 3 Search site 3 Search site 3 Figure 1: Different document replication strategies. 4. DOCUMENT REPLICATION As previously mentioned, document replication helps increasing the query locality of a distributed search engine. Replicating too large a fraction of D, however, defeats the very purpose of distributed search engines, since the query processing times increase when the replicated indexes become too large. It is thus customary to place a replication budget b% upon the storage overhead of replicated documents. Given a budget, we aim to devise a replication strategy that selects a set of documents R i D \ D i for each site s i to build the replicated index of that site such that (i) The size of the replicated documents is below a given percentage b% of the global size of the document collection: 1 i m size(ri) b% size(d). (ii) The query locality, i.e. the fraction of queries that can be locally processed without any quality loss (on both the local and replicated indexes D i R i) is maximized. In this section, we first categorize different strategies for document replication (Section 4.1). We then propose a new replication strategy (Section 4.2), and we finally show (Section 4.3) that this strategy outperforms existing heuristics in terms of query locality if assuming the use of an ORACLE query forwarder. We leave the discussion on the joint effect of the three techniques for Section Document replication strategies Figure 1 illustrates how different document replication strategies distribute the replication budget b% among the different search sites. Figure 1(a) depicts the search engine before any document is replicated, with each site indexing a disjoint document collection, shown in a distinct color. Identical replication. A simple strategy is to replicate the same set R of documents in all the sites (Figure 1(b)). Since part of the selected documents may already exist in the local index D i of site s i (e.g., the regions inside dotted lines on the figure), only those documents in R i = R\D i are

4 used to build the replicated index. This avoids redundancies when scoring documents that already exist in the local index during query processing time. To bound the total size of replicated documents by b% of the local indexes, we just need to ensure that size(r) b % size(d). m Individual replication. Identical replication possesses two obvious drawbacks: (i) those documents best suited for replicating at site s i might not be the best for site s j, since the users at each site might have different needs, and (ii) the set R of replicated documents may contain a significant fraction of documents that already exist in the local index of a search site, thus unnecessarily consuming replication budget. A possible improvement is to let each site select its own, distinct set of documents to replicate, so that it is tailored to its particular query stream and explicitly excludes documents in its local index. We refer to this kind of strategy as individual replication, and study the two following alternatives for distributing the allowed budget: (i) Global replication budget. This option only places a global constraint to the size of all the replicated indexes, namely 1 i m size(ri) b% size(d). As shown in Figure 1(c), a site with a small local index may have a large replicated index (e.g., site 3), whereas a site with a large local index may have a small replicated index (e.g., site 1). (ii) replication budget. In a distributed engine, the storage and computation resources available in each search site are usually tailored to the size of its local index and to the amount of queries it receives. With a global replication budget, replicating a large index to a site with limited resources might force the redistribution of resources among sites or an investment in new resources. To alleviate such need, we can set an additional constraint to the global replication budget strategy, requiring that the size of the replicated index in any site s i does not surpass b% of the size of the local index (size(r i) b% size(d i), see Figure 1(d)). The choice of a particular replication type (identical, individual) and of a budgeting alternative might depend on external system design factors. We next present a generic heuristic that, given a particular type and distribution scheme, selects the documents to replicate at each site. 4.2 Document selection heuristics For a given query set Q, it has been shown that the computation of the replicated subsets R i that maximize the fraction of top-k relevant documents that are indexed locally can be reduced to the 0-1 knapsack problem [14, 21] 7, which is known to be NP-hard. We therefore propose a greedy heuristic 8 that consists in ranking the documents in descending order of their utility of replication, and selecting them progressively until the replication budget is exhausted. We next define how we estimate this replication utility. For a given site s i, let R i(q) = R(q) \ D i be the set of documents in the top-k result set of q that are not indexed in the local index of s i. We define the utility of replicating 7 The basic idea consists on assigning each document a weight equal to its size and a utility proportional to the number of top-k result sets where it appears. 8 Note that our focus is on defining a new heuristic to compute the utility of replication, not to devise a new algorithm for the 0-1 knapsack problem. The proposed heuristic can be easily applied to more complicated solutions to this problem [23, 25]. a document d D \ D i to site s i with respect to query q as { 1 if d R(q) u i,q(d) = R i (q) s(d) 0 otherwise Once we have that, we define the global utility of replicating a document d D \D i to site s i as the sum of the individual utility of each query q weighted by its frequency f i(q), i.e., u i(d) = q Q i f i(q) u i,q(d) The choice of u i(d) stems from the observations that (i) the more often a query q was issued to a site s i (i.e. the larger f i(q)), the more likely replicating a non-local document in R(q) will increase the query locality; (ii) the less non-local documents in R(q), the more likely replicating one of these non-local documents will allow q to be processed locally; and (iii) the smaller a document, the larger the utility per unit size, and the larger the overall utility for a given replication budget. With this definition of utility u i(d), we build the replicated document sets R i for each of the replication options discussed in Section 4.1 as follows: Identical replication. We compute the utility u(d) of identically replicating a document (Figure 1(b)) as the sum of the utilities of replicating it to each site, i.e., u(d) = 1 i m ui(d), and select documents in descending order of u(d) until the replication budget b % size(d) is exhausted. m Individual replication with global budget. Since a site can replicate any number of documents as long as the global replication budget is not exhausted (Figure 1(c)), we rank the pairs u i(d), s i in descending order of u i(d) and select document d with highest u i(d) to replicate in site s i until the global replication budget b% size(d) is exhausted. Individual replication with local budget. Each site requires the size of replicated documents not to surpass its local budget (Figure 1(d)), so we select the set of documents to replicate at each site separately. For every site s i we rank the documents in descending order of u i(d) and progressively select them until the local budget b% size(d i) is exhausted. 4.3 Experimental performance In this section we evaluate the performance of the proposed heuristic with respect to query locality, and in comparison with the document frequency-based heuristic in [13] for identical replication (Identical-OF) and with that in [21] for individual replication both with global (Individual-ORT- G) and local (Individual-ORT-L). In [13], the utility of replicating a document is computed as the total frequency of the queries having the document in their top-k results divided by the size of this document. In [21], the utility of replicating a document, for the objective of increasing the number of locally processed queries, is computed in the same way as in [13], but only non-local documents are considered for replication. Our heuristic in the three replication strategies are denoted as Identical-OU, Individual-OU-G and Individual- OU-L respectively, where OU stands for optimizing utility. Query locality. If no document is replicated, 44.4% queries (44.8% unique queries) can be entirely processed locally for all the five sites. Figure 2 compares our heuristic against the baseline approaches for the three replication strategies with varying replication budget b%. We observe that (i) replicating documents significantly improves the number of locally processed queries (e.g., replicating only 1% of documents

5 Fraction of locally processed queries Size of replicated index w.r.t. original index Individual-OU-L Individual-ORT-G Individual-ORT-L 0 Individual-OU-G Identical-OF Identical-OU Replication percentage 0.1 (a) All queries. Fraction of locally processed queries Individual-OU-L Individual-ORT-G Individual-ORT-L Individual-OU-G Identical-OF Identical-OU Replication percentage 32 (b) Unique queries. Figure 2: Effectiveness of document replication. Individual-OU-G Individual-ORT-G Individual-OU-L Individual-ORT-L Identical-OU Identical-OF s 1 s 2 s 3 s 4 s 5 (a) Storage overhead. Average query locality Individual-OU-G Individual-ORT-G Individual-OU-L Individual-ORT-L Identical-OU Identical-OF NoRep s 1 s 2 s 3 s 4 s 5 (b) ly processed queries. Figure 3: Load balancing (b = 8%). improves the query locality by 36% for all queries); (ii) individual replication outperforms identical replication; (iii) for individual replication, global budgeting outperforms local budgeting; (iv) our heuristic (*-OU-*) improves the query locality by 4% to 1.32% and the unique query locality by 0.73% to 2.22% compared to the two individual replication strategies. The improvement for identical replication is up to 5.88%. Interestingly, the improvement for unique queries is more significant than that for all the queries, which will be important once we add a cache layer (Section 6). Load balancing. We study the load balance between different sites by measuring the per-site storage overhead caused by the replication. This overhead is measured as the size of the replicated index divided by the size of the local index. We observe from Figure 3(a) that individual replication with global budget replicates more documents in sites having smaller local indexes (e.g., s 5), while individual replication with local budget results in a more balanced replication. 9 From Figure 3(b) we observe that sites with larger local indexes have higher query locality. This explains why individual replication with global budget results in larger query locality than individual replication with local budget. 5. QUERY FORWARDING We describe next our proposed solution for the query forwarding problem. Given m distributed search sites, we cast the problem as m 2 independent binary decision problems arising from every possible site pair. Thus, each local site s i has m available binary classifiers c i,1, c i,2,..., c i,m, where c i,j is in charge of deciding, for any query received at site s i, whether it is worth processing the query on the index 9 For easier visualization, Figure 3 shows only a fixed replication budget b = 8%, but other values of b exhibit a similar behavior. Type Category # pre-retrieval Term lengths 8 pre-retrieval Term IDFs 20 pre-retrieval Term scores 32 pre-retrieval Query language 1 pre-retrieval Query popularity 15 pre-retrieval Query performance 12 post-retrieval query score 8 post-retrieval forwarder decision 4 TOTAL 100 Table 1: Query feature categories. residing at site s j. We present the type of machine learning features we consider in Section 5.1, and we evaluate them in Section 5.2. We detail the forwarder architecture in Section 5.3, and report on its performance in Section 5.4. Interplay with document replication. Query forwarding in multi-site search engines typically determines where to forward a query according to whether a remote site is likely to index some of its best-matching documents [3]. Although in Section 4.3 we saw that individual replication outperforms identical replication, it may replicate any given document only to a strict subset of all sites. In that case, determining the optimal subset of sites to which the query should be forwarded becomes much more involved, and is beyond the classical problem we aim to solve. We therefore only report the performance of query forwarders coupled with an identical replication strategy from here on Feature extraction Given the model described above, each query q submitted to a site s i can be seen as the source of m different machine learning instances q i,1,..., q i,m, where the data instance q i,j will solely be used to train the binary classifier c i,j, and contains features about (i) the local site s i, (ii) the remote site s j, (iii) some third sites s k, where k / {i, j}, 11 and (iv) sensible combinations of the previous options. Finally, the label (i.e., our ground truth) of every query instantiation q i,j is a binary variable indicating whether site s j contains at least one document from R(q)\(D i R i), i.e., one document not indexed in any of the indexes of site s i that should be in the top-k result set returned to the user. Table 1 shows a summary of the different feature categories that we have considered. It is important to note that some of these features (pre-retrieval) can be extracted prior to the local processing of the query, whereas some others need the query to be processed (post-retrieval). We next give a brief description of these different categories. Term lengths, IDFs and scores. We extract some basic statistics about the query term count and about the lengths of the different terms in the query. We also use some perterm and per-site frequency statistics, as well as information on the maximum contribution of any term appearing in the 10 The above-mentioned problem does not exist for identical replication. Once a document is replicated, it is available in all sites and can be obtained from the local index without any query forwarding. 11 For instance, by encoding the maximum score that the lowestscoring term is able to achieve at a third site, we expect our query forwarder to be more informed than the previous thresholding models [3, 13], which by nature only take into account information from the local-remote site pair.

6 IG score IG score χ 2 score Term IDF Query scores Term scores Third site χ 2 score q Pre-retrieval classifier ML query forwarder < F, c > Yes c > C No query processor query score F Post-retrieval classifier F Figure 5: Architecture of the ML query forwarder. 0 Figure 4: Individual feature scores. site index to the score of a query containing the term. 12 We extract a number of features from that information, including, as previously mentioned, features that try to capture the possible contribution of third sites to the result set. Query language. Another feature which intuitively can be expected to capture the relevance of the indexes in different sites is the language of the query, especially given that documents are assigned to sites using language as a feature. We therefore try to predict the language of the query by using a character-based, n-gram-based predictor. Per-site query popularity. A simple set of features encoding information about whether the query belongs to the most frequent queries issued to the local site s i, for different percentile-based definitions of most frequent queries. Query performance. We also extract some features related to the query performance predictors γ 1 and γ 2 described in [18], which predict how good the results of a query will be on a given index. These predictors are based on the distribution of IDF values over the different terms of a query and can be easily computed in a pre-retrieval fashion. query score. This set contains a number features related to the scores obtained after processing the query against the local index. forwarder decision. Finally, we also extract some features based on the model decision from [13]. 5.2 Feature evaluation We investigate the importance of individual query features by computing both the Informational Gain (IG) and the chi-squared (χ 2 ) scores of each feature. IG measures the reduction of entropy when the feature value is known, whereas χ 2 measures the lack of independence between a feature and class values. Both metrics have been previously reported to perform well as feature selection methods [26]. Figure 4 shows IG and χ 2 scores averaged over the query instances arising from the different site pairs, and ordered by decreasing IG score types. The first thing to note is that the relative performance of individual features is roughly the same for both scores. As expected, the most informative features are those extracted from the formulation. A bit surprisingly, the (post-retrieval) query score features based on the local processing of the query are less informative than pre-retrieval features based on term scores and term IDFs. Note, however, that we are only measuring the performance of features considered in isolation, and therefore the informativeness of some feature combinations might not be properly captured. It is also worth noticing that features capturing information of term scores in sites other than the local-remote pair (labeled as third site features) do carry 12 As pointed out in [12], the storage overhead of keeping this site term score matrix is negligible. 0 a certain amount of information. Finally, let us note that the features showing the worst performance are those related to query popularity, on the one hand, and to query language, on the other hand. This last phenomenon is probably due to the difficulty of accurately predicting the language of the queries, given their usually short length. For performance reasons, we exclude the query language feature henceforth. 5.3 Machine-learned query forwarder The machine-learned query forwarder that we propose is depicted on Fig. 5. The forwarder is composed of two different classifiers. On one hand, we have a pre-retrieval classifier which only uses pre-retrieval features and thus can be run right upon the reception of the query. Since this classifier dismisses post-retrieval features, however, it will predictably be less accurate than the full post-retrieval classifier, which employs all the available features. In order to exploit the trade-off between result quality and efficiency, we choose to trust the set of forwarding decisions F made by the preretrieval classifier only if the confidence score c [, 1] of the classifier is high enough, i.e., higher than a given threshold C. 13 This ML architecture is thus parametrized by the confidence threshold C, and different values of C will result in query forwarders with different properties: lower thresholds will be able to trigger the forwarding process earlier, but at the cost of being less precise and thus producing both worse results and a higher chance of unnecessarily loading remote sites. Also, note that this architecture gives us the option not to process the query locally, as opposed to previous approaches. As a classification model, we choose to combine a set of 100 base decision trees through the bagging ensemble method [9]. We use the off-the-shelf implementation of both the bagging method and the fast REP classification tree provided by the WEKA package [17]. In our preliminary experiments, other machine learning approaches such as random forest classifiers showed slightly better accuracy rates, but their training time was significantly larger. Therefore, and given the large number of classifiers and the size of the datasets, we chose to use the bagging classifier. In order to assess the value of using large training sets, we analyze the accuracy of our pre-retrieval classifiers as the amount of training queries increases, in the no-replication scenario. We hold out a fraction of our training dataset for testing purposes and train the forwarders with subsets of varying size from the remaining fraction. The resulting accuracies (averaged over all different site pairs) do not attain any saturation point before reaching the maximum training set size. Increasing the number of training queries from 10K to 100K, for instance, raises the accuracy from 88.72% to 96%; using all the available queries results in a further 13 Note that this confidence score is actually the estimated probability of decision being right; hence, since we are dealing with a binary decision problem, the score always lies in the range [, 1].

7 Accuracy No replication 8% replication ML Average query locality 0.7 No replication 8% replication ML TN rate TP rate FN rate FP rate TN rate TP rate FN rate FP rate Confidence threshold (a) Accuracy Confidence threshold (b) Query locality. Figure 6: Performance of different query forwarders. Replication 0% 8% TP TN FP FN TP TN FP FN ORACLE ML ML ML Table 2: Confusion matrix values (%). boost up to 92.75% accuracy. We thus only report henceforth on query forwarders trained with the full training set. 5.4 Experimental performance We report now the accuracy rates and query locality induced by our classification strategy, and defer the discussion of other metrics to Section 7. Figure 6(a) compares the accuracy rate of the baseline and those of our ML forwarder for different confidence thresholds, in two replication scenarios (no replication and 8% replication). Table 2 further presents their true positive (TP), true negative (TN), false positive (FP) and false negative (FN) rates (ML C stands for confidence threshold C). In both replication scenarios, the accuracy of the ML forwarder significantly exceeds that of the forwarder (for any confidence threshold), due to the high FP rate of the model. Unexpectedly, in the replicated scenario, the accuracy decreases as the confidence threshold increases, i.e., as more queries are forwarded according to the the post-retrieval query forwarder. To investigate this, Figure 7 shows a graphical breakdown of the evolution of the confusion matrix values as the confidence threshold increases. The accuracy decrease in the replicated scenario turns out to be caused by an increase of the FP rate, although the FN rate actually decreases. This might be due to the higher imbalance induced by the document replication; one way of solving this is to weight false positives and false negatives non-uniformly, e.g., penalizing false negatives more than false negatives when training the classifiers. We explore this possibility and its results in Section 7. Figure 6(b), on the other hand, compares the query locality achieved by the different forwarding strategies. As expected, both the and ML forwarders are able to exploit the higher query locality allowed by the document replication. Compared to the forwarder, however, the lower FP rates of our approach ensure significant increase of query locality (10%-20% more queries are processed locally). Again, note that in the replicated scenario, the increase in the FP rate that we just mentioned has a negative impact on the query locality for high confidence thresholds Confidence score (a) No replication Confidence score (b) 8% global replication. Figure 7: Confusion matrix values. The inset details the lower section of the graph for easier visualization. 6. RESULT CACHING Result caching is an important technique for improving the performance of search engines, since the results of recurrent queries can be served from the cache with no processing on the index. A difference with centralized engines is that in multi-site engines queries are issued to the search front-ends in different locations. Each site typically maintains its own result cache built from the queries it receives. It is thus important to determine, among all the sites, where a query and its result should be cached, such that given the constraints (on freshness of results, cache memory consumption, etc.), the maximum number of successive occurrences of this query can be responded without (or with less) query forwarding. This in turn ensures fast query response time and low system workload. In this section, we present different result caching strategies for multi-site web search and report on their experimental performance. 6.1 Result caching strategies We focus on determining where to cache a query and its result given that each site keeps a different cache. We consider unlimited capacity caches, as in [1]. To keep the freshness of results upon index updates, we invalidate cache entries with a TTL-based mechanism. cache. The most straightforward result caching strategy, as considered in the existing multi-site search engines [13], is that each search site only keeps in its cache the queries issued to its front-end. When a site receives a cached query, the answer is served straight from the cache only if the age of the query in the cache is smaller than the predefined TTL value. Otherwise, the query is either processed locally or forwarded to remote sites according to the query forwarder. This strategy is simple to implement, but popular queries issued to multiple sites have to be processed independently at each site. Global cache. An effective way to reduce the number of queries to process is to replicate cache entries to all sites. Once a query enters the cache of a site, it is also sent to the rest of sites to be cached there 14. This way we ensure that every query is only processed once (until the TTL expires). Partial cache. One problem of global caching is that it may replicate cache entries in sites where they will never be used, inducing unnecessary storage and transmission costs. To 14 This can be achieved either on a per query basis or in batch mode, according to the desired trade-off between the hit rate and the transmission cost.

8 Hit rate Partial/Forward Global TTL (min) (a) Hit rate w.r.t. TTL. Network overhead (MB) Forward Partial Global s 1 s 2 s 3 s 4 s 5 Site (b) Network o/h, 2h TTL. Figure 8: Performance of different caching strategies. alleviate this, we can replicate cache entries only to the sites where the query is forwarded during the query processing. The intuition is that, since these sites are likely to index documents relevant to the query, which additionally have been selected according to their likelihood of being necessary on the site (by the document replication strategy), the query will be issued to them with higher probability. Forward cache. A variation to reduce network congestion is that, whenever a query q is forwarded from site s i to site s j during the query processing, s j caches a pointer to the site s i. If s j receives the same query later, it simply requests the results from the cache on site s i instead of processing the query. This type of cache entries are expired with the same TTL-based mechanism as before. Additionally, the queries that are processed locally are maintained in the cache as normal cache entries. 6.2 Experimental performance We evaluate in this section the performance of the different result caching strategies, especially regarding their ability to improve the query locality. Figure 8(a) shows the hit rate as the value of TTL increases. As expected, global cache consistently ensures the highest hit rate compared to other strategies, as more queries and results are maintained in the cache of each site. Taking a TTL of 2 hours as an example, the average hit rate of global cache is 3.4% higher than that of local cache, and 1.1% higher than those of partial cache and forward cache 15. Although we observe from Figure 8(b) that it incurs higher network (and storage) overhead to maintain the global cache, considering that the caches have unlimited capacity and that the transmission of cache entries between sites can be performed off-line, we choose to focus only on the performance of global cache for the rest of the paper. We further report the query locality when global caching is used along with the replication and forwarding strategies presented in Sections 4 and 5. We observe from Figure 9 that the query locality increases along with the TTL value, for all the replication amounts and the query forwarding strategies that we take into consideration. More importantly, we observe that our ML query forwarder outperforms the baselines for all TTL values. Besides, even if the difference between not using cache (i.e., TTL=0) and using cache (e.g., TTL=15 min) decreases when documents are replicated, using cache still improves the query locality by at least 9.3% for the different query forwarding strategies. 15 Query forwarding is made by the ORACLE forwarder. Average query locality 0.7 ML,c= ML,c=0.75 ML,c= TTL (min) (a) No replication. Average query locality 0.7 ML,c= ML,c=0.75 ML,c= TTL (min) (b) 8% replication. Figure 9: Impact of global cache on query locality. Replication 0% 8% c p O@p N@p E@p O@p N@p E@p Table 3: Quality metrics at different rank positions p. 7. GLOBAL RESULTS We report in this section the evaluation of result quality, query response time and system workload. We first report on these metrics independently, and then study the combined effect of our different strategies on them. 16 Regarding quality, Table 3 shows some quality metrics of our ML forwarder for three different confidence thresholds c {, 0.75, } at different rank positions p {1, 5, 10}. O@p, N@p and E@p stand for overlap@p, NDCG@p and EMR@p, respectively. The results are consistent with what we saw in the previous sections: as the confidence threshold increases, the quality improves (according to all metrics), since a higher threshold means that the post-retrieval forwarder, which has a smaller FN rate, is used more often. With respect to query response time, we analyze how it varies across different queries in Figure 10, which shows the cumulative percentage of queries that are answered below a certain time threshold. In the scenario with replication, for instance, 53.51% of the test queries are answered in less than 300ms when using the forwarder, whereas by using our ML strategy, this percentage is increased to a value in the range 62.09%-71.43%, depending on the confidence threshold. We now turn to the analysis of the trade-off between both metrics, this time accounting for the impact of the caching strategies, which is visualized in Figure 11(a). Solid markers denote a simulation without caching; hollow markers include the effect of a global cache strategy with a 2h TTL. Markers from left to right correspond to ML forwarders with increasing confidence threshold C [, 1]. As expected, the higher the quality we want to achieve, the higher the average response time. Without caching, and taking the forwarder 16 Note that caching has no impact on the result quality.

9 (Accumulated) Fraction of queries Average response time (ms) ML,c= ML,c= Response time (ms) (a) No replication. (Accumulated) Fraction of queries ML,c= ML,c= Response time (ms) (b) 8% global replication. Figure 10: Distribution of the query response time. No replication 8% replication ML Global, TTL=2h No cache Average NDCG (a) No training weights. Average response time (ms) w=5 w=0 w=2 ML w=-5 w= Average NDCG (b) Varying training weights, 8% replication, 2h TTL. Figure 11: Trade-off between result quality and time. as a baseline, our ML forwarder achieves a significant reduction of average response time in the range 20%-22% (no replication) or 15%-25% (8% replication), depending on the confidence threshold, in exchange for a small quality loss. If we focus on the scenario with replication and pick a confidence threshold of 0.7, our approach offers an improvement of 19.5% on the average response time, while the average NDCG remains as high as 0.978, and 91.65% of the queries obtain a result set identical to that of a centralized architecture. As both variables are strongly correlated, the relative improvements in system workload follow a similar pattern, which we do not show for lack of space. Regarding the impact of the cache layer, we observe that caching neutralizes the advantages of replication as far as response time is concerned, up to the point where the larger index sizes caused by the replication and the induced larger query processing time outweigh the benefits achieved by the replication itself. This might be partly due to the fact that the replication strategy is specifically tailored to the query frequencies of the original query stream, which are largely modified by the presence of the cache layer. To circumvent this problem, one should simulate a fixed caching strategy paired with a fixed TTL value and tailor the replication decisions to the resulting, post-cache query stream. We leave this open as a future line of research. Varying training weights. In Section 5.4 we hinted at an alternative, orthogonal way of exploring the quality-time trade-off that consisted on modifying the balance between false positives and false negatives by weighting them nonuniformly during the query forwarders training phase. We show the results of applying this strategy on Figure 11(b), where for easier visualization only a scenario with 8% replication, coupled with a global cache with a 2h TTL, is depicted. w = x stands for a ML forwarder such that during training phase an FP is considered x times as negative as an FN (e.g., the w = 2 stands for those forwarders where an FN is considered twice as harmful as an FP). It can be seen that this strategy further widens the flexibility that the ML forwarder offers to the system designer. In exchange for a larger quality loss, the average query response time can now be decreased to as little as 137ms (a 40% reduction with respect to the forwarder). Even more interestingly, on the other extreme we can reach a 0.99 NDCG quality and decrease the average response time to 176ms, which is almost as good as the ORACLE forwarder and means a significant 24% reduction with respect to the state-of-the-art forwarder RELATED WORK There is a rich literature on query forwarding, document replication and result caching. Herein, we provide a brief survey of related work on these topics, limiting the context to multi-site web search. A broader survey on scalability and efficiency of web search engines can be found in [10]. Query forwarding. In the multi-site web search framework, the query forwarding problem is first investigated by Baeza-Yates et al. [3]. The authors propose a simple technique for estimating the maximum score a query can achieve on each search site. The forwarding decisions are made based on the outcome of a comparison between the estimated maximum scores and the lowest score in the top k result set computed by the local search site. The authors show that, with the proposed technique, many queries can be processed solely in local search sites without sacrificing the result quality. The work of Cambazoglu et al. [10] further increases the rate of locally processed queries by a linear-programming technique. The technique exploits the co-occurrence of query terms in documents and yields tighter upper bound score estimations. In our work, we use a machine-learned query forwarding technique, which achieves further performance improvements over the linear programming model used in [13]. Finally, Baeza-Yates et al. [4] also investigate the use of machine learning techniques in a distributed search engine, although the scope of their simulation is much more limited and their focus is on two-tiered systems with high index size imbalance. Document replication. There are two prior studies focusing on document replication in multi-site web search engines [8, 21]. Brefeld et al. [8] employ machine learning techniques to select the data center(s) where a newly discovered document should be replicated. Since no past popularity information (i.e., number of views in web search results) is available for new documents, the features used by the learning model are mainly extracted from the document content (e.g., the language feature). Kayaaslan et al. [21] consider a scenario where past popularity information is available for every document. They propose three algorithmic optimizations aiming to reduce the response latency, total workload, or the loss in search quality. The replication heuristic proposed in our work achieves further improvements over the heuristic described in [21] in terms of the query locality. Besides [8] and [21], Cambazoglu et al. [13] also evaluate a sim- 17 Not shown in the figure, these response time decrease percentages are virtually the same even without the cache layer.

Improving the Efficiency of Multi-site Web Search Engines

Improving the Efficiency of Multi-site Web Search Engines ABSTRACT Guillem Francès Universitat Pompeu Fabra Barcelona, Spain guillem.frances@upf.edu B. Barla Cambazoglu Yahoo Labs Barcelona, Spain barla@yahoo-inc.com