Improving the Efficiency of Multi-site Web Search Engines

Size: px
Start display at page:

Download "Improving the Efficiency of Multi-site Web Search Engines"

Transcription

1 Improving the Efficiency of Multi-site Web Search Engines ABSTRACT A multi-site web search engine is composed of a number of search sites geographically distributed around the world. Each search site is typically responsible for crawling and indexing the web pages that are in its geographical neighborhood. A query is selectively processed on a subset of search sites that are predicted to return the best-matching results. The scalability and efficiency of multi-site web search engines have attracted a lot of research attention in the recent years. In particular, research has focused on replicating important web pages across sites, forwarding queries to relevant sites, and caching results of previous queries. Yet, these problems have only been studied in isolation, but no prior work has properly investigated the interplay between them. In this paper, we take this challenge up and conduct what we believe is the first comprehensive analysis of a full stack of techniques for efficient multi-site web search. Specifically, we propose a document replication technique that improves the query locality of the state-of-the-art approaches with various replication budget distribution strategies. We devise a machine learning approach to decide the query forwarding patterns, which achieves significantly lower false positive ratio than the state-of-the-art thresholding approach with little impact on search result quality. We propose three result caching strategies that reduce the number of forwarded queries and analyze their trade-off in storage and network overhead. Importantly, we show that the combination of the best-of-the-class techniques yields very promising search efficiency, rendering multi-site distributed web search engines an attractive alternative to centralized web search engines. Categories and Subject Descriptors H.3.3 [Information Storage Systems]: Information Retrieval Systems Keywords Distributed web search, query processing, document replication, query forwarding, result caching, efficiency. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX...$ INTRODUCTION As the vast amounts of data available in the Web and the number of users accessing it keep their steady growth, an emerging line of research has been focusing on the scalability problems faced by single-site centralized web search engines [3, 10, 13, 21]. An interesting alternative, multi-site distributed web search engines have been shown to reduce the resource consumption in query processing and thus the financial costs of search engines, while also decreasing query response time perceived by users [3]. In multi-site web search engines, the whole document collection is partitioned into a number of distributed indexes, usually based on the geographical proximity between web servers and users. The objective is to exploit the so-called query locality, i.e., the fact that a fraction of queries can get their best-matching documents in the index of the site to which they are issued, saving the need to process them on remote sites and thus reducing response latencies and query processing workloads. Multi-site web search engines raise three major research problems regarding query processing, the focus of our work. First, since each search site only indexes a subset of the documents, some queries may not be able to get all the best-matching results from the local index. A query forwarding strategy is thus necessary to determine, every time a query is received at a local site, which search sites may be indexing some best-matching documents, so that the query can be forwarded to those sites. In order to avoid unnecessary increases in query response time, it is important to forward queries only to the sites which indexed the bestmatching results. Second, forwarding queries increases the query response time due to the network latency among sites. If we replicate and index locally those documents that are frequently fetched from remote sites to serve queries, more queries will benefit from the faster query response time ensured by the query locality. The challenge here is to determine the right documents to replicate in each site in order to achieve high query locality with low storage overhead. Third, as commonly used in centralized search engines, a result cache is useful to reduce the query response time and the query processing workload. In addition to the problems faced by result caches in centralized search engines (e.g., admission, eviction, invalidation, etc.), it is also important for a multi-site search engine to determine in which sites a query should be cached. Contribution. We propose and evaluate new strategies for each of the outlined problems (document replication, query forwarding and multi-site result caching), and conduct what we believe is the first realistic simulation of the full stack of

2 strategies in a multi-site search engine: We propose a document replication technique that relies on per-document utility to select the documents to replicate on each site, improving the query locality of the state-of-the-art approaches by up to 5.88% with various replication budget distribution strategies. We propose a machine learning approach for the query forwarding problem that exploits the trade-off between result quality, on the one hand, and query response time and system workload, on the other hand. Without using any document replication or result caching techniques, our technique is able to increase the fraction of locally processed queries by up to 61% with respect to a state-of-the-art approach. We propose three multi-site result caching strategies that improve the query locality of the local cache approach by up to 3.4% and evaluate the trade-off in terms of the network and storage overheads. We conduct extensive experiments with real datasets to study the interplay among the proposed techniques and show that the combination of the best-of-the-class techniques improves the average query response time of the state-of-the-art approach by up to 24% for the same 0.99 NDCG result quality. Paper outline. The rest of the paper is organized as follows. We describe the general architecture of a multi-site web search engine in Section 2. Section 3 briefly describes the dataset and setup used in our simulations. Sections 4, 5, and 6 present the strategies we propose for document replication, query forwarding, and multi-site result caching, respectively. Section 7 combines and evaluates the best-ofthe-class techniques. Section 8 surveys the related work and Section 9 concludes the paper. 2. MULTI-SITE WEB SEARCH 2.1 Architecture A distributed web search engine is composed of a set S = {s 1, s 2,..., s m} of m of geographically distributed sites. The set D = {d 1, d 2,..., d n} of n documents is partitioned among the indexes of these m sites so that each site s i maintains a local index comprising a disjoint subset of the documents D i (D = 1 i m Di and Di Dj = ).1 We denote by size(d) the size of document d, measured in the number of postings contributed to the inverted index. 2 Additionally, each site s i serves a set Q i of queries that are issued to the site. We denote by f i(q) the number of times that a query q was issued to s i (in a given query set). Finally, we denote by R(q) D the global top-k result set for the query q (i.e., the result set returned by a centralized architecture). 3 Query forwarding. Upon the reception of a query q at some search site s i, a query forwarder is responsible for deciding which of the m sites might have documents that are among the best-matching documents for the query. In accordance with this decision, the query is then forwarded to a number of remote sites, processed on their indexes, and the 1 In this work, we assume that this partition of documents into sites is given by some external design factor. 2 This is the number of unique terms the document contains. For any set of documents D, we also define size(d) = d D size(d). 3 For the sake of simplicity, throughout the rest of this paper we assume a standard value of k = 10. different top-k result sets returned by them are then merged at the local site s i and returned to the user. 4 Replication. As the network latencies between sites might be significant, and given that the occurrence of documents in result sets usually follows a power law distribution, one might be interested in replicating a certain subset of the documents (typically, the most popular) into some of the sites, expecting that this might help reducing the amount of forwarding [3, 21]. Thus, each site s i has a replicated index that contains a certain subset R i D \D i of the documents that are not present in the local index of the site. Caching. Finally, some form of result caching might be used in order to improve response time and reduce the overall system workload. 2.2 Performance metrics Query locality. We define the query locality as the fraction of queries that are answered by the local site they are submitted to without any forwarding, given a combination of document replication, query forwarding, and result caching strategies. A higher query locality will permit a faster query response time and help decrease the overall system load. Query forwarder accuracy rates. Given a query q received at site s i, the query forwarder on that site needs to decide whether to process q on the index of each site s j (1 j m). Each of these m binary decisions might turn out to be a true positive (TP), a true negative (TP), a false positive (FP) or a false negative (FN). If the decision is not to process the query on s j, we have an FN if indeed s j indexes some best-matching document in R(q) (and that document is not locally replicated), and a TN otherwise. If, on the contrary, the decision is to process the query on s j, we have an FP if s j indexes no best-matching document, and a TP otherwise. An FN will degrade the quality of the search results, as the result set will be missing some important documents, but on the other hand, it might reduce the average response time and system workload. A FP might degrade average response time (because the user might have to wait for unnecessary query processing and round-trip time due to forwarding), as well as increment the network congestion and system workload, but it will not affect the quality of the results. We report the FP and FN rates for different forwarding strategies, as well as the overall accuracy (defined as 1 FP FN). Result quality. We aim at generating the same result set that a centralized search engine can provide. In order to measure the degradation of the result quality caused by the FN rate of the query forwarder, we use a number of standard metrics to compare the result sets returned by the distributed architecture to those of a centralized architecture. We are interested in measuring their values at different ranks p, i.e., when considering only the first p results, although for practical reasons we restrict ourselves to p {1, 5, 10}: (i) Exact match rate at rank p (EMR@p). We define the exact match rate as the fraction of all queries for which the retrieved top-k results exactly coincide with the centralized top-k results. (ii) Average overlap at rank p (overlap@p). For any query q, let D p(q) and R p(q) be the set containing the first p results returned by, respectively, a distributed and a centralized engine. We define the overlap at rank p for 4 In contrast with previous approaches, we allow the query forwarder not to process q on the local site s i, as described later.

3 that query as the fraction of results from R p(q) that are present in D p(q), i.e., D p(q) R p(q) / R p(q), 5 and we average this value over the whole query test set (see Section 3). (iii) Average normalized discounted cumulative gain at rank p (NDCG@p). Finally, we are also interested in measuring the average NDCG of the query set [19]. For the sake of simplicity, we consider that a document is relevant if and only if the document belongs to the top-k result set of the centralized architecture. 6 We normalize each query DCG value by dividing it by the maximum DCG attainable, which is that of the centralized architecture. Query response time. We approximate the distribution of query response time by simulating the processing of a full query set. System workload. We compute the workload induced by a query as the sum of the lengths of the inverted lists that need to be traversed in order to process a query. We average this measure over all the queries in a given query set. 3. EXPERIMENTAL SETUP Datasets. We simulate a distributed search engine composed of five sites, located in five different major continents. For each site, we sample consecutive queries from the corresponding search frontend of a commercial web search engine. Queries are normalized by case-folding, stop-word removal, term uniquing, and alphabetical ordering of query terms. The full query set is split roughly 75% 25% between a training set (5.25 million queries), which we use to compute the replication utility of the documents and to train our query forwarding models, and a test set (1.72 million queries), which we use to evaluate the performance metrics. The document collection contains about 200 million web pages obtained after various cleansing and filtering steps such as spam filtering. A proprietary classifier is used to assign the documents to a home country (some documents are not assigned to any of the five sites). This classifier uses features such as document language, IP address, and domain name, among others. Simulation parameters. For each search site, we build a separate index using the documents assigned to it. Each site is assumed to have computational resources proportional to its index size. The simulated response times include query processing times and network latencies. For the former, we assume a processing cost of 200 ns per posting and 20 ms of fixed preprocessing overhead per query. The top k query results are retrieved using the open-source search engine Terrier. For the latter, we considered the user-to-site and siteto-site network latencies, estimated based on the geodesic distances between users and sites. Baselines. We compare the performance of our query forwarder to the state-of-the-art forwarding model proposed in [13]. In order to facilitate the interpretation of our results, we often show the performance of an ORACLE forwarder that always knows, for any given query, in which search sites the top k best-matching documents are indexed. 5 Note that this is similar to the query recall if R p(q) were the only results considered relevant. 6 This approach is rather restrictive, in the sense that it attributes no usefulness to any result beyond the k-th one. (a) No replication Search site 1 (b) Identical replication Search site 1 (c) Individual replication - Global budget Search site 1 (d) Individual replication - budget Search site 1 Search site 2 Search site 2 Search site 3 Search site 2 Search site 2 Search site 3 Search site 3 Search site 3 Figure 1: Different document replication strategies. 4. DOCUMENT REPLICATION As previously mentioned, document replication helps increasing the query locality of a distributed search engine. Replicating too large a fraction of D, however, defeats the very purpose of distributed search engines, since the query processing times increase when the replicated indexes become too large. It is thus customary to place a replication budget b% upon the storage overhead of replicated documents. Given a budget, we aim to devise a replication strategy that selects a set of documents R i D \ D i for each site s i to build the replicated index of that site such that (i) The size of the replicated documents is below a given percentage b% of the global size of the document collection: 1 i m size(ri) b% size(d). (ii) The query locality, i.e. the fraction of queries that can be locally processed without any quality loss (on both the local and replicated indexes D i R i) is maximized. In this section, we first categorize different strategies for document replication (Section 4.1). We then propose a new replication strategy (Section 4.2), and we finally show (Section 4.3) that this strategy outperforms existing heuristics in terms of query locality if assuming the use of an ORACLE query forwarder. We leave the discussion on the joint effect of the three techniques for Section Document replication strategies Figure 1 illustrates how different document replication strategies distribute the replication budget b% among the different search sites. Figure 1(a) depicts the search engine before any document is replicated, with each site indexing a disjoint document collection, shown in a distinct color. Identical replication. A simple strategy is to replicate the same set R of documents in all the sites (Figure 1(b)). Since part of the selected documents may already exist in the local index D i of site s i (e.g., the regions inside dotted lines on the figure), only those documents in R i = R\D i are

4 used to build the replicated index. This avoids redundancies when scoring documents that already exist in the local index during query processing time. To bound the total size of replicated documents by b% of the local indexes, we just need to ensure that size(r) b % size(d). m Individual replication. Identical replication possesses two obvious drawbacks: (i) those documents best suited for replicating at site s i might not be the best for site s j, since the users at each site might have different needs, and (ii) the set R of replicated documents may contain a significant fraction of documents that already exist in the local index of a search site, thus unnecessarily consuming replication budget. A possible improvement is to let each site select its own, distinct set of documents to replicate, so that it is tailored to its particular query stream and explicitly excludes documents in its local index. We refer to this kind of strategy as individual replication, and study the two following alternatives for distributing the allowed budget: (i) Global replication budget. This option only places a global constraint to the size of all the replicated indexes, namely 1 i m size(ri) b% size(d). As shown in Figure 1(c), a site with a small local index may have a large replicated index (e.g., site 3), whereas a site with a large local index may have a small replicated index (e.g., site 1). (ii) replication budget. In a distributed engine, the storage and computation resources available in each search site are usually tailored to the size of its local index and to the amount of queries it receives. With a global replication budget, replicating a large index to a site with limited resources might force the redistribution of resources among sites or an investment in new resources. To alleviate such need, we can set an additional constraint to the global replication budget strategy, requiring that the size of the replicated index in any site s i does not surpass b% of the size of the local index (size(r i) b% size(d i), see Figure 1(d)). The choice of a particular replication type (identical, individual) and of a budgeting alternative might depend on external system design factors. We next present a generic heuristic that, given a particular type and distribution scheme, selects the documents to replicate at each site. 4.2 Document selection heuristics For a given query set Q, it has been shown that the computation of the replicated subsets R i that maximize the fraction of top-k relevant documents that are indexed locally can be reduced to the 0-1 knapsack problem [14, 21] 7, which is known to be NP-hard. We therefore propose a greedy heuristic 8 that consists in ranking the documents in descending order of their utility of replication, and selecting them progressively until the replication budget is exhausted. We next define how we estimate this replication utility. For a given site s i, let R i(q) = R(q) \ D i be the set of documents in the top-k result set of q that are not indexed in the local index of s i. We define the utility of replicating 7 The basic idea consists on assigning each document a weight equal to its size and a utility proportional to the number of top-k result sets where it appears. 8 Note that our focus is on defining a new heuristic to compute the utility of replication, not to devise a new algorithm for the 0-1 knapsack problem. The proposed heuristic can be easily applied to more complicated solutions to this problem [23, 25]. a document d D \ D i to site s i with respect to query q as { 1 if d R(q) u i,q(d) = R i (q) s(d) 0 otherwise Once we have that, we define the global utility of replicating a document d D \D i to site s i as the sum of the individual utility of each query q weighted by its frequency f i(q), i.e., u i(d) = q Q i f i(q) u i,q(d) The choice of u i(d) stems from the observations that (i) the more often a query q was issued to a site s i (i.e. the larger f i(q)), the more likely replicating a non-local document in R(q) will increase the query locality; (ii) the less non-local documents in R(q), the more likely replicating one of these non-local documents will allow q to be processed locally; and (iii) the smaller a document, the larger the utility per unit size, and the larger the overall utility for a given replication budget. With this definition of utility u i(d), we build the replicated document sets R i for each of the replication options discussed in Section 4.1 as follows: Identical replication. We compute the utility u(d) of identically replicating a document (Figure 1(b)) as the sum of the utilities of replicating it to each site, i.e., u(d) = 1 i m ui(d), and select documents in descending order of u(d) until the replication budget b % size(d) is exhausted. m Individual replication with global budget. Since a site can replicate any number of documents as long as the global replication budget is not exhausted (Figure 1(c)), we rank the pairs u i(d), s i in descending order of u i(d) and select document d with highest u i(d) to replicate in site s i until the global replication budget b% size(d) is exhausted. Individual replication with local budget. Each site requires the size of replicated documents not to surpass its local budget (Figure 1(d)), so we select the set of documents to replicate at each site separately. For every site s i we rank the documents in descending order of u i(d) and progressively select them until the local budget b% size(d i) is exhausted. 4.3 Experimental performance In this section we evaluate the performance of the proposed heuristic with respect to query locality, and in comparison with the document frequency-based heuristic in [13] for identical replication (Identical-OF) and with that in [21] for individual replication both with global (Individual-ORT- G) and local (Individual-ORT-L). In [13], the utility of replicating a document is computed as the total frequency of the queries having the document in their top-k results divided by the size of this document. In [21], the utility of replicating a document, for the objective of increasing the number of locally processed queries, is computed in the same way as in [13], but only non-local documents are considered for replication. Our heuristic in the three replication strategies are denoted as Identical-OU, Individual-OU-G and Individual- OU-L respectively, where OU stands for optimizing utility. Query locality. If no document is replicated, 44.4% queries (44.8% unique queries) can be entirely processed locally for all the five sites. Figure 2 compares our heuristic against the baseline approaches for the three replication strategies with varying replication budget b%. We observe that (i) replicating documents significantly improves the number of locally processed queries (e.g., replicating only 1% of documents

5 Fraction of locally processed queries Size of replicated index w.r.t. original index Individual-OU-L Individual-ORT-G Individual-ORT-L 0 Individual-OU-G Identical-OF Identical-OU Replication percentage 0.1 (a) All queries. Fraction of locally processed queries Individual-OU-L Individual-ORT-G Individual-ORT-L Individual-OU-G Identical-OF Identical-OU Replication percentage 32 (b) Unique queries. Figure 2: Effectiveness of document replication. Individual-OU-G Individual-ORT-G Individual-OU-L Individual-ORT-L Identical-OU Identical-OF s 1 s 2 s 3 s 4 s 5 (a) Storage overhead. Average query locality Individual-OU-G Individual-ORT-G Individual-OU-L Individual-ORT-L Identical-OU Identical-OF NoRep s 1 s 2 s 3 s 4 s 5 (b) ly processed queries. Figure 3: Load balancing (b = 8%). improves the query locality by 36% for all queries); (ii) individual replication outperforms identical replication; (iii) for individual replication, global budgeting outperforms local budgeting; (iv) our heuristic (*-OU-*) improves the query locality by 4% to 1.32% and the unique query locality by 0.73% to 2.22% compared to the two individual replication strategies. The improvement for identical replication is up to 5.88%. Interestingly, the improvement for unique queries is more significant than that for all the queries, which will be important once we add a cache layer (Section 6). Load balancing. We study the load balance between different sites by measuring the per-site storage overhead caused by the replication. This overhead is measured as the size of the replicated index divided by the size of the local index. We observe from Figure 3(a) that individual replication with global budget replicates more documents in sites having smaller local indexes (e.g., s 5), while individual replication with local budget results in a more balanced replication. 9 From Figure 3(b) we observe that sites with larger local indexes have higher query locality. This explains why individual replication with global budget results in larger query locality than individual replication with local budget. 5. QUERY FORWARDING We describe next our proposed solution for the query forwarding problem. Given m distributed search sites, we cast the problem as m 2 independent binary decision problems arising from every possible site pair. Thus, each local site s i has m available binary classifiers c i,1, c i,2,..., c i,m, where c i,j is in charge of deciding, for any query received at site s i, whether it is worth processing the query on the index 9 For easier visualization, Figure 3 shows only a fixed replication budget b = 8%, but other values of b exhibit a similar behavior. Type Category # pre-retrieval Term lengths 8 pre-retrieval Term IDFs 20 pre-retrieval Term scores 32 pre-retrieval Query language 1 pre-retrieval Query popularity 15 pre-retrieval Query performance 12 post-retrieval query score 8 post-retrieval forwarder decision 4 TOTAL 100 Table 1: Query feature categories. residing at site s j. We present the type of machine learning features we consider in Section 5.1, and we evaluate them in Section 5.2. We detail the forwarder architecture in Section 5.3, and report on its performance in Section 5.4. Interplay with document replication. Query forwarding in multi-site search engines typically determines where to forward a query according to whether a remote site is likely to index some of its best-matching documents [3]. Although in Section 4.3 we saw that individual replication outperforms identical replication, it may replicate any given document only to a strict subset of all sites. In that case, determining the optimal subset of sites to which the query should be forwarded becomes much more involved, and is beyond the classical problem we aim to solve. We therefore only report the performance of query forwarders coupled with an identical replication strategy from here on Feature extraction Given the model described above, each query q submitted to a site s i can be seen as the source of m different machine learning instances q i,1,..., q i,m, where the data instance q i,j will solely be used to train the binary classifier c i,j, and contains features about (i) the local site s i, (ii) the remote site s j, (iii) some third sites s k, where k / {i, j}, 11 and (iv) sensible combinations of the previous options. Finally, the label (i.e., our ground truth) of every query instantiation q i,j is a binary variable indicating whether site s j contains at least one document from R(q)\(D i R i), i.e., one document not indexed in any of the indexes of site s i that should be in the top-k result set returned to the user. Table 1 shows a summary of the different feature categories that we have considered. It is important to note that some of these features (pre-retrieval) can be extracted prior to the local processing of the query, whereas some others need the query to be processed (post-retrieval). We next give a brief description of these different categories. Term lengths, IDFs and scores. We extract some basic statistics about the query term count and about the lengths of the different terms in the query. We also use some perterm and per-site frequency statistics, as well as information on the maximum contribution of any term appearing in the 10 The above-mentioned problem does not exist for identical replication. Once a document is replicated, it is available in all sites and can be obtained from the local index without any query forwarding. 11 For instance, by encoding the maximum score that the lowestscoring term is able to achieve at a third site, we expect our query forwarder to be more informed than the previous thresholding models [3, 13], which by nature only take into account information from the local-remote site pair.

6 IG score IG score χ 2 score Term IDF Query scores Term scores Third site χ 2 score q Pre-retrieval classifier ML query forwarder < F, c > Yes c > C No query processor query score F Post-retrieval classifier F Figure 5: Architecture of the ML query forwarder. 0 Figure 4: Individual feature scores. site index to the score of a query containing the term. 12 We extract a number of features from that information, including, as previously mentioned, features that try to capture the possible contribution of third sites to the result set. Query language. Another feature which intuitively can be expected to capture the relevance of the indexes in different sites is the language of the query, especially given that documents are assigned to sites using language as a feature. We therefore try to predict the language of the query by using a character-based, n-gram-based predictor. Per-site query popularity. A simple set of features encoding information about whether the query belongs to the most frequent queries issued to the local site s i, for different percentile-based definitions of most frequent queries. Query performance. We also extract some features related to the query performance predictors γ 1 and γ 2 described in [18], which predict how good the results of a query will be on a given index. These predictors are based on the distribution of IDF values over the different terms of a query and can be easily computed in a pre-retrieval fashion. query score. This set contains a number features related to the scores obtained after processing the query against the local index. forwarder decision. Finally, we also extract some features based on the model decision from [13]. 5.2 Feature evaluation We investigate the importance of individual query features by computing both the Informational Gain (IG) and the chi-squared (χ 2 ) scores of each feature. IG measures the reduction of entropy when the feature value is known, whereas χ 2 measures the lack of independence between a feature and class values. Both metrics have been previously reported to perform well as feature selection methods [26]. Figure 4 shows IG and χ 2 scores averaged over the query instances arising from the different site pairs, and ordered by decreasing IG score types. The first thing to note is that the relative performance of individual features is roughly the same for both scores. As expected, the most informative features are those extracted from the formulation. A bit surprisingly, the (post-retrieval) query score features based on the local processing of the query are less informative than pre-retrieval features based on term scores and term IDFs. Note, however, that we are only measuring the performance of features considered in isolation, and therefore the informativeness of some feature combinations might not be properly captured. It is also worth noticing that features capturing information of term scores in sites other than the local-remote pair (labeled as third site features) do carry 12 As pointed out in [12], the storage overhead of keeping this site term score matrix is negligible. 0 a certain amount of information. Finally, let us note that the features showing the worst performance are those related to query popularity, on the one hand, and to query language, on the other hand. This last phenomenon is probably due to the difficulty of accurately predicting the language of the queries, given their usually short length. For performance reasons, we exclude the query language feature henceforth. 5.3 Machine-learned query forwarder The machine-learned query forwarder that we propose is depicted on Fig. 5. The forwarder is composed of two different classifiers. On one hand, we have a pre-retrieval classifier which only uses pre-retrieval features and thus can be run right upon the reception of the query. Since this classifier dismisses post-retrieval features, however, it will predictably be less accurate than the full post-retrieval classifier, which employs all the available features. In order to exploit the trade-off between result quality and efficiency, we choose to trust the set of forwarding decisions F made by the preretrieval classifier only if the confidence score c [, 1] of the classifier is high enough, i.e., higher than a given threshold C. 13 This ML architecture is thus parametrized by the confidence threshold C, and different values of C will result in query forwarders with different properties: lower thresholds will be able to trigger the forwarding process earlier, but at the cost of being less precise and thus producing both worse results and a higher chance of unnecessarily loading remote sites. Also, note that this architecture gives us the option not to process the query locally, as opposed to previous approaches. As a classification model, we choose to combine a set of 100 base decision trees through the bagging ensemble method [9]. We use the off-the-shelf implementation of both the bagging method and the fast REP classification tree provided by the WEKA package [17]. In our preliminary experiments, other machine learning approaches such as random forest classifiers showed slightly better accuracy rates, but their training time was significantly larger. Therefore, and given the large number of classifiers and the size of the datasets, we chose to use the bagging classifier. In order to assess the value of using large training sets, we analyze the accuracy of our pre-retrieval classifiers as the amount of training queries increases, in the no-replication scenario. We hold out a fraction of our training dataset for testing purposes and train the forwarders with subsets of varying size from the remaining fraction. The resulting accuracies (averaged over all different site pairs) do not attain any saturation point before reaching the maximum training set size. Increasing the number of training queries from 10K to 100K, for instance, raises the accuracy from 88.72% to 96%; using all the available queries results in a further 13 Note that this confidence score is actually the estimated probability of decision being right; hence, since we are dealing with a binary decision problem, the score always lies in the range [, 1].

7 Accuracy No replication 8% replication ML Average query locality 0.7 No replication 8% replication ML TN rate TP rate FN rate FP rate TN rate TP rate FN rate FP rate Confidence threshold (a) Accuracy Confidence threshold (b) Query locality. Figure 6: Performance of different query forwarders. Replication 0% 8% TP TN FP FN TP TN FP FN ORACLE ML ML ML Table 2: Confusion matrix values (%). boost up to 92.75% accuracy. We thus only report henceforth on query forwarders trained with the full training set. 5.4 Experimental performance We report now the accuracy rates and query locality induced by our classification strategy, and defer the discussion of other metrics to Section 7. Figure 6(a) compares the accuracy rate of the baseline and those of our ML forwarder for different confidence thresholds, in two replication scenarios (no replication and 8% replication). Table 2 further presents their true positive (TP), true negative (TN), false positive (FP) and false negative (FN) rates (ML C stands for confidence threshold C). In both replication scenarios, the accuracy of the ML forwarder significantly exceeds that of the forwarder (for any confidence threshold), due to the high FP rate of the model. Unexpectedly, in the replicated scenario, the accuracy decreases as the confidence threshold increases, i.e., as more queries are forwarded according to the the post-retrieval query forwarder. To investigate this, Figure 7 shows a graphical breakdown of the evolution of the confusion matrix values as the confidence threshold increases. The accuracy decrease in the replicated scenario turns out to be caused by an increase of the FP rate, although the FN rate actually decreases. This might be due to the higher imbalance induced by the document replication; one way of solving this is to weight false positives and false negatives non-uniformly, e.g., penalizing false negatives more than false negatives when training the classifiers. We explore this possibility and its results in Section 7. Figure 6(b), on the other hand, compares the query locality achieved by the different forwarding strategies. As expected, both the and ML forwarders are able to exploit the higher query locality allowed by the document replication. Compared to the forwarder, however, the lower FP rates of our approach ensure significant increase of query locality (10%-20% more queries are processed locally). Again, note that in the replicated scenario, the increase in the FP rate that we just mentioned has a negative impact on the query locality for high confidence thresholds Confidence score (a) No replication Confidence score (b) 8% global replication. Figure 7: Confusion matrix values. The inset details the lower section of the graph for easier visualization. 6. RESULT CACHING Result caching is an important technique for improving the performance of search engines, since the results of recurrent queries can be served from the cache with no processing on the index. A difference with centralized engines is that in multi-site engines queries are issued to the search front-ends in different locations. Each site typically maintains its own result cache built from the queries it receives. It is thus important to determine, among all the sites, where a query and its result should be cached, such that given the constraints (on freshness of results, cache memory consumption, etc.), the maximum number of successive occurrences of this query can be responded without (or with less) query forwarding. This in turn ensures fast query response time and low system workload. In this section, we present different result caching strategies for multi-site web search and report on their experimental performance. 6.1 Result caching strategies We focus on determining where to cache a query and its result given that each site keeps a different cache. We consider unlimited capacity caches, as in [1]. To keep the freshness of results upon index updates, we invalidate cache entries with a TTL-based mechanism. cache. The most straightforward result caching strategy, as considered in the existing multi-site search engines [13], is that each search site only keeps in its cache the queries issued to its front-end. When a site receives a cached query, the answer is served straight from the cache only if the age of the query in the cache is smaller than the predefined TTL value. Otherwise, the query is either processed locally or forwarded to remote sites according to the query forwarder. This strategy is simple to implement, but popular queries issued to multiple sites have to be processed independently at each site. Global cache. An effective way to reduce the number of queries to process is to replicate cache entries to all sites. Once a query enters the cache of a site, it is also sent to the rest of sites to be cached there 14. This way we ensure that every query is only processed once (until the TTL expires). Partial cache. One problem of global caching is that it may replicate cache entries in sites where they will never be used, inducing unnecessary storage and transmission costs. To 14 This can be achieved either on a per query basis or in batch mode, according to the desired trade-off between the hit rate and the transmission cost.

8 Hit rate Partial/Forward Global TTL (min) (a) Hit rate w.r.t. TTL. Network overhead (MB) Forward Partial Global s 1 s 2 s 3 s 4 s 5 Site (b) Network o/h, 2h TTL. Figure 8: Performance of different caching strategies. alleviate this, we can replicate cache entries only to the sites where the query is forwarded during the query processing. The intuition is that, since these sites are likely to index documents relevant to the query, which additionally have been selected according to their likelihood of being necessary on the site (by the document replication strategy), the query will be issued to them with higher probability. Forward cache. A variation to reduce network congestion is that, whenever a query q is forwarded from site s i to site s j during the query processing, s j caches a pointer to the site s i. If s j receives the same query later, it simply requests the results from the cache on site s i instead of processing the query. This type of cache entries are expired with the same TTL-based mechanism as before. Additionally, the queries that are processed locally are maintained in the cache as normal cache entries. 6.2 Experimental performance We evaluate in this section the performance of the different result caching strategies, especially regarding their ability to improve the query locality. Figure 8(a) shows the hit rate as the value of TTL increases. As expected, global cache consistently ensures the highest hit rate compared to other strategies, as more queries and results are maintained in the cache of each site. Taking a TTL of 2 hours as an example, the average hit rate of global cache is 3.4% higher than that of local cache, and 1.1% higher than those of partial cache and forward cache 15. Although we observe from Figure 8(b) that it incurs higher network (and storage) overhead to maintain the global cache, considering that the caches have unlimited capacity and that the transmission of cache entries between sites can be performed off-line, we choose to focus only on the performance of global cache for the rest of the paper. We further report the query locality when global caching is used along with the replication and forwarding strategies presented in Sections 4 and 5. We observe from Figure 9 that the query locality increases along with the TTL value, for all the replication amounts and the query forwarding strategies that we take into consideration. More importantly, we observe that our ML query forwarder outperforms the baselines for all TTL values. Besides, even if the difference between not using cache (i.e., TTL=0) and using cache (e.g., TTL=15 min) decreases when documents are replicated, using cache still improves the query locality by at least 9.3% for the different query forwarding strategies. 15 Query forwarding is made by the ORACLE forwarder. Average query locality 0.7 ML,c= ML,c=0.75 ML,c= TTL (min) (a) No replication. Average query locality 0.7 ML,c= ML,c=0.75 ML,c= TTL (min) (b) 8% replication. Figure 9: Impact of global cache on query locality. Replication 0% 8% c p O@p N@p E@p O@p N@p E@p Table 3: Quality metrics at different rank positions p. 7. GLOBAL RESULTS We report in this section the evaluation of result quality, query response time and system workload. We first report on these metrics independently, and then study the combined effect of our different strategies on them. 16 Regarding quality, Table 3 shows some quality metrics of our ML forwarder for three different confidence thresholds c {, 0.75, } at different rank positions p {1, 5, 10}. O@p, N@p and E@p stand for overlap@p, NDCG@p and EMR@p, respectively. The results are consistent with what we saw in the previous sections: as the confidence threshold increases, the quality improves (according to all metrics), since a higher threshold means that the post-retrieval forwarder, which has a smaller FN rate, is used more often. With respect to query response time, we analyze how it varies across different queries in Figure 10, which shows the cumulative percentage of queries that are answered below a certain time threshold. In the scenario with replication, for instance, 53.51% of the test queries are answered in less than 300ms when using the forwarder, whereas by using our ML strategy, this percentage is increased to a value in the range 62.09%-71.43%, depending on the confidence threshold. We now turn to the analysis of the trade-off between both metrics, this time accounting for the impact of the caching strategies, which is visualized in Figure 11(a). Solid markers denote a simulation without caching; hollow markers include the effect of a global cache strategy with a 2h TTL. Markers from left to right correspond to ML forwarders with increasing confidence threshold C [, 1]. As expected, the higher the quality we want to achieve, the higher the average response time. Without caching, and taking the forwarder 16 Note that caching has no impact on the result quality.

9 (Accumulated) Fraction of queries Average response time (ms) ML,c= ML,c= Response time (ms) (a) No replication. (Accumulated) Fraction of queries ML,c= ML,c= Response time (ms) (b) 8% global replication. Figure 10: Distribution of the query response time. No replication 8% replication ML Global, TTL=2h No cache Average NDCG (a) No training weights. Average response time (ms) w=5 w=0 w=2 ML w=-5 w= Average NDCG (b) Varying training weights, 8% replication, 2h TTL. Figure 11: Trade-off between result quality and time. as a baseline, our ML forwarder achieves a significant reduction of average response time in the range 20%-22% (no replication) or 15%-25% (8% replication), depending on the confidence threshold, in exchange for a small quality loss. If we focus on the scenario with replication and pick a confidence threshold of 0.7, our approach offers an improvement of 19.5% on the average response time, while the average NDCG remains as high as 0.978, and 91.65% of the queries obtain a result set identical to that of a centralized architecture. As both variables are strongly correlated, the relative improvements in system workload follow a similar pattern, which we do not show for lack of space. Regarding the impact of the cache layer, we observe that caching neutralizes the advantages of replication as far as response time is concerned, up to the point where the larger index sizes caused by the replication and the induced larger query processing time outweigh the benefits achieved by the replication itself. This might be partly due to the fact that the replication strategy is specifically tailored to the query frequencies of the original query stream, which are largely modified by the presence of the cache layer. To circumvent this problem, one should simulate a fixed caching strategy paired with a fixed TTL value and tailor the replication decisions to the resulting, post-cache query stream. We leave this open as a future line of research. Varying training weights. In Section 5.4 we hinted at an alternative, orthogonal way of exploring the quality-time trade-off that consisted on modifying the balance between false positives and false negatives by weighting them nonuniformly during the query forwarders training phase. We show the results of applying this strategy on Figure 11(b), where for easier visualization only a scenario with 8% replication, coupled with a global cache with a 2h TTL, is depicted. w = x stands for a ML forwarder such that during training phase an FP is considered x times as negative as an FN (e.g., the w = 2 stands for those forwarders where an FN is considered twice as harmful as an FP). It can be seen that this strategy further widens the flexibility that the ML forwarder offers to the system designer. In exchange for a larger quality loss, the average query response time can now be decreased to as little as 137ms (a 40% reduction with respect to the forwarder). Even more interestingly, on the other extreme we can reach a 0.99 NDCG quality and decrease the average response time to 176ms, which is almost as good as the ORACLE forwarder and means a significant 24% reduction with respect to the state-of-the-art forwarder RELATED WORK There is a rich literature on query forwarding, document replication and result caching. Herein, we provide a brief survey of related work on these topics, limiting the context to multi-site web search. A broader survey on scalability and efficiency of web search engines can be found in [10]. Query forwarding. In the multi-site web search framework, the query forwarding problem is first investigated by Baeza-Yates et al. [3]. The authors propose a simple technique for estimating the maximum score a query can achieve on each search site. The forwarding decisions are made based on the outcome of a comparison between the estimated maximum scores and the lowest score in the top k result set computed by the local search site. The authors show that, with the proposed technique, many queries can be processed solely in local search sites without sacrificing the result quality. The work of Cambazoglu et al. [10] further increases the rate of locally processed queries by a linear-programming technique. The technique exploits the co-occurrence of query terms in documents and yields tighter upper bound score estimations. In our work, we use a machine-learned query forwarding technique, which achieves further performance improvements over the linear programming model used in [13]. Finally, Baeza-Yates et al. [4] also investigate the use of machine learning techniques in a distributed search engine, although the scope of their simulation is much more limited and their focus is on two-tiered systems with high index size imbalance. Document replication. There are two prior studies focusing on document replication in multi-site web search engines [8, 21]. Brefeld et al. [8] employ machine learning techniques to select the data center(s) where a newly discovered document should be replicated. Since no past popularity information (i.e., number of views in web search results) is available for new documents, the features used by the learning model are mainly extracted from the document content (e.g., the language feature). Kayaaslan et al. [21] consider a scenario where past popularity information is available for every document. They propose three algorithmic optimizations aiming to reduce the response latency, total workload, or the loss in search quality. The replication heuristic proposed in our work achieves further improvements over the heuristic described in [21] in terms of the query locality. Besides [8] and [21], Cambazoglu et al. [13] also evaluate a sim- 17 Not shown in the figure, these response time decrease percentages are virtually the same even without the cache layer.

Improving the Efficiency of Multi-site Web Search Engines

Improving the Efficiency of Multi-site Web Search Engines Improving the Efficiency of Multi-site Web Search Engines ABSTRACT Guillem Francès Universitat Pompeu Fabra Barcelona, Spain guillem.frances@upf.edu B. Barla Cambazoglu Yahoo Labs Barcelona, Spain barla@yahoo-inc.com

More information

Improving the Efficiency of Multi-site Web Search Engines

Improving the Efficiency of Multi-site Web Search Engines Improving the Efficiency of Multi-site Web Search Engines Xiao Bai ( xbai@yahoo-inc.com) Yahoo Labs Joint work with Guillem Francès Medina, B. Barla Cambazoglu and Ricardo Baeza-Yates July 15, 2014 Web

More information

Towards a Distributed Web Search Engine

Towards a Distributed Web Search Engine Towards a Distributed Web Search Engine Ricardo Baeza-Yates Yahoo! Labs Barcelona, Spain Joint work with many people, most at Yahoo! Labs, In particular Berkant Barla Cambazoglu A Research Story Architecture

More information

Leveraging Transitive Relations for Crowdsourced Joins*

Leveraging Transitive Relations for Crowdsourced Joins* Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

Second Chance: A Hybrid Approach for Dynamic Result Caching and Prefetching in Search Engines

Second Chance: A Hybrid Approach for Dynamic Result Caching and Prefetching in Search Engines Second Chance: A Hybrid Approach for Dynamic Result Caching and Prefetching in Search Engines RIFAT OZCAN, Turgut Ozal University ISMAIL SENGOR ALTINGOVDE, Middle East Technical University B. BARLA CAMBAZOGLU,

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Reduction of Periodic Broadcast Resource Requirements with Proxy Caching

Reduction of Periodic Broadcast Resource Requirements with Proxy Caching Reduction of Periodic Broadcast Resource Requirements with Proxy Caching Ewa Kusmierek and David H.C. Du Digital Technology Center and Department of Computer Science and Engineering University of Minnesota

More information

A Bagging Method using Decision Trees in the Role of Base Classifiers

A Bagging Method using Decision Trees in the Role of Base Classifiers A Bagging Method using Decision Trees in the Role of Base Classifiers Kristína Machová 1, František Barčák 2, Peter Bednár 3 1 Department of Cybernetics and Artificial Intelligence, Technical University,

More information

K- Nearest Neighbors(KNN) And Predictive Accuracy

K- Nearest Neighbors(KNN) And Predictive Accuracy Contact: mailto: Ammar@cu.edu.eg Drammarcu@gmail.com K- Nearest Neighbors(KNN) And Predictive Accuracy Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni.

More information

Business Club. Decision Trees

Business Club. Decision Trees Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

Mobile Cloud Multimedia Services Using Enhance Blind Online Scheduling Algorithm

Mobile Cloud Multimedia Services Using Enhance Blind Online Scheduling Algorithm Mobile Cloud Multimedia Services Using Enhance Blind Online Scheduling Algorithm Saiyad Sharik Kaji Prof.M.B.Chandak WCOEM, Nagpur RBCOE. Nagpur Department of Computer Science, Nagpur University, Nagpur-441111

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

Migration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM

Migration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM Migration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM Hyunchul Seok Daejeon, Korea hcseok@core.kaist.ac.kr Youngwoo Park Daejeon, Korea ywpark@core.kaist.ac.kr Kyu Ho Park Deajeon,

More information

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to

More information

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy Machine Learning Dr.Ammar Mohammed Nearest Neighbors Set of Stored Cases Atr1... AtrN Class A Store the training samples Use training samples

More information

CS229 Final Project: Predicting Expected Response Times

CS229 Final Project: Predicting Expected  Response Times CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

Theorem 2.9: nearest addition algorithm

Theorem 2.9: nearest addition algorithm There are severe limits on our ability to compute near-optimal tours It is NP-complete to decide whether a given undirected =(,)has a Hamiltonian cycle An approximation algorithm for the TSP can be used

More information

CS229 Lecture notes. Raphael John Lamarre Townshend

CS229 Lecture notes. Raphael John Lamarre Townshend CS229 Lecture notes Raphael John Lamarre Townshend Decision Trees We now turn our attention to decision trees, a simple yet flexible class of algorithms. We will first consider the non-linear, region-based

More information

Static Index Pruning in Web Search Engines: Combining Term and Document Popularities with Query Views

Static Index Pruning in Web Search Engines: Combining Term and Document Popularities with Query Views Static Index Pruning in Web Search Engines: Combining Term and Document Popularities with Query Views ISMAIL SENGOR ALTINGOVDE L3S Research Center, Germany RIFAT OZCAN Bilkent University, Turkey and ÖZGÜR

More information

6. Parallel Volume Rendering Algorithms

6. Parallel Volume Rendering Algorithms 6. Parallel Volume Algorithms This chapter introduces a taxonomy of parallel volume rendering algorithms. In the thesis statement we claim that parallel algorithms may be described by "... how the tasks

More information

Inverted List Caching for Topical Index Shards

Inverted List Caching for Topical Index Shards Inverted List Caching for Topical Index Shards Zhuyun Dai and Jamie Callan Language Technologies Institute, Carnegie Mellon University {zhuyund, callan}@cs.cmu.edu Abstract. Selective search is a distributed

More information

CMPSCI 646, Information Retrieval (Fall 2003)

CMPSCI 646, Information Retrieval (Fall 2003) CMPSCI 646, Information Retrieval (Fall 2003) Midterm exam solutions Problem CO (compression) 1. The problem of text classification can be described as follows. Given a set of classes, C = {C i }, where

More information

Distributed Scheduling of Recording Tasks with Interconnected Servers

Distributed Scheduling of Recording Tasks with Interconnected Servers Distributed Scheduling of Recording Tasks with Interconnected Servers Sergios Soursos 1, George D. Stamoulis 1 and Theodoros Bozios 2 1 Department of Informatics, Athens University of Economics and Business

More information

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics

More information

Sentiment analysis under temporal shift

Sentiment analysis under temporal shift Sentiment analysis under temporal shift Jan Lukes and Anders Søgaard Dpt. of Computer Science University of Copenhagen Copenhagen, Denmark smx262@alumni.ku.dk Abstract Sentiment analysis models often rely

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Timestamp-based Result Cache Invalidation for Web Search Engines

Timestamp-based Result Cache Invalidation for Web Search Engines Timestamp-based Result Cache Invalidation for Web Search Engines Sadiye Alici Computer Engineering Dept. Bilkent University, Turkey sadiye@cs.bilkent.edu.tr B. Barla Cambazoglu Yahoo! Research Barcelona,

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

WHITE PAPER Application Performance Management. The Case for Adaptive Instrumentation in J2EE Environments

WHITE PAPER Application Performance Management. The Case for Adaptive Instrumentation in J2EE Environments WHITE PAPER Application Performance Management The Case for Adaptive Instrumentation in J2EE Environments Why Adaptive Instrumentation?... 3 Discovering Performance Problems... 3 The adaptive approach...

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

Recommender Systems (RSs)

Recommender Systems (RSs) Recommender Systems Recommender Systems (RSs) RSs are software tools providing suggestions for items to be of use to users, such as what items to buy, what music to listen to, or what online news to read

More information

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT PhD Summary DOCTORATE OF PHILOSOPHY IN COMPUTER SCIENCE & ENGINEERING By Sandip Kumar Goyal (09-PhD-052) Under the Supervision

More information

Retrieval Evaluation. Hongning Wang

Retrieval Evaluation. Hongning Wang Retrieval Evaluation Hongning Wang CS@UVa What we have learned so far Indexed corpus Crawler Ranking procedure Research attention Doc Analyzer Doc Rep (Index) Query Rep Feedback (Query) Evaluation User

More information

Scalable Trigram Backoff Language Models

Scalable Trigram Backoff Language Models Scalable Trigram Backoff Language Models Kristie Seymore Ronald Rosenfeld May 1996 CMU-CS-96-139 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 This material is based upon work

More information

Improved Brute Force Search Strategies for Single Trace and Few Traces Template Attacks on the DES Round Keys

Improved Brute Force Search Strategies for Single Trace and Few Traces Template Attacks on the DES Round Keys Improved Brute Force Search Strategies for Single Trace and Few Traces Template Attacks on the DES Round Keys Mathias Wagner, Stefan Heyse mathias.wagner@nxp.com Abstract. We present an improved search

More information

WHITE PAPER: ENTERPRISE AVAILABILITY. Introduction to Adaptive Instrumentation with Symantec Indepth for J2EE Application Performance Management

WHITE PAPER: ENTERPRISE AVAILABILITY. Introduction to Adaptive Instrumentation with Symantec Indepth for J2EE Application Performance Management WHITE PAPER: ENTERPRISE AVAILABILITY Introduction to Adaptive Instrumentation with Symantec Indepth for J2EE Application Performance Management White Paper: Enterprise Availability Introduction to Adaptive

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal

More information

Automatically Generating Queries for Prior Art Search

Automatically Generating Queries for Prior Art Search Automatically Generating Queries for Prior Art Search Erik Graf, Leif Azzopardi, Keith van Rijsbergen University of Glasgow {graf,leif,keith}@dcs.gla.ac.uk Abstract This report outlines our participation

More information

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL 85 CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL 5.1 INTRODUCTION Document clustering can be applied to improve the retrieval process. Fast and high quality document clustering algorithms play an important

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data

More information

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track Alejandro Bellogín 1,2, Thaer Samar 1, Arjen P. de Vries 1, and Alan Said 1 1 Centrum Wiskunde

More information

Improving Positron Emission Tomography Imaging with Machine Learning David Fan-Chung Hsu CS 229 Fall

Improving Positron Emission Tomography Imaging with Machine Learning David Fan-Chung Hsu CS 229 Fall Improving Positron Emission Tomography Imaging with Machine Learning David Fan-Chung Hsu (fcdh@stanford.edu), CS 229 Fall 2014-15 1. Introduction and Motivation High- resolution Positron Emission Tomography

More information

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul 1 CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul Introduction Our problem is crawling a static social graph (snapshot). Given

More information

Using Statistics for Computing Joins with MapReduce

Using Statistics for Computing Joins with MapReduce Using Statistics for Computing Joins with MapReduce Theresa Csar 1, Reinhard Pichler 1, Emanuel Sallinger 1, and Vadim Savenkov 2 1 Vienna University of Technology {csar, pichler, sallinger}@dbaituwienacat

More information

FUTURE communication networks are expected to support

FUTURE communication networks are expected to support 1146 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL 13, NO 5, OCTOBER 2005 A Scalable Approach to the Partition of QoS Requirements in Unicast and Multicast Ariel Orda, Senior Member, IEEE, and Alexander Sprintson,

More information

VOID FILL ACCURACY MEASUREMENT AND PREDICTION USING LINEAR REGRESSION VOID FILLING METHOD

VOID FILL ACCURACY MEASUREMENT AND PREDICTION USING LINEAR REGRESSION VOID FILLING METHOD VOID FILL ACCURACY MEASUREMENT AND PREDICTION USING LINEAR REGRESSION J. Harlan Yates, Mark Rahmes, Patrick Kelley, Jay Hackett Harris Corporation Government Communications Systems Division Melbourne,

More information

Evaluating Unstructured Peer-to-Peer Lookup Overlays

Evaluating Unstructured Peer-to-Peer Lookup Overlays Evaluating Unstructured Peer-to-Peer Lookup Overlays Idit Keidar EE Department, Technion Roie Melamed CS Department, Technion ABSTRACT Unstructured peer-to-peer lookup systems incur small constant overhead

More information

A Class of Submodular Functions for Document Summarization

A Class of Submodular Functions for Document Summarization A Class of Submodular Functions for Document Summarization Hui Lin, Jeff Bilmes University of Washington, Seattle Dept. of Electrical Engineering June 20, 2011 Lin and Bilmes Submodular Summarization June

More information

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann Search Engines Chapter 8 Evaluating Search Engines 9.7.2009 Felix Naumann Evaluation 2 Evaluation is key to building effective and efficient search engines. Drives advancement of search engines When intuition

More information

Fabric Image Retrieval Using Combined Feature Set and SVM

Fabric Image Retrieval Using Combined Feature Set and SVM Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,

More information

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati Evaluation Metrics (Classifiers) CS Section Anand Avati Topics Why? Binary classifiers Metrics Rank view Thresholding Confusion Matrix Point metrics: Accuracy, Precision, Recall / Sensitivity, Specificity,

More information

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks Archana Sulebele, Usha Prabhu, William Yang (Group 29) Keywords: Link Prediction, Review Networks, Adamic/Adar,

More information

Chapter S:II. II. Search Space Representation

Chapter S:II. II. Search Space Representation Chapter S:II II. Search Space Representation Systematic Search Encoding of Problems State-Space Representation Problem-Reduction Representation Choosing a Representation S:II-1 Search Space Representation

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

Rank Measures for Ordering

Rank Measures for Ordering Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many

More information

Protecting Web Servers from Web Robot Traffic

Protecting Web Servers from Web Robot Traffic Wright State University CORE Scholar Kno.e.sis Publications The Ohio Center of Excellence in Knowledge- Enabled Computing (Kno.e.sis) 11-14-2014 Protecting Web Servers from Web Robot Traffic Derek Doran

More information

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems Salford Systems Predictive Modeler Unsupervised Learning Salford Systems http://www.salford-systems.com Unsupervised Learning In mainstream statistics this is typically known as cluster analysis The term

More information

Maintaining Mutual Consistency for Cached Web Objects

Maintaining Mutual Consistency for Cached Web Objects Maintaining Mutual Consistency for Cached Web Objects Bhuvan Urgaonkar, Anoop George Ninan, Mohammad Salimullah Raunak Prashant Shenoy and Krithi Ramamritham Department of Computer Science, University

More information

A Session-based Ontology Alignment Approach for Aligning Large Ontologies

A Session-based Ontology Alignment Approach for Aligning Large Ontologies Undefined 1 (2009) 1 5 1 IOS Press A Session-based Ontology Alignment Approach for Aligning Large Ontologies Editor(s): Name Surname, University, Country Solicited review(s): Name Surname, University,

More information

A semi-incremental recognition method for on-line handwritten Japanese text

A semi-incremental recognition method for on-line handwritten Japanese text 2013 12th International Conference on Document Analysis and Recognition A semi-incremental recognition method for on-line handwritten Japanese text Cuong Tuan Nguyen, Bilan Zhu and Masaki Nakagawa Department

More information

Chapter 6 Evaluation Metrics and Evaluation

Chapter 6 Evaluation Metrics and Evaluation Chapter 6 Evaluation Metrics and Evaluation The area of evaluation of information retrieval and natural language processing systems is complex. It will only be touched on in this chapter. First the scientific

More information

Project Report: "Bayesian Spam Filter"

Project Report: Bayesian  Spam Filter Humboldt-Universität zu Berlin Lehrstuhl für Maschinelles Lernen Sommersemester 2016 Maschinelles Lernen 1 Project Report: "Bayesian E-Mail Spam Filter" The Bayesians Sabine Bertram, Carolina Gumuljo,

More information

Creating a Classifier for a Focused Web Crawler

Creating a Classifier for a Focused Web Crawler Creating a Classifier for a Focused Web Crawler Nathan Moeller December 16, 2015 1 Abstract With the increasing size of the web, it can be hard to find high quality content with traditional search engines.

More information

AI32 Guide to Weka. Andrew Roberts 1st March 2005

AI32 Guide to Weka. Andrew Roberts   1st March 2005 AI32 Guide to Weka Andrew Roberts http://www.comp.leeds.ac.uk/andyr 1st March 2005 1 Introduction Weka is an excellent system for learning about machine learning techniques. Of course, it is a generic

More information

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS] Search Evaluation Tao Yang CS293S Slides partially based on text book [CMS] [MRS] Table of Content Search Engine Evaluation Metrics for relevancy Precision/recall F-measure MAP NDCG Difficulties in Evaluating

More information

Citation for published version (APA): He, J. (2011). Exploring topic structure: Coherence, diversity and relatedness

Citation for published version (APA): He, J. (2011). Exploring topic structure: Coherence, diversity and relatedness UvA-DARE (Digital Academic Repository) Exploring topic structure: Coherence, diversity and relatedness He, J. Link to publication Citation for published version (APA): He, J. (211). Exploring topic structure:

More information

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Analytical Modeling of Parallel Systems To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Sources of Overhead in Parallel Programs Performance Metrics for

More information

Combining PGMs and Discriminative Models for Upper Body Pose Detection

Combining PGMs and Discriminative Models for Upper Body Pose Detection Combining PGMs and Discriminative Models for Upper Body Pose Detection Gedas Bertasius May 30, 2014 1 Introduction In this project, I utilized probabilistic graphical models together with discriminative

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Classification. 1 o Semestre 2007/2008

Classification. 1 o Semestre 2007/2008 Classification Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 Single-Class

More information

Computer Sciences Department

Computer Sciences Department Computer Sciences Department SIP: Speculative Insertion Policy for High Performance Caching Hongil Yoon Tan Zhang Mikko H. Lipasti Technical Report #1676 June 2010 SIP: Speculative Insertion Policy for

More information

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long

More information

Chapter III.2: Basic ranking & evaluation measures

Chapter III.2: Basic ranking & evaluation measures Chapter III.2: Basic ranking & evaluation measures 1. TF-IDF and vector space model 1.1. Term frequency counting with TF-IDF 1.2. Documents and queries as vectors 2. Evaluating IR results 2.1. Evaluation

More information

Use of Synthetic Data in Testing Administrative Records Systems

Use of Synthetic Data in Testing Administrative Records Systems Use of Synthetic Data in Testing Administrative Records Systems K. Bradley Paxton and Thomas Hager ADI, LLC 200 Canal View Boulevard, Rochester, NY 14623 brad.paxton@adillc.net, tom.hager@adillc.net Executive

More information

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio

More information

COMP 465: Data Mining Still More on Clustering

COMP 465: Data Mining Still More on Clustering 3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following

More information

Analyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc

Analyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc Analyzing the performance of top-k retrieval algorithms Marcus Fontoura Google, Inc This talk Largely based on the paper Evaluation Strategies for Top-k Queries over Memory-Resident Inverted Indices, VLDB

More information

Interactive Campaign Planning for Marketing Analysts

Interactive Campaign Planning for Marketing Analysts Interactive Campaign Planning for Marketing Analysts Fan Du University of Maryland College Park, MD, USA fan@cs.umd.edu Sana Malik Adobe Research San Jose, CA, USA sana.malik@adobe.com Eunyee Koh Adobe

More information

Sample questions with solutions Ekaterina Kochmar

Sample questions with solutions Ekaterina Kochmar Sample questions with solutions Ekaterina Kochmar May 27, 2017 Question 1 Suppose there is a movie rating website where User 1, User 2 and User 3 post their reviews. User 1 has written 30 positive (5-star

More information

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING Chapter 3 EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING 3.1 INTRODUCTION Generally web pages are retrieved with the help of search engines which deploy crawlers for downloading purpose. Given a query,

More information

Downloading Hidden Web Content

Downloading Hidden Web Content Downloading Hidden Web Content Alexandros Ntoulas Petros Zerfos Junghoo Cho UCLA Computer Science {ntoulas, pzerfos, cho}@cs.ucla.edu Abstract An ever-increasing amount of information on the Web today

More information

Parallel Crawlers. 1 Introduction. Junghoo Cho, Hector Garcia-Molina Stanford University {cho,

Parallel Crawlers. 1 Introduction. Junghoo Cho, Hector Garcia-Molina Stanford University {cho, Parallel Crawlers Junghoo Cho, Hector Garcia-Molina Stanford University {cho, hector}@cs.stanford.edu Abstract In this paper we study how we can design an effective parallel crawler. As the size of the

More information

modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation

More information

Hierarchical Intelligent Cuttings: A Dynamic Multi-dimensional Packet Classification Algorithm

Hierarchical Intelligent Cuttings: A Dynamic Multi-dimensional Packet Classification Algorithm 161 CHAPTER 5 Hierarchical Intelligent Cuttings: A Dynamic Multi-dimensional Packet Classification Algorithm 1 Introduction We saw in the previous chapter that real-life classifiers exhibit structure and

More information

Theme Identification in RDF Graphs

Theme Identification in RDF Graphs Theme Identification in RDF Graphs Hanane Ouksili PRiSM, Univ. Versailles St Quentin, UMR CNRS 8144, Versailles France hanane.ouksili@prism.uvsq.fr Abstract. An increasing number of RDF datasets is published

More information

CHAPTER 6 STATISTICAL MODELING OF REAL WORLD CLOUD ENVIRONMENT FOR RELIABILITY AND ITS EFFECT ON ENERGY AND PERFORMANCE

CHAPTER 6 STATISTICAL MODELING OF REAL WORLD CLOUD ENVIRONMENT FOR RELIABILITY AND ITS EFFECT ON ENERGY AND PERFORMANCE 143 CHAPTER 6 STATISTICAL MODELING OF REAL WORLD CLOUD ENVIRONMENT FOR RELIABILITY AND ITS EFFECT ON ENERGY AND PERFORMANCE 6.1 INTRODUCTION This chapter mainly focuses on how to handle the inherent unreliability

More information

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 188 CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 6.1 INTRODUCTION Image representation schemes designed for image retrieval systems are categorized into two

More information

Chapter 5: Analytical Modelling of Parallel Programs

Chapter 5: Analytical Modelling of Parallel Programs Chapter 5: Analytical Modelling of Parallel Programs Introduction to Parallel Computing, Second Edition By Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar Contents 1. Sources of Overhead in Parallel

More information

Computationally Efficient Serial Combination of Rotation-invariant and Rotation Compensating Iris Recognition Algorithms

Computationally Efficient Serial Combination of Rotation-invariant and Rotation Compensating Iris Recognition Algorithms Computationally Efficient Serial Combination of Rotation-invariant and Rotation Compensating Iris Recognition Algorithms Andreas Uhl Department of Computer Sciences University of Salzburg, Austria uhl@cosy.sbg.ac.at

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

On the automatic classification of app reviews

On the automatic classification of app reviews Requirements Eng (2016) 21:311 331 DOI 10.1007/s00766-016-0251-9 RE 2015 On the automatic classification of app reviews Walid Maalej 1 Zijad Kurtanović 1 Hadeer Nabil 2 Christoph Stanik 1 Walid: please

More information

Lightweight caching strategy for wireless content delivery networks

Lightweight caching strategy for wireless content delivery networks Lightweight caching strategy for wireless content delivery networks Jihoon Sung 1, June-Koo Kevin Rhee 1, and Sangsu Jung 2a) 1 Department of Electrical Engineering, KAIST 291 Daehak-ro, Yuseong-gu, Daejeon,

More information

Developing Focused Crawlers for Genre Specific Search Engines

Developing Focused Crawlers for Genre Specific Search Engines Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com

More information