Optimizing Content Freshness of Relations Extracted From the Web Using Keyword Search

Size: px
Start display at page:

Download "Optimizing Content Freshness of Relations Extracted From the Web Using Keyword Search"

Transcription

1 Optimizing Content Freshness of Relations Extracted From the Web Using Keyword Search Mohan Yang 1,2 Haixun Wang 2 Lipyeow Lim 3 Min Wang 4 1 Shanghai Jiao Tong University, mh.yang.sjtu@gmail.com 2 Microsoft Research Asia, haixunw@microsoft.com 3 University of Hawai i at Mānoa, lipyeow@hawaii.edu 4 HP Labs China, min.wang6@hp.com ABSTRACT An increasing number of applications operate on data obtained from the Web. These applications typically maintain local copies of the web data to avoid network latency in data accesses. As the data on the Web evolves, it is critical that the local copy be kept up-to-date. Data freshness is one of the most important data quality issues, and has been extensively studied for various applications including web crawling. However, web crawling is focused on obtaining as many raw web pages as possible. Our applications, on the other hand, are interested in specific content from specific data sources. Knowing the content or the semantics of the data enables us to differentiate data items based on their importance and volatility, which are key factors that impact the design of the data synchronization strategy. In this work, we formulate the concept of content freshness, and present a novel approach that maintains content freshness with least amount of web communication. Specifically, we assume data is accessible through a general keyword search interface, and we form keyword queries based on their selectivity, as well their contribution to content freshness of the local copy. Experiments show the effectiveness of our approach compared with several naive methods for keeping data fresh. Categories and Subject Descriptors H.2.8 [Database management]: Database Applications General Terms Performance Keywords content freshness, synchronization strategy, web crawling Work done while the author is at Microsoft Research Asia. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGMOD 10, June 6 11, 2010, Indianapolis, Indiana, USA. Copyright 2010 ACM /10/06...$ INTRODUCTION Today, an increasing number of applications operate on data from the Web. Mashup applications are particularly good examples of such applications: they combine data from multiple external (web) sources to create a new service. In order to provide reliable, high quality, value added services, data availability and freshness are of paramount importance. The Internet provides no guarantees on connectivity and data availability; hence data availability is often achieved by maintaining a local copy of the data of interest. These applications are then faced with the problem of keeping the data in the local copy fresh. In this paper, we study the problem of refreshing the local data maintained by these applications. The problem is particularly challenging because the Internet that sits between the local data and the data source is unreliable, has relatively high latency, the change characteristics of the data at its source is generally not observable, and most web data sources only support very limited data retrieval APIs, i.e., mainly keyword search. Keeping a local data set fresh has been studied previously in the context of web crawling. However, the problem we face is quite different. First, the goal of web crawling is to maintain a local copy of raw web pages from millions of web sites. Commercial search engines are built upon web crawling techniques. It can take weeks for a commercial search engine to completely refresh its entire repository. For the purpose of web search, it is often acceptable that a local copy of a web page is not updated for days or even weeks in the search engine. It is true that stale pages may reduce the relevancy of web search results, but the problem is not as serious as in our case, where stale content may lead to wrong answers in applications that provide specific services (e.g., checking the availability of a product, as in ticket reservations). Second, web crawling handles raw web pages and is agnostic to the content of the web pages. In contrast, the applications we are concerned with understand the content or the semantics of the data that need to be kept fresh in the local copy. While such applications are much more sensitive to the staleness of the data than applications such as search engines, it also creates an opportunity for developing new mechanisms for keeping the content (not just the raw web pages) fresh. For example, knowing which data items are more volatile or more important enables the refresh algorithm to raise their priorities for synchronization. Furthermore, we can group data items together according

2 to application semantics, and use one Web access to refresh them. 1.1 Applications Before we formally describe the problem and the challenges, let us consider some representative application scenarios. Online Specialty Stores. Suppose one manages a specialty store on the Web. Instead of selling his own stock of merchandise, he links products on his Web pages to major e-commerce sites such as Amazon.com, Buy.com, ebay.com, etc., and earns commission from them. Usually, a specialty store sells products in focused categories, which can be narrow or broad. For example, one can build an online drugstore that has the most comprehensive coverage of medications by pulling data from other general purpose stores, or one can specialize in books on closely related topics, or furniture of particular styles. To maintain such a store, one needs to synchronize with the data source (e.g., Amazon.com) frequently to keep information such as price and availability up-to-date. Paper Citations. Citations are a good measurement for scientific impact, and more and more institutes take citations into consideration when they evaluate job applicants or award candidates. Suppose a web application maintains multiple lists of publications, and publications in each list are on a particular topic (e.g., a reading list of web crawling literature), authored by a particular researcher, or by a group of researchers (e.g., faculties in Stanford s CS department). Suppose among everything else, the application obtains the citation count of each publication from Google Scholar. The challenge is to maintain the freshness of this materialized view. 1.2 Challenges Many applications that rely on third party data sources face similar challenges as the two examples given above. The architectures of these applications have many things in common: to ensure data availability, a web application maintains one or multiple data sets, each materializing certain content from one or more external sources on the Web. In order to serve its clients using its materialized content, the application must poll the data sources regularly in order to ensure the freshness of the materialized content. The challenges lie in how to poll the sources. The first challenge is that the content we need to materialize locally is often big and dynamic. Take paper citations as an example. Its goal is to maintain publication lists for multiple researchers, for multiple institutes, and on multiple topics in order to provide more useful services such as topic analysis, etc. However, the content is dynamic since new papers and new citations are being added constantly. Consequently, the application needs to devise intelligent methods for materializing the large and dynamic content. Similar challenges exist for online specialty stores, as the availability and prices of products of interest also change constantly. The second challenge is the lack of efficient data access methods on the Web. Ideally, we want to query entities (e.g., publications and products) we are interested in as if they are structured relations. Thus, we would like to view the data source as a database, so that we can issue SQL-like queries against it. However, the data sources are either not structured or if they are, the schema of the data is usually not published. There are many reasons for not doing so, among which is the cost issue: for example, during schema evolution, APIs must be changed which may cause interruption of service. The most common interface for querying the external data source is through keyword search, which is the case for searching papers on Google Scholar and searching for products on Amazon.com. The third challenge is access restrictions. Most web sites have heavy restrictions against crawling. Take Google Scholar for example. Google will direct users that generate frequent polling to a CAPTCHA test to ensure that the requests are not coming from a robot. A naive method is to limit the number of queries sent to the data source within a given period of time (e.g., limit it to say 10 queries in any given 5 minute period). However, this approach will limit the functionality of the services we want to provide, especially when the data sets are large. Some of these issues are also faced by the traditional web crawling problem (usually carried out by search engines), but our problem is unique in several aspects. First, our goal is to obtain specific information for a specific set of entities (papers, products, etc). The goal of Web crawling is to obtain as many raw pages as possible. Typically, the applications we consider has one or several targeted data sources on the Web, while web crawlers employed by commercial search engines scan millions of hosts. Recently, the topic of deep web crawling has attracted much interest. Just like general purpose web crawling, its goal is also to obtain as much information from the database beneath the surface web as possible. Second, we are concerned about content freshness. For instance, a significant change of a product s price or a paper s citation justifies high priority of refreshing the information about the product. Web crawling, on the other hand, is only concerned about whether a raw page has been updated. Thus, there is no difference from page to page, or the nature of change within a page. Third, we have a more stringent requirement on data being fresh. For search engines, there are typically no correct answers when it comes to search results. Instead, search engines provide a list of results, hopefully ranked by their relevancy. It is acceptable that the crawled pages are a few days or a few weeks old. However, the applications we are concerned with must analyze (e.g. aggregate) the content to provide a new type of service. Thus, most of them require that the content (e.g., product availability and price) is accurate and up-to-date. 1.3 Our Approach Our goal is to maintain the freshness of a set of entities by sending as few web requests as possible to the sites where such information is hosted. We assume that each data item has a unique ID assigned by the data source. For instance, each product on Amazon.com has a unique ID, and each paper hosted by Google Scholar has a unique ID. Furthermore,

3 we assume we can form a web request using an item s ID to obtain the information about the item. However, accessing information by ID is not efficient: in order to refresh the entire repository of N items, we need to send N requests. Besides being inefficient, sending many requests within a short period of time is usually unacceptable due to web access restrictions (excessive accesses of a Web site such as Google Scholar will be rejected by the Web site). In addition to ID-based data retrieval, data sources usually provide keyword based search. For each search, items that contain the keywords in the search are returned (unless the keywords are too popular, in which case, a subset of the items are returned). Thus, if the result of one search contains information about multiple entities we are interested in, then we save on the number of web accesses. This is possible because the entities we are concerned with are not a random set of entities. Instead, they usually have some commonality. For instance, publications in a list are either on a specific topic, or authored by one researcher, or by a group of researchers in the same institute, etc. For specialty stores, the seller focuses on grouping and presenting products in a better way than general retailers such as Amazon.com, which means products in a specialty store are often related. This offers opportunity to refresh multiple items in a single Web request. In view of this, our approach focuses on finding the best set of keywords that can be used to refresh the data set in the most efficient way, i.e., with least number of searches. In the following, we discuss two factors that affect the keyword selection process. Keyword Selectivity. First, we should choose keywords that have high selectivity on the local data set (i.e., find the keywords that maximize the number of items in the local copy that contain the keywords). Second, we should choose keywords that have low selectivity on the server, such that most returned items are relevant to the local data set. Refreshing Priority. Not all items are the same. There are at least three factors that may impact the priority: time, volatility, and importance. Take papers in a publication list as an example. The time factor says that a paper that was refreshed a long time ago should have high priority to be updated. Second, different papers have different volatility. For example, highly cited papers, or papers on hot topics, or papers written by well known authors are more likely to attract citations, and are hence more volatile. Thus, they should have higher priority to be updated. Finally, certain items, e.g., items that many users are interested in, are more important, and should be kept fresh with higher priority. In this paper, we introduce the concept of content freshness. Unlike previous work in maintaining data freshness, content freshness takes into consideration not only the time factor, but also factors related to items volatility and importance. We then conceive a greedy approach: every time we select keywords in a way to maximize the increase of the content freshness. 1.4 Paper Organization The rest of the paper is organized as follows. Section 2 compares our work with related work. Section 3 presents the mathematical formulation of the problem. Section 4 presents our greedy algorithm for the query generation and graph technique for querying order and frequency optimization. Section 5 presents the experimental results. The paper concludes in Section RELATED WORK Cho et. al. [3] did some pioneering work on designing a crawling policy to maintain a web repository fresh. The goal of their work is to answer the following questions: How often should we synchronize the copy to maintain, say, 8 of the copy up-to-date? How much fresher does the copy get if we synchronize it twice as often? In what order should data items be synchronized? For instance, would it be better to synchronize a data item more often when we believe that it changes more often than the other items? Cho et. al. introduced a definition of data freshness, and proposed a crawling policy to increase the freshness. In particular, the crawling policy takes web page update patterns (e.g. Poisson) into consideration [3]. There are two major differences between Cho s work and ours. First, Cho et. al. maintain the freshness of a local data set that correspond to a set of URLs. Each URL potentially points to a different web site. Thus, each web request is for one URL and refreshes one element of the local data set: in other words, it is not possible to refresh multiple elements in the local copy by one Web request. In our work, the local data consists of a set of entities, which are extracted from a remote database hosted on a single Web site (e.g., paper citations based on Google Scholar) or a small number of Web sites (e.g., online specialty stores based on Amazon.com, buy.com, etc). Our goal is thus to maximize the number of entities retrieved using one query. The second major difference is that Cho et. al. s freshness model is based on the assumption that updates on a web site follow Poisson process. Because Cho et. al. maintains a set of URLs of different web sites, this is probably the best assumption they can make since they do not have any knowledge of the content of the data. In our case, we maintain a set of entities. Thus, we know the semantics of the data, so we can make more reasonable assumptions. For example, consider the publication/citation graph. We can assume that the citations follow a power-law distribution. In other words, a highly cited paper is more likely to attract more citations, and hence should be updated more frequently. The same phenomenon occurs in online specialty stores. A popular item (highly ranked on sales list) is more volatile, that is, it is more likely to go through changes (ranking, price adjustment, availability, etc). Furthermore, a popular item, just like a highly cited paper, is more likely to be queried on the local site as well. Sia et. al. [14] studied how the RSS aggregation services should monitor the data sources to retrieve new content quickly using minimal resources and to provide its subscribers with fast news alerts. The optimal resource allocation and retrieval scheduling policy enables the RSS aggregator to provide news alerts significantly faster than the previous approaches. Our work is different from theirs in the following two aspects. First, their requests are clustered based on data sources to begin with as users subscribe news by domains. In our work, one challenge is to cluster entities so that they can be covered by a set of queries. Second, their optimization is performed on the data source level while ours

4 is performed on the query level. Their work cares about how much resource should be allocated to each data source and when should a specific data source be synchronized in the allocated time slot based on the posting pattern in the data source. Our work cares about in what order should queries be sent and how frequently should we send queries. There are also a lot of research work on aggregation services, including [4, 10]. However, these methods passively wait for new data to come in instead of actively pulling data from the data sources. Deep Web crawling [11, 9, 1, 2, 17, 8, 6, 7, 16] is another area that is closely related to our work. Raghavan and Garcia-Molina [11] proposed a generic operational model of a deep web crawler. They mainly focused on the learning of hidden web query interfaces. Ntoulas et. al. [9] developed a greedy algorithm to automatically generate keyword queries. The algorithm tried to maximize the efficiency of every query, thus obtaining as many pages as possible with a relatively small number of queries. Wu et. al. [17] also proposed a similar keyword query selection technique for efficient crawling of structured web sources. Barbosa et. al. [1, 2] worked on a project called DeepPeep which gathered hidden web sources in different domains based on novel focused crawler techniques. Madhavan et. al. [8] described a system for surfacing deep web content, i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index. Wang et. al. [16] developed a greedy algorithm to solve the weighted set covering problem targeting at deep web crawling. However, deep web crawling focus on the problem of discovering as many items as possible from the data source while our work mainly focuses on the scenario of maintaining a specific set of items. Thus, the crawling strategy in the above work does not apply, as we are not interested in the irrelevant items returned by the queries. Additionally, content-freshness or resultfreshness is never considered in these works, it is an important dimension of deep web crawling and the main focus of our paper. 3. PROBLEM FORMULATION We model a web data source as a server database. Without loss of generality, consider the case where the server database contains a single relational table. Note that the table is used as the database backend of a web application, although it has a schema, it does not support relational queries from the user. Instead, all queries must go through a web interface. Thus, we assume, as far as queries are concerned, (1) each row in the table is represented by a bag of words, and (2) the web interface only supports keyword queries on the table. Each query consists of a set of keywords, and can retrieve up to K tuples that contain all of the keywords in the query. An application that keeps a local copy of the web data is modeled as a local database. The local copy contains a subset of the rows in the server database, and its content is kept up-to-date by querying the server database periodically using hopefully a small number of queries. Furthermore, the local database can only query or probe the server database via keyword queries. At each synchronization, the application needs to decide on a set of keyword queries to probe the server database in order to maximally refresh the content of its local copy. In most practical applications, the synchronization occurs at fixed intervals. S Server copy of the relation. L Local copy of the relation. k(q) Keywords associated with query q. w(r) The weight associated with tuple r in the local relation. R S (k) The set of tuples in the server relation associated with keyword k. R L (k) The set of tuples in the local relation associated with keyword k. K L The set of all keywords associated with the local copy of the relation. Table 1: Notations There are two optimization objectives when choosing the set of keyword queries at each synchronization point. 1. the local relation should be kept as up-to-date, i.e., as similar to the current content on the server as possible, and 2. the number of keyword queries should be as small as possible. 3.1 Optimization Objective 1 A key element in the first objective is that we need a similarity measure between the local data and the data on the server. Let r L be a tuple in the local database L. Let r S be the corresponding tuple in the server database S. Assume we have a function δ(r L, r S ) that measures the dissimilarity between r L and r S. However, not all tuples are equal. The freshness of some tuples is more important than others. To reflect the importance, we assign a weight w( ) to each tuple. Thus, the priority of a record is given by its weighted dissimilarity: w(r L ) δ(r L, r S ) In other words, the higher the weight of a tuple, and the bigger the difference between the server and the local copy, the higher the priority to refresh the tuple. Then, the content freshness of the entire local relation with respect to the server relation can be computed as: δ(l, S) = 1 w(r L ) δ(r L, r S ) r L L Note that a content freshness of zero means that the local database is perfectly in-sync with the source. The larger the content freshness, the less fresh the local content. Both the dissimilarity function and the weight function are application dependent. For example, for the paper citation application, the dissimilarity function can be defined as the absolute value of the difference between the number of citations on the local and the server databases, i.e., δ(r L, r S) = r L.Num_of_citation r S.Num_of_citation where Num_of_citation is a database column. Similarly, we can conceive a weight function which says the more citations a paper has, the more important the paper is, i.e.: w(r L) = r L.Num_of_citation In the same spirit, we can define the importance based on the popularity of its author, the relevance of its topic, or the prestige of the conference or journal where the paper appears, etc.

5 As we can see, the weighted dissimilarity is the key to the concept of content freshness, as it employs the semantics of the data to decide on the refreshing strategy. This also makes our problem different from web crawling, which is mainly concerned with whether a web page has been updated or not. Given the weighted dissimilarity, we design a synchronizing strategy to keep the data set fresh. Intuitively, the local application measures the difference between the local tuples and the server tuples, and gives priority to refreshing those that have a big weighted difference when it synchronizes with the server. But there is a big catch: the local application does not know what the current r S is. We assume r S has changed since the last synchronization (otherwise δ(r L, r S ) will be 0, which means there is no need to refresh it), but we do not know how much it has changed. In our approach, we estimate r S based on i) the content of r L, ii) the volatility of r L, and iii) the time elapsed since the last synchronization of r L. In other words, we assume δ(r L, r S) can be estimated by F (r L; t), or simply δ(r L, r S) = F (r L; t) where t is the current time. In other words, we create an update model of the content. We consider r L s volatility and previous synchronization information as features of r L. Thus, from r L and t we can derive r S. This leads to our definition of content freshness. Definition 1. The content freshness of the entire local relation at time t is defined as: F (L; t) = 1 w(r L ) F (r L ; t) (1) r L L As an example, a simple deterministic update model for the citations can be the following. Let t s be the time that r L was last synchronized. Let ω be the parameter that reflects the volatility of r L for instance, assume the citation of a paper increases by 1 every ω days. Thus, we have F (r L ; t) = δ(r L, r S ) = r S.Num_of_citation r L.Num_of_citation = (r L.Num_of_citation + t ts ω ) r L.Num_of_citation = t ts ω Certainly, this is a very naïve update model. In Section 4, we introduce more realistic models where the change of the citation is modeled by a Poisson process with a rate parameter obtained from historical data. 3.2 Optimization Objective 2 The second objective seeks to minimize the number of necessary keyword queries for synchronization. This requirement is due to a limitation in web based keyword search: for each query, the web server only returns top K results that match the query. In other words, even if there exists a common keyword that matches every tuple in the server database, a query containing the keyword will not retrieve the entire database. More likely it returns K tuples that the local application has little interest in. Thus, a good keyword query is a keyword query that ensures most of the returned K tuples are useful and important to the local application. By useful and important, we mean they are part of the local copy and they have high weighted dissimilarity. Thus, obtaining these tuples through synchronization will greatly improve the content freshness of the local data. This requires the query to contain keywords that match as many important tuples in the local copy, and as few irrelevant tuples, as possible. To make this optimization decision, the local database needs to estimate the selectivity of keyword queries on the server database. That is, given a set of keywords, we need to estimate how many tuples in the server database contains these keywords. Then, we will avoid keywords that match a large number of tuples in the server database. 3.3 Problem Statement We give our problem stalement as follows: At each synchronization, find a set of queries Q such that after getting the results of Q, the content freshness of the local relation is maximally improved (that is, F (L; t) is maximally reduced.) A couple of constraints on queries are necessary to fully specify the problem: (1) only relevant keywords are to be used in the queries, q Q, k(q) K L, (2) the keywords in the queries cover the local relation, L q QR L(k(q)), (3) the number of queries Q should be minimized, and (4) the number of tuples transmitted, q Q RS(k(q)) should be minimized. A Note on NP-Hardness. Consider a more constrained version of the problem where the weights of the dissimilarity measure between local and server relation are set to one. The problem then is to find the smallest number of queries to cover the whole local relation while the number of tuples transmitted is minimized. Let Q be the set of all possible queries (i.e., all possible combination of keywords in K L ), then we are trying to solve the following optimization problem: minimize R S (k(q)) x q subject to q Q q:r R L (k(q)) x q {0, 1} x q 1 q Q r L where x q represents whether a query q Q is selected(1 for yes, 0 for no) when objective function is minimum. This is the covering integer program [15] which is a generalization of the set cover problem. 4. OUR APPROACH Our approach to refresh the content of a local relation L with a server relation S is outlined in Algorithm 1. The refreshlocal algorithm contains an infinite loop of refresh events. Each refresh event, i.e., each iteration of the loop, generates a set of keyword queries using the greedyprobes algorithm, uses the generated set of keyword queries to retrieve data from the server relation, updates the local relation using the retrieved data, and sleeps for τ time units. At this point, for ease of exposition, we assume that the results

6 Algorithm 1 refreshlocal(l, S) Input: local relation L, information related to server relation S Output: synchronized local relation L 1: loop 2: P greedyprobes(l, S) 3: D query S using P 4: update L using results D 5: sleep for time interval τ 6: end loop Algorithm 2 GreedyProbes(L, S) Input: local relation L, information related to server relation S and statistics to compute R S (q) for query q Output: P 1: P 2: R notcovered L 3: while not stopping condition do 4: K find all keywords associated with R notcovered 5: Pick q from powerset P(K) using greedy heuristic Eqn. (2) 6: P P {q} 7: R notcovered R notcovered R L (q) 8: end while 9: return P for all the keyword queries are retrieved instantaneously. In Section 4.3, we remove this assumption and analyze the effect of ordering the keyword queries for obtaining maximum freshness. The essence of our solution lies in the greedyprobes algorithm that directly addresses the problem statement of the previous section. Given the NP-hardness of the problem, greedyprobes uses a greedy heuristic-based algorithm outlined in Algorithm 2 to find the set of keyword queries for each refresh event. The greedprobes algorithm starts with an initially empty set of query probes P and adds a query probe to this set using a greedy query efficiency heuristic (to be described in Section 4.1). The stopping condition of the while-loop can be application dependent and some examples are: (1) when some quota of keyword queries has been reached, or (2) when the current set of queries completely covers the local relation. A query probe can be a single keyword or a conjunction of keywords; hence, a query is picked from the powerset of all possible keywords associated with the local relation. Note that in practice, the power set is never materialized, because of its exponential complexity. Next, we describe the query efficiency heuristic in detail. The rest of the section will discuss how relaxing the assumption that the set of query probes are executed instantaneously will affect the change of content freshness. The paper citation scenario will be used as a running example in our analysis. 4.1 Query Efficiency Heuristic For a keyword query q, let R L (q) and R S (q) be the local coverage and server coverage of q respectively. To minimize the number of queries, the local coverage of every query should be as high as possible. On the other hand, the search result of most keyword search interfaces is partitioned into pages while each page covers only a limited number of results. So if a query has large server coverage, i.e., the results of this query is partitioned into many pages, then a lot of additional queries are needed to retrieve all the result pages from the server. To minimize the number of tuples transmitted, the server coverage of every query should be as low as possible. Intuitively, a good query should maximize the query efficiency as defined by q := arg max q R L (q) R S (q). This is similar to the tf/idf measure in IR. The term frequency (tf) represents the number of times a term occurs in a document. The inverse document frequency (idf) is a measure of the general importance of the term. A high weight of tf/idf measure is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents. As different tuples have different priorities to be refreshed, it is preferable to cover high priority tuples. Let w(r) be the importance of tuple r, and let F (r; t) be the dissimilarity between r and the current r on the server. Then we can change the local coverage into weighted local coverage, i.e., for a query q we have r R L (q) w(r) F (r; t) instead of R L (q), which is a special case where all w(r) and all F (r; t) are the same). Still, we want to maximize the efficiency q := arg max q r R L (q) w(r) F (r; t) R S(q) Estimating Server Coverage. In Equation (2), the numerator part can be directly calculated based on the local relation L and an update model which we will introduce in Section 4.2. But to compute the denominator part, we need information from the server. Usually, a keyword search interface will notify user about the total number of results related to a query on the result page. This is a way to obtain the value of R S(q). However, this method requires sending queries to the server for every possible value of q. The query generation may need the information of every possible combination of keywords. Such a huge amount of queries may be blocked by the keyword search interface. So we need some alternative methods to indirectly estimate the value of R S (q). One possible approach is to use the technique developed by Li and Church [5], which estimates word associations (cooccurrences of words, i.e., the frequency that a set of words occur together). If we regard the server relation as a corpus, the keyword query can be seen as calculating the word association for words in a keyword query. Li and Church s method samples the inverted index of the corpus, and constructs a contingency table for this sample. Then, a straightforward method for deriving the contingency table for the entire population is to use scaling. Li and Church use a maximum likelihood estimator (MLE) by taking advantage of the margins (also known as document frequencies) to make the estimation more accurate [5]. However, in order to use Li and Church s method, we need to create the inverted index first, which means we need to crawl the entire server. Another possible approach is to use some similar data sources to estimate the value of R S (q). Data sources from the same domain are similar both in terms of their attribute values and their attribute value distributions. Sometimes, downloadable data sources in the same domain are available. The downloaded data can act as a good estimation of our target server data. With the help of the downloaded data, the estimation can be rather simple. For example, if we need the R S (q) value for some computer science papers (2)

7 on Google Scholar, we can use the data from DBLP 1 as an estimation. DBLP includes more than one million articles on computer science. Let R DBLP (q) be the number of items related to query q in the DBLP data dump, then the value of R S (q) can be estimated as R DBLP (q) c. c is a scaling factor representing the volume difference of these two data sources. c can be estimated by comparing the results number of several queries on these two data sources. 4.2 Poisson Analysis In Section 4.1, we introduced the greedyprobes algorithm to find a set of queries that cover the local dataset. Suppose the set of queries found is P = {q 1, q 2,..., q k } and R L (q 1 ), R L (q 2 ),..., R L (q k ) denote the corresponding results of the k queries. In many practical applications, the number of queries, k, can be quite large. For instance, a website that keeps track of thousands of products sold on Amazon.com may need hundreds if not thousands of queries to synchronize its list. Previously, we have assumed that these k queries can be executed and results obtained from the server database instantaneously. When k is large, the assumption is not valid. Moreover, the server may not even allow such a large number of queries from a single client in a single refresh event. Thus, we need to spread out the queries over time and try to maximize the content freshness with less frequent queries to the server. One of the simplest models is to issue one query at fixed time intervals. Let I be the fixed time interval. At each time slot, we issue exactly one query, and thus we finish one iteration of the synchronization loop in the refreshlocal algorithm within a time span of ki + τ time units. Two questions immediately arise with this model. First, what is the order of issuing the queries, and second, how does the fixed time interval I impact the content freshness of the dataset. To answer these two questions, we introduce a theoretical content freshness measure and a Poisson update model of the source data in order to perform the analysis for these two questions. Definition of F (e; t). Suppose we measure the content freshness of an item e at time t as follows. { 0 if e is up-to-date at time t F (e; t) = t i U t t i otherwise where U is the set of time points at which an update occurs on the server (e changes value) since last synchronization. Then the content freshness of the local relation L at time t is F (L; t) = 1 w(e) F (e; t) (3) e L Fig. 1 shows an example illustrating this definition of content freshness. The content freshness of e stays zero before the first updating. After that the content freshness keeps growing at a speed of one till the first synchronization. After the first synchronization, the content freshness drops to zero and remains zero till next updating. The content freshness keeps increasing from the third updating. Since no synchronization has occurred since the third updating, the content freshness keeps growing at the speed of two after the fourth 1 The DBLP Computer Science Bibliography: informatik.uni-trier.de/~ley/db F(e;t) Element is updated Element is synchronized Time Figure 1: An Example of the Time Evolution of Content Freshness time content freshness = 1 7 (7-5) + (7-6) = 3 8 (8-5) + (8-6) + (8-7) = 6 Table 2: F (e; t) of e at time 6, 7 and 8 updating and at the speed of three after the fifth updating. If the time of the third, fourth and fifth updating is 5, 6 and 7 respectively, then the content freshness of e at time 6, 7 and 8 is shown in Table 4.2. Poisson Update Model. Suppose we model the update characteristics of a source data element (eg. paper citation count) using a Poisson process 2. For an item e, the parameter of the Poisson process is λ which represents the number of update events in one time unit. A Poisson process implies that update events do not occur simultaneously. In the paper citation scenario, this means that the citation count can only increase by one at each update event. Let N(t) be the number of update events within a time period of t. Then we have the following based on the Poisson assumption: P (N(s + t) N(s) = k) = (λt)k e λt k! Next, we derive the expected content freshness at time t under this update model. Let the sequence of Poisson update 2 Cho et. al. [3] verified the changes of Web pages followed a Poisson process. As the information of source data elements is extracted from Web pages, the update characteristics should also follow a Poisson process.

8 event times be τ 1, τ 2,, τ k,.... At time x, we have P (τ k = x) = P (N(x + dt) k, N(x) = k 1) lim dt 0 dt = 1 P (N(dt) = 0) lim P (N(x) = k 1) dt 0 dt = 1 (1 λdt + O(dt 2 )) (λx) k 1 lim dt 0 dt (k 1)! e λx = λ (λx)k 1 (k 1)! e λx Hence, the contribution of each k-th update to the expectation of content freshness is, t 0 (t x)λ (λx)k 1 (k 1)! e λx dx Adding up the above expression for k = 1, 2,, gives t (t x)λ (λx)k 1 (k 1)! e λx dx = 1 2 λt2 k=1 0 However, this expression does not guarantee the order τ 1 < τ 2 < < τ k <. So some unsatisfactory situations should be excluded from this value, i.e., E[F (e; t)] < 1 2 λt2 On the other hand, E[F (e; t)] must be bigger than the expectation of τ 1, so E[F (e; t)] > t 0 (t x)λe λx dx = t(1 1 e λt ) λt 1 2 λt2 Combining the upper and lower bound, we have E[F (e; t)] 1 2 λt2, as the expression to represent the expectation of content freshness at time t since time Query Order Optimization Let g i be the group of data elements covered by issuing query q i, i.e., we use g i as a shorthand for R L (q i ). After issuing a query q i, the elements in the group g i are synchronized together. For our analysis, we assume g i g j =, i j (if they are not we can artificially remove duplicates from the groups). Let x i denote the time when g i was last synchronized and x i satisfies the constraints x i {0, 1, 2,, k}, i = 1, 2,, k, and x i x j if i j. Then, if we send one query in a time unit I, the expected content freshness of L at time t = ki is E[F (L; ki)] = 1 = 1 k i=1 e j g i w(e j )E[F (e j ; (k x i )I)] k 1 2 w(ej)λj(k xi)2 I 2 e j g i i=1 1 1 e j g i Let a i be a shorthand for w(e 2 j)λ j I 2, which for fixed I and local copy of relation L, a i is a positive constant. We have E[F (L; ki)] = k a i(k x i) 2 (4) To achieve the smallest content freshness L, we want to minimize Eqn. (4). Clearly, the order of the queries, which is embodied by the vector of x i, makes an impact on the content freshness. In Table 3, we use an example to demonstrate the relationship between the ordering and the content freshness. Assume i) the local dataset maintains 4 items, L = {e 1, e 2, e 3, e 4}; ii) item has the same weight, that is, w(e i) = {1, 1, 1, 1}; iii) each item is updated by a Poisson process, with parameter (λ i ) = (1, 2, 3, 4), and iv) e 1 and e 2 are covered by query q 1, e 3 and e 4 are covered by query q 2. Table 3: Content Freshness L for k = 2, I = 1, (λ i) = (1, 2, 3, 4), w(e i) = {1, 1, 1, 1} x 1 = 0 x 1 = 1 x 1 = 2 x 2 = x 2 = N/A x 2 = N/A i=1 For I = 1, we have a 1 = and a 2 = 0.875, and Eqn. (4) becomes 0.375(2 x 1) (2 x 2) 2. We need to decide the order of queries or the exact value of x i so that Eqn. (4) is minimized. As k = 2, we can enumerate the value of x i. The result is shown in Table 3. We see that is the minimum, which means we should send q 1 first and then q 2. In this case, a 1 < a 2 and Eqn. (4) is minimized when q 1 is send before q 2. This indicates that the order to send the queries and the order of a i s should be the same in order to minimize Eqn. (4). We have the following greedy algorithm. Greedy Query Order Optimization. Sort a i s in Eqn. (4) in ascending order, that is, a p1 < a p2 < < a pk where {p 1,, p k } = {1,, k}. Then, x pi = i i = 1, 2,, k, that is, sending query q pi at time ii, minimizes Eqn. (4). The proof is simple. According to rearrangement inequality, Eqn. (4) is minimal when (k x p1 ) 2 > (k x p2 ) 2 > > (k x pk ) 2 or x p1 < x p2 < < x pk. As we can send k queries, so every x i is positive if we need to minimize Eqn. (4). Thus x pi = i i = 1, 2,, k. After all queries have been sent in the order specified by vector x i, the algorithm sleeps for τ time units and another loop starts from the first query in the list. Each loop lasts for ki + τ time units. Sometimes the the number of queries we can issue in each synchronization loop is low, then, some elements will not be synchronized in a loop. In the subsequent loops, the content freshness of the unsynchronized elements in previous loops will become significantly bigger because they have big values of t in 1 2 λt2. So these unsynchronized elements will finally get synchronized as this will lead to a great decrease in the content freshness of local relation L. 4.4 Query Frequency Optimization We send a query every I time units and k queries require ki time units. If we do not send these queries, the content freshness of L will become ki if the content freshness of L increases as a linear function of the time. Our target is to

9 ensure that the content freshness of L does not go beyond αki where α is a threshold, i.e., E[F (L; ki)] < αki. For fixed L, E[F (L; ki)] is a function of x = (x 1,, x k ) and I. Let f(x, I) = E[F (L; ki)], x := arg min x f(x, I), and g(x, I) = 1 f(x, I), then the target becomes g(x, I) < ki α, where g(x) is given as g(x, I) = 1 2 k 2 w(e j )λ j (k x i ) I 2k e j q i i=1 g(x, I) is an increasing function of I for fixed x and L. If I 1 > I 2 > 0, we have g( x 1, I 1 ) > g( x 1, I 2 ) (5) = 1 f( x1, I2) ki 2 (6) 1 ki 2 f( x 2, I 2 ) (7) = g( x 2, I 2 ) (8) (5) is because g is an increasing function of I. (7) is based on the definition of x. (6) and (8) are both based on the definition of g. We start with a random value of I 0 (a simple choice is 1), and we calculate the correspondent x 0 and the value of g( x 0, I 0). If the value is small than α, we use 2 1 I 0, 2 2 I 0,, 2 s I 0 until g( x s, 2 s I 0) is bigger than α; otherwise we use 2 1 I 0, 2 2 I 0,, 2 s I 0. After the process, we have g( x s 1, 2 s 1 I 0) α g( x s, 2 s I 0), s Z Then we just do a binary search in the interval [2 s 1 I 0, 2 s I 0] to get the final value of I. 5. EXPERIMENTS In this section we describe the implementation of our prototype and present an experimental evaluation of our greedy synchronization method. 5.1 Settings Methods. Our implementation considers all possible keyword subsets up to a fixed size. We found empirically that considering all subsets up to size two is already sufficient for good performance. A keyword query often returns multiple pages of ranked results. We use two methods to extract results from them. Method 1 Scan through all result pages to extract as many tuples in the local relation as possible. Method 2 Scan only the first page of results. Data sets. We present experimental results for two application scenarios: paper citations and on-line DVD stores. For the paper citation scenario, we used the real citation data from Google scholar as the server database. To estimate the server coverage in the query efficiency heuristics, we used statistics extracted from DBLP. We also generated synthetic citation update data for the experiments measuring content freshness since it is not possible to obtain update and the data generation procedure will be described in Sec For the on-line DVD stores, we used real DVD pricing data from Amazon.com as our server database and the Content Freshness Naive 2 Naive 1 Greedy Time Figure 2: Content Freshness of Different Methods with the Same Query Frequency statistics extracted from Internet Movie Database 3 (IMDB) to estimate the query efficiency heuristics. Performance Measures. We use three different measures in our experiments. Content freshness measures the upto-dateness of the local database with respect to a server database. Since update characteristics of the data elements on the server database cannot be observed, we use the formula with Poisson approximation in Eqn. (3). Coverage ratio measures the expected fraction of the number of tuples in the local relation covered by a set of queries Q, q Q Coverage Ratio = E coverage(q). E coverage (q) is the expected number of tuples in the local relation associated with query q. In Method 1, E coverage (q) is R L (q). In Method 2, E coverage (q) is computed as, E coverage (q) = p R L(q), R S (q) where p is the number of tuples in a result page and the server coverage term R S(q) is estimated using statistics (eg. R S (q) R DBLP (q) in our citation scenario). Finally, Speedup ratio measures the number of queries send to the server divided by the expected number of tuples in the local relation covered by the queries, q Q Speedup Ratio = N(q) q Q R. L(q) N(q) is the total number of queries send to the server for query q. In Method 1, N(q) is computed as N(q) = RS(q), p where p is the number of tuples in a result page. In Method 2, N(q) is simply 1 for all q Q. Coverage ratio measures the number of tuples in the local relation covered by Q. Then for the tuples covered by Q, we need q Q R L(q) simple queries (i.e., ID based query) to synchronize them all. However, using our method, only q Q N(q) keyword based queries are needed. Speedup ratio measures the speedup between these two methods. These two measures should be considered together as we need both high coverage ratio and high speedup ratio. 3 Internet Movie Database:

10 5.2 Content Freshness In this experiment we investigate the performance in terms of content freshness for the Naive 1, Naive 2 and Greedy refresh algorithms. Naive 1 sends simple queries (that is, ID based query). Naive 2 also sends simple queries while the λ of every corresponding item is less than 1 (i.e., the item does not change too frequent). Greedy sends keyword queries picked by our approach. The experiment is performed under the paper citations scenario. We generate synthetic citation data for this experiment based on statistics from the Physical Review journal [12]. The generated data contains 1000 papers with a citation distribution [12] following a power law, NC(x) x α, with α 3. We set the minimum number of citation to one, and use α = 3 to get the normalize factor 2(i.e., calculate the constant factor in the citation distribution for NC(x) = 1). So our citation distribution x=1 is NC(x) = 2. Every paper is assigned an initial citation x number from the 3 NC(x) distribution. [13] uses attachment rate A k to characterize the development of citations. A k gives the likelihood that a paper with k citations will be cited by a new article. The data of 110 years of Physical Review suggests that A k is a linear function of k(the scale factor is about 0.25), especially for k less than 150. We choose A k to be 0.25k, thus the λ of every paper is equal to a quarter of the initial citation number. As the update model of paper citation is assumed to be a Poisson process, we generate the citation update events of every paper based on its λ. In order to simulate the grouping effect of the greedyprobes algorithm, we generate a group number for each paper. We have 200 groups while each can cover at most 10 papers. On average, a group covers about 5 papers. Then the papers in a group are synchronized by one query together. Fig. 2 shows the content freshness of different refresh algorithms with the same query frequency. The Naive 1 line, Naive 2 line and Greedy line represent the content freshness of sending 10 queries using the corresponding algorithm. During the first 100 time units, the content freshness of these methods looks similar. From 100 time unit to 500 time unit, the content freshness of the Naive 1 and Naive 2 line experiences a stepped climbing procedure. Both line grow over 2000 at time 200 while the Greedy line is only about 200. Using our greedy synchronization strategy, more than 10 papers are synchronized during a time unit while the other two methods synchronize only 10 papers. So the content freshness of the Greedy line is much lower. Although the Naive 1 method and Naive 2 method synchronize the same number of papers during a time unit, but some papers will never get synchronized due to the synchronization strategy of the Naive 2 method. When the time is long enough, the content freshness of these unsynchronized elements become extremely big and dominate the final value of the content freshness of the set. So the content freshness of Naive 2 line finally goes beyond the Naive 1 line. It keeps growing over 2500 at 500 time unit and appears to keep growing in the later time. The content freshness of Greedy line is controlled under 600 and the average content freshness is only about 350. It is clear that Greedy algorithm performs the best among these three methods. 5.3 Quality of Query Probes In the next series of experiments, we investigate the performance of our greedy synchronization approach by looking at the quality of the query probes generated by our greedyprobes algorithm at each refresh event. The quality of a set of query probes is measured using the coverage ratio and speedup ratio. Paper citations scenario. In the paper citations scenario, we want to maintain the citations of a set of papers. We use the DBLP data dump to estimate the value of R S (q) for query q. R DBLP (q) is the set of papers in the DBLP data R S (q) R DBLP (q) is associated with query q. Assume the ratio of a specific constant c. Based on the results of several queries, we estimate the constant c as 4.6. We create a paper set with 671 papers on 7 topics from the Google Scholar search result. Title and author are the attributes used in the experiment. Using author as the keyword, we make a bag to hold all authors related to not covered papers. Then these authors are ranked based on their occurrences. The top-k frequent authors are selected to make up the keyword set K. Smaller k makes the greedyprobes algorithm faster while bigger k one is slower. But bigger k one is more likely to find an optimal solution as it tries more situation than smaller k one. So we change the value of k to see the value of the coverage ratio and the speedup ratio. Then we select the smallest k which achieves a high enough coverage and speedup ratio, i.e., we use the smallest computational resource to execute the greedyprobes algorithm. The result of different k is shown in Fig. 3(a) and Fig. 3(b). The Method 1 on Author line and Method 2 on Author line in Fig. 3(a) are overlapping. Both line stay at the same value. So k does not affect the coverage ratio criteria in this case. As for the speedup ratio, the value of Method 1 remains the same while the value of Method 2 also remains the same. Method 2 performs slightly better than Method 1 on speedup ratio. To sum up, the keyword set size does not affect the performance of the algorithm for the author attribute. So directly choosing the most frequent author as a query will get a satisfactory result. Experiment on title is carried out similarly. Every title is broken into single words and collected by a bag. Stopwords are removed from the bag and the remaining words are still ranked by their occurrences. The top-k frequent words are selected as the keyword set. Result is also shown in Fig. 3(a) and Fig. 3(b). When the keyword size is 1 (k = 1), the algorithm will always return single word queries as the keyword set has only one word. But the selectivity of single word queries from titles is extremely low, that is, you will get millions of papers if you send a single word query from titles to the search engine. It is easy to see that single word queries from titles can always cover the local paper set. So the coverage ratio of Method 1 on Title is 10. But if you use Method 2 to retrieve the results, you will get an extremely low coverage ratio as the expected number of papers in the first result page is small. In the experiment, the coverage ratio of Method 2 on Title is nearly zero when k is 1. However, when the keyword size is exactly 2 (k = 2), the iteration procedure is more likely to return query of two keywords as the selectivity of two keywords query is much higher than single word query. The Method 2 on Title line starts from about 55%, drops below 2 immediately, and then increases to about 6 as k varies from 2 to 100. Intuitively, large keyword set size is more likely to return a

Leveraging Transitive Relations for Crowdsourced Joins*

Leveraging Transitive Relations for Crowdsourced Joins* Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

Information Discovery, Extraction and Integration for the Hidden Web

Information Discovery, Extraction and Integration for the Hidden Web Information Discovery, Extraction and Integration for the Hidden Web Jiying Wang Department of Computer Science University of Science and Technology Clear Water Bay, Kowloon Hong Kong cswangjy@cs.ust.hk

More information

Designing Views to Answer Queries under Set, Bag,and BagSet Semantics

Designing Views to Answer Queries under Set, Bag,and BagSet Semantics Designing Views to Answer Queries under Set, Bag,and BagSet Semantics Rada Chirkova Department of Computer Science, North Carolina State University Raleigh, NC 27695-7535 chirkova@csc.ncsu.edu Foto Afrati

More information

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific

More information

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER 4.1 INTRODUCTION In 1994, the World Wide Web Worm (WWWW), one of the first web search engines had an index of 110,000 web pages [2] but

More information

CS6200 Information Retreival. Crawling. June 10, 2015

CS6200 Information Retreival. Crawling. June 10, 2015 CS6200 Information Retreival Crawling Crawling June 10, 2015 Crawling is one of the most important tasks of a search engine. The breadth, depth, and freshness of the search results depend crucially on

More information

Advanced Crawling Techniques. Outline. Web Crawler. Chapter 6. Selective Crawling Focused Crawling Distributed Crawling Web Dynamics

Advanced Crawling Techniques. Outline. Web Crawler. Chapter 6. Selective Crawling Focused Crawling Distributed Crawling Web Dynamics Chapter 6 Advanced Crawling Techniques Outline Selective Crawling Focused Crawling Distributed Crawling Web Dynamics Web Crawler Program that autonomously navigates the web and downloads documents For

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams

More information

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system. Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

More information

Column Generation Method for an Agent Scheduling Problem

Column Generation Method for an Agent Scheduling Problem Column Generation Method for an Agent Scheduling Problem Balázs Dezső Alpár Jüttner Péter Kovács Dept. of Algorithms and Their Applications, and Dept. of Operations Research Eötvös Loránd University, Budapest,

More information

Evaluation of Query Generators for Entity Search Engines

Evaluation of Query Generators for Entity Search Engines Evaluation of Query Generators for Entity Search Engines Stefan Endrullis Andreas Thor Erhard Rahm University of Leipzig Germany {endrullis, thor, rahm}@informatik.uni-leipzig.de ABSTRACT Dynamic web applications

More information

Downloading Hidden Web Content

Downloading Hidden Web Content Downloading Hidden Web Content Alexandros Ntoulas Petros Zerfos Junghoo Cho UCLA Computer Science {ntoulas, pzerfos, cho}@cs.ucla.edu Abstract An ever-increasing amount of information on the Web today

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Effective Keyword Search over (Semi)-Structured Big Data Mehdi Kargar

Effective Keyword Search over (Semi)-Structured Big Data Mehdi Kargar Effective Keyword Search over (Semi)-Structured Big Data Mehdi Kargar School of Computer Science Faculty of Science University of Windsor How Big is this Big Data? 40 Billion Instagram Photos 300 Hours

More information

Downloading Textual Hidden Web Content Through Keyword Queries

Downloading Textual Hidden Web Content Through Keyword Queries Downloading Textual Hidden Web Content Through Keyword Queries Alexandros Ntoulas UCLA Computer Science ntoulas@cs.ucla.edu Petros Zerfos UCLA Computer Science pzerfos@cs.ucla.edu Junghoo Cho UCLA Computer

More information

/ Approximation Algorithms Lecturer: Michael Dinitz Topic: Linear Programming Date: 2/24/15 Scribe: Runze Tang

/ Approximation Algorithms Lecturer: Michael Dinitz Topic: Linear Programming Date: 2/24/15 Scribe: Runze Tang 600.469 / 600.669 Approximation Algorithms Lecturer: Michael Dinitz Topic: Linear Programming Date: 2/24/15 Scribe: Runze Tang 9.1 Linear Programming Suppose we are trying to approximate a minimization

More information

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING Chapter 3 EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING 3.1 INTRODUCTION Generally web pages are retrieved with the help of search engines which deploy crawlers for downloading purpose. Given a query,

More information

LECTURE NOTES OF ALGORITHMS: DESIGN TECHNIQUES AND ANALYSIS

LECTURE NOTES OF ALGORITHMS: DESIGN TECHNIQUES AND ANALYSIS Department of Computer Science University of Babylon LECTURE NOTES OF ALGORITHMS: DESIGN TECHNIQUES AND ANALYSIS By Faculty of Science for Women( SCIW), University of Babylon, Iraq Samaher@uobabylon.edu.iq

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly

More information

How to Crawl the Web. Hector Garcia-Molina Stanford University. Joint work with Junghoo Cho

How to Crawl the Web. Hector Garcia-Molina Stanford University. Joint work with Junghoo Cho How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho Stanford InterLib Technologies Information Overload Service Heterogeneity Interoperability Economic Concerns Information

More information

Mining Frequent Itemsets in Time-Varying Data Streams

Mining Frequent Itemsets in Time-Varying Data Streams Mining Frequent Itemsets in Time-Varying Data Streams Abstract A transactional data stream is an unbounded sequence of transactions continuously generated, usually at a high rate. Mining frequent itemsets

More information

Effective Page Refresh Policies for Web Crawlers

Effective Page Refresh Policies for Web Crawlers For CS561 Web Data Management Spring 2013 University of Crete Effective Page Refresh Policies for Web Crawlers and a Semantic Web Document Ranking Model Roger-Alekos Berkley IMSE 2012/2014 Paper 1: Main

More information

Conclusions. Chapter Summary of our contributions

Conclusions. Chapter Summary of our contributions Chapter 1 Conclusions During this thesis, We studied Web crawling at many different levels. Our main objectives were to develop a model for Web crawling, to study crawling strategies and to build a Web

More information

CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on to remove this watermark.

CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on   to remove this watermark. 119 CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM 120 CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM 5.1. INTRODUCTION Association rule mining, one of the most important and well researched

More information

Query Likelihood with Negative Query Generation

Query Likelihood with Negative Query Generation Query Likelihood with Negative Query Generation Yuanhua Lv Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 ylv2@uiuc.edu ChengXiang Zhai Department of Computer

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

Searching the Deep Web

Searching the Deep Web Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index

More information

3 No-Wait Job Shops with Variable Processing Times

3 No-Wait Job Shops with Variable Processing Times 3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select

More information

Optimizing Search Engines using Click-through Data

Optimizing Search Engines using Click-through Data Optimizing Search Engines using Click-through Data By Sameep - 100050003 Rahee - 100050028 Anil - 100050082 1 Overview Web Search Engines : Creating a good information retrieval system Previous Approaches

More information

Featured Archive. Saturday, February 28, :50:18 PM RSS. Home Interviews Reports Essays Upcoming Transcripts About Black and White Contact

Featured Archive. Saturday, February 28, :50:18 PM RSS. Home Interviews Reports Essays Upcoming Transcripts About Black and White Contact Saturday, February 28, 2009 03:50:18 PM To search, type and hit ente SEARCH RSS Home Interviews Reports Essays Upcoming Transcripts About Black and White Contact SUBSCRIBE TO OUR MAILING LIST First Name:

More information

Selecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach +

Selecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach + Selecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach + Abdullah Al-Hamdani, Gultekin Ozsoyoglu Electrical Engineering and Computer Science Dept, Case Western Reserve University,

More information

11.1 Facility Location

11.1 Facility Location CS787: Advanced Algorithms Scribe: Amanda Burton, Leah Kluegel Lecturer: Shuchi Chawla Topic: Facility Location ctd., Linear Programming Date: October 8, 2007 Today we conclude the discussion of local

More information

Advanced Data Management Technologies

Advanced Data Management Technologies ADMT 2017/18 Unit 13 J. Gamper 1/42 Advanced Data Management Technologies Unit 13 DW Pre-aggregation and View Maintenance J. Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Acknowledgements:

More information

Implementation of Enhanced Web Crawler for Deep-Web Interfaces

Implementation of Enhanced Web Crawler for Deep-Web Interfaces Implementation of Enhanced Web Crawler for Deep-Web Interfaces Yugandhara Patil 1, Sonal Patil 2 1Student, Department of Computer Science & Engineering, G.H.Raisoni Institute of Engineering & Management,

More information

Visualization and Analysis of Inverse Kinematics Algorithms Using Performance Metric Maps

Visualization and Analysis of Inverse Kinematics Algorithms Using Performance Metric Maps Visualization and Analysis of Inverse Kinematics Algorithms Using Performance Metric Maps Oliver Cardwell, Ramakrishnan Mukundan Department of Computer Science and Software Engineering University of Canterbury

More information

Deep Web Content Mining

Deep Web Content Mining Deep Web Content Mining Shohreh Ajoudanian, and Mohammad Davarpanah Jazi Abstract The rapid expansion of the web is causing the constant growth of information, leading to several problems such as increased

More information

Notes for Lecture 24

Notes for Lecture 24 U.C. Berkeley CS170: Intro to CS Theory Handout N24 Professor Luca Trevisan December 4, 2001 Notes for Lecture 24 1 Some NP-complete Numerical Problems 1.1 Subset Sum The Subset Sum problem is defined

More information

KEYWORD search is a well known method for extracting

KEYWORD search is a well known method for extracting IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 7, JULY 2014 1657 Efficient Duplication Free and Minimal Keyword Search in Graphs Mehdi Kargar, Student Member, IEEE, Aijun An, Member,

More information

RSDC 09: Tag Recommendation Using Keywords and Association Rules

RSDC 09: Tag Recommendation Using Keywords and Association Rules RSDC 09: Tag Recommendation Using Keywords and Association Rules Jian Wang, Liangjie Hong and Brian D. Davison Department of Computer Science and Engineering Lehigh University, Bethlehem, PA 18015 USA

More information

General properties of staircase and convex dual feasible functions

General properties of staircase and convex dual feasible functions General properties of staircase and convex dual feasible functions JÜRGEN RIETZ, CLÁUDIO ALVES, J. M. VALÉRIO de CARVALHO Centro de Investigação Algoritmi da Universidade do Minho, Escola de Engenharia

More information

Searching the Deep Web

Searching the Deep Web Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Analytical Modeling of Parallel Systems To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Sources of Overhead in Parallel Programs Performance Metrics for

More information

Week - 04 Lecture - 01 Merge Sort. (Refer Slide Time: 00:02)

Week - 04 Lecture - 01 Merge Sort. (Refer Slide Time: 00:02) Programming, Data Structures and Algorithms in Python Prof. Madhavan Mukund Department of Computer Science and Engineering Indian Institute of Technology, Madras Week - 04 Lecture - 01 Merge Sort (Refer

More information

CTL.SC4x Technology and Systems

CTL.SC4x Technology and Systems in Supply Chain Management CTL.SC4x Technology and Systems Key Concepts Document This document contains the Key Concepts for the SC4x course, Weeks 1 and 2. These are meant to complement, not replace,

More information

On Biased Reservoir Sampling in the Presence of Stream Evolution

On Biased Reservoir Sampling in the Presence of Stream Evolution Charu C. Aggarwal T J Watson Research Center IBM Corporation Hawthorne, NY USA On Biased Reservoir Sampling in the Presence of Stream Evolution VLDB Conference, Seoul, South Korea, 2006 Synopsis Construction

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Information Search and Management Web Search Prof. Chris Clifton 18 October 2017 Some slides courtesy Croft et al. Web Crawler Finds and downloads web pages automatically provides the collection

More information

Presented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu

Presented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu Presented by: Dimitri Galmanovich Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu 1 When looking for Unstructured data 2 Millions of such queries every day

More information

A mining method for tracking changes in temporal association rules from an encoded database

A mining method for tracking changes in temporal association rules from an encoded database A mining method for tracking changes in temporal association rules from an encoded database Chelliah Balasubramanian *, Karuppaswamy Duraiswamy ** K.S.Rangasamy College of Technology, Tiruchengode, Tamil

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

A Framework for Securing Databases from Intrusion Threats

A Framework for Securing Databases from Intrusion Threats A Framework for Securing Databases from Intrusion Threats R. Prince Jeyaseelan James Department of Computer Applications, Valliammai Engineering College Affiliated to Anna University, Chennai, India Email:

More information

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa Ana Carolina Salgado Francisco de Carvalho Jacques Robin Juliana Freire School of Computing Centro

More information

High Dimensional Indexing by Clustering

High Dimensional Indexing by Clustering Yufei Tao ITEE University of Queensland Recall that, our discussion so far has assumed that the dimensionality d is moderately high, such that it can be regarded as a constant. This means that d should

More information

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES Journal of Defense Resources Management No. 1 (1) / 2010 AN OVERVIEW OF SEARCHING AND DISCOVERING Cezar VASILESCU Regional Department of Defense Resources Management Studies Abstract: The Internet becomes

More information

Sentiment analysis under temporal shift

Sentiment analysis under temporal shift Sentiment analysis under temporal shift Jan Lukes and Anders Søgaard Dpt. of Computer Science University of Copenhagen Copenhagen, Denmark smx262@alumni.ku.dk Abstract Sentiment analysis models often rely

More information

Document Retrieval using Predication Similarity

Document Retrieval using Predication Similarity Document Retrieval using Predication Similarity Kalpa Gunaratna 1 Kno.e.sis Center, Wright State University, Dayton, OH 45435 USA kalpa@knoesis.org Abstract. Document retrieval has been an important research

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

12.1 Formulation of General Perfect Matching

12.1 Formulation of General Perfect Matching CSC5160: Combinatorial Optimization and Approximation Algorithms Topic: Perfect Matching Polytope Date: 22/02/2008 Lecturer: Lap Chi Lau Scribe: Yuk Hei Chan, Ling Ding and Xiaobing Wu In this lecture,

More information

Foundations of Computing

Foundations of Computing Foundations of Computing Darmstadt University of Technology Dept. Computer Science Winter Term 2005 / 2006 Copyright c 2004 by Matthias Müller-Hannemann and Karsten Weihe All rights reserved http://www.algo.informatik.tu-darmstadt.de/

More information

Semantic Search in s

Semantic Search in  s Semantic Search in Emails Navneet Kapur, Mustafa Safdari, Rahul Sharma December 10, 2010 Abstract Web search technology is abound with techniques to tap into the semantics of information. For email search,

More information

Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge

Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Exploiting Internal and External Semantics for the Using World Knowledge, 1,2 Nan Sun, 1 Chao Zhang, 1 Tat-Seng Chua 1 1 School of Computing National University of Singapore 2 School of Computer Science

More information

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs Advanced Operations Research Techniques IE316 Quiz 1 Review Dr. Ted Ralphs IE316 Quiz 1 Review 1 Reading for The Quiz Material covered in detail in lecture. 1.1, 1.4, 2.1-2.6, 3.1-3.3, 3.5 Background material

More information

Chapter 15 Introduction to Linear Programming

Chapter 15 Introduction to Linear Programming Chapter 15 Introduction to Linear Programming An Introduction to Optimization Spring, 2015 Wei-Ta Chu 1 Brief History of Linear Programming The goal of linear programming is to determine the values of

More information

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18 601.433/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18 22.1 Introduction We spent the last two lectures proving that for certain problems, we can

More information

CS446: Machine Learning Fall Problem Set 4. Handed Out: October 17, 2013 Due: October 31 th, w T x i w

CS446: Machine Learning Fall Problem Set 4. Handed Out: October 17, 2013 Due: October 31 th, w T x i w CS446: Machine Learning Fall 2013 Problem Set 4 Handed Out: October 17, 2013 Due: October 31 th, 2013 Feel free to talk to other members of the class in doing the homework. I am more concerned that you

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Theorem 2.9: nearest addition algorithm

Theorem 2.9: nearest addition algorithm There are severe limits on our ability to compute near-optimal tours It is NP-complete to decide whether a given undirected =(,)has a Hamiltonian cycle An approximation algorithm for the TSP can be used

More information

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Lecture #3: PageRank Algorithm The Mathematics of Google Search Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,

More information

Empirical risk minimization (ERM) A first model of learning. The excess risk. Getting a uniform guarantee

Empirical risk minimization (ERM) A first model of learning. The excess risk. Getting a uniform guarantee A first model of learning Let s restrict our attention to binary classification our labels belong to (or ) Empirical risk minimization (ERM) Recall the definitions of risk/empirical risk We observe the

More information

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases Jinlong Wang, Congfu Xu, Hongwei Dan, and Yunhe Pan Institute of Artificial Intelligence, Zhejiang University Hangzhou, 310027,

More information

Handout 9: Imperative Programs and State

Handout 9: Imperative Programs and State 06-02552 Princ. of Progr. Languages (and Extended ) The University of Birmingham Spring Semester 2016-17 School of Computer Science c Uday Reddy2016-17 Handout 9: Imperative Programs and State Imperative

More information

A Randomized Algorithm for Minimizing User Disturbance Due to Changes in Cellular Technology

A Randomized Algorithm for Minimizing User Disturbance Due to Changes in Cellular Technology A Randomized Algorithm for Minimizing User Disturbance Due to Changes in Cellular Technology Carlos A. S. OLIVEIRA CAO Lab, Dept. of ISE, University of Florida Gainesville, FL 32611, USA David PAOLINI

More information

Searching the Web What is this Page Known for? Luis De Alba

Searching the Web What is this Page Known for? Luis De Alba Searching the Web What is this Page Known for? Luis De Alba ldealbar@cc.hut.fi Searching the Web Arasu, Cho, Garcia-Molina, Paepcke, Raghavan August, 2001. Stanford University Introduction People browse

More information

Integer Programming Theory

Integer Programming Theory Integer Programming Theory Laura Galli October 24, 2016 In the following we assume all functions are linear, hence we often drop the term linear. In discrete optimization, we seek to find a solution x

More information

Implementation Techniques

Implementation Techniques V Implementation Techniques 34 Efficient Evaluation of the Valid-Time Natural Join 35 Efficient Differential Timeslice Computation 36 R-Tree Based Indexing of Now-Relative Bitemporal Data 37 Light-Weight

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

Web Service Usage Mining: Mining For Executable Sequences

Web Service Usage Mining: Mining For Executable Sequences 7th WSEAS International Conference on APPLIED COMPUTER SCIENCE, Venice, Italy, November 21-23, 2007 266 Web Service Usage Mining: Mining For Executable Sequences MOHSEN JAFARI ASBAGH, HASSAN ABOLHASSANI

More information

THE WEB SEARCH ENGINE

THE WEB SEARCH ENGINE International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com

More information

Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute. Module 02 Lecture - 45 Memoization

Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute. Module 02 Lecture - 45 Memoization Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute Module 02 Lecture - 45 Memoization Let us continue our discussion of inductive definitions. (Refer Slide Time: 00:05)

More information

Algorithms and Data Structures

Algorithms and Data Structures Algorithms and Data Structures Spring 2019 Alexis Maciel Department of Computer Science Clarkson University Copyright c 2019 Alexis Maciel ii Contents 1 Analysis of Algorithms 1 1.1 Introduction.................................

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

A General Sign Bit Error Correction Scheme for Approximate Adders

A General Sign Bit Error Correction Scheme for Approximate Adders A General Sign Bit Error Correction Scheme for Approximate Adders Rui Zhou and Weikang Qian University of Michigan-Shanghai Jiao Tong University Joint Institute Shanghai Jiao Tong University, Shanghai,

More information

AM 221: Advanced Optimization Spring 2016

AM 221: Advanced Optimization Spring 2016 AM 221: Advanced Optimization Spring 2016 Prof. Yaron Singer Lecture 2 Wednesday, January 27th 1 Overview In our previous lecture we discussed several applications of optimization, introduced basic terminology,

More information

3 Media Web. Understanding SEO WHITEPAPER

3 Media Web. Understanding SEO WHITEPAPER 3 Media Web WHITEPAPER WHITEPAPER In business, it s important to be in the right place at the right time. Online business is no different, but with Google searching more than 30 trillion web pages, 100

More information

LP-Modelling. dr.ir. C.A.J. Hurkens Technische Universiteit Eindhoven. January 30, 2008

LP-Modelling. dr.ir. C.A.J. Hurkens Technische Universiteit Eindhoven. January 30, 2008 LP-Modelling dr.ir. C.A.J. Hurkens Technische Universiteit Eindhoven January 30, 2008 1 Linear and Integer Programming After a brief check with the backgrounds of the participants it seems that the following

More information

Evolutionary Algorithms

Evolutionary Algorithms Evolutionary Algorithms Proposal for a programming project for INF431, Spring 2014 version 14-02-19+23:09 Benjamin Doerr, LIX, Ecole Polytechnique Difficulty * *** 1 Synopsis This project deals with the

More information

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Department of Electronic Engineering FINAL YEAR PROJECT REPORT Department of Electronic Engineering FINAL YEAR PROJECT REPORT BEngCE-2007/08-HCS-HCS-03-BECE Natural Language Understanding for Query in Web Search 1 Student Name: Sit Wing Sum Student ID: Supervisor:

More information

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Mining Frequent Itemsets for data streams over Weighted Sliding Windows Mining Frequent Itemsets for data streams over Weighted Sliding Windows Pauray S.M. Tsai Yao-Ming Chen Department of Computer Science and Information Engineering Minghsin University of Science and Technology

More information

End-To-End Delay Optimization in Wireless Sensor Network (WSN)

End-To-End Delay Optimization in Wireless Sensor Network (WSN) Shweta K. Kanhere 1, Mahesh Goudar 2, Vijay M. Wadhai 3 1,2 Dept. of Electronics Engineering Maharashtra Academy of Engineering, Alandi (D), Pune, India 3 MITCOE Pune, India E-mail: shweta.kanhere@gmail.com,

More information

UML-Based Conceptual Modeling of Pattern-Bases

UML-Based Conceptual Modeling of Pattern-Bases UML-Based Conceptual Modeling of Pattern-Bases Stefano Rizzi DEIS - University of Bologna Viale Risorgimento, 2 40136 Bologna - Italy srizzi@deis.unibo.it Abstract. The concept of pattern, meant as an

More information

analyzing the HTML source code of Web pages. However, HTML itself is still evolving (from version 2.0 to the current version 4.01, and version 5.

analyzing the HTML source code of Web pages. However, HTML itself is still evolving (from version 2.0 to the current version 4.01, and version 5. Automatic Wrapper Generation for Search Engines Based on Visual Representation G.V.Subba Rao, K.Ramesh Department of CS, KIET, Kakinada,JNTUK,A.P Assistant Professor, KIET, JNTUK, A.P, India. gvsr888@gmail.com

More information

Bipartite Graph Partitioning and Content-based Image Clustering

Bipartite Graph Partitioning and Content-based Image Clustering Bipartite Graph Partitioning and Content-based Image Clustering Guoping Qiu School of Computer Science The University of Nottingham qiu @ cs.nott.ac.uk Abstract This paper presents a method to model the

More information

An Extension of the Multicut L-Shaped Method. INEN Large-Scale Stochastic Optimization Semester project. Svyatoslav Trukhanov

An Extension of the Multicut L-Shaped Method. INEN Large-Scale Stochastic Optimization Semester project. Svyatoslav Trukhanov An Extension of the Multicut L-Shaped Method INEN 698 - Large-Scale Stochastic Optimization Semester project Svyatoslav Trukhanov December 13, 2005 1 Contents 1 Introduction and Literature Review 3 2 Formal

More information

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Ensemble methods in machine learning. Example. Neural networks. Neural networks Ensemble methods in machine learning Bootstrap aggregating (bagging) train an ensemble of models based on randomly resampled versions of the training set, then take a majority vote Example What if you

More information

Identifying Redundant Search Engines in a Very Large Scale Metasearch Engine Context

Identifying Redundant Search Engines in a Very Large Scale Metasearch Engine Context Identifying Redundant Search Engines in a Very Large Scale Metasearch Engine Context Ronak Desai 1, Qi Yang 2, Zonghuan Wu 3, Weiyi Meng 1, Clement Yu 4 1 Dept. of CS, SUNY Binghamton, Binghamton, NY 13902,

More information

Crew Scheduling Problem: A Column Generation Approach Improved by a Genetic Algorithm. Santos and Mateus (2007)

Crew Scheduling Problem: A Column Generation Approach Improved by a Genetic Algorithm. Santos and Mateus (2007) In the name of God Crew Scheduling Problem: A Column Generation Approach Improved by a Genetic Algorithm Spring 2009 Instructor: Dr. Masoud Yaghini Outlines Problem Definition Modeling As A Set Partitioning

More information

Applied Algorithm Design Lecture 3

Applied Algorithm Design Lecture 3 Applied Algorithm Design Lecture 3 Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Applied Algorithm Design Lecture 3 1 / 75 PART I : GREEDY ALGORITHMS Pietro Michiardi (Eurecom) Applied Algorithm

More information