Decomposition-based Optimization of Reload Strategies in the World Wide Web

Size: px
Start display at page:

Download "Decomposition-based Optimization of Reload Strategies in the World Wide Web"

Transcription

1 Decomposition-based Optimization of Reload Strategies in the World Wide Web Dirk Kukulenz Luebeck University, Institute of Information Systems, Ratzeburger Allee 160, Lübeck, Germany Abstract. Web sites, Web pages and the data on pages are available only for specific periods of time and are deleted afterwards from a client s point of view. An important task in order to retrieve information from the Web is to consider Web information in the course of time. Different strategies like push and pull strategies may be applied for this task. Since push services are usually not available, pull strategies have to be conceived in order to optimize the retrieved information with respect to the age of retrieved data and its completeness. In this article we present a new procedure to optimize retrieved data from Web pages by page decomposition. By deploying an automatic Wrapper induction technique a page is decomposed into functional segments. Each segment is considered as an independent component for the analysis of the time behavior of the page. Based on this decomposition we present a new component-based download strategy. By applying this method to Web pages it is shown that for a fraction of Web data the freshness of retrieved data may be improved significantly compared to traditional methods. 1 Introduction The information in the World Wide Web changes in the course of time. New Web sites appear in the Web, old sites are deleted. Pages in Web sites exist for specific periods of time. Data on pages are inserted, modified or deleted. There are important reasons to consider the information in the Web in the course of time. From a client s point of view, the information that has been deleted from the Web is usually no longer accessible. Pieces of information like old news articles or (stock) prices may however still be of much value for a client. One conceivable task is e.g. the analysis of the evolution of specific information, as e.g. a stock chart or the news coverage concerning a specific topic. A Web archive that mirrors the information in a specific Web area over a period of time may help a client to access information that is no longer available in the real Web. A different aspect concerning information changes in the Web is to consider information that appears in the future. Continuous queries in the Web may help a client to query future states of the Web, similar to triggers

2 in a database context [11], [12]. There are different techniques available to realize such history or future-based Web information analysis. In a push system a server actively provides a client with information. Information changes on a server may directly trigger the information of a passive client [10]. In a distributed heterogeneous environment like the World Wide Web push services are difficult to realize and are usually not available. Pull systems on the other hand require an active client to fetch the information from the Web when it becomes available [7]. In contrast to push systems in a pull system the respective tool is usually not informed about the times of information changes. The pull system has to apply strategies in order to optimize the retrieved information with respect to the staleness and the completeness of the information [4]. In this article we consider the problem of retrieving information from single Web pages that appears at unknown periods of time. By observing a Web page over a period of time we acquire certain aspects of the change characteristic of the page. This knowledge is used to optimize a strategy to access the information appearing on the page at future periods of time. The basic approach presented in this article is to decompose a Web page into segments. The change dynamics of whole Web pages is usually very complex. However the change behavior of single segments is frequently relatively simple and the respective update patterns may easily be predicted as is shown by examples in this article. In the article we discuss different approaches to construct a segmentation of a Web page. We motivate the use of wrapper induction techniques for page decompositions. Wrappers are tools to extract data from Web pages automatically. Recently automatic wrapper induction techniques were introduced to learn a wrapper from a set of sample pages. The resulting wrapper is expressed by a common page grammar of the sample pages. We apply a modified wrapper induction process so that a wrapper is acquired from subsequent versions of the same page. Based on the resulting page grammar and the corresponding page segmentation technique we present a new reload strategy for information contained in Web pages. This new technique decreases the costs in terms of network traffic and optimizes the quality of the retrieved information in terms of the freshness and the completeness of the data. The paper is organized as follows: After an overview of recent related research the contribution of this article is described. Section 2 gives an introduction into the theoretical background of wrapper induction techniques and the applied model for the dynamic Web. Section 3 describes a framework to define page changes based on page segmentation. The main contribution of this article, a new reload optimization strategy that is based on page decomposition, is presented in sections 4 and 5. In section 6, the decomposition-based change prediction is applied to Web pages. Section 7 summarizes the results and describes further aspects. 1.1 Related research The prediction of the times of information changes on a remote source plays an important role for diverse software systems like search engines, Web crawlers,

3 -caches and Web archives. In these fields different prediction strategies were presented. [1] gives an introduction into problems related to optimal page refresh in the context of search engine optimization. In [4] and [13] the problem of minimizing the average level of staleness of local copies of remote web pages is considered in the context of Web crawler optimization. The main basic assumption is usually an independent and identical distribution of time intervals between remote data changes. The series of update times is usually modeled by Poisson processes. In [3] an optimization of this approach is presented with respect to a reduction of the bias of the estimator. In a previous publication we considered the case that remote data change approximately deterministically and update times may be modeled by regular grammars [8]. The latter approach may only be applied to a fraction of Web data, the freshness of local copies may however be improved significantly. Similar questions are important in order to optimize continuous queries in the Web, i.e. standing queries that monitor specific Web pages [11], [12]. In the above publications a change of a Web page is usually defined as an arbitrary change in the HTML code of the page. However new approaches in the field of automatic Web data extraction may be applied to develop more precise definitions of Web changes. In [9] an overview of common approaches to extract data from the Web is given. The article presents a taxonomy of Web wrapping techniques and different groups of data extraction tools are identified as e.g. wrapper induction and HTML-aware, natural language- and ontologybased tools. A fully automatic approach in the field of HTML-aware wrapper induction is described in [5]. This technique is based on the assumption that there exist similar pages in the Web that may be regarded as created by the same grammar. The task is to estimate this common grammar based on pageexamples. Based on the knowledge of the page grammar, data may be extracted automatically and insight into the function of certain page components, as e.g. lists, may be obtained. A further development of this approach accepting a more general class of HTML pages is presented in [2]. We apply a similar grammar inference approach in this work because successive page versions may frequently also be regarded as based on a single grammar. 1.2 Contribution The main contribution of this article is a new strategy to reload remote data sources in the Web. In this article we consider only HTML content, however similar considerations may also apply to XML. In most previous publications about information changes in the Web a change is defined as an arbitrary modification of the respective HTML code of the page. In contrast to this for a more precise change definition we propose a decomposition of HTML pages based on automatically induced wrappers. The automatic wrapper induction is based on methods that were recently presented in the literature [5]. An extension of the respective methods is considered in order to learn wrappers by example pages that appear at subsequent and unknown periods of time. Based on a decomposition of pages we propose a method to analyze changes of Web

4 pages. The main aspect considered in this article is a new strategy for the reload of remote data sources. The presented method is based on a decomposition step and is then based on a prediction of the times of remote changes that considers each component individually. We demonstrate in this article that for a fraction of Web data this decomposition based prediction may improve pull services in terms of the freshness of retrieved data. 2 Theoretical Background 2.1 Introduction to automatic wrapper induction The theoretical background concerning data extraction from the Web by automatic wrapper induction is similar to previous publications in this field [6], [5], [2]. We consider the fraction of pages on the Web that are created automatically based on data that are stored in a database or obtained from a different data source. The data are converted to HTML code and sent to a client. This conversion may be regarded as an encoding process. The encoding process may be regarded as the application of a grammar which produces HTML structures. We consider nested data types as the underlying data structure. Given a set of sample HTML pages <html> <a href="weath.html"> #PCDATA </a> (<p><a href="conc.html">concert</a></p>)? <ul> (<li> <b> #PCDATA </b> (<p> #PCDATA </p>)+ </li>)+ </ul> </html> Fig. 1. A common grammar of the pages in figure 2. belonging to the same class of pages the task is therefore to find the nested data type of the source data set that was used to create the HTML pages. In the wrapper application phase it is the task to extract the respective source data instances from which the individual pages were generated. There exist different approaches for this task [5], [2]. In the following we summarize the approach in [5] which is used as a basis for the applied model in this work in section 2.2. Nested data types may be modeled by union-free regular grammars [5]. Let PCDATA be a special symbol and Σ an alphabet of symbols not containing PCDATA. A union-free regular expression (UFRE) is a string over the alphabet Σ { PCDATA,, +,?, (, )} with the following restrictions. At first, the empty string ɛ and the elements of Σ { PCDATA} are UFREs. If a and b are UFREs, then a b, (a) + and (a)? are UFREs where a b denotes a concatenation, (a)? denotes optional patterns and (a) + is an iteration. Further (a) is a shortcut for ((a) + )?. In [6] it is shown that the class of union-free regular expressions has a straightforward mapping to nested data types. The PCDATA element models string fields, + models (possibly nested) lists and? models nullable data fields. For a given UFRE σ the corresponding nested type τ = type(σ) may be constructed in linear time. It is obvious that UFREs may not be used to model every structure appearing in the Web. It is however suitable for a

5 significant fraction of Web data. Let p 1, p 2,... p n be a set of HTML strings that correspond to encodings of a source data set d 1, d 2,... d n of a nested type τ. It is shown in [5] that the type τ may be estimated by inferring the minimal UFRE σ whose language L(σ) contains the encoded strings p 1, p 2,... p n. In [5] a containment relationship is defined as σ a σ b iff L(σ a ) L(σ b ). The (optimal) UFRE to describe the input strings is the least upper bound of the input strings. [5] reduces this problem to the problem of finding the least upper bound of two UFREs. The grammar in figure 1 describes e.g. the pages in figure 2. 1: <html> 2: <a href="weath.html">temp: 14C</a> deleted 3: <p><a href="conc.html">concert</a></p> 4: <ul> 5: <li><b>world</b> 6: <p>togolese vote count under way </p> deleted, 7: </li> inserted 8: <li><b>business</b> 9: <p> A mix it up Monday</p> deleted, 10:</li> inserted 11:</ul> 12:</html> inserted p(a1) changed 1: <html> 2: <a href="weath.html">temp: 18C</a> 3: <ul> 4: <li><b>world</b> 5: <p>space crew returns </p> 6: <p>markey: Energy off base </p> 7: </li> 8: <li><b>business</b> 9: <p> Stocks wait on rates?</p> 10: </li> 11: <li><b>sports</b> 12: <p> Burns: Fast Breaks</p> 13: <p> Day two NFL</p> 14: </li> 15: </ul> 16: </html> p(a2) Fig. 2. The HTML sources of a Web page at two different points in time a 1 and a 2 with different kinds of changes like insertions, deletions and modifications. 2.2 A model for data changes in the Web In contrast to the data model in the previous section where structurally similar pages appear at the same time in this article we consider page versions appearing at different time intervals. For this purpose a time stamp is attached to every page version. Let u i R + denote the point in time at which the i th update of a page occurs, where 0 u 1 u 2 u 3 u 4... u n T R +, n N. The time span between the i 1 st and i th update will be denoted by t i := u i u i 1, i N. This is the lifetime of a page version. The different page versions are denoted as p(u 1 ), p(u 2 ),... p(u n ). Let a 1, a 2,... a m R + denote the points in time where reload operations of the remote source are executed, where 0 a 1 a 2 a 3... a m T. The set of local copies of remote page versions q(a 1 ), q(a 2 ),... is obviously a subset of the remote page versions. For t R + let N u (t) denote the largest index of an element in the sequence u that is smaller than t, i.e. N u (t) := max{n u n t}. Let A u (t) R + denote the size of the time interval since the last update, i.e. A u (t) := t u N u (t). If t is the time of a reload (t = a i for i m), we denote A u (t) as the age of q(a i ). The age of a local copy denotes how much time has passed since the last remote data update and thus how long an old copy of the data was stored although a new

6 version should have been considered. Finding an optimal reload strategy means that after each update of the remote data source, the data should be reloaded as soon as possible, i.e. the sum of ages sumage := m i=1 Au (a i ) has to be minimal. The number of reloads should be as small as possible. No change of the data source should be unobserved. The number of lost (unobserved) data objects will be denoted as loss in the experiments. One question considered in the following is the prediction of remote update times. For this purpose in this article we consider the special case that remote updates of page components are performed after deterministic time intervals. Let Q := {t j j n N} denote the set of time intervals between updates of a page component. We assign a symbol s i, i N n to every element of Q. We call the set of symbols := {s i i n} the alphabet of the sequence (u j ) j T. 1 Let S denote a starting symbol, let r 1, r 2,... r n denote terminals and the symbols R 1, R 2,... R n non-terminals. In the following we refer to a regular grammar Γ corresponding to the non-deterministic finite automaton in figure 3 as a cyclic regular grammar [8]. In figure 3, R 0 is a starting state which leads to any of n states R 1,..., R n. After this, the list of symbols is accepted in a cyclic way. Every state is an accepting state. To abbreviate this definition we will use the notation: (r 1 r 2...r n ) := Γ. R 0 r n r 1 rn r n 1 R 1 r R 1 2 r R 2 n Fig. 3. Nondeterministic automaton corresponding to the grammar (r 1 r 2... r n ). One problem is to find an opti- mal sequence a 1, a 2,... a m in order to capture most of the data changes necessary to learn the wrapper which may be applied for a page decomposition as described in section 3. The set of local page versions q(a 1 ), q(a 2 ),... q(a m ) is the input set of positive examples for the wrapper induction task. The basic problem concerning the wrapper induction is to find a basic schema for the page versions that appear at the considered URL. In [5] it is shown that the considered language is not only identifiable in the limit but that a rich set of examples is sufficient to learn the grammar. This set of examples has a high probability of being found in a number of randomly chosen HTML pages. In this article we don t consider this problem in detail and assume in the experiments that a correct wrapper has been learned from previous page versions. 3 A definition of page changes 3.1 Motivation The question considered in this section is to find a description for changes in a Web page. An appropriate definition is necessary in order to acquire an understanding of the nature of changes. If e.g. a page provides a certain stock price every minute and in addition provides news related to the respective 1 Due to the size of the sampling interval and due to network delays intervals between updates registered at the client side are distorted. The symbols are used to comprise registered intervals assumed to result from identical update intervals on the server side.

7 company which appear about three times a day, it may be helpful to separate the two change dynamics in order to optimize e.g. a pull client. 3.2 Decomposition strategies There are different concepts for a segmentation of pages conceivable that are suitable for a change analysis of components. One approach is to consider the DOM (document-object-model) tree of a HTML page. Changes may be attached to nodes (HTML tags) in this tree. The problem with this strategy is that HTML tags are in general only layout directives and there may appear additional structures that are not clear by observing the HTML tag structure, as e.g. in the case of the inner list in figure 2 p(a 2 ). A method to acquire additional information about the page structure is to consider a set of similar pages as described in section 2 and to determine the common grammar. This grammar may be used to decompose a page into segments as described below and to attach changes in a page to these segments. 3.3 Wrapper based change definition AND <html> <a href="weath.html"> #PCDATA </a> HOOK <ul> PLUS </ul> </html> C1 C2 C3 C4 C5 C6 C7 AND <p> <b> <a href="conc.html">concert </a> </b> </p> C41 AND <li> <b> #PCDATA </b> PLUS </li> C61 C62 C63 C64 C65 AND <p> #PCDATA </p> C641 Fig. 4. Abstract-syntax-tree (AST) corresponding to the pages in figure 2. AND-nodes refer to tuples, PLUS-nodes refer to lists and HOOK-nodes refer to optional elements. Boxes framing AST-nodes denote time components (i.e. groups of nodes with a similar change characteristic). These groups constitute nodes of the TCT tree. The decomposition-based time analysis is based on the estimated page grammar acquired from a sample set of pages as described in section 2. One notation for the page grammar is an Abstract-Syntax-Tree (AST) as shown in figure 4 for the pages in figure 2 which corresponds to the grammar in figure 1. By merging adjacent nodes or subtrees in the AST that have the same change dynamics, the AST may be tranformed into the time-component-type tree (TCT tree). The TCT-tree illustrated by the boxes in figure 4 shows the different time-components of the AST. A time-component is a node (or a set of adjacent nodes in the AST that have the same time characteristic) and the

8 respective subtrees. A change of a component requires a change in the node or the attached subtrees. Components that actually change in time, are marked by thick rectangles. Based on the TCT-tree different strategies to define page changes are conceivable. The any-difference method described above that considers an arbitrary change in the HTML source of a page is equivalent to considering only the root node of the TCT tree. More specific methods consider deeper levels of the TCT tree up to a certain depth. It is also conceivable that only specific nodes or subtrees of the TCT tree are considered, chosen e.g. manually by a user. In order to analyze the change dynamics of a page over time it may not be sufficient to consider the TCT tree because nodes in the TCT tree may correspond to a number of instances of the respective type in real page instances. We will refer to the respective instance tree as the Time-Component-Instance-tree (TCI-tree). Now we may finally define changes of a Web page. A (single) change c is a tuple c = (node T CI, t), where t is the time of the change. This change definition depends on the definition of the TCI tree, which itself is based on the decomposition by the grammar. The possible causes of changes depend on the position of the respective node in the TCI tree as described in the example below. A change pattern is a set of time series each of one is connected to a node in the considered TCI tree. As an example the Web page in figure 2 may be considered. If the page change function is defined to consider the first level of the TCT tree (including level 0 an 1) the TCT tree has 7 nodes at level 1 similar to the corresponding TCI tree. Three of these nodes (C2, C4 and C6) actually change in the course of time. A change is detected if the string in element C2 is changed, if the optional element C4 appears, is modified or deleted or if a change in the list element C6 occurs, i.e. a list element is inserted, modified or deleted. 4 Component-based update estimation In this article we consider the special case that components of a specific Web page are updated deterministically. In order to model the update characteristics of each component we apply a cyclic regular grammar (fig. 3). In [8] a method to learn similar update characteristics of entire Web pages is presented. In this section we present a method to apply the respective algorithm to pages where not entire pages but page components change deterministically. The input of this Component-Update-Estimation-algorithm are the target URL of the considered page and a wrapper of the respective page. In this algorithm the data source is reloaded after constant periods of time. In each cycle the data contained in page components are extracted by applying the wrapper. Based on the current and the previous data vector changes occurring in page components are registered. Based on detected changes of components the intervals between updates of specific components may be estimated. 2 The intervals are used to 2 Due to the finite sampling interval length, the interval length between updates is only an estimation.

9 estimate symbols by a clustering process (section 2.2)[8]. Based on sequences of symbols associated to components the cyclic regular grammars (section 2.2) may be estimated. A termination criterion concerning the grammar estimation is applied to mark the grammar of a component as determined. This termination criterion may e.g. consider the number of times a new symbol is detected which has also been predicted correctly by a current grammar estimation. Finally after the detection of the cyclic-regular grammar of each component the respective grammars are stored in a vector, which is denoted as timegrammarvector in the following. 5 Download optimization The new download optimization strategy may now be described by the algorithm Component-Based-Reload in figure 5. The main aspect of the algorithm Component-Based-Reload( wrapper, timegrammarv ector, URL) 1 set previous content = 2 reload source(url) 3 extract current content vector (wrapper) 4 copy current content vector to previous content vector 5 while component where phase not detected 6 reload source(url) 7 extract current content vector (wrapper) 8 for each component: compare previous and current content 9 for each component: extract symbols(timegrammarvector) 10 for each component: match phase(timegrammarvector) 11 if phase of component j is determined 12 mark phase of component j as determined 13 start download thread for component j (timegrammarvector) 14 wait( t) 15 copy current content vector to previous content vector Fig. 5. The Component-Based-Reload-algorithm. For each component an independent reload-thread is started after a phase detection. is to determine the different phases 3 of the components. If e.g. the cyclic regular grammar of a component is (ababc) and we register a symbol a, the current state of the respective automaton is ambiguous. A sequence of successive symbols has to be considered in order to disambiguate the current state (symbol) of a component. In the algorithm in figure 5 the remote source is reloaded frequently from the considered URL (steps 2, 6). The data contained 3 With the term phase we denote the current symbol of the update characteristic of a component (fig. 3) and roughly the position in the respective interval.

10 in the respective components are extracted applying the wrapper (step 3,7) and the contents is compared to the contents of previous component versions (step 8). By this method current symbols may be extracted and the symbols may be compared to the respective grammar in the time-grammar-vector (obtained from the grammar estimation algorithm) until the current symbol is unique. These steps are performed in steps 9 and 10 of the algorithm. If the phase of a component has been detected, in step 11 a download thread is started for this component that predicts further symbols of the respective component and performs reload operations. In particular in this reload strategy the remote source is loaded shortly before the expected remote change (as provided by the cyclic regular grammar of a component) and then by reloading the remote source with a high frequency until the change has been detected [8]. By this method a feedback is acquired that is necessary to compensate deviations of the registered update times of a remote source due to server and network delays. After the phase detection a number of threads is running performing reload operations that corresponds to the number of page components. 6 Experiments 6.1 Experimental setup The experiment consists of two basic stages. In the estimation stage the change dynamics of a Web page is estimated by observation of the page over a period of time. This stage consists of four phases that are performed successively. Phase 1: In the first phase the basic reload frequency for the wrapper induction and subsequent sampling processes is determined [8]. Phase 2: In this phase the wrapper is induced based on successive versions of the page. The estimated wrapper is subjected to a development in the course of time. A heuristic criterion is applied to terminate the induction process. Phase 3: Based on the wrapper and the respective parser in this phase changes on the page are examined over a period of time. The result is a vector of component change series indicating, at which times a specific component changed in the considered period of time. Phase 4: In this phase the detected change behavior of the page obtained in the previous step is examined. The result is a vector of update grammars of the different components (section 4). In the second application stage the knowledge about the page dynamics acquired in the previous stage is applied for the extraction of data in the course of time. Phase 5: The data extraction algorithm considers each component independently. The phase of the update pattern of each component has to be detected (section 5). Phase 6: In this phase data are extracted by component-based reload requests.

11 6.2 Application examples In the experiments we apply artificial and real pages in order to demonstrate main benefits of the new method. In order to measure the quality of a reload strategy we consider the costs in terms of the number of reload operations (causing network traffic etc.), the number of lost data objects and the age of components (section 2.2). The steps of the procedure in section 6.1 are successively applied to a page and the respective results are demonstrated. In a first example we consider the page This page contains links to images of different geostationary satellites. After the wrapper induction the final TCI graph has 117 leaf nodes all of which are related to links to specific satellite images in the page. Figures 6 and 7 show the result of step 3 of the procedure. The any-change update detection in any change time in seconds Fig. 6. Detected changes of the Web page by the any-change -method over a period of 2 days and 7 hours. component c 38 c 25 c 26 c 55 c time in seconds Fig. 7. Change dynamics of sample components of the Web page figure 6 shows a complex change pattern. After a grammar-based decomposition different components in figure 7 show however simple and deterministic update characteristics. The underlying process applied to generate the page automatically may easily be understood. The change patterns obtained from step 3 of the experiment may be used to estimate the grammar vector in step 4 as described in section 4. The grammar vector is then used for the multi-phase detection (phase 5). As a visualization of phase 6 of the procedure, figure 8 shows the application of different reload strategies. In this experiment we consider only a fraction of the entire page, in particular the fraction consisting of the components c-38 and c-26 in figure 7 for reasons of simplicity. The superposition of the component s update

12 patterns (c-38 and c-26) as depicted at the top of figure 8 reveals very close update operations. Because of a finite resolution in the estimation process due to network delays etc. the superposed update pattern may not be regarded as a simple deterministic pattern. A quasi deterministic prediction as presented in [8] may therefore not be applied and only common constant-frequency sampling methods are applicable as depicted in the second graph ( freq. ) in the center of figure 8. After a grammar based decomposition in phase 3 of the procedure the simple update characteristics of the components shown in figure 7 (c-38 and c-26) are revealed. After this decomposition step the different update characteristics of the components may be estimated and applied for the prediction of future updates of the remote source as shown in the third graph ( dec. ) of figure 8. In this figure the reload operations triggered by the different components are depicted. In contrast to the constant-frequency reload strategy illustrated updates // reloads dec. freq. updates time in seconds Fig. 8. Visualization of different reload strategies. The first graph ( updates ) shows original updates. The second graph ( freq. ) shows reloads applying a constant reload frequency. The third graph ( dec. ) shows a decomposition based reload strategy where the data source consists of two components. in the center of figure 8 ( freq. ) it may be observed that reload operations are performed close to the points in time of remote update operations. We present a second example to demonstrate the applicability of a grammar based decomposition in order to obtain a clear presentation of the update characteristics of a page. In this example an artificial page is considered that consists of two components. Each component is updated after a constant period of time (figure 10). The two update intervals are however slightly different (100 and 90 seconds). The superposition of the component s update patterns as registered by the any-change -difference measure is shown in figure 9. Although the basic scenario is simple, the update characteristics of the page is quite complex and no regularities in the update pattern may be observed. 4 A prediction of future update times is hardly possible. After a grammar based decomposition the basic update characteristics of the components are easily revealed (figure 4 Regularities may obviously be observed at a larger time scale. However the length of periods may become arbitrarily large for more complex examples.

13 any change time in seconds Fig. 9. Detected changes by the any-change method in the second example. No regularity in the change pattern may be observed. 10). The decomposition-based reload is performed similar to the first example (figure 8). Table 1 shows numeric results for the experiments described component c 1 c time in seconds Fig. 10. Analysis of the change dynamics in the second experiment. downloads loss sumage (seconds) experiment 1 constant method decomp. method experiment 2 constant method decomp. method Table 1. Comparison of the constant-frequency sampling and a decomposition-based reload strategy. above. In order to compare the methods in this comparison the costs, i.e. the number of downloads triggered by a respective download strategy, is fixed. Since the quality parameters may be different for different components (if the decomposition-based method is applied) the values in table 1 constitute mean values with respect to all components. The table shows that the values for lost information are similar. The age of the data may be reduced significantly by the decomposition-based reload strategy. 7 Conclusion In the article we presented a new reload strategy for Web information that is based on a decomposition of pages into functional segments. For the segmen-

14 tation we applied automatic wrapper induction techniques. Successive versions of a Web page are used as sample pages for the wrapper induction process. By using artificial and real examples we showed that the quality of retrieved information may be improved significantly compared to traditional (sampling with constant frequency) techniques. The (deterministic) change prediction based on page decomposition presented in this article is a method that may be applied only to a fraction of Web pages. If page components change statistically further optimization strategies have to be developed. However also in this case page decomposition may reveal new optimization strategies for client side data retrieval tools. A further research aspect is to achieve a higher degree of automatism, if e.g. different kinds of deterministic and statistic change characteristics are involved on a single page. References 1. A.Arasu, J.Cho, H.Garcia-Molina, A.Paepcke, and S.Raghavan. Searching the web. ACM Trans. Inter. Tech., 1(1):2 43, Arvind Arasu, Hector Garcia-Molina, and Stanford University. Extracting structured data from web pages. In SIGMOD 03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages , New York, NY, USA, ACM Press. 3. Junghoo Cho and Hector Garcia-Molina. Estimating frequency of change. ACM Trans. Inter. Tech., 3(3): , E. Coffman, Z.Liu, and R.R.Weber. Optimal robot scheduling for web search engines. Journal of Scheduling, 1(1):15 29, June Valter Crescenzi and Giansalvatore Mecca. Automatic information extraction from large websites. J. ACM, 51(5): , S. Grumbach and G.Mecca. In search of the lost schema. In ICDT 99: Proc. of the 7th Int. Conf. on Database Theory, pages , London, UK, Springer-Verlag. 7. Julie E. Kendall and Kenneth E. Kendall. Information delivery systems: an exploration of web pull and push technologies. Commun. AIS, 1(4es):1 43, D. Kukulenz. Capturing web dynamics by regular approximation. In X. Zhou et al., editor, WISE 04, Web Information Systems, LNCS 3306, pages Springer-Verlag Berlin Heidelberg, A. Laender, B. Ribeiro-Neto, A. Silva, and J. Teixeira. A brief survey of web data extraction tools. In SIGMOD Record, June C. Olston and J.Widom. Best-effort cache synchronization with source cooperation. In Proceedings of SIGMOD, pages 73 84, May S. Pandey, K.Ramamritham, and S.Chakrabarti. Monitoring the dynamic web to respond to continuous queries. In WWW 03: Proc. of the 12th int. conf. on World Wide Web, pages , New York, NY, USA, ACM Press. 12. M.A. Sharaf, A. Labrinidis, P.K. Chrysanthis, and K. Pruhs. Freshness-aware scheduling of continuous queries in the dynamic web. In 8th Int. Workshop on the Web and Databases (WebDB 2005), Baltimore, Maryland, pages 73 78, J. L. Wolf, M. S. Squillante, P. S. Yu, J. Sethuraman, and L. Ozsen. Optimal crawling strategies for web search engines. In Proceedings of the eleventh international conference on World Wide Web, pages ACM Press, 2002.

Keywords Data alignment, Data annotation, Web database, Search Result Record

Keywords Data alignment, Data annotation, Web database, Search Result Record Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Annotating Web

More information

How to Crawl the Web. Hector Garcia-Molina Stanford University. Joint work with Junghoo Cho

How to Crawl the Web. Hector Garcia-Molina Stanford University. Joint work with Junghoo Cho How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho Stanford InterLib Technologies Information Overload Service Heterogeneity Interoperability Economic Concerns Information

More information

Automatic Generation of Wrapper for Data Extraction from the Web

Automatic Generation of Wrapper for Data Extraction from the Web Automatic Generation of Wrapper for Data Extraction from the Web 2 Suzhi Zhang 1, 2 and Zhengding Lu 1 1 College of Computer science and Technology, Huazhong University of Science and technology, Wuhan,

More information

DataRover: A Taxonomy Based Crawler for Automated Data Extraction from Data-Intensive Websites

DataRover: A Taxonomy Based Crawler for Automated Data Extraction from Data-Intensive Websites DataRover: A Taxonomy Based Crawler for Automated Data Extraction from Data-Intensive Websites H. Davulcu, S. Koduri, S. Nagarajan Department of Computer Science and Engineering Arizona State University,

More information

Template Extraction from Heterogeneous Web Pages

Template Extraction from Heterogeneous Web Pages Template Extraction from Heterogeneous Web Pages 1 Mrs. Harshal H. Kulkarni, 2 Mrs. Manasi k. Kulkarni Asst. Professor, Pune University, (PESMCOE, Pune), Pune, India Abstract: Templates are used by many

More information

Gestão e Tratamento da Informação

Gestão e Tratamento da Informação Gestão e Tratamento da Informação Web Data Extraction: Automatic Wrapper Generation Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2010/2011 Outline Automatic Wrapper Generation

More information

Handling Irregularities in ROADRUNNER

Handling Irregularities in ROADRUNNER Handling Irregularities in ROADRUNNER Valter Crescenzi Universistà Roma Tre Italy crescenz@dia.uniroma3.it Giansalvatore Mecca Universistà della Basilicata Italy mecca@unibas.it Paolo Merialdo Universistà

More information

Effective Page Refresh Policies for Web Crawlers

Effective Page Refresh Policies for Web Crawlers For CS561 Web Data Management Spring 2013 University of Crete Effective Page Refresh Policies for Web Crawlers and a Semantic Web Document Ranking Model Roger-Alekos Berkley IMSE 2012/2014 Paper 1: Main

More information

Title: Artificial Intelligence: an illustration of one approach.

Title: Artificial Intelligence: an illustration of one approach. Name : Salleh Ahshim Student ID: Title: Artificial Intelligence: an illustration of one approach. Introduction This essay will examine how different Web Crawling algorithms and heuristics that are being

More information

A Connection between Network Coding and. Convolutional Codes

A Connection between Network Coding and. Convolutional Codes A Connection between Network Coding and 1 Convolutional Codes Christina Fragouli, Emina Soljanin christina.fragouli@epfl.ch, emina@lucent.com Abstract The min-cut, max-flow theorem states that a source

More information

Formal Model. Figure 1: The target concept T is a subset of the concept S = [0, 1]. The search agent needs to search S for a point in T.

Formal Model. Figure 1: The target concept T is a subset of the concept S = [0, 1]. The search agent needs to search S for a point in T. Although this paper analyzes shaping with respect to its benefits on search problems, the reader should recognize that shaping is often intimately related to reinforcement learning. The objective in reinforcement

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Information Discovery, Extraction and Integration for the Hidden Web

Information Discovery, Extraction and Integration for the Hidden Web Information Discovery, Extraction and Integration for the Hidden Web Jiying Wang Department of Computer Science University of Science and Technology Clear Water Bay, Kowloon Hong Kong cswangjy@cs.ust.hk

More information

Web Data Extraction. Craig Knoblock University of Southern California. This presentation is based on slides prepared by Ion Muslea and Kristina Lerman

Web Data Extraction. Craig Knoblock University of Southern California. This presentation is based on slides prepared by Ion Muslea and Kristina Lerman Web Data Extraction Craig Knoblock University of Southern California This presentation is based on slides prepared by Ion Muslea and Kristina Lerman Extracting Data from Semistructured Sources NAME Casablanca

More information

An approach to the model-based fragmentation and relational storage of XML-documents

An approach to the model-based fragmentation and relational storage of XML-documents An approach to the model-based fragmentation and relational storage of XML-documents Christian Süß Fakultät für Mathematik und Informatik, Universität Passau, D-94030 Passau, Germany Abstract A flexible

More information

Reverse method for labeling the information from semi-structured web pages

Reverse method for labeling the information from semi-structured web pages Reverse method for labeling the information from semi-structured web pages Z. Akbar and L.T. Handoko Group for Theoretical and Computational Physics, Research Center for Physics, Indonesian Institute of

More information

RecipeCrawler: Collecting Recipe Data from WWW Incrementally

RecipeCrawler: Collecting Recipe Data from WWW Incrementally RecipeCrawler: Collecting Recipe Data from WWW Incrementally Yu Li 1, Xiaofeng Meng 1, Liping Wang 2, and Qing Li 2 1 {liyu17, xfmeng}@ruc.edu.cn School of Information, Renmin Univ. of China, China 2 50095373@student.cityu.edu.hk

More information

WEB DATA EXTRACTION METHOD BASED ON FEATURED TERNARY TREE

WEB DATA EXTRACTION METHOD BASED ON FEATURED TERNARY TREE WEB DATA EXTRACTION METHOD BASED ON FEATURED TERNARY TREE *Vidya.V.L, **Aarathy Gandhi *PG Scholar, Department of Computer Science, Mohandas College of Engineering and Technology, Anad **Assistant Professor,

More information

Annotating Multiple Web Databases Using Svm

Annotating Multiple Web Databases Using Svm Annotating Multiple Web Databases Using Svm M.Yazhmozhi 1, M. Lavanya 2, Dr. N. Rajkumar 3 PG Scholar, Department of Software Engineering, Sri Ramakrishna Engineering College, Coimbatore, India 1, 3 Head

More information

A survey: Web mining via Tag and Value

A survey: Web mining via Tag and Value A survey: Web mining via Tag and Value Khirade Rajratna Rajaram. Information Technology Department SGGS IE&T, Nanded, India Balaji Shetty Information Technology Department SGGS IE&T, Nanded, India Abstract

More information

Manual Wrapper Generation. Automatic Wrapper Generation. Grammar Induction Approach. Overview. Limitations. Website Structure-based Approach

Manual Wrapper Generation. Automatic Wrapper Generation. Grammar Induction Approach. Overview. Limitations. Website Structure-based Approach Automatic Wrapper Generation Kristina Lerman University of Southern California Manual Wrapper Generation Manual wrapper generation requires user to Specify the schema of the information source Single tuple

More information

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures Anatomy of a search engine Design criteria of a search engine Architecture Data structures Step-1: Crawling the web Google has a fast distributed crawling system Each crawler keeps roughly 300 connection

More information

EXTRACTION INFORMATION ADAPTIVE WEB. The Amorphic system works to extract Web information for use in business intelligence applications.

EXTRACTION INFORMATION ADAPTIVE WEB. The Amorphic system works to extract Web information for use in business intelligence applications. By Dawn G. Gregg and Steven Walczak ADAPTIVE WEB INFORMATION EXTRACTION The Amorphic system works to extract Web information for use in business intelligence applications. Web mining has the potential

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Using Clustering and Edit Distance Techniques for Automatic Web Data Extraction

Using Clustering and Edit Distance Techniques for Automatic Web Data Extraction Using Clustering and Edit Distance Techniques for Automatic Web Data Extraction Manuel Álvarez, Alberto Pan, Juan Raposo, Fernando Bellas, and Fidel Cacheda Department of Information and Communications

More information

EXTRACTION AND ALIGNMENT OF DATA FROM WEB PAGES

EXTRACTION AND ALIGNMENT OF DATA FROM WEB PAGES EXTRACTION AND ALIGNMENT OF DATA FROM WEB PAGES Praveen Kumar Malapati 1, M. Harathi 2, Shaik Garib Nawaz 2 1 M.Tech, Computer Science Engineering, 2 M.Tech, Associate Professor, Computer Science Engineering,

More information

ISSN (Online) ISSN (Print)

ISSN (Online) ISSN (Print) Accurate Alignment of Search Result Records from Web Data Base 1Soumya Snigdha Mohapatra, 2 M.Kalyan Ram 1,2 Dept. of CSE, Aditya Engineering College, Surampalem, East Godavari, AP, India Abstract: Most

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

Benchmarking the UB-tree

Benchmarking the UB-tree Benchmarking the UB-tree Michal Krátký, Tomáš Skopal Department of Computer Science, VŠB Technical University of Ostrava, tř. 17. listopadu 15, Ostrava, Czech Republic michal.kratky@vsb.cz, tomas.skopal@vsb.cz

More information

MIWeb: Mediator-based Integration of Web Sources

MIWeb: Mediator-based Integration of Web Sources MIWeb: Mediator-based Integration of Web Sources Susanne Busse and Thomas Kabisch Technical University of Berlin Computation and Information Structures (CIS) sbusse,tkabisch@cs.tu-berlin.de Abstract MIWeb

More information

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection Petr Somol 1,2, Jana Novovičová 1,2, and Pavel Pudil 2,1 1 Dept. of Pattern Recognition, Institute of Information Theory and

More information

A Path Decomposition Approach for Computing Blocking Probabilities in Wavelength-Routing Networks

A Path Decomposition Approach for Computing Blocking Probabilities in Wavelength-Routing Networks IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 8, NO. 6, DECEMBER 2000 747 A Path Decomposition Approach for Computing Blocking Probabilities in Wavelength-Routing Networks Yuhong Zhu, George N. Rouskas, Member,

More information

A New Technique to Optimize User s Browsing Session using Data Mining

A New Technique to Optimize User s Browsing Session using Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

Höllische Programmiersprachen Hauptseminar im Wintersemester 2014/2015 Determinism and reliability in the context of parallel programming

Höllische Programmiersprachen Hauptseminar im Wintersemester 2014/2015 Determinism and reliability in the context of parallel programming Höllische Programmiersprachen Hauptseminar im Wintersemester 2014/2015 Determinism and reliability in the context of parallel programming Raphael Arias Technische Universität München 19.1.2015 Abstract

More information

FedX: A Federation Layer for Distributed Query Processing on Linked Open Data

FedX: A Federation Layer for Distributed Query Processing on Linked Open Data FedX: A Federation Layer for Distributed Query Processing on Linked Open Data Andreas Schwarte 1, Peter Haase 1,KatjaHose 2, Ralf Schenkel 2, and Michael Schmidt 1 1 fluid Operations AG, Walldorf, Germany

More information

Aspects of an XML-Based Phraseology Database Application

Aspects of an XML-Based Phraseology Database Application Aspects of an XML-Based Phraseology Database Application Denis Helic 1 and Peter Ďurčo2 1 University of Technology Graz Insitute for Information Systems and Computer Media dhelic@iicm.edu 2 University

More information

A Hybrid Unsupervised Web Data Extraction using Trinity and NLP

A Hybrid Unsupervised Web Data Extraction using Trinity and NLP IJIRST International Journal for Innovative Research in Science & Technology Volume 2 Issue 02 July 2015 ISSN (online): 2349-6010 A Hybrid Unsupervised Web Data Extraction using Trinity and NLP Anju R

More information

RoadRunner for Heterogeneous Web Pages Using Extended MinHash

RoadRunner for Heterogeneous Web Pages Using Extended MinHash RoadRunner for Heterogeneous Web Pages Using Extended MinHash A Suresh Babu 1, P. Premchand 2 and A. Govardhan 3 1 Department of Computer Science and Engineering, JNTUACE Pulivendula, India asureshjntu@gmail.com

More information

Efficient Remote Data Access in a Mobile Computing Environment

Efficient Remote Data Access in a Mobile Computing Environment This paper appears in the ICPP 2000 Workshop on Pervasive Computing Efficient Remote Data Access in a Mobile Computing Environment Laura Bright Louiqa Raschid University of Maryland College Park, MD 20742

More information

Hierarchical Pointer Analysis for Distributed Programs

Hierarchical Pointer Analysis for Distributed Programs Hierarchical Pointer Analysis for Distributed Programs Amir Kamil Computer Science Division, University of California, Berkeley kamil@cs.berkeley.edu April 14, 2006 1 Introduction Many distributed, parallel

More information

XML Clustering by Bit Vector

XML Clustering by Bit Vector XML Clustering by Bit Vector WOOSAENG KIM Department of Computer Science Kwangwoon University 26 Kwangwoon St. Nowongu, Seoul KOREA kwsrain@kw.ac.kr Abstract: - XML is increasingly important in data exchange

More information

smartcq: Answering and Evaluating Bounded Continuous Search Queries within the WWW and Sensor Networks

smartcq: Answering and Evaluating Bounded Continuous Search Queries within the WWW and Sensor Networks smartcq: Answering and Evaluating Bounded Continuous Search Queries within the WWW and Sensor Networks Nils Hoeller, Christoph Reinke, Dirk Kukulenz, Volker Linnemann Institute of Information Systems University

More information

Review: Searching the Web [Arasu 2001]

Review: Searching the Web [Arasu 2001] Review: Searching the Web [Arasu 2001] Gareth Cronin University of Auckland gareth@cronin.co.nz The authors of Searching the Web present an overview of the state of current technologies employed in the

More information

Parallel Query Processing and Edge Ranking of Graphs

Parallel Query Processing and Edge Ranking of Graphs Parallel Query Processing and Edge Ranking of Graphs Dariusz Dereniowski, Marek Kubale Department of Algorithms and System Modeling, Gdańsk University of Technology, Poland, {deren,kubale}@eti.pg.gda.pl

More information

Updates through Views

Updates through Views 1 of 6 15 giu 2010 00:16 Encyclopedia of Database Systems Springer Science+Business Media, LLC 2009 10.1007/978-0-387-39940-9_847 LING LIU and M. TAMER ÖZSU Updates through Views Yannis Velegrakis 1 (1)

More information

UC Irvine UC Irvine Previously Published Works

UC Irvine UC Irvine Previously Published Works UC Irvine UC Irvine Previously Published Works Title Differencing and merging within an evolving product line architecture Permalink https://escholarship.org/uc/item/0k73r951 Authors Chen, Ping H Critchlow,

More information

Security Based Heuristic SAX for XML Parsing

Security Based Heuristic SAX for XML Parsing Security Based Heuristic SAX for XML Parsing Wei Wang Department of Automation Tsinghua University, China Beijing, China Abstract - XML based services integrate information resources running on different

More information

Distributed minimum spanning tree problem

Distributed minimum spanning tree problem Distributed minimum spanning tree problem Juho-Kustaa Kangas 24th November 2012 Abstract Given a connected weighted undirected graph, the minimum spanning tree problem asks for a spanning subtree with

More information

SIMILARITY MEASURES FOR MULTI-VALUED ATTRIBUTES FOR DATABASE CLUSTERING

SIMILARITY MEASURES FOR MULTI-VALUED ATTRIBUTES FOR DATABASE CLUSTERING SIMILARITY MEASURES FOR MULTI-VALUED ATTRIBUTES FOR DATABASE CLUSTERING TAE-WAN RYU AND CHRISTOPH F. EICK Department of Computer Science, University of Houston, Houston, Texas 77204-3475 {twryu, ceick}@cs.uh.edu

More information

Evaluating the Role of Context in Syntax Directed Compression of XML Documents

Evaluating the Role of Context in Syntax Directed Compression of XML Documents Evaluating the Role of Context in Syntax Directed Compression of XML Documents S. Hariharan Priti Shankar Department of Computer Science and Automation Indian Institute of Science Bangalore 60012, India

More information

An Analysis of Approaches to XML Schema Inference

An Analysis of Approaches to XML Schema Inference An Analysis of Approaches to XML Schema Inference Irena Mlynkova irena.mlynkova@mff.cuni.cz Charles University Faculty of Mathematics and Physics Department of Software Engineering Prague, Czech Republic

More information

Semistructured Data Store Mapping with XML and Its Reconstruction

Semistructured Data Store Mapping with XML and Its Reconstruction Semistructured Data Store Mapping with XML and Its Reconstruction Enhong CHEN 1 Gongqing WU 1 Gabriela Lindemann 2 Mirjam Minor 2 1 Department of Computer Science University of Science and Technology of

More information

An Efficient Technique for Tag Extraction and Content Retrieval from Web Pages

An Efficient Technique for Tag Extraction and Content Retrieval from Web Pages An Efficient Technique for Tag Extraction and Content Retrieval from Web Pages S.Sathya M.Sc 1, Dr. B.Srinivasan M.C.A., M.Phil, M.B.A., Ph.D., 2 1 Mphil Scholar, Department of Computer Science, Gobi Arts

More information

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER 4.1 INTRODUCTION In 1994, the World Wide Web Worm (WWWW), one of the first web search engines had an index of 110,000 web pages [2] but

More information

ISSN: (Online) Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

ISSN: (Online) Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Interoperability for Digital Libraries

Interoperability for Digital Libraries DRTC Workshop on Semantic Web 8 th 10 th December, 2003 DRTC, Bangalore Paper: C Interoperability for Digital Libraries Michael Shepherd Faculty of Computer Science Dalhousie University Halifax, NS, Canada

More information

Chapter S:II. II. Search Space Representation

Chapter S:II. II. Search Space Representation Chapter S:II II. Search Space Representation Systematic Search Encoding of Problems State-Space Representation Problem-Reduction Representation Choosing a Representation S:II-1 Search Space Representation

More information

Design and implementation of an incremental crawler for large scale web. archives

Design and implementation of an incremental crawler for large scale web. archives DEWS2007 B9-5 Web, 247 850 5 53 8505 4 6 E-mail: ttamura@acm.org, kitsure@tkl.iis.u-tokyo.ac.jp Web Web Web Web Web Web Web URL Web Web PC Web Web Design and implementation of an incremental crawler for

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

Processing Rank-Aware Queries in P2P Systems

Processing Rank-Aware Queries in P2P Systems Processing Rank-Aware Queries in P2P Systems Katja Hose, Marcel Karnstedt, Anke Koch, Kai-Uwe Sattler, and Daniel Zinn Department of Computer Science and Automation, TU Ilmenau P.O. Box 100565, D-98684

More information

Notes on Binary Dumbbell Trees

Notes on Binary Dumbbell Trees Notes on Binary Dumbbell Trees Michiel Smid March 23, 2012 Abstract Dumbbell trees were introduced in [1]. A detailed description of non-binary dumbbell trees appears in Chapter 11 of [3]. These notes

More information

Automatic Wrapper Adaptation by Tree Edit Distance Matching

Automatic Wrapper Adaptation by Tree Edit Distance Matching Automatic Wrapper Adaptation by Tree Edit Distance Matching E. Ferrara 1 R. Baumgartner 2 1 Department of Mathematics University of Messina, Italy 2 Lixto Software GmbH Vienna, Austria 2nd International

More information

analyzing the HTML source code of Web pages. However, HTML itself is still evolving (from version 2.0 to the current version 4.01, and version 5.

analyzing the HTML source code of Web pages. However, HTML itself is still evolving (from version 2.0 to the current version 4.01, and version 5. Automatic Wrapper Generation for Search Engines Based on Visual Representation G.V.Subba Rao, K.Ramesh Department of CS, KIET, Kakinada,JNTUK,A.P Assistant Professor, KIET, JNTUK, A.P, India. gvsr888@gmail.com

More information

Web Data Extraction Using Tree Structure Algorithms A Comparison

Web Data Extraction Using Tree Structure Algorithms A Comparison Web Data Extraction Using Tree Structure Algorithms A Comparison Seema Kolkur, K.Jayamalini Abstract Nowadays, Web pages provide a large amount of structured data, which is required by many advanced applications.

More information

Structure of Association Rule Classifiers: a Review

Structure of Association Rule Classifiers: a Review Structure of Association Rule Classifiers: a Review Koen Vanhoof Benoît Depaire Transportation Research Institute (IMOB), University Hasselt 3590 Diepenbeek, Belgium koen.vanhoof@uhasselt.be benoit.depaire@uhasselt.be

More information

Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler

Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.12, December 2008 349 Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler A.K. Sharma 1, Ashutosh

More information

Adaptive Data Dissemination in Mobile ad-hoc Networks

Adaptive Data Dissemination in Mobile ad-hoc Networks Adaptive Data Dissemination in Mobile ad-hoc Networks Joos-Hendrik Böse, Frank Bregulla, Katharina Hahn, Manuel Scholz Freie Universität Berlin, Institute of Computer Science, Takustr. 9, 14195 Berlin

More information

A Framework for Incremental Hidden Web Crawler

A Framework for Incremental Hidden Web Crawler A Framework for Incremental Hidden Web Crawler Rosy Madaan Computer Science & Engineering B.S.A. Institute of Technology & Management A.K. Sharma Department of Computer Engineering Y.M.C.A. University

More information

Searching the Web What is this Page Known for? Luis De Alba

Searching the Web What is this Page Known for? Luis De Alba Searching the Web What is this Page Known for? Luis De Alba ldealbar@cc.hut.fi Searching the Web Arasu, Cho, Garcia-Molina, Paepcke, Raghavan August, 2001. Stanford University Introduction People browse

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly

More information

Recognizing regular tree languages with static information

Recognizing regular tree languages with static information Recognizing regular tree languages with static information Alain Frisch (ENS Paris) PLAN-X 2004 p.1/22 Motivation Efficient compilation of patterns in XDuce/CDuce/... E.g.: type A = [ A* ] type B =

More information

The Markov Reformulation Theorem

The Markov Reformulation Theorem The Markov Reformulation Theorem Michael Kassoff and Michael Genesereth Logic Group, Department of Computer Science Stanford University {mkassoff, genesereth}@cs.stanford.edu Abstract In this paper, we

More information

Web Scraping Framework based on Combining Tag and Value Similarity

Web Scraping Framework based on Combining Tag and Value Similarity www.ijcsi.org 118 Web Scraping Framework based on Combining Tag and Value Similarity Shridevi Swami 1, Pujashree Vidap 2 1 Department of Computer Engineering, Pune Institute of Computer Technology, University

More information

On Distributed Algorithms for Maximizing the Network Lifetime in Wireless Sensor Networks

On Distributed Algorithms for Maximizing the Network Lifetime in Wireless Sensor Networks On Distributed Algorithms for Maximizing the Network Lifetime in Wireless Sensor Networks Akshaye Dhawan Georgia State University Atlanta, Ga 30303 akshaye@cs.gsu.edu Abstract A key challenge in Wireless

More information

Stochastic Models of Pull-Based Data Replication in P2P Systems

Stochastic Models of Pull-Based Data Replication in P2P Systems Stochastic Models of Pull-Based Data Replication in P2P Systems Xiaoyong Li and Dmitri Loguinov Presented by Zhongmei Yao Internet Research Lab Department of Computer Science and Engineering Texas A&M

More information

A SIMPLE APPROXIMATION ALGORITHM FOR NONOVERLAPPING LOCAL ALIGNMENTS (WEIGHTED INDEPENDENT SETS OF AXIS PARALLEL RECTANGLES)

A SIMPLE APPROXIMATION ALGORITHM FOR NONOVERLAPPING LOCAL ALIGNMENTS (WEIGHTED INDEPENDENT SETS OF AXIS PARALLEL RECTANGLES) Chapter 1 A SIMPLE APPROXIMATION ALGORITHM FOR NONOVERLAPPING LOCAL ALIGNMENTS (WEIGHTED INDEPENDENT SETS OF AXIS PARALLEL RECTANGLES) Piotr Berman Department of Computer Science & Engineering Pennsylvania

More information

Scalability via Parallelization of OWL Reasoning

Scalability via Parallelization of OWL Reasoning Scalability via Parallelization of OWL Reasoning Thorsten Liebig, Andreas Steigmiller, and Olaf Noppens Institute for Artificial Intelligence, Ulm University 89069 Ulm, Germany firstname.lastname@uni-ulm.de

More information

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006 2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006 The Encoding Complexity of Network Coding Michael Langberg, Member, IEEE, Alexander Sprintson, Member, IEEE, and Jehoshua Bruck,

More information

ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies

ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies Research Article / Paper / Case Study Available online at: www.ijarcsms.com

More information

Managing test suites for services

Managing test suites for services Managing test suites for services Kathrin Kaschner Universität Rostock, Institut für Informatik, 18051 Rostock, Germany kathrin.kaschner@uni-rostock.de Abstract. When developing an existing service further,

More information

The 3-Steiner Root Problem

The 3-Steiner Root Problem The 3-Steiner Root Problem Maw-Shang Chang 1 and Ming-Tat Ko 2 1 Department of Computer Science and Information Engineering National Chung Cheng University, Chiayi 621, Taiwan, R.O.C. mschang@cs.ccu.edu.tw

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web

More information

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data Enhua Jiao, Tok Wang Ling, Chee-Yong Chan School of Computing, National University of Singapore {jiaoenhu,lingtw,chancy}@comp.nus.edu.sg

More information

A noninformative Bayesian approach to small area estimation

A noninformative Bayesian approach to small area estimation A noninformative Bayesian approach to small area estimation Glen Meeden School of Statistics University of Minnesota Minneapolis, MN 55455 glen@stat.umn.edu September 2001 Revised May 2002 Research supported

More information

On Reduct Construction Algorithms

On Reduct Construction Algorithms 1 On Reduct Construction Algorithms Yiyu Yao 1, Yan Zhao 1 and Jue Wang 2 1 Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 {yyao, yanzhao}@cs.uregina.ca 2 Laboratory

More information

DATA MINING II - 1DL460. Spring 2014"

DATA MINING II - 1DL460. Spring 2014 DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Core Membership Computation for Succinct Representations of Coalitional Games

Core Membership Computation for Succinct Representations of Coalitional Games Core Membership Computation for Succinct Representations of Coalitional Games Xi Alice Gao May 11, 2009 Abstract In this paper, I compare and contrast two formal results on the computational complexity

More information

Automata-Theoretic LTL Model Checking. Emptiness of Büchi Automata

Automata-Theoretic LTL Model Checking. Emptiness of Büchi Automata Automata-Theoretic LTL Model Checking Graph Algorithms for Software Model Checking (based on Arie Gurfinkel s csc2108 project) Automata-Theoretic LTL Model Checking p.1 Emptiness of Büchi Automata An automation

More information

A Method for Construction of Orthogonal Arrays 1

A Method for Construction of Orthogonal Arrays 1 Eighth International Workshop on Optimal Codes and Related Topics July 10-14, 2017, Sofia, Bulgaria pp. 49-54 A Method for Construction of Orthogonal Arrays 1 Iliya Bouyukliev iliyab@math.bas.bg Institute

More information

Parallel Model Checking of ω-automata

Parallel Model Checking of ω-automata Parallel Model Checking of ω-automata Vincent Bloemen Formal Methods and Tools, University of Twente v.bloemen@utwente.nl Abstract. Specifications for non-terminating reactive systems are described by

More information

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.

More information

6 Distributed data management I Hashing

6 Distributed data management I Hashing 6 Distributed data management I Hashing There are two major approaches for the management of data in distributed systems: hashing and caching. The hashing approach tries to minimize the use of communication

More information

AGG: A Graph Transformation Environment for Modeling and Validation of Software

AGG: A Graph Transformation Environment for Modeling and Validation of Software AGG: A Graph Transformation Environment for Modeling and Validation of Software Gabriele Taentzer Technische Universität Berlin, Germany gabi@cs.tu-berlin.de Abstract. AGG is a general development environment

More information

CIS 1.5 Course Objectives. a. Understand the concept of a program (i.e., a computer following a series of instructions)

CIS 1.5 Course Objectives. a. Understand the concept of a program (i.e., a computer following a series of instructions) By the end of this course, students should CIS 1.5 Course Objectives a. Understand the concept of a program (i.e., a computer following a series of instructions) b. Understand the concept of a variable

More information

Web Data Extraction and Alignment Tools: A Survey Pranali Nikam 1 Yogita Gote 2 Vidhya Ghogare 3 Jyothi Rapalli 4

Web Data Extraction and Alignment Tools: A Survey Pranali Nikam 1 Yogita Gote 2 Vidhya Ghogare 3 Jyothi Rapalli 4 IJSRD - International Journal for Scientific Research & Development Vol. 3, Issue 01, 2015 ISSN (online): 2321-0613 Web Data Extraction and Alignment Tools: A Survey Pranali Nikam 1 Yogita Gote 2 Vidhya

More information

I. INTRODUCTION. Fig Taxonomy of approaches to build specialized search engines, as shown in [80].

I. INTRODUCTION. Fig Taxonomy of approaches to build specialized search engines, as shown in [80]. Focus: Accustom To Crawl Web-Based Forums M.Nikhil 1, Mrs. A.Phani Sheetal 2 1 Student, Department of Computer Science, GITAM University, Hyderabad. 2 Assistant Professor, Department of Computer Science,

More information

Automatic Generation of Graph Models for Model Checking

Automatic Generation of Graph Models for Model Checking Automatic Generation of Graph Models for Model Checking E.J. Smulders University of Twente edwin.smulders@gmail.com ABSTRACT There exist many methods to prove the correctness of applications and verify

More information

Rough Set Approaches to Rule Induction from Incomplete Data

Rough Set Approaches to Rule Induction from Incomplete Data Proceedings of the IPMU'2004, the 10th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, Perugia, Italy, July 4 9, 2004, vol. 2, 923 930 Rough

More information

Consistency and Set Intersection

Consistency and Set Intersection Consistency and Set Intersection Yuanlin Zhang and Roland H.C. Yap National University of Singapore 3 Science Drive 2, Singapore {zhangyl,ryap}@comp.nus.edu.sg Abstract We propose a new framework to study

More information

New Optimal Load Allocation for Scheduling Divisible Data Grid Applications

New Optimal Load Allocation for Scheduling Divisible Data Grid Applications New Optimal Load Allocation for Scheduling Divisible Data Grid Applications M. Othman, M. Abdullah, H. Ibrahim, and S. Subramaniam Department of Communication Technology and Network, University Putra Malaysia,

More information