Decomposition-based Optimization of Reload Strategies in the World Wide Web

Similar documents
Keywords Data alignment, Data annotation, Web database, Search Result Record

How to Crawl the Web. Hector Garcia-Molina Stanford University. Joint work with Junghoo Cho

Automatic Generation of Wrapper for Data Extraction from the Web

DataRover: A Taxonomy Based Crawler for Automated Data Extraction from Data-Intensive Websites

Template Extraction from Heterogeneous Web Pages

Gestão e Tratamento da Informação

Handling Irregularities in ROADRUNNER

Effective Page Refresh Policies for Web Crawlers

Title: Artificial Intelligence: an illustration of one approach.

A Connection between Network Coding and. Convolutional Codes

Formal Model. Figure 1: The target concept T is a subset of the concept S = [0, 1]. The search agent needs to search S for a point in T.

Joint Entity Resolution

Information Discovery, Extraction and Integration for the Hidden Web

Web Data Extraction. Craig Knoblock University of Southern California. This presentation is based on slides prepared by Ion Muslea and Kristina Lerman

An approach to the model-based fragmentation and relational storage of XML-documents

Reverse method for labeling the information from semi-structured web pages

RecipeCrawler: Collecting Recipe Data from WWW Incrementally

WEB DATA EXTRACTION METHOD BASED ON FEATURED TERNARY TREE

Annotating Multiple Web Databases Using Svm

A survey: Web mining via Tag and Value

Manual Wrapper Generation. Automatic Wrapper Generation. Grammar Induction Approach. Overview. Limitations. Website Structure-based Approach

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

EXTRACTION INFORMATION ADAPTIVE WEB. The Amorphic system works to extract Web information for use in business intelligence applications.

The Encoding Complexity of Network Coding

Using Clustering and Edit Distance Techniques for Automatic Web Data Extraction

EXTRACTION AND ALIGNMENT OF DATA FROM WEB PAGES

ISSN (Online) ISSN (Print)

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Benchmarking the UB-tree

MIWeb: Mediator-based Integration of Web Sources

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection

A Path Decomposition Approach for Computing Blocking Probabilities in Wavelength-Routing Networks

A New Technique to Optimize User s Browsing Session using Data Mining

Höllische Programmiersprachen Hauptseminar im Wintersemester 2014/2015 Determinism and reliability in the context of parallel programming

FedX: A Federation Layer for Distributed Query Processing on Linked Open Data

Aspects of an XML-Based Phraseology Database Application

A Hybrid Unsupervised Web Data Extraction using Trinity and NLP

RoadRunner for Heterogeneous Web Pages Using Extended MinHash

Efficient Remote Data Access in a Mobile Computing Environment

Hierarchical Pointer Analysis for Distributed Programs

XML Clustering by Bit Vector

smartcq: Answering and Evaluating Bounded Continuous Search Queries within the WWW and Sensor Networks

Review: Searching the Web [Arasu 2001]

Parallel Query Processing and Edge Ranking of Graphs

Updates through Views

UC Irvine UC Irvine Previously Published Works

Security Based Heuristic SAX for XML Parsing

Distributed minimum spanning tree problem

SIMILARITY MEASURES FOR MULTI-VALUED ATTRIBUTES FOR DATABASE CLUSTERING

Evaluating the Role of Context in Syntax Directed Compression of XML Documents

An Analysis of Approaches to XML Schema Inference

Semistructured Data Store Mapping with XML and Its Reconstruction

An Efficient Technique for Tag Extraction and Content Retrieval from Web Pages

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER

ISSN: (Online) Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Interoperability for Digital Libraries

Chapter S:II. II. Search Space Representation

Design and implementation of an incremental crawler for large scale web. archives

Part I: Data Mining Foundations

Processing Rank-Aware Queries in P2P Systems

Notes on Binary Dumbbell Trees

Automatic Wrapper Adaptation by Tree Edit Distance Matching

analyzing the HTML source code of Web pages. However, HTML itself is still evolving (from version 2.0 to the current version 4.01, and version 5.

Web Data Extraction Using Tree Structure Algorithms A Comparison

Structure of Association Rule Classifiers: a Review

Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler

Adaptive Data Dissemination in Mobile ad-hoc Networks

A Framework for Incremental Hidden Web Crawler

Searching the Web What is this Page Known for? Luis De Alba

Search Engines. Information Retrieval in Practice

Recognizing regular tree languages with static information

The Markov Reformulation Theorem

Web Scraping Framework based on Combining Tag and Value Similarity

On Distributed Algorithms for Maximizing the Network Lifetime in Wireless Sensor Networks

Stochastic Models of Pull-Based Data Replication in P2P Systems

A SIMPLE APPROXIMATION ALGORITHM FOR NONOVERLAPPING LOCAL ALIGNMENTS (WEIGHTED INDEPENDENT SETS OF AXIS PARALLEL RECTANGLES)

Scalability via Parallelization of OWL Reasoning

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006

ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies

Managing test suites for services

The 3-Steiner Root Problem

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data

A noninformative Bayesian approach to small area estimation

On Reduct Construction Algorithms

DATA MINING II - 1DL460. Spring 2014"

Core Membership Computation for Succinct Representations of Coalitional Games

Automata-Theoretic LTL Model Checking. Emptiness of Büchi Automata

A Method for Construction of Orthogonal Arrays 1

Parallel Model Checking of ω-automata

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

6 Distributed data management I Hashing

AGG: A Graph Transformation Environment for Modeling and Validation of Software

CIS 1.5 Course Objectives. a. Understand the concept of a program (i.e., a computer following a series of instructions)

Web Data Extraction and Alignment Tools: A Survey Pranali Nikam 1 Yogita Gote 2 Vidhya Ghogare 3 Jyothi Rapalli 4

I. INTRODUCTION. Fig Taxonomy of approaches to build specialized search engines, as shown in [80].

Automatic Generation of Graph Models for Model Checking

Rough Set Approaches to Rule Induction from Incomplete Data

Consistency and Set Intersection

New Optimal Load Allocation for Scheduling Divisible Data Grid Applications

Transcription:

Decomposition-based Optimization of Reload Strategies in the World Wide Web Dirk Kukulenz Luebeck University, Institute of Information Systems, Ratzeburger Allee 160, 23538 Lübeck, Germany kukulenz@ifis.uni-luebeck.de Abstract. Web sites, Web pages and the data on pages are available only for specific periods of time and are deleted afterwards from a client s point of view. An important task in order to retrieve information from the Web is to consider Web information in the course of time. Different strategies like push and pull strategies may be applied for this task. Since push services are usually not available, pull strategies have to be conceived in order to optimize the retrieved information with respect to the age of retrieved data and its completeness. In this article we present a new procedure to optimize retrieved data from Web pages by page decomposition. By deploying an automatic Wrapper induction technique a page is decomposed into functional segments. Each segment is considered as an independent component for the analysis of the time behavior of the page. Based on this decomposition we present a new component-based download strategy. By applying this method to Web pages it is shown that for a fraction of Web data the freshness of retrieved data may be improved significantly compared to traditional methods. 1 Introduction The information in the World Wide Web changes in the course of time. New Web sites appear in the Web, old sites are deleted. Pages in Web sites exist for specific periods of time. Data on pages are inserted, modified or deleted. There are important reasons to consider the information in the Web in the course of time. From a client s point of view, the information that has been deleted from the Web is usually no longer accessible. Pieces of information like old news articles or (stock) prices may however still be of much value for a client. One conceivable task is e.g. the analysis of the evolution of specific information, as e.g. a stock chart or the news coverage concerning a specific topic. A Web archive that mirrors the information in a specific Web area over a period of time may help a client to access information that is no longer available in the real Web. A different aspect concerning information changes in the Web is to consider information that appears in the future. Continuous queries in the Web may help a client to query future states of the Web, similar to triggers

in a database context [11], [12]. There are different techniques available to realize such history or future-based Web information analysis. In a push system a server actively provides a client with information. Information changes on a server may directly trigger the information of a passive client [10]. In a distributed heterogeneous environment like the World Wide Web push services are difficult to realize and are usually not available. Pull systems on the other hand require an active client to fetch the information from the Web when it becomes available [7]. In contrast to push systems in a pull system the respective tool is usually not informed about the times of information changes. The pull system has to apply strategies in order to optimize the retrieved information with respect to the staleness and the completeness of the information [4]. In this article we consider the problem of retrieving information from single Web pages that appears at unknown periods of time. By observing a Web page over a period of time we acquire certain aspects of the change characteristic of the page. This knowledge is used to optimize a strategy to access the information appearing on the page at future periods of time. The basic approach presented in this article is to decompose a Web page into segments. The change dynamics of whole Web pages is usually very complex. However the change behavior of single segments is frequently relatively simple and the respective update patterns may easily be predicted as is shown by examples in this article. In the article we discuss different approaches to construct a segmentation of a Web page. We motivate the use of wrapper induction techniques for page decompositions. Wrappers are tools to extract data from Web pages automatically. Recently automatic wrapper induction techniques were introduced to learn a wrapper from a set of sample pages. The resulting wrapper is expressed by a common page grammar of the sample pages. We apply a modified wrapper induction process so that a wrapper is acquired from subsequent versions of the same page. Based on the resulting page grammar and the corresponding page segmentation technique we present a new reload strategy for information contained in Web pages. This new technique decreases the costs in terms of network traffic and optimizes the quality of the retrieved information in terms of the freshness and the completeness of the data. The paper is organized as follows: After an overview of recent related research the contribution of this article is described. Section 2 gives an introduction into the theoretical background of wrapper induction techniques and the applied model for the dynamic Web. Section 3 describes a framework to define page changes based on page segmentation. The main contribution of this article, a new reload optimization strategy that is based on page decomposition, is presented in sections 4 and 5. In section 6, the decomposition-based change prediction is applied to Web pages. Section 7 summarizes the results and describes further aspects. 1.1 Related research The prediction of the times of information changes on a remote source plays an important role for diverse software systems like search engines, Web crawlers,

-caches and Web archives. In these fields different prediction strategies were presented. [1] gives an introduction into problems related to optimal page refresh in the context of search engine optimization. In [4] and [13] the problem of minimizing the average level of staleness of local copies of remote web pages is considered in the context of Web crawler optimization. The main basic assumption is usually an independent and identical distribution of time intervals between remote data changes. The series of update times is usually modeled by Poisson processes. In [3] an optimization of this approach is presented with respect to a reduction of the bias of the estimator. In a previous publication we considered the case that remote data change approximately deterministically and update times may be modeled by regular grammars [8]. The latter approach may only be applied to a fraction of Web data, the freshness of local copies may however be improved significantly. Similar questions are important in order to optimize continuous queries in the Web, i.e. standing queries that monitor specific Web pages [11], [12]. In the above publications a change of a Web page is usually defined as an arbitrary change in the HTML code of the page. However new approaches in the field of automatic Web data extraction may be applied to develop more precise definitions of Web changes. In [9] an overview of common approaches to extract data from the Web is given. The article presents a taxonomy of Web wrapping techniques and different groups of data extraction tools are identified as e.g. wrapper induction and HTML-aware, natural language- and ontologybased tools. A fully automatic approach in the field of HTML-aware wrapper induction is described in [5]. This technique is based on the assumption that there exist similar pages in the Web that may be regarded as created by the same grammar. The task is to estimate this common grammar based on pageexamples. Based on the knowledge of the page grammar, data may be extracted automatically and insight into the function of certain page components, as e.g. lists, may be obtained. A further development of this approach accepting a more general class of HTML pages is presented in [2]. We apply a similar grammar inference approach in this work because successive page versions may frequently also be regarded as based on a single grammar. 1.2 Contribution The main contribution of this article is a new strategy to reload remote data sources in the Web. In this article we consider only HTML content, however similar considerations may also apply to XML. In most previous publications about information changes in the Web a change is defined as an arbitrary modification of the respective HTML code of the page. In contrast to this for a more precise change definition we propose a decomposition of HTML pages based on automatically induced wrappers. The automatic wrapper induction is based on methods that were recently presented in the literature [5]. An extension of the respective methods is considered in order to learn wrappers by example pages that appear at subsequent and unknown periods of time. Based on a decomposition of pages we propose a method to analyze changes of Web

pages. The main aspect considered in this article is a new strategy for the reload of remote data sources. The presented method is based on a decomposition step and is then based on a prediction of the times of remote changes that considers each component individually. We demonstrate in this article that for a fraction of Web data this decomposition based prediction may improve pull services in terms of the freshness of retrieved data. 2 Theoretical Background 2.1 Introduction to automatic wrapper induction The theoretical background concerning data extraction from the Web by automatic wrapper induction is similar to previous publications in this field [6], [5], [2]. We consider the fraction of pages on the Web that are created automatically based on data that are stored in a database or obtained from a different data source. The data are converted to HTML code and sent to a client. This conversion may be regarded as an encoding process. The encoding process may be regarded as the application of a grammar which produces HTML structures. We consider nested data types as the underlying data structure. Given a set of sample HTML pages <html> <a href="weath.html"> #PCDATA </a> (<p><a href="conc.html">concert</a></p>)? <ul> (<li> <b> #PCDATA </b> (<p> #PCDATA </p>)+ </li>)+ </ul> </html> Fig. 1. A common grammar of the pages in figure 2. belonging to the same class of pages the task is therefore to find the nested data type of the source data set that was used to create the HTML pages. In the wrapper application phase it is the task to extract the respective source data instances from which the individual pages were generated. There exist different approaches for this task [5], [2]. In the following we summarize the approach in [5] which is used as a basis for the applied model in this work in section 2.2. Nested data types may be modeled by union-free regular grammars [5]. Let PCDATA be a special symbol and Σ an alphabet of symbols not containing PCDATA. A union-free regular expression (UFRE) is a string over the alphabet Σ { PCDATA,, +,?, (, )} with the following restrictions. At first, the empty string ɛ and the elements of Σ { PCDATA} are UFREs. If a and b are UFREs, then a b, (a) + and (a)? are UFREs where a b denotes a concatenation, (a)? denotes optional patterns and (a) + is an iteration. Further (a) is a shortcut for ((a) + )?. In [6] it is shown that the class of union-free regular expressions has a straightforward mapping to nested data types. The PCDATA element models string fields, + models (possibly nested) lists and? models nullable data fields. For a given UFRE σ the corresponding nested type τ = type(σ) may be constructed in linear time. It is obvious that UFREs may not be used to model every structure appearing in the Web. It is however suitable for a

significant fraction of Web data. Let p 1, p 2,... p n be a set of HTML strings that correspond to encodings of a source data set d 1, d 2,... d n of a nested type τ. It is shown in [5] that the type τ may be estimated by inferring the minimal UFRE σ whose language L(σ) contains the encoded strings p 1, p 2,... p n. In [5] a containment relationship is defined as σ a σ b iff L(σ a ) L(σ b ). The (optimal) UFRE to describe the input strings is the least upper bound of the input strings. [5] reduces this problem to the problem of finding the least upper bound of two UFREs. The grammar in figure 1 describes e.g. the pages in figure 2. 1: <html> 2: <a href="weath.html">temp: 14C</a> deleted 3: <p><a href="conc.html">concert</a></p> 4: <ul> 5: <li><b>world</b> 6: <p>togolese vote count under way </p> deleted, 7: </li> inserted 8: <li><b>business</b> 9: <p> A mix it up Monday</p> deleted, 10:</li> inserted 11:</ul> 12:</html> inserted p(a1) changed 1: <html> 2: <a href="weath.html">temp: 18C</a> 3: <ul> 4: <li><b>world</b> 5: <p>space crew returns </p> 6: <p>markey: Energy off base </p> 7: </li> 8: <li><b>business</b> 9: <p> Stocks wait on rates?</p> 10: </li> 11: <li><b>sports</b> 12: <p> Burns: Fast Breaks</p> 13: <p> Day two NFL</p> 14: </li> 15: </ul> 16: </html> p(a2) Fig. 2. The HTML sources of a Web page at two different points in time a 1 and a 2 with different kinds of changes like insertions, deletions and modifications. 2.2 A model for data changes in the Web In contrast to the data model in the previous section where structurally similar pages appear at the same time in this article we consider page versions appearing at different time intervals. For this purpose a time stamp is attached to every page version. Let u i R + denote the point in time at which the i th update of a page occurs, where 0 u 1 u 2 u 3 u 4... u n T R +, n N. The time span between the i 1 st and i th update will be denoted by t i := u i u i 1, i N. This is the lifetime of a page version. The different page versions are denoted as p(u 1 ), p(u 2 ),... p(u n ). Let a 1, a 2,... a m R + denote the points in time where reload operations of the remote source are executed, where 0 a 1 a 2 a 3... a m T. The set of local copies of remote page versions q(a 1 ), q(a 2 ),... is obviously a subset of the remote page versions. For t R + let N u (t) denote the largest index of an element in the sequence u that is smaller than t, i.e. N u (t) := max{n u n t}. Let A u (t) R + denote the size of the time interval since the last update, i.e. A u (t) := t u N u (t). If t is the time of a reload (t = a i for i m), we denote A u (t) as the age of q(a i ). The age of a local copy denotes how much time has passed since the last remote data update and thus how long an old copy of the data was stored although a new

version should have been considered. Finding an optimal reload strategy means that after each update of the remote data source, the data should be reloaded as soon as possible, i.e. the sum of ages sumage := m i=1 Au (a i ) has to be minimal. The number of reloads should be as small as possible. No change of the data source should be unobserved. The number of lost (unobserved) data objects will be denoted as loss in the experiments. One question considered in the following is the prediction of remote update times. For this purpose in this article we consider the special case that remote updates of page components are performed after deterministic time intervals. Let Q := {t j j n N} denote the set of time intervals between updates of a page component. We assign a symbol s i, i N n to every element of Q. We call the set of symbols := {s i i n} the alphabet of the sequence (u j ) j T. 1 Let S denote a starting symbol, let r 1, r 2,... r n denote terminals and the symbols R 1, R 2,... R n non-terminals. In the following we refer to a regular grammar Γ corresponding to the non-deterministic finite automaton in figure 3 as a cyclic regular grammar [8]. In figure 3, R 0 is a starting state which leads to any of n states R 1,..., R n. After this, the list of symbols is accepted in a cyclic way. Every state is an accepting state. To abbreviate this definition we will use the notation: (r 1 r 2...r n ) := Γ. R 0 r n r 1 rn r n 1 R 1 r R 1 2 r R 2 n Fig. 3. Nondeterministic automaton corresponding to the grammar (r 1 r 2... r n ). One problem is to find an opti- mal sequence a 1, a 2,... a m in order to capture most of the data changes necessary to learn the wrapper which may be applied for a page decomposition as described in section 3. The set of local page versions q(a 1 ), q(a 2 ),... q(a m ) is the input set of positive examples for the wrapper induction task. The basic problem concerning the wrapper induction is to find a basic schema for the page versions that appear at the considered URL. In [5] it is shown that the considered language is not only identifiable in the limit but that a rich set of examples is sufficient to learn the grammar. This set of examples has a high probability of being found in a number of randomly chosen HTML pages. In this article we don t consider this problem in detail and assume in the experiments that a correct wrapper has been learned from previous page versions. 3 A definition of page changes 3.1 Motivation The question considered in this section is to find a description for changes in a Web page. An appropriate definition is necessary in order to acquire an understanding of the nature of changes. If e.g. a page provides a certain stock price every minute and in addition provides news related to the respective 1 Due to the size of the sampling interval and due to network delays intervals between updates registered at the client side are distorted. The symbols are used to comprise registered intervals assumed to result from identical update intervals on the server side.

company which appear about three times a day, it may be helpful to separate the two change dynamics in order to optimize e.g. a pull client. 3.2 Decomposition strategies There are different concepts for a segmentation of pages conceivable that are suitable for a change analysis of components. One approach is to consider the DOM (document-object-model) tree of a HTML page. Changes may be attached to nodes (HTML tags) in this tree. The problem with this strategy is that HTML tags are in general only layout directives and there may appear additional structures that are not clear by observing the HTML tag structure, as e.g. in the case of the inner list in figure 2 p(a 2 ). A method to acquire additional information about the page structure is to consider a set of similar pages as described in section 2 and to determine the common grammar. This grammar may be used to decompose a page into segments as described below and to attach changes in a page to these segments. 3.3 Wrapper based change definition AND <html> <a href="weath.html"> #PCDATA </a> HOOK <ul> PLUS </ul> </html> C1 C2 C3 C4 C5 C6 C7 AND <p> <b> <a href="conc.html">concert </a> </b> </p> C41 AND <li> <b> #PCDATA </b> PLUS </li> C61 C62 C63 C64 C65 AND <p> #PCDATA </p> C641 Fig. 4. Abstract-syntax-tree (AST) corresponding to the pages in figure 2. AND-nodes refer to tuples, PLUS-nodes refer to lists and HOOK-nodes refer to optional elements. Boxes framing AST-nodes denote time components (i.e. groups of nodes with a similar change characteristic). These groups constitute nodes of the TCT tree. The decomposition-based time analysis is based on the estimated page grammar acquired from a sample set of pages as described in section 2. One notation for the page grammar is an Abstract-Syntax-Tree (AST) as shown in figure 4 for the pages in figure 2 which corresponds to the grammar in figure 1. By merging adjacent nodes or subtrees in the AST that have the same change dynamics, the AST may be tranformed into the time-component-type tree (TCT tree). The TCT-tree illustrated by the boxes in figure 4 shows the different time-components of the AST. A time-component is a node (or a set of adjacent nodes in the AST that have the same time characteristic) and the

respective subtrees. A change of a component requires a change in the node or the attached subtrees. Components that actually change in time, are marked by thick rectangles. Based on the TCT-tree different strategies to define page changes are conceivable. The any-difference method described above that considers an arbitrary change in the HTML source of a page is equivalent to considering only the root node of the TCT tree. More specific methods consider deeper levels of the TCT tree up to a certain depth. It is also conceivable that only specific nodes or subtrees of the TCT tree are considered, chosen e.g. manually by a user. In order to analyze the change dynamics of a page over time it may not be sufficient to consider the TCT tree because nodes in the TCT tree may correspond to a number of instances of the respective type in real page instances. We will refer to the respective instance tree as the Time-Component-Instance-tree (TCI-tree). Now we may finally define changes of a Web page. A (single) change c is a tuple c = (node T CI, t), where t is the time of the change. This change definition depends on the definition of the TCI tree, which itself is based on the decomposition by the grammar. The possible causes of changes depend on the position of the respective node in the TCI tree as described in the example below. A change pattern is a set of time series each of one is connected to a node in the considered TCI tree. As an example the Web page in figure 2 may be considered. If the page change function is defined to consider the first level of the TCT tree (including level 0 an 1) the TCT tree has 7 nodes at level 1 similar to the corresponding TCI tree. Three of these nodes (C2, C4 and C6) actually change in the course of time. A change is detected if the string in element C2 is changed, if the optional element C4 appears, is modified or deleted or if a change in the list element C6 occurs, i.e. a list element is inserted, modified or deleted. 4 Component-based update estimation In this article we consider the special case that components of a specific Web page are updated deterministically. In order to model the update characteristics of each component we apply a cyclic regular grammar (fig. 3). In [8] a method to learn similar update characteristics of entire Web pages is presented. In this section we present a method to apply the respective algorithm to pages where not entire pages but page components change deterministically. The input of this Component-Update-Estimation-algorithm are the target URL of the considered page and a wrapper of the respective page. In this algorithm the data source is reloaded after constant periods of time. In each cycle the data contained in page components are extracted by applying the wrapper. Based on the current and the previous data vector changes occurring in page components are registered. Based on detected changes of components the intervals between updates of specific components may be estimated. 2 The intervals are used to 2 Due to the finite sampling interval length, the interval length between updates is only an estimation.

estimate symbols by a clustering process (section 2.2)[8]. Based on sequences of symbols associated to components the cyclic regular grammars (section 2.2) may be estimated. A termination criterion concerning the grammar estimation is applied to mark the grammar of a component as determined. This termination criterion may e.g. consider the number of times a new symbol is detected which has also been predicted correctly by a current grammar estimation. Finally after the detection of the cyclic-regular grammar of each component the respective grammars are stored in a vector, which is denoted as timegrammarvector in the following. 5 Download optimization The new download optimization strategy may now be described by the algorithm Component-Based-Reload in figure 5. The main aspect of the algorithm Component-Based-Reload( wrapper, timegrammarv ector, URL) 1 set previous content = 2 reload source(url) 3 extract current content vector (wrapper) 4 copy current content vector to previous content vector 5 while component where phase not detected 6 reload source(url) 7 extract current content vector (wrapper) 8 for each component: compare previous and current content 9 for each component: extract symbols(timegrammarvector) 10 for each component: match phase(timegrammarvector) 11 if phase of component j is determined 12 mark phase of component j as determined 13 start download thread for component j (timegrammarvector) 14 wait( t) 15 copy current content vector to previous content vector Fig. 5. The Component-Based-Reload-algorithm. For each component an independent reload-thread is started after a phase detection. is to determine the different phases 3 of the components. If e.g. the cyclic regular grammar of a component is (ababc) and we register a symbol a, the current state of the respective automaton is ambiguous. A sequence of successive symbols has to be considered in order to disambiguate the current state (symbol) of a component. In the algorithm in figure 5 the remote source is reloaded frequently from the considered URL (steps 2, 6). The data contained 3 With the term phase we denote the current symbol of the update characteristic of a component (fig. 3) and roughly the position in the respective interval.

in the respective components are extracted applying the wrapper (step 3,7) and the contents is compared to the contents of previous component versions (step 8). By this method current symbols may be extracted and the symbols may be compared to the respective grammar in the time-grammar-vector (obtained from the grammar estimation algorithm) until the current symbol is unique. These steps are performed in steps 9 and 10 of the algorithm. If the phase of a component has been detected, in step 11 a download thread is started for this component that predicts further symbols of the respective component and performs reload operations. In particular in this reload strategy the remote source is loaded shortly before the expected remote change (as provided by the cyclic regular grammar of a component) and then by reloading the remote source with a high frequency until the change has been detected [8]. By this method a feedback is acquired that is necessary to compensate deviations of the registered update times of a remote source due to server and network delays. After the phase detection a number of threads is running performing reload operations that corresponds to the number of page components. 6 Experiments 6.1 Experimental setup The experiment consists of two basic stages. In the estimation stage the change dynamics of a Web page is estimated by observation of the page over a period of time. This stage consists of four phases that are performed successively. Phase 1: In the first phase the basic reload frequency for the wrapper induction and subsequent sampling processes is determined [8]. Phase 2: In this phase the wrapper is induced based on successive versions of the page. The estimated wrapper is subjected to a development in the course of time. A heuristic criterion is applied to terminate the induction process. Phase 3: Based on the wrapper and the respective parser in this phase changes on the page are examined over a period of time. The result is a vector of component change series indicating, at which times a specific component changed in the considered period of time. Phase 4: In this phase the detected change behavior of the page obtained in the previous step is examined. The result is a vector of update grammars of the different components (section 4). In the second application stage the knowledge about the page dynamics acquired in the previous stage is applied for the extraction of data in the course of time. Phase 5: The data extraction algorithm considers each component independently. The phase of the update pattern of each component has to be detected (section 5). Phase 6: In this phase data are extracted by component-based reload requests.

6.2 Application examples In the experiments we apply artificial and real pages in order to demonstrate main benefits of the new method. In order to measure the quality of a reload strategy we consider the costs in terms of the number of reload operations (causing network traffic etc.), the number of lost data objects and the age of components (section 2.2). The steps of the procedure in section 6.1 are successively applied to a page and the respective results are demonstrated. In a first example we consider the page http://www.sat.dundee.ac.uk/pdus.html. This page contains links to images of different geostationary satellites. After the wrapper induction the final TCI graph has 117 leaf nodes all of which are related to links to specific satellite images in the page. Figures 6 and 7 show the result of step 3 of the procedure. The any-change update detection in any change 0 50000 100000 150000 200000 time in seconds Fig. 6. Detected changes of the Web page http://www.sat.dundee.ac.uk/pdus.html by the any-change -method over a period of 2 days and 7 hours. component c 38 c 25 c 26 c 55 c 54 0 50000 100000 150000 200000 time in seconds Fig. 7. Change dynamics of sample components of the Web page http://www.sat.dundee.ac.uk/pdus.html. figure 6 shows a complex change pattern. After a grammar-based decomposition different components in figure 7 show however simple and deterministic update characteristics. The underlying process applied to generate the page automatically may easily be understood. The change patterns obtained from step 3 of the experiment may be used to estimate the grammar vector in step 4 as described in section 4. The grammar vector is then used for the multi-phase detection (phase 5). As a visualization of phase 6 of the procedure, figure 8 shows the application of different reload strategies. In this experiment we consider only a fraction of the entire page, in particular the fraction consisting of the components c-38 and c-26 in figure 7 for reasons of simplicity. The superposition of the component s update

patterns (c-38 and c-26) as depicted at the top of figure 8 reveals very close update operations. Because of a finite resolution in the estimation process due to network delays etc. the superposed update pattern may not be regarded as a simple deterministic pattern. A quasi deterministic prediction as presented in [8] may therefore not be applied and only common constant-frequency sampling methods are applicable as depicted in the second graph ( freq. ) in the center of figure 8. After a grammar based decomposition in phase 3 of the procedure the simple update characteristics of the components shown in figure 7 (c-38 and c-26) are revealed. After this decomposition step the different update characteristics of the components may be estimated and applied for the prediction of future updates of the remote source as shown in the third graph ( dec. ) of figure 8. In this figure the reload operations triggered by the different components are depicted. In contrast to the constant-frequency reload strategy illustrated updates // reloads dec. freq. updates 350000 400000 450000 500000 550000 time in seconds Fig. 8. Visualization of different reload strategies. The first graph ( updates ) shows original updates. The second graph ( freq. ) shows reloads applying a constant reload frequency. The third graph ( dec. ) shows a decomposition based reload strategy where the data source consists of two components. in the center of figure 8 ( freq. ) it may be observed that reload operations are performed close to the points in time of remote update operations. We present a second example to demonstrate the applicability of a grammar based decomposition in order to obtain a clear presentation of the update characteristics of a page. In this example an artificial page is considered that consists of two components. Each component is updated after a constant period of time (figure 10). The two update intervals are however slightly different (100 and 90 seconds). The superposition of the component s update patterns as registered by the any-change -difference measure is shown in figure 9. Although the basic scenario is simple, the update characteristics of the page is quite complex and no regularities in the update pattern may be observed. 4 A prediction of future update times is hardly possible. After a grammar based decomposition the basic update characteristics of the components are easily revealed (figure 4 Regularities may obviously be observed at a larger time scale. However the length of periods may become arbitrarily large for more complex examples.

any change 0 200 400 600 800 1000 time in seconds Fig. 9. Detected changes by the any-change method in the second example. No regularity in the change pattern may be observed. 10). The decomposition-based reload is performed similar to the first example (figure 8). Table 1 shows numeric results for the experiments described component c 1 c 2 0 200 400 600 800 1000 time in seconds Fig. 10. Analysis of the change dynamics in the second experiment. downloads loss sumage (seconds) experiment 1 constant method 393 0 46157 decomp. method 393 0 2903 experiment 2 constant method 654 0 130 decomp. method 654 0 89 Table 1. Comparison of the constant-frequency sampling and a decomposition-based reload strategy. above. In order to compare the methods in this comparison the costs, i.e. the number of downloads triggered by a respective download strategy, is fixed. Since the quality parameters may be different for different components (if the decomposition-based method is applied) the values in table 1 constitute mean values with respect to all components. The table shows that the values for lost information are similar. The age of the data may be reduced significantly by the decomposition-based reload strategy. 7 Conclusion In the article we presented a new reload strategy for Web information that is based on a decomposition of pages into functional segments. For the segmen-

tation we applied automatic wrapper induction techniques. Successive versions of a Web page are used as sample pages for the wrapper induction process. By using artificial and real examples we showed that the quality of retrieved information may be improved significantly compared to traditional (sampling with constant frequency) techniques. The (deterministic) change prediction based on page decomposition presented in this article is a method that may be applied only to a fraction of Web pages. If page components change statistically further optimization strategies have to be developed. However also in this case page decomposition may reveal new optimization strategies for client side data retrieval tools. A further research aspect is to achieve a higher degree of automatism, if e.g. different kinds of deterministic and statistic change characteristics are involved on a single page. References 1. A.Arasu, J.Cho, H.Garcia-Molina, A.Paepcke, and S.Raghavan. Searching the web. ACM Trans. Inter. Tech., 1(1):2 43, 2001. 2. Arvind Arasu, Hector Garcia-Molina, and Stanford University. Extracting structured data from web pages. In SIGMOD 03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 337 348, New York, NY, USA, 2003. ACM Press. 3. Junghoo Cho and Hector Garcia-Molina. Estimating frequency of change. ACM Trans. Inter. Tech., 3(3):256 290, 2003. 4. E. Coffman, Z.Liu, and R.R.Weber. Optimal robot scheduling for web search engines. Journal of Scheduling, 1(1):15 29, June 1998. 5. Valter Crescenzi and Giansalvatore Mecca. Automatic information extraction from large websites. J. ACM, 51(5):731 779, 2004. 6. S. Grumbach and G.Mecca. In search of the lost schema. In ICDT 99: Proc. of the 7th Int. Conf. on Database Theory, pages 314 331, London, UK, 1999. Springer-Verlag. 7. Julie E. Kendall and Kenneth E. Kendall. Information delivery systems: an exploration of web pull and push technologies. Commun. AIS, 1(4es):1 43, 1999. 8. D. Kukulenz. Capturing web dynamics by regular approximation. In X. Zhou et al., editor, WISE 04, Web Information Systems, LNCS 3306, pages 528 540. Springer-Verlag Berlin Heidelberg, 2004. 9. A. Laender, B. Ribeiro-Neto, A. Silva, and J. Teixeira. A brief survey of web data extraction tools. In SIGMOD Record, June 2002. 10. C. Olston and J.Widom. Best-effort cache synchronization with source cooperation. In Proceedings of SIGMOD, pages 73 84, May 2002. 11. S. Pandey, K.Ramamritham, and S.Chakrabarti. Monitoring the dynamic web to respond to continuous queries. In WWW 03: Proc. of the 12th int. conf. on World Wide Web, pages 659 668, New York, NY, USA, 2003. ACM Press. 12. M.A. Sharaf, A. Labrinidis, P.K. Chrysanthis, and K. Pruhs. Freshness-aware scheduling of continuous queries in the dynamic web. In 8th Int. Workshop on the Web and Databases (WebDB 2005), Baltimore, Maryland, pages 73 78, 2005. 13. J. L. Wolf, M. S. Squillante, P. S. Yu, J. Sethuraman, and L. Ozsen. Optimal crawling strategies for web search engines. In Proceedings of the eleventh international conference on World Wide Web, pages 136 147. ACM Press, 2002.