DeepLibrary: Wrapper Library for DeepDesign

Size: px

Start display at page:

Download "DeepLibrary: Wrapper Library for DeepDesign"

Melvyn McKinney
6 years ago
Views:

Research Collection Master Thesis DeepLibrary: Wrapper Library for DeepDesign Author(s): Ebbe, Jan Publication Date: 2016 Permanent Link: https://doi.org/10.

1 Research Collection Master Thesis DeepLibrary: Wrapper Library for DeepDesign Author(s): Ebbe, Jan Publication Date: 2016 Permanent Link: Rights / License: In Copyright - Non-Commercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use. ETH Library

2 DeepLibrary: Wrapper Library for DeepDesign Master Thesis Jan Ebbe Prof. Dr. Moira C. Norrie Alfonso Murolo Global Information Systems Group Institute of Information Systems Department of Computer Science ETH Zurich 28th April 2016

4 Abstract The data extraction community proposed many solutions to extract records from web pages. Many of those solutions use the concept of wrappers to encapsulate the extraction rules from the application itself. Most of them focus on data extraction using wrappers, but neglect to specify how to store, verify and maintain those wrappers. The ones that actually specify how to maintain a wrapper often describe approaches that fail if a website performs large changes on its template. We present a novel approach on how to handle wrappers for content extraction. We describe an efficient way how to store wrappers, that allows to automatically re-execute, verify and maintain them whenever the template of the underlying website changes. To build this novel approach we analyzed historical template changes of websites and compared our results against existing approaches from previous work, obtaining very promising results. The described approach has been implemented as a wrapper-management library component for DeepDesign [16], which is a data extraction tool developed by the GlobIS 1 research group. The implemented wrapper-management library component is called DeepLibrary. 1 Global Information Systems Group, ETH Zurich iii

5 iv

6 Contents 1 Introduction Contributions Background Wrapper Induction Wrapper Verification Wrapper Maintenance DeepDesign Approach Wrapper Structure Example Structure Annotation Structure Url Container Structure Wrapper Induction CSS Rule Selection Text Nodes Wrapper Verification Visual Similarity Wrapper Maintenance Ensemble of Classifiers Tag Name Classifier Siblings Classifier Text Content Classifier XPath Classifier Visual Classifier Content Extraction v

7 vi CONTENTS 4 Implementation & Architecture System Architecture Client-Side Server-Side Interfaces Wrapper Representation Wrapper Creation CSS Rules Extraction Wrapper Verification Wrapper Maintenance Server-Side Maintenance Implementation Wrapper Suggestion Wrapper Lookup Dialog System DeepDesign Results Export Evaluation Data Sets Classifier Evaluation Text Content Classifier XPath Classifier Visual Classifier Tag Name Classifier Siblings Classifier Backtest Test Task Based Comparison Single Annotation Based Record Based Runtime Comparison Adaptability Test Conclusion & Future Work 51

8 CONTENTS vii A Training Data Set 53 B Test Data Set 57 C Lerman et al. Data Set 61 D Adaptability Data Set 63

9 viii CONTENTS

10 1 Introduction Although the web was designed as a source of information for human use, its data can be retrieved and processed automatically using content extraction tools. Those tools transform the semi-structured HTML formatted data on the web into structured data that then can be further processed. Commercial applications, for example, use the extracted data to closely follow competitor pricing changes and current product trends in their market. Not all of the webs information is presented in an uniform way. This makes the task of information extraction very challenging, since different web sites require different rules for information extraction. Also the information and its representation isn t static. This means that extraction rules may need to be updated as soon as the template of their corresponding website changes. This implies the challenge to verify whether an extraction rule is extracting the correct data. There have been many approaches presented that try to tackle those challenges. Chang et al. [4] proposed dimensions for comparing those approaches. The dimension mostly used in this thesis is the automation degree of a approach. Some approaches are completely unsupervised and do not relay on labeled training examples to generate a wrapper. Others are semi-supervised or totally supervised approaches which often require one or more labeled web pages with examples of the data to be extracted. Many approaches fail to extract the correct information after a website changes much of its template and are not able to update the extraction rules correctly. This is due to the fact that the extraction rules used by those approaches often only focus on a few features (e.g. text content, DOM subtree, HTML tag) but ignore the others. To solve this challenge, the approach presented in this paper uses five different features to identify the correct information in a changed website. In addition to HTML tags, XPaths, siblings and text content, this approach also processes visual features. These features allow to maintain a wrapper more reliable compared to other approaches, as our evaluation has shown. Our approach has been implemented as a wrapper-management library called DeepLibrary in form of a Chrome browser extension. 1

11 CONTRIBUTIONS The rest of this thesis is organized as follows. We review some of the related work in chapter 2. Our approach is presented in chapter 3. Chapter 4 contains architecture and implementation details of DeepLibrary. Our approach is evaluated based on two existing approaches by other authors in chapter 5. Chapter 6 contains a conclusion and future work. 1.1 Contributions The main contributions of this thesis are the following: Wrapper-Management Library In this thesis, we present a system that is able to generate, validate and maintain wrappers. To do so, we introduce a data structure that stores wrappers in Section 3.1. Our approach on wrapper induction is presented in Section 3.2, the approach on wrapper verification in Section 3.3 and the approach on wrapper maintenance in Section 3.4. All approaches have been implemented in a wrapper-management library called DeepLibrary, presented in Chapter 4. Evaluation & Comparison This thesis contains an evaluation of DeepLibrary in Chapter 5. We evaluate the runtime and task based performance of its wrapper maintenance process using multiple data sets containing snapshots of websites from 2006 to We also compare the runtime and task based performance of DeepLibrary to other approaches. For this comparison, we use the approaches presented by Ferrara et al. [7] and Lerman et al [11]. Additionally, we evaluate the adaptability of wrappers generated by DeepLibrary. To do so, we test if a wrapper generated from a specific web page of a website can also be used for other web pages of the same website. Tag Name Change Likelihoods We introduce a naive-bayes classifier in Section 3.4.2, which is able to predict the likelihood of a tag name A to change into a tag name B after a template change of a website. This is useful to maintain a annotation, since we know the tag name of the annotated element. If we can t find this annotated element anymore in a web page that changed its template, we can use this classifier to identify tag names of candidate elements. The likelihoods used by this classifier can be presented visually as a matrix. Such a matrix is shown in Figure 3.6. Since those likelihoods were deviated from a specific data set, the resulting matrix can be seen as a signature of this data. This signature might look different depending on the age and type of the used websites in the data set.

12 2 Background Web content extraction tools often use wrappers to encapsulate extraction rules from the application itself. The extraction rules are responsible for extracting the right records from a web page. Depending on the complexity of the extraction tools, the extraction rules can be simple regular expressions or entire program code blocks that are executed whenever something needs to be extracted. A wrapper contains extraction rules for a specific purpose, for example to extract product names and prices from a web shop. Also a wrapper can only extract data from a specific set of web pages. Figure 2.1 shows the typical life-cycle of a wrapper. The process, in which a wrapper is generated is called wrapper induction. Whenever a wrapper should be used, the wrapper verification process checks if the wrapper is able to extract to correct information. If the wrapper is correctly working, it can be executed. After its execution, the correct information should have been extracted. Should the wrapper verification process fail, we need to repair this wrapper. This can be done by updating its extraction rules. Usually this happens when a website changes its template. We call this process wrapper maintenance. After successfully performing wrapper maintenance, the wrapper should be able to extract the correct information. 2.1 Wrapper Induction Wrapper Induction or Wrapper Generation describe the process in which a new wrapper with its extraction rules is generated. This process can be done fully automatic as Lerman et al. [11] proposed. In their approach they learn the extraction rules automatically from training examples. Each extraction rule is described as the pattern of the record which they want to extract. The extraction rule for a address would look like this: <Number Upper Upper>, which describes a string that starts with a number followed by two terms that begin with an 3

13 WRAPPER INDUCTION Figure 2.1: The life-cycle of a wrapper uppercase letter. Any string like 9478 River Road in a web page would be extracted by this rule as an address. RoadRunner [6] is another fully automatic approach that requires a set of similar web pages from the same website as input. Those web pages are then analyzed and divided into two parts: Static content which is assumed to be the same on each page and dynamic content which will differ. The static and dynamic content is then used by a classifier to infer a wrapper. The generated wrapper is an union-free regular expression, which describes the structure of the pages used as input. The to be extracted data is marked as parsed character data. In addition to the fully automatic approaches there are also semiautomatic ones. Often this approaches offer tools that assist the user to create a wrapper. Meng et al. [14] proposed a schema-guided approach for wrapper generation. The user defines the structure of the to be extracted records by providing an XML schema. Using a GUI toolkit which renders the target web page, the user then matches values from the web page to corresponding XML schema elements. This matching is then transformed into extraction rules as XQuery expressions which are stored in the wrapper. Another semiautomatic system is IEPAD [3]. This system takes a web page as input and parses it into a so-called PAT tree. This tree can then be used by the pattern discoverer module to find repetitive patterns. The found repetitive patterns are turned into extraction rules by the rule composer. Each extraction rule is represented as a regular expression. Finally, the user selects the extraction rules that extract the desired information.

14 CHAPTER 2. BACKGROUND 5 OLERA [2] is another semiautomatic approach, which uses user specified example records in order to generate extraction patterns. In a fist step, OLERA tries to identify similar records to the given example record by the user. To do so, a approximate string matching technique is used, which tries to align tokenized HTML content of records. Finally, the user can select and label the relevant parts of the identified records. Depending on this selection by the user, the extraction rules are generated. Each extraction rule describes a pattern of string tokens. Irmak et al. [9] also presented a semiautomatic approach. They are using a browser toolbar to allow the user to directly interact with the browser. To create a wrapper, the user has to select one complete record of the data he wants to extract on a training web page. Once the record is selected, the system tries to find similar records on the same web page and the user can select which ones are relevant for the extraction. The system then generates a wrapper which contains the extraction rules for the relevant records. The extraction rules define a set of predicates of the from (attribute=value) for each depth of the DOM tree. Only elements in the DOM tree where each of their ancestors satisfy all the predicates at their level are extracted. Miled et al. [15] presented a domain specific semiautomatic approach. This approach relies on a domain specific knowledge data base. To generate a wrapper, the user has to provide a set of sample web pages. The knowledge base is then used to identify records in those web pages. The identified records are presented to the user, who has then the option to correct wrongly identified records. Once the record identification is complete, the extraction rules are generated. The extraction rules describe the text pattern of the record content. There are also completely manual approaches. Gruser et al. [8] presented such an approach. Although they offer a GUI based toolkit to the user, the actual extraction rules have to be created manually. To do so, they use a qualified path expression extractor language, which allows to address elements in a web page similar to XPaths. WebOQL [1] offers a functional language that can be used to query data from a web page. The main data structure provided by this language is similar to the HTML DOM tree. Elements in a web page can be queried using their tag name, HTML code or text content. The wrapper induction approach presented in this thesis is semiautomatic, but requires minimal user effort. Similar to [9] our approach uses the browser itself to directly interact with a web page. Unlike [6] and [15] this approach only needs one web page to generate a wrapper and does not require any training data like [11]. In order to generate a wrapper that extracts a list of similar records from a web page, it s sufficient to add the relevant elements of one example record to the wrapper. To add elements from a web page to a wrapper, it s enough to simply click on these elements in the browser. Once the user is done with adding elements, a wrapper can be generated. For each element in the wrapper we store its XPath, tag name, content, CSS properties and the tag names of its siblings. Although not all of these features are needed to extract content from a web page, they are necessary to verify and maintain a wrapper.

15 WRAPPER VERIFICATION 2.2 Wrapper Verification A Wrapper Verification process is used to verify if a wrapper is extracting the right data. In the approach of Lerman et al. [11], they check if the extracted data by a wrapper is significantly different than previous data extracted by the same wrapper. This is done by comparing newly extracted data to correctly extracted examples. If the density of tokens in both samples is at some significance level statistically the same, the wrapper is judged to be extracting correctly. Kushmerick [10] proposed a very similar approach. In addition to the token density he also considers the token length density. Chang et al. [5] proposed a different approach to verify a wrapper. They store a wrapper as a schema tree which defines the XPath and content features of each element in the page which needs to be extracted. As content features they compute the density of letters, digits and punctuations. They also store if a string begins with http and if the first letter is capitalized. Before they start extracting data for a page, they try to match each element from the schema tree in the wrapper with an element from the web page. To match two elements, they compute their matching probability based on their content features. If they can match the whole schema tree in the wrapper with the page in the right order, the consider the wrapper to be valid. Pek et al. [17] proposed an approach that verifies a wrapper by checking the DOM tree of the web page from which the wrapper was created. Each extraction rule in a wrapper is described as a path in the DOM tree. The wrapper also contains the number of child elements for each element along such a path. If the number of children of an element has changed, this approach checks if newly added branches also follow the structure describe by the extraction rule. We present a novel approach to verify a wrapper in this thesis. For each extraction rule in the wrapper, we test if its XPath points to an element in the DOM tree of the current web page from which the user wants to extract data. If the XPath is valid, we check if the siblings of the element are still the same as the ones stored in the wrapper. We do this by comparing their tag names. In a last step we compute the visual similarity between the element when the wrapper was generated and the element now. If the similarity is above a given threshold, we consider this extraction rule to be valid. If all extraction rules inside the wrapper are valid, we consider the whole wrapper to be valid. Unlike [11], [10] and [5], this approach does not use content features for verification, since we want our wrappers to be applicable to multiple web pages from the same website. Many web pages from the same website are likely to use the same template, but their dynamic content usually differs. Another advantage of this approach is that we do not need to run a wrapper for its validation unlike [11], [10] and [5]. 2.3 Wrapper Maintenance As soon as a wrapper fails to extract the correct data, a process is needed to repair the wrapper. Usually this happens after a website changes its template. This process is called Wrapper Maintenance or Wrapper Re-induction [11]. During this process, each extraction rule inside the wrapper which is extracting the wrong data is updated. The wrapper re-induction approach described by Lerman et al. [11] tries to find the data in

16 CHAPTER 2. BACKGROUND 7 the changed website that is the most similar to older correctly extracted data. This data is then used to learn new extraction rules for this web site. To reduce the amount of data to be processed, the content of a web page is divided into a static and a dynamic part with results of a database. To achieve this, multiple web pages of the same website are needed. Only the dynamic part of a web page is considered to contain important data. A similar approach was described by Meng et al. [14]. To maintain their wrappers, they look for elements in the changed web page with similar features to their previous correctly extracted elements. They use three features to describe an element: a boolean value to indicate whether the element has an associated hyperlink or not, a text that appears close to the element in the web page and a regular expression that describes the pattern of the data. Using the similar elements in the changed web page, they re-induct their extraction rules as XQueries. Ferrara et al. [7] proposed to use tree matching algorithms to maintain a wrapper. To do so, they store the DOM subtree of one correctly matched record in the wrapper. As soon as the wrapper fails to extract correct data, the wrapper maintenance procedure searches for the most similar DOM subtree in the changed web page. The wrapper is then updated according to this newly found subtree. To compute the similarity between two subtrees, they present two recursive tree matching algorithms. The first algorithm is called simple tree matching and computes the similarity between two trees by producing the maximum matching through dynamic programming. Nodes are considered to be equal if their tag names are identical. The second tree matching algorithm is called clustered tree matching and is similar to the simple tree matching algorithm. The only difference is that they attribute less importance to slight changes in the structure of a tree if they occur in deep sublevels. To do so, they weight changes according to their position in a tree. Another wrapper maintenance method is EDG-WM [13], which consists out of three steps. In a first step, a support vector machine is trained using previously correctly extracted data. To do so, attributes, position, font, color, size and path are used as features to describe the data. In the next step, the trained classifier is used on the changed web page to find new examples that have similar features than the previously correctly extracted data. Finally, a new wrapper is generated from the newly found examples using a wrapper induction method. The approach by Miled et al. [15] used a knowledge base to identify relevant records in order to generate a wrapper. Once the wrapper needs to be maintained, they repeat the induction process and replace the old rules. To do so, they again require a set of sample web pages. Then they use the knowledge base to identify new relevant records, let the user check the identified records and finally generate the new extraction rules. The wrapper maintenance approach presented in this thesis is similar to [14]. To maintain an extraction rule, we look for elements in the changed web page that are similar to the element described by the extraction rule. This is possible, since we store the values of all features of an element addressed by an extraction rule in the wrapper. To find the most similar element in the changed page, we need to iterate through each element and compute its similarity to the element addressed by the extraction rule. Once the most similar element has been found, the extraction rule in the wrapper is updated. Unlike [11] and [14] this approach doesn t need to store old extracted data in order to maintain a wrapper. Also we only need one web page to maintain the wrapper, unlike [11] and [15]. Another advantage of our approach is that we

17 DEEPDESIGN don t need to store the whole DOM subtree of an annotation, unlike [7]. 2.4 DeepDesign Murolo et al. [16] presented a semiautomatic tool for web data extraction called DeepDesign. This tool is able to extract data records from a web page given annotated example records. Annotations are used to to map a part of a record in a web page to a label. DeepDesign requires one, in some situations two annotated example records to extract similar records from a web page. Those annotated example records are called examples. The matching algorithm used by DeepDesign works in four steps: 1. Boundaries Detection In a first step, DeepDesign tries to find the boundaries of the given example records in a web page. The boundaries are given by the lowest common ancestor of all annotated elements. If not all records have a distinct common ancestor, the user needs to annotate a second example, which then helps DeepDesign to find the boundaries of records within a common lowest common ancestor. 2. Similar Records Using an approximated tree edit distance function, Deep- Design looks for subtrees in a web page that are similar to the subtree of the example record. If the tree edit distance is lower than a given threshold, a subtree is considered to be a record. 3. Record Propagation In this step, DeepDesign propagates the given annotations to the newly found records. DeepDesign decides which elements of a record are the best match for an annotation. To do so, DeepDesign uses a distance function based on structural and visual features. For each annotation, a list of candidates for each record is generated. 4. Label Propagation Labels are locally propagated within records. To do so, DeepDesign hierarchically clusters all candidates of an annotation according to a distance function. Each element in a cluster gets the same label. DeepDesign has been implemented as an extension for the Chrome browser. Therefore, Deep- Design can be used directly within the browser by interacting with a rendered web page. DeepDesign is used by the approach presented in thesis as an underlying content extraction system. While this thesis is focused on the handling of wrappers, DeepDesign is used to extract content from a web page given the annotations stored in a wrapper which was generated by our approach.

18 3 Approach The goal of this thesis is to build a wrapper-management library for DeepDesign that is able to generate, verify and maintain wrappers for web content extraction. Those wrappers should be applicable to web pages in three scenarios: A wrapper should be applicable to the same web page from which it was generated. Also, a wrapper should be applicable to web pages from the same website it was generated. Additionally, if the web site from which the wrapper was generated changes its template, our wrapper maintenance approach should be able to update the wrapper according to the changed template. In this approach, a wrapper is a data structure that contains all the information needed by DeepDesign to extract content from a web page. As described in Section 2.4, we need up to two annotated example records from a web page. This means, our wrappers need to be able to store two examples each with a set of annotations. Such a data structure is described in Section 3.1. In order to generate new wrappers, we present our wrapper induction approach in Section 3.2. To verify a wrapper, we introduce our wrapper verification approach in Section 3.3. Our wrapper maintenance approach is described in Section 3.4. Finally, we describe how our approach is working side by side with DeepDesign, in order to extract content from a web page using our wrappers. 3.1 Wrapper Structure In the context of DeepLibrary, a wrapper is a data structure that holds all the information needed to extract data from a web page. Additionally, the wrapper data structure contains information needed for its verification and maintenance. In Figure 3.1 we can see this structure in detail. The field Original URL stores the URL of the web page, on which the wrapper was created. The Maintenance Interval defines the time interval in which this wrapper will be 9

10 3.1. WRAPPER STRUCTURE automatically maintained by the server. The server-side maintenance is describe in Section 4.5.1. Apart from that, the wrapper data structure also contains some fields for its metadata such as name, description and the timestamp of its last modification.

This URLs point to the web pages to which this wrapper is applicable to, as described in Section 3.1.3. 3.1.1 Example Structure A wrapper contains exactly two examples.

19 WRAPPER STRUCTURE automatically maintained by the server. The server-side maintenance is describe in Section Apart from that, the wrapper data structure also contains some fields for its metadata such as name, description and the timestamp of its last modification. Figure 3.1: The structure of a wrapper The wrapper data structure contains a data structure for each example. Each example then holds a set of annotations, as described in Section There is also a data structure that holds a set of URLs in the wrapper data structure. This URLs point to the web pages to which this wrapper is applicable to, as described in Section Example Structure A wrapper contains exactly two examples. Each of those examples represents a collection of annotations, as shown in Figure 3.2. An example can be empty in cases where DeepDesign doesn t require it to extract records. This is the case for the second example if it isn t needed to detect the boundaries between records. Figure 3.2: The structure of an example Annotation Structure Each element annotated by the user is stored in an annotation structure. This annotation structure contains all the features of an element that are needed to extract data, verify if the

CHAPTER 3. APPROACH 11 annotation is still available, and maintain the annotation if the element has changed or isn t available anymore. This can be seen in Figure 3.3. DeepLibrary stores the XPath of each annotation, its text content, its tag name, its left and right siblings tag name and its defined CSS rules.

Each CSS rule is represented with its property name and its value. A CSS rule like color: red; consists out of the property color and its value red. Figure 3.4: The CSS rules container structure 3.1.

20 CHAPTER 3. APPROACH 11 annotation is still available, and maintain the annotation if the element has changed or isn t available anymore. This can be seen in Figure 3.3. DeepLibrary stores the XPath of each annotation, its text content, its tag name, its left and right siblings tag name and its defined CSS rules. Additionally each annotation has a name, which can be defined by the user. Figure 3.3: The structure of an annotation The CSS rules are stored in a separate data structure, as seen in Figure 3.4. Each CSS rule is represented with its property name and its value. A CSS rule like color: red; consists out of the property color and its value red. Figure 3.4: The CSS rules container structure Url Container Structure The URL data structure is shown in Figure 3.5. It contains a set of URLs, which point to web pages to which a wrapper is applicable to. The URLs in this set can contain wildcards like % which matches any number of characters and which matches exactly one character. If we want to use a wrapper to extract products from all categories of a web shop that uses URLs like we can simply add the URL to the wrapper.

12 3.2. WRAPPER INDUCTION 3.2 Wrapper Induction Figure 3.5: The URL container structure The wrapper induction approach presented in this thesis is semiautomatic.

21 WRAPPER INDUCTION 3.2 Wrapper Induction Figure 3.5: The URL container structure The wrapper induction approach presented in this thesis is semiautomatic. This means that the approach begins with a manual part, during which user interaction in required in order to add annotations to the wrapper, and ends with an automatic part, during which the features of each annotation are extracted and the wrapper is generated. To increase to usability of our approach, the wrapper induction will happen directly in a browser, allowing the user to interact with an web page. Our approach begins with a given empty wrapper. An empty wrapper is a wrapper that has two empty examples, i.e. examples that do not contain any annotations. The user then can add annotations to each of both examples. This can be done by directly clicking on elements in the browser. Once the user clicked on an element, a new annotation is added to the previously specified example and the user is prompted to name the new annotation. When the user is done adding annotations, he can initiate the wrapper generation. During this part of the approach, all features of each annotated element will be extracted and then stored in the data structure described in Section 3.1. DeepLibrary uses the following features: XPath Text Content Siblings Tag Name CSS Rules The path of an element in the DOM tree. If we know a website hasn t changed its template, this path can be used to directly retrieve an annotated element. The visible text content of an element. The tag name of the left and right sibling of an element. If there is no sibling available, we use NONE as tag name. The tag name of the element. Some selected CSS rules of the element. Section explains the selection process in detail. In a last step, the user will be prompted to define all URLs to which this wrapper will be applicable to. The URLs can contain wildcards, as described in Section The user also gets a chance to add a name, description and maintenance interval to the wrapper. Since all annotations in a wrapper are completely independent, this wrapper induction approach can also be used to modify existing wrappers. This means that existing annotations can be removed and new annotations can be added, without requiring to generate a new wrapper.

22 CHAPTER 3. APPROACH CSS Rule Selection One of the features that are extracted for each annotated element are its CSS rules. Since a large quantity of CSS rules can be defined for an element, this could increase the required space to store a wrapper and also increase the runtime of the visual similarity computation, described in Section To resolve this issue, DeepLibrary only extracts specific CSS rules. Those specific rules are the ones that are most likely not going to change during a template change of a website. To select those specific CSS rules, we evaluated our training data set, which is presented in Appendix A. This data set contains template changes of websites from 2006 to The way we obtained this data set is described in Section 5.1. In all template changes of a website the same set of elements was manually annotated. For each annotated element we then evaluated the CSS properties that didn t change during a template change. We computed a score for each CSS property which is equal to the number of times the property didn t change its value for the same element during a template change of a website. The results of this experiment can be seen in table 3.1. The experiment was run in a Chrome browser, since DeepLibrary and DeepDesign both are implemented as an extension for Chrome and therefore deal with the Chrome specific CSS properties. We decided to extract the top 50 CSS properties for each annotated element if they are defined. Rank Score CSS Property transform-origin perspective-origin text-align font-family font-size color outline-color webkit-text-fill-color webkit-text-stroke-color webkit-text-emphasis-color webkit-column-rule-color list-style-type border-top-color border-left-color border-bottom-color border-right-color line-height webkit-border-vertical-spacing webkit-border-horizontal-spacing webkit-locale height border-collapse width webkit-text-decorations-in-effect font-weight Rank Score CSS Property cursor padding-left padding-right text-decoration padding-bottom white-space padding-top margin-bottom vertical-align display margin-top margin-right margin-left overflow-x overflow-y border-bottom-width border-bottom-style border-right-width border-right-style unicode-bidi 46 8 word-break 47 7 background-color 48 6 float 49 6 overflow-wrap 50 6 word-wrap Table 3.1: CSS properties that didn t change during a template change

23 WRAPPER VERIFICATION Text Nodes Since DeepDesign supports the annotation of text nodes, DeepLibrary needs to support them as well. Text nodes have to be treated differently than normal element nodes, since they do not support the same methods and properties. If a text node in the web page is clicked in order to create an annotation, the user can decide if the text node itself or its parent element (which is in any case an element node) should be used. Should the user decide to use the text node itself, DeepDesign will internally wrap this text node inside of a TEXTTAG HTML element. This allows DeepLibrary to treat this wrapped text node like a normal element node in most cases. 3.3 Wrapper Verification Before an user is able to use a wrapper to extract content from a web page, the wrapper needs to be validated. Similarly to the wrapper induction, the wrapper verification happens directly in the clients browser. A wrapper is always validated against the currently displayed web page in the browser. Since a wrapper can be applied to multiple web pages, it is possible that a wrapper can be successfully verified for some of those web pages, but not for others. The wrapper verification approach presented in this thesis verifies each annotation separately. If all annotations are successfully validated, we consider the whole wrapper to be valid. The validation of an annotation happens in 3 steps: 1. Validate XPath In a first step, we try to relocate the annotated element using its stored XPath. If this fails, we consider the annotation to be invalid and stop its verification. Otherwise we use the relocated element as a candidate. 2. Validate Siblings Using only the XPath to validate an annotation isn t enough. A website might have changed its template and now the XPath could point to a completely different element. This is the reason why we need to validate the candidate element from step one further. We do so by comparing the sibling tag names stored in the annotation with the sibling tag names of the candidate element. If they aren t equal, we consider the annotation to be invalid and stop its verification. 3. Validate Visual Similarity In a last step, we compute the visual similarity between the element that was annotated and the candidate element. The method that computes the similarity is describe in Section If the elements have a visual similarity of at least 75%, we consider the annotation the be valid, otherwise not.

24 CHAPTER 3. APPROACH Visual Similarity The computation of the visual similarity between two elements relies on their defined CSS rules. The more CSS rules both elements have in common, the higher their visual similarity. Assume we compute the visual similarity between two elements a and b. The set of CSS rules defined for element a is called A, the set of CSS rules defined for element b is called B. Their visual similarity can be computed using the following formula: visual similarity(a, B) = A B A B The visual similarity will be 0 if two elements do not have any CSS rules in common. If the CSS rules defined for both elements are equal, their visual similarity will be equal to Wrapper Maintenance Our wrapper maintenance approach updates annotations that could not be verified on an updated web page. The idea behind the approach is simple: We look for the element in the updated web page that is the most similar to the annotated element. Similar to the wrapper verification approach, this happens for each annotation separately. This allows us to maintain only the annotations in a wrapper that really need to be maintained, instead of the whole wrapper. The maintenance of an annotation happens in three steps: 1. Candidate Extraction In a first step, we extract all candidate elements from the updated web page. The decision whether an element is considered to be a candidate is depending on its tag name. Only HTML tags that can have text node children are considered to be candidates. 2. Similarity Computation For each candidate element, we compute its similarity to the annotated element. To do so, we use an ensemble of classifiers. The ensemble consists out of a tag name classifier (described in Section 3.4.2), a siblings tag name classifier (described in Section 3.4.3), a text content classifier (described in Section 3.4.4), a XPath classifier (described in Section 3.4.5), and a visual classifier (described in Section 3.4.6). We then combine the results of each classifier to determine the most similar element, as described in Section Annotation Update The features of the annotated elements are overwritten by the values of the newly found most similar element. Once this is done, the annotation maintenance process is done.

25 WRAPPER MAINTENANCE Ensemble of Classifiers To compute the similarity between two elements, we use an ensemble of classifiers. Each classifier in this ensemble uses different features for its computation. The result of each classifier is a value between zero and one. The higher the value, the higher the similarity between the two elements. We combine all the values from the classifiers into one single weighted sum. To do so, we multiply the computed similarity of each classifier C i with its weight w i. The weight of each classifier is shown in Table 3.2. Those values result from experiments described in Section 5.2. Given an ensemble of N classifiers, the weighted sum can be computed using this formula: similarity(e 1, E 2 ) = N C i (E 1, E 2 ) w i i=1 Additionally, each classifier will produce a binary decision for each given pair of elements. The decision is based of whether the classifier decides the two elements are similar or not. Whenever the computed similarity value of a classifier C i is higher than a given threshold for this classifier t i then the decision will be positive, otherwise negative. For the classifiers in DeepLibrary, we use the thresholds shown in Table 3.2 Those values result from experiments described in Section 5.2. Classifier C Weight w Threshold t XPath Visual Text Content Siblings Tag Name Table 3.2: Weights and thresholds for classifiers used by DeepLibrary To find the most similar element in a web page to a given element, we start by looking for the element in the web page which gets the most positive votes from the classifiers. If this results in multiple elements, we would select the element with the highest similarity value out of those elements Tag Name Classifier The tag name classifier is based on the likelihood of an annotated element with tag name B changing to an element with tag name A in the updated web page. To compute a similarity score, we use a naive-bayes classifier: tag similarity(a, B) = P (tag at time t 2 = A tag at time t 1 = B) = Number of B tags at time t 1 that changed to A tags at time t 2 Number of B tags at time t 1 This computation relies on statistics generated from historical data. We used our training data set described in Appendix A as data source. For each annotated element in a snapshot, we

26 CHAPTER 3. APPROACH 17 retrieved its tag name and check which tag name is used for the same annotated element in the following snapshot. We ended up with 684 tag name changes and were able to generate a likelihood matrix, as seen in Figure 3.6. The tag names in red are those of the annotated element, the ones in green are those of the element in the changed web page. The darker a field, the higher the likelihood of a change. The fields on the diagonal of the matrix describe the likelihood that a given tag name wont change its tag name during a template change. Figure 3.6: Tag name change likelihood matrix Siblings Classifier The siblings classifier computes the similarity between an annotated element B and an element A in the updated web page, given their immediate left and right sibling element in the DOM tree. To do so, we use two naive-bayes classifiers. One for the left siblings and one for the right siblings. Both classifiers are equal but rely on different statistics. Given the two elements B and A, we can address the tag name of their immediate left and right siblings in the DOM tree by B.left, B.right and A.left, A.right. If a sibling does not exist, we say there is no tag. By focusing on the left sibling, the following transitions from element B to A can happen:

27 WRAPPER MAINTENANCE Tag to Same Tag Tag to Different Tag Tag to No Tag No Tag to No Tag No Tag to Tag A and B both have a left sibling and the tag names of those siblings are equal. Example: B.left = DIV, A.left = DIV A and B both have a left sibling but the tag names of those siblings are different. Example: B.left = DIV, A.left = SPAN B has a left sibling but A doesn t. Example: B.left = SPAN, A.left = - B and A don t have a left sibling. Example: B.left = -, A.left = - B has no left sibling but A does. Example: B.left = -, A.left = DIV The same transitions exist for the right sibling as well. We map each transition to a likelihood, which results from experiments run on our training data set described in Appendix A. The likelihoods are shown in Table 3.3. To combine the likelihood of the left sibling transition with the likelihood of the right sibling transition we multiply them: siblings similarity(a, B) = P (A.lef t, B.lef t B.lef t) P (A.right, B.right B.right) Transition Left Sibling Likelihood Right Sibling Likelihood Tag to Same Tag 74% 66% Tag to Different Tag 11% 13% Tag to No Tag 15% 21% No Tag to No Tag 82% 83% No Tag to Tag 18% 17% Table 3.3: Siblings classifier likelihoods Text Content Classifier Given two elements A and B, we can compute their similarity based on their text content. This computation can be done in two steps: 1. Character Counting In a first step, we iterate through each character in the text content of an element. This happens for each element separately. We map each character to one of those character classes: ASCII Spaces ASCII Digits

28 CHAPTER 3. APPROACH 19 ASCII Latin Uppercase Letters ASCII Latin Lowercase Letters ASCII Symbols Unicode Left to Right Characters Unicode Right to Left Characters Unicode Indic Characters Unicode African Characters Unicode Conlang Characters Unicode Near East Characters Unicode Undecipherable Characters Unicode North American Characters Unicode Hieroglyphics Characters Unicode Sumerian Characters Unicode Asian Characters Unicode Unmapped Characters Each character class has its own counter. Every time we map a character to a class, we increase the counter of this class by one. Note that we use individual counters for each element. 2. Comparing In a second step, we compare the character class counts of element A to the character class counts of element B. Let C A,i be the character count of class i for element A and l A the length of the text content of element A. We then can compute the content similarity of element A and B using this formula: N i=1 content similarity(a, B) = 1 C A,i C B,i l A + l B XPath Classifier The XPath classifier computes the similarity between the XPaths of two elements. This approach requires both XPaths to start at the root of a HTML document. XPaths like //*[@id= rso ]/div/div are not supported. The similarity computation can be done in three steps: 1. Split XPath Given two XPaths a and b, we split each XPath into a sequence of path nodes. For a given XPath a = /html/body/div/a, the associated sequence would be {html, body, div, a}. We call this sequence A. The i-th path node in this list can be addressed by A i.

29 WRAPPER MAINTENANCE 2. Edit Distance We compute the edit distance over the two sequences of path nodes. The edit distance describes the minimum number of needed deletions, replacements or insertions of path nodes to get from sequence A to sequence B. The computation of this XPath edit distance is similar to the Leveshtein Distance [12]. Instead of comparing each character of a string, we compare entire path nodes of a sequence. This changes the original algorithm slightly: Algorithm 1: XPath edit distance Data: Sequence A, Sequence B Result: Minimal edit distance of A and B: D m,n m A n B D 0,0 = 0 D i,0 = 1 for 1 i m D 0,j = 1 for 1 j n D i 1,j if A i = B j D D i,j = min i 1,j D i,j D i 1,j + 1 for 1 i m, 1 i n After running the algorithm, the edit distance will be in field D m,n. 3. Similarity The edit distance from step 2 is an absolute value that defines the number of needed edit operations. If we want to get a comparable edit distance, we need a value that is relative to the length of sequences A and B. To get such a value, we simply divide the edits distance by the length of the longest sequence. relative distance = D m+1,n+1 max(m, n)) This value will always be between zero and one. To get a similarity value, we subtract the relative distance from one. xpath similatity = 1 relative distance Visual Classifier The visual classifier uses the approach described in Section to compute the visual similarity between two elements.

30 CHAPTER 3. APPROACH Content Extraction The actual content extraction using wrappers generated by our approach is executed by Deep- Design. DeepDesign and our approach are both working side by side in the client browser and communicate through specified interfaces by passing messages. Our library has to validate and send all annotated elements of a wrapper to DeepDesign. The annotated elements can be directly accessed by the XPath stored inside of each annotation. After this has happened, DeepDesign is triggered to extract the content defined by the wrapper.

31 CONTENT EXTRACTION

32 4 Implementation & Architecture The approaches presented in Chapter 3 have been implemented as a wrapper-management library called DeepLibrary. DeepLibrary consists out of a client-sided Chrome extension and a server-sided application that is able to store wrappers in a centralized database. We describe the architecture of DeepLibrary in Section 4.1. In Section 4.2, we describe the different representations of a wrapper in our library. Sections 4.3, 4.4 and 4.5 present our implementation of the wrapper induction, wrapper verification and wrapper maintenance process. Finally, we describe the challenges that we had to solve during the development process in Section System Architecture The architecture of DeepLibrary can be divided into two parts: A client-sided part that was implemented as a Chrome extension and a centralized server-sided part that was implemented using Node.js and a MySQL database. The database allows to store and share wrappers between all users. Figure 4.1 shows the architecture of DeepLibaray Client-Side DeepLibrary has been implemented as a Chrome extension on the client-side. This allows us to use a popup window that appears whenever the user clicks on the icon of our extension, as shown in Figure 4.4. This popup window is used to interact with the user. Another feature of Chrome extensions are the so called background scripts. They can be used to run scripts during the whole browsing session like a daemon. DeepLibrary uses a background script to retrieve the number of applicable wrappers each time the user changes the web page, as described in Section

24 4.2. WRAPPER REPRESENTATION Figure 4.

33 WRAPPER REPRESENTATION Figure 4.1: The architecture of DeepLibrary The most important feature of Chrome extensions is their ability to inject JavaScript code into any web page that is currently opened in the browser. This allows us to inject code for inspecting the DOM tree of a web page and extracting features of elements. This is mainly used for our wrapper maintenance procedure and for the content extraction based on DeepDesign Server-Side On the server-side we use an application written in Node.js which accepts requests from the client-side Chrome extension. The Chrome extension can request to create, update or delete a wrapper from the database. Additionally the Chrome extension can request suggestions of wrappers for a given url. All those requests are forwarded to a MySQL database which stores all wrappers in the system Interfaces There are multiple interfaces in this system. To send messages inside the Chrome extension, for example between the popup and the injected JavaScript code in the web page, we use the chrome.runtime API. In order to send messages between the chrome extension and the serverside Node.js application, we use the Socket.IO library. For the communication between the Node.js application and the database, we use SQL queries. 4.2 Wrapper Representation A wrapper can be represented in multiple ways. During the runtime of the client-side Chrome extension, a wrapper might be represented as an object in the clients main memory. During that time, a wrapper is stored in the structure described in Section 3.1. To reuse a wrapper, it might be useful to store it persistently. We offer two ways to do so:

34 CHAPTER 4. IMPLEMENTATION & ARCHITECTURE 25 Centralized Database All wrappers created by DeepLibrary can be stored in a centralized database. This allows not only to reuse a wrapper but also to share it with different users. To store a wrapper in the database, we first dissassemble it and then store its parts in multiple tables, according to the first normal form (1NF). The entity-relationship diagram of our database is shown in Figure 4.2. Figure 4.2: Entity-relationship diagram of the database Local File In some cases it might be useful to store a wrapper locally. To do so, DeepLibrary offers an export and import function that is able to save and load local files. The save function uses the JSON.stringify function to transform a wrapper object in memory into a string that then can downloaded by the user into a local file. The load function does the opposite: It reads the content from a local file using the JavaScript File API and then parses it using the JSON.parse function into a wrapper object. 4.3 Wrapper Creation Wrappers are created directly in the browser, as described in Section 3.2. By clicking on rendered elements in a web page, annotations can be added to a wrapper. After clicking on an

35 WRAPPER CREATION element, the user is prompted to add a name to the annotation, as shown in Figure 4.3. Each annotated element is highlighted in the web page. Figure 4.3: Add annotation dialog The popup of the extension shows a list of all annotations that were added. This popup also acts as an editor for wrappers, since annotations can be added or removed. Additionally, this popup allows the user to directly run DeepDesign using the added annotations. This is the reason why we call this the wrapper controller popup. We explain it in Figure 4.4. Once all annotations have been added, the user can save the wrapper in the centralized database by clicking on the safe icon. Before the wrapper is sent to the database, the user has to add a name, description and a set of URLs to which this wrapper will become applicable to, as shown in Figure CSS Rules Extraction At the time writing this thesis, the only way to extract the CSS rules that apply to an element in the Chrome browser was by using the Window.getComputedStyle JavaScript function. This function returns an object that contains each CSS property known by Chrome associated with its computed value. If the web page doesn t specify a CSS property, the default value for this property by the browser will be applied. This introduces the problem how to distinguish between CSS properties with a value defined by the web page and CSS properties with a default value defined by the browser. To compute the visual similarity between to elements according to Section 3.3.1, we only consider CSS properties that were defined by the web page. This reduces the amount of CSS rules we need to store in a wrapper and increases the quality of the visual classifier. We solved this issue by storing the default values applied by Chrome for each CSS property for all supported tag names by DeepLibrary. During the CSS rules extraction of an annotation we then check each computed value against the stored default value. If those values are equal, we ignore this property. This solution works in most cases, but is far from perfect. If an element has a default weight of 100%, we will get a numeric value in pixels as result from calling Window.getComputedStyle. We can t identify this numeric value to be a default value by the browser since it s not in percent. Another issue is that the stored default values depend

CHAPTER 4. IMPLEMENTATION & ARCHITECTURE 27 Nr. Function 1 This section contains a list of all annotations that were added to the first example of the current wrapper.

36 CHAPTER 4. IMPLEMENTATION & ARCHITECTURE 27 Nr. Function 1 This section contains a list of all annotations that were added to the first example of the current wrapper. Each annotation is described with its name, tag name and whether the annotation is available on the current page. 2 This section contains a list of all annotations that were added to the second example of the current wrapper. 3 The sliders in this section can be used to change parameters that are used by DeepDesign to extract content. 4 Once DeepDesign has finished the content extraction, this section shows how many records have been extracted and how long it took. 5 By clicking the plus icon, annotations can be added to the first example. 6 By clicking the magnifying glass, the current web page will scroll to this annotation to make it visible. 7 By clicking this button, the annotation can be deleted. 8 By clicking this button, DeepDesign is run using the annotations defined by this wrapper. 9 By clicking this button, the wrapper can be stored in the centralized database. 10 By clicking this button, the wrapper can be downloaded as a file. 11 By clicking this button, the extracted results by DeepDesign will be shown in a dialog. This is shown in Figure By clicking this button, additional debug options are shown. 13 By clicking this button, the current wrapper will be deleted. Figure 4.4: Wrapper controller popup

37 WRAPPER VERIFICATION Figure 4.5: Wrapper save popup on the version of Chrome. If Chrome changes its default values or adds new CSS properties, our stored default values need to be updated. 4.4 Wrapper Verification While the user is browsing the web, our extension will continuously suggest wrappers to the user, depending on the web page the user is currently looking at. If the user opens the extension popup, a list of suggested wrappers is shown, as in Figure 4.6. Each of those wrappers is verified according to the approach described in Section 3.3. The result of the verification is shown in the colored bar below the wrapper name. The number in the colored bar shows how man annotations could be verified out of all the annotations in the wrapper. If all annotations could have been verified, the bar will be green. If some annotations could have been verified, the bar will be yellow. The red bar will appear if no annotations could have been verified.

38 CHAPTER 4. IMPLEMENTATION & ARCHITECTURE 29 Figure 4.6: Wrapper verification popup 4.5 Wrapper Maintenance In Figure 4.6, we see a list of wrappers suggested to the user. If the user clicks on the play icon to the right of the wrapper name, the corresponding wrapper will be loaded into the wrapper controller popup, shown in Figure 4.4. If all annotations are available, no wrapper maintenance is needed during this loading process. Otherwise the annotations that aren t available need to be maintained. To do so, we implemented the maintenance approach described in Section Server-Side Maintenance Depending on the number of elements in a web page and the number of annotations in a wrapper, the wrapper maintenance process might take a few seconds to run. To speed up this process, we implemented a server-side wrapper maintenance process. This server-side maintenance process is equal to the client-side process, with the exception that it is executed on the server-side. The wrapper maintenance for each wrapper is started periodically after a predefined time interval. This time interval can be chosen for each wrapper individually. It is also possible to exclude a wrapper from this automatic maintenance process. The server-side wrapper maintenance process has been implemented as a Node.js application. The application constantly checks our centralized database for wrappers that need to be maintained. To do so, we use the fields last changed and maintenance interval. If we find a wrapper that needs to be updated, we reconstruct the whole wrapper object from the database and use a headless browser called PhantomJS to open the url stored in the original url field of the wrapper. Once the page is loaded, we inject JavaScript code to validate and maintain

39 IMPLEMENTATION this wrapper. This code is equal to the code used in the Chrome extension. We then pass the wrapper object to PhantomJS, validate the wrapper and maintain it if needed. Finally we write the wrapper back to our centralized database and update the last changed field. 4.6 Implementation This section presents some implementation details about DeepLibrary. We explain how the wrapper lookup in the database works and how we suggest wrappers to the user. Additionally, we present our dialog system, which is able to interact with the user. Finally, we show how the extracted data by DeepDesign is presented to the user Wrapper Suggestion The Chrome extension is able to suggest wrappers to the user. This happens in two different places: Icon Badge Number The number of available wrappers for the currently visible web page is constantly shown next to the icon of the extension. This is shown in Figure 4.7. The background script of the extension is responsible to update this number every time the user changes the web page. To get the number of available wrappers for a given URL, the background script sends a request with the current URL to the server. The server then answers with the number of available wrappers. Figure 4.7: Icon badge wrapper suggestion Popup List If the user clicks on the icon of the extension, a popup will appear with a list of available wrappers for the currently visible web page. This is shown in Figure 4.6. This list is also obtained by sending a request with the current URL to the server. As a response to this request, the server sends all available wrappers as complete objects with all annotations. It is important to use two different requests for those cases, since it would generate very much traffic to send entire wrapper objects to the background script each time the user changes the web page. This is the reason why we just send the number of applicable wrappers in that case.

40 CHAPTER 4. IMPLEMENTATION & ARCHITECTURE Wrapper Lookup In Section we describe the process in which the Chrome extension requests available wrappers for a given URL. We have seen that such a request is sent whenever the user changes the web page in the browser. This results in a huge amount of such requests on the server-side and requires a fast handling. We try to answer such queries without performing a table scan over all the stored URLs in the database. To do so, we exploit the MySQL LIKE comparison operator. This operator allows to evaluate a given URL against stored URL patterns in the database using wildcards. We describe the format of the stored URLs in Section To get all wrappers that are available for a given URL, for example we would use the following MySQL query: SELECT DISTINCT wrapper id FROM tbl url WHERE LIKE url; This query would match the following stored URL pattens: The wildcards supported by the MySQL LIKE operator are % and. Since % and could appear in any URL, we need to encode them before storing them in the database, in order to prevent them from being interpreted as a wildcard. Since most browsers support URLencoding, we can replace % by %25 and by %5F.

41 IMPLEMENTATION Dialog System The DeepLibrary Chrome extension uses its own system to show dialogs to the user. Those dialogs can be used to simply show information, as seen in Figure 4.8, but also to request input from the user, as seen in Figure 4.3. The dialog system can be used as shown in Listing 4.1. A new dialog is created by calling the function dialogcreatenewdialog. This function requires a dialog title, its HTML formatted content and an array of buttons as arguments. Each button in this array has a label and a callback, which will be called if the user clicks on this button. Those buttons will be appended to the bottom of the newly created dialog. Once the dialogcreatenewdialog function is called, a HTML5 dialog element will be appended to the visible web page. By calling showmodal the newly created dialog element will be shown to the user. Finally, by calling closeandremovedialog, the dialog element will be hidden and removed from the web page. v a r d i a l o g = d i a l o g C r e a t e N e w D i a l o g ( D i a l o g T i t l e, <p>the c o n t e n t of t h i s d i a l o g. </ p>, [ { ] ) ; t i t l e : B u t t o n 1, c a l l b a c k : b u t t o n 1 C a l l b a c k }, { t i t l e : B u t t o n 2, c a l l b a c k : b u t t o n 2 C a l l b a c k }, { t i t l e : Cancel, c a l l b a c k : f u n c t i o n ( ) { closeandremovedialog ( d i a l o g ) ; } } d i a l o g. showmodal ( ) ; Listing 4.1: DeepLibrary dialog system

CHAPTER 4. IMPLEMENTATION & ARCHITECTURE 33 4.6.4 DeepDesign Results Export Once DeepDesign has extracted all available records from a web site, DeepLibrary offers functions to handle those results.

42 CHAPTER 4. IMPLEMENTATION & ARCHITECTURE DeepDesign Results Export Once DeepDesign has extracted all available records from a web site, DeepLibrary offers functions to handle those results. To show the extracted records to the user, DeepLibrary creates a dialog that lists all results in a table, as seen in Figure 4.8. In this dialog, the user has the option to download the extracted records as a CSV or JSON formatted file. Figure 4.8: Results dialog

43 IMPLEMENTATION

44 5 Evaluation In this chapter, we present an evaluation of our approach. In Section 5.1 we describe the process in which our training an test data set was created. In the following Section 5.2 we evaluate each of our classifiers from Section 3.4 individually using the training data set. We show the results of a test on the training and test data set using our wrapper maintenance approach in Section 5.3 and 5.4. In Section 5.5, we compare the task based performance of our wrapper maintenance approach with the approach by Ferrara et al. [7] and the approach by Lerman et al. [11]. In Section 5.6, we also compare the runtime performance. Finally, we test the adaptability of wrappers generated by our approach. To do so, we try to apply a wrapper generated from a specific web page of a website to other web pages from the same website. The results of this test are shown in Section Data Sets For the evaluation of our approach we needed to create a training and test data set using representative websites. The training data set is used to train the classifiers and for the backtest in Section 5.3. The test set is used for the test in Section 5.4 and the task based comparison in Section 5.5. Since we are evaluating our wrapper maintenance approach, we needed multiple snapshots of a website at different dates. If the interval between the dates of those snapshots is big enough, we will most likely observe a template change of a website. In order to build a training and test data set, we needed to decide which websites should be used. The Alexa ranking 1 seemed to be useful, since it lists the websites with the most traffic in the internet. We see this ranking as representative for most of the website in the internet,

45 CLASSIFIER EVALUATION since we argue that they use design patterns which are often adopted by less successful websites. Unfortunately, we had to skip some websites from this ranking, since some websites do not offer reasonable content that could be annotated and extracted. A website needs to contain at least one record based listing with dynamic content, in order to be included in our data sets. To get multiple snapshots of a website, we used the WayBackMachine 2, a service which offers to look at older version of a website. For each website chosen from the Alexa ranking, we tried to find older snapshots. We tried to find all snapshots of a website between 2006 and 2016 in an interval of one year. Unfortunately, this was not possible for all websites. Often a website changed that much that the annotated content could not be found anymore after some years. In other cases, the WayBackMachine suddenly stopped archiving a website after a few years. During the evaluation, we realized that the WayBackMachine is not a very reliable source. Sometimes it takes really long to load a snapshot of a website, other times things suddenly stop working. After obtaining all snapshots of a website, we manually annotated the same elements in each snapshot. The result of this process was stored in our data sets and has been used for the evaluation as a ground truth. The training data set contains snapshots of 18 websites and is described in Appendix A. The test data set contains snapshots of 9 websites and is described in Appendix B. 5.2 Classifier Evaluation In Section 3.4, we describe all the classifiers used by our wrapper maintenance approach. Each of those classifiers uses a threshold value and weight as described in Section This section describes how we optimized those values for each classifier. We used our training data set for the optimization of the classifiers. This training data set contains a set of websites where each website contains a list of snapshots. A snapshot of a website is an exact copy of this website at a given date. The list of snapshots is ordered, which means that the oldest snapshot will be found at the head of the list. We annotated the same set of elements in each snapshot of a website. To find the optimal threshold value for a classifier, we test all threshold values between zero and one using a given step size of The threshold value has to be between zero and one, since this is the range of the computed similarity by our classifiers. Given a classifier C and a threshold t, we can evaluate this classifier using our training data. For each consecutive pair of snapshots of each website, we count how many elements of the older snapshot would be mapped to the newer snapshot correctly, given the threshold t. Since we have an annotated set of the same elements for both snapshots, we can do this. We say a classifier would map an element e1 from the older snapshot to element e2 from the newer snapshot, if the computed similarity by the classifier C(e1, e2) is greater or equal to the threshold t. Optimally, this is the case for all of the same annotated elements, but not for all other elements. Using this approach, we can compute the number of true positives, false positives, true negatives and false negatives for each classifier and each threshold between zero and one. Using 2

46 CHAPTER 5. EVALUATION 37 those numbers, we can then compute the precision and recall of a classifier. To measure the quality of a classifier, we also compute the F1-score and the Matthews correlation coefficient. Finally, we use the threshold t, that results in the highest Matthews correlation coefficient for a classifier. As a weight for this classifier, we use the highest Matthews correlation coefficient itself. This way, a classifier get weighted according to its quality Text Content Classifier Figure 5.1 shows the evaluation of the text content classifier, which was described in Section We see that the precision of this classifier never reaches a value above 0.4 for every threshold value. The reason for this is that we find very similar text contents in all snapshots. This results in many false positives which keep the precision value low. The highest Matthews correlation coefficient of was measured using a threshold of At that point, we measured a recall of and a precision of Figure 5.1: Text content classifier evaluation XPath Classifier Figure 5.2 shows the evaluation of the XPath classifier, which was described in Section Unlike all of the other classifiers, the precision of this classifier almost reaches the value one. The reason for this is the fact that all of our XPaths are unique in a web page, since they start at the root of a HTML document and end at a specific element. Therefore we can get rid of many false positives using a high threshold value. The highest Matthews correlation coefficient of was measured using a threshold of At that point, we measured a recall of and a precision of

47 CLASSIFIER EVALUATION Figure 5.2: XPath classifier evaluation Visual Classifier Figure 5.3 shows the evaluation of the visual classifier, which was described in Section Similar to almost all other classifiers, the precision of this classifier never reaches a value close to one. This is because a web page often contains many very similarly looking or even identically looking elements. If there are visually equal elements to an annotated element, which is often the case, then we will get false positives and therefore a lower precision value. The highest Matthews correlation coefficient of was measured using a threshold of At that point, we measured a recall of and a precision of Tag Name Classifier Figure 5.4 shows the evaluation of the tag name classifier, which was described in Section The plots are not as smooth as the plots of the other classifiers, since we are using discrete likelihood values for this classifier, as described in Section For each threshold, a tag name can either be classified positive or negative. As soon as we change the classification of a tag name, the precision and recall value will abruptly change. Similar to the other classifiers, this classifier never achieves a high precision value, since it s not possible to filter out false positives by just looking at the tag name of element, since multiple elements in a web page might use the same tag name as one of the annotated elements. The highest Matthews correlation coefficient of was measured using a threshold of At that point, we measured a recall of and a precision of

48 CHAPTER 5. EVALUATION 39 Figure 5.3: Visual classifier evaluation Figure 5.4: Tag name classifier evaluation

49 BACKTEST Siblings Classifier Figure 5.5 shows the evaluation of the siblings classifier, which was described in Section The results of this classifier are very similar to those of the tag name classifier, since we also use discrete likelihood values. Again, we get a low precision, since often multiple elements in a web page have the same left and right sibling tag name as one of the annotated elements. The highest Matthews correlation coefficient of was measured using a threshold of At that point, we measured a recall of and a precision of Figure 5.5: Siblings classifier evaluation 5.3 Backtest After we optimized the parameters of our classifiers, we run a backtest on the training data set. For each consecutive pair of snapshots of each website, we counted how many annotated elements from the older snapshot could have been maintained to point to the right element in the newer snapshot. From those results, we computed the accuracy of our maintenance approach for each website and finally for the whole training set. We achieved an overall accuracy of 80%. The accuracy was computed using this formula: accuracy = #correctly maintained annotations #annotations The results can be seen in Table 5.1 and Figure 5.6.

CHAPTER 5. EVALUATION 41 Website Correct Incorrect Accuracy shopping.yahoo.co.jp 6 3 0.67 stores.ebay.com 5 2 0.71 news.google.co.jp 16 11 0.59 yandex.ru 19 5 0.79 espn.go.com 24 2 0.92 amazon.co.jp 10 1 0.

50 CHAPTER 5. EVALUATION 41 Website Correct Incorrect Accuracy shopping.yahoo.co.jp stores.ebay.com news.google.co.jp yandex.ru espn.go.com amazon.co.jp reddit.com wordpress.com/top-posts buzz.blogger.com imdb.com stackoverflow.com questions stackoverflow.com tags stackoverflow.com users craigslist.org jd.com kat.cr flipkart.com Table 5.1: Backtest results Figure 5.6: Backtest chart

42 5.4. TEST 5.4 Test To evaluate the performance of our maintenance approach, we run a test on unseen data. To do so, we used our test data set, which is described in Appendix B.

51 TEST 5.4 Test To evaluate the performance of our maintenance approach, we run a test on unseen data. To do so, we used our test data set, which is described in Appendix B. The testing procedure is equal to the one used for the backtest in Section 5.3. We achieved an overall accuracy of 76%. This accuracy might be lower than the resulting accuracy from the backtest, because we re handling unseen data in this test. The results of this test can be seen in Table 5.2 and Figure 5.7. Website Correct Incorrect Accuracy news.google.com youtube.com dir.yahoo.com shopping.yahoo.com wikipedia.org shopping.msn.com amazon.com listing amazon.com detail twitter.com Table 5.2: Test results Figure 5.7: Test chart

52 CHAPTER 5. EVALUATION Task Based Comparison We compare the task based performance of multiple wrapper maintenance approaches in this section. To do so, we implemented the approach presented by Ferrara et al. [7] and evaluated it on our data sets. In order to compare to the approach presented by Lerman et al. [11], we were able to use their data set and compared our results with those presented in their paper. Since the approach presented by Ferrara et al. [7] requires that annotated elements contain a rather large subtree of elements in order to work properly, we perform a record based comparison in Section in addition to the annotation based comparison in Section This is due to the tree matching algorithms used by Ferrara et al. [7], which compute the distance between two subtrees. To maintain an annotation, they look for the element with the most similar subtree in the changed web page to the subtree of the annotated element. If the subtree of the annotated element is too small, this approach might find multiple false positive similar elements in the changed web page. This approach can be performed using two different tree matching algorithms called simple tree matching and clustered tree matching. Both of them are described in Section 2.3. We use both of them for this comparison Single Annotation Based In this section, we compute and compare the accuracy of multiple wrapper maintenance approaches in their ability to maintain a single annotation. In difference to a record, a single annotation refers to a HTML element which often only contains a few text node childrens and generally has a very small subtree. It s important to mention that the approach presented by Ferrara et al. [7] was intended to maintain records with large subtrees. This explains the bad performance of their approach in this annotation based evaluation. Since DeepDesign fully supports this kind of annotated elements, it still makes sense to run this comparison, even if it s to the disfavor of the approach by Ferrara et al. [7]. Figure 5.8 shows the computed accuracy of the approach by Ferrara et al. [7] and DeepLibrary, obtained by using our training data set. Since DeepLibrary was trained on this data set, this comparison is to the disfavor of the approach by Ferrara et al. [7]. The accuracy of DeepLibrary is in average about 60% higher than the accuracy of the approach by Ferrara et al. [7]. In Figure 5.9 we do the same comparison, but use our test data set. Although none of the compared approaches was trained on this data, the results are similar to those obtained by using the training data set. We were able to evaluated the appraoch by Ferrara et al. [7] and DeepLibrary on the data set used by Lerman et al. [11]. The results can be seen in Figure This comparison is based on precision, recall and the F1-score, since those are the only values we got from Lerman et al. [11]. The snapshots of the 6 websites in their data set are from 1999 to 2000, and the template changes between snapshots are very small. The data set is described in Appendix C. Due to the small changes between snapshots, DeepLibrary was almost always able to maintain an annotation correctly. Also, the approach by Lerman et al. [11] performed good.

44 5.5. TASK BASED COMPARISON Figure 5.8: Single annotation based evaluation on training data set Figure 5.9: Single annotation based evaluation on test data set 5.5.2 Record Based Instead of maintaining each annotation individually, we maintain them on a record level in this evaluation.

53 TASK BASED COMPARISON Figure 5.8: Single annotation based evaluation on training data set Figure 5.9: Single annotation based evaluation on test data set Record Based Instead of maintaining each annotation individually, we maintain them on a record level in this evaluation. This means, that we try to maintain the record that holds all annotations instead of the annotation itself. For each two consecutive snapshots of a website, we try to maintain the record from the older snapshot to the correct record in the newer snapshot. To do so, we can use our training and test data set, but have to identify the record that hold all annotations for each snapshot for this comparison. This record is equal to the element in the DOM tree that is the lowest common ancestor of each annotated element. We perform this kind of comparison, since the comparison in Section was to the disfavor of the approach by Ferrara et al. [7] due to the small subtree of each annotated element. This comparison now uses records which contain larger subtrees. To maintain an entire record with DeepLibrary, we maintain each annotated element first, and then compute the lowest common ancestor of the maintained annotations. If all annotated elements haven been maintained correctly, this lowest common ancestor will point to the correct record. This comparison is unfair towards DeepLibrary, since in most cases we have to maintain all of the annotations correctly in order to get the correct lowest common ancestor.

54 CHAPTER 5. EVALUATION 45 Figure 5.10: Single annotation based evaluation on Lerman et al. [11] data set

46 5.6. RUNTIME COMPARISON In Figure 5.11, we compared the accuracy of the approaches on our training data set. Again, this comparison is to the disfavor of the approach by Ferrara et al.

The accuracy of DeepLibaray lost 13%. Still, the accurcy of DeepLibrary is 8% higher. Figure 5.11: Record based evaluation on training data set For Figure 5.

55 RUNTIME COMPARISON In Figure 5.11, we compared the accuracy of the approaches on our training data set. Again, this comparison is to the disfavor of the approach by Ferrara et al. [7], since DeepLibrary was trained using this data. We see that the accuracy of the Simple Tree Matching approach increased by 32% compared to the annotation based comparison. The accuracy of DeepLibaray lost 13%. Still, the accurcy of DeepLibrary is 8% higher. Figure 5.11: Record based evaluation on training data set For Figure 5.12 we compared the approaches on our test data set. We can see that all approaches performed slightly worse. The results are similar to those obtained using the training data set. Figure 5.12: Record based evaluation on test data set 5.6 Runtime Comparison In this section, we compare the runtime of DeepLibrary with the runtime of the approach presented by Ferrara et al. [7]. To do so, we measure the runtime each approach takes to maintain an annotation. We use the training and test data set. This measurement is not to the disfavor of the approach presented by Ferrara et al. [7] since the fact that DeepLibrary was trained on one of those data sets does not influence its runtime. In Figure 5.13 and 5.14 we can see the results of this measurement. For each website, we

56 CHAPTER 5. EVALUATION 47 show the average runtime of each wrapper maintenance approach and the average number of DOM elements. We can see that the runtime scales according to the average number of DOM elements. Figure 5.13: Runtime comparison We can also see that the runtime of DeepDesign is in average about four times higher than the runtime of the other approaches. The reason for this is that the wrapper maintenance approach used by DeepDesign is much more complex than the one presented by Ferrara et al. [7]. While Ferrara et al. [7] mostly needs to compare tag names of elements to perform their tree matching algorithm, DeepLibrary performs more expensive operations, such as the computation of set intersections and set unions as used by the visual classifier. Although the measured runtime of DeepLibrary is higher than the runtime of the approach presented by Ferrara et al. [7], we think it s still in an acceptable range with an average runtime of 821ms. Since the wrapper maintenance will only be needed occasionally, we think this is an acceptable time for the user to wait.

57 ADAPTABILITY TEST Figure 5.14: Runtime comparison chart 5.7 Adaptability Test In this adaptability test, we evaluate how likely a wrapper generated by DeepLibrary from a specific web page of a website can be used for other web pages of the same website. We created a new data set for this purpose, which is described in Appendix D and consists out of 10 websites from the Alexa top 26 ranking. For each of those websites, we use four of its web pages. One web page is used to create a wrapper and three are used to test if the wrapper is extracting the right data. The results of this test can be seen in Table 5.3. Website #Tested Web Pages #Wrapper Working #Maintenance Needed google.com youtube.com yahoo.com amazon.com twitter.com msn.com bing.com yandex.ru taobao.com ebay.com Table 5.3: Adaptability test results By looking at the results obtained from this test, we state that in 80% of all test cases, a wrapper generated by DeepLibrary for a specific web page of a website could be used for other web pages of the same website. In 20% of all test cases, this was possible after maintain-

58 CHAPTER 5. EVALUATION 49 ing the wrapper. There was no case where a wrapper wasn t working after the maintenance process.

59 ADAPTABILITY TEST

60 6 Conclusion & Future Work In summary, this thesis presented a novel approach how to generate, store, validate and maintain wrappers. Our wrapper generation approach is semiautomatic and allows the user to annotate relevant elements directly in the rendered web page. All relevant features of the annotated elements are then stored in a new wrapper, which can be saved in the centralized database or in a local file. To validate a previously generated wrapper, we compare the stored features of its annotated elements with the features of the coherent element in the current version of the web page, from which this wrapper was generated. The wrapper is considered to be valid, if the features are equal. If a wrapper can t be successfully validated, which is often the case when the web page from which a wrapper was generated changes its template, we use our wrapper maintenance approach to update the annotated elements in a wrapper. Our wrapper maintenance approach tries to identify the annotated elements in the updated web page using multiple classifiers which use different features. In difference to other approaches that focus on the management of wrappers, our approach requires very little input data. To generate, validate and maintain a wrapper, we need exactly one web page. Approaches presented by other authors often relay on a set of web pages, in order to divide the content into a static and dynamic part. Also, our approach is able to take multiple features of an annotated element into account for the validation and maintenance process. Currently we are considering structural, visual and content based features. Other approaches often focus on just one or two of these features. The presented approach has been implemented as a wrapper-management library called DeepLibrary which relies on the content extraction tool DeepDesign. The client-side part of DeepLibrary has been implemented as a Chrome extension which integrates the functionality of our wrapper-management library directly in the browser. This Chrome extension is also able to suggest wrappers from the centralized database to the users depending on the web page that is currently displayed in the browser. The most important part on the server-side is the centralized database which stores wrappers from all users and allows those wrappers to 51

61 52 be shared among the users. Also, we evaluated DeepLibrary in this thesis. To do so, we created multiple data sets with pages from the Alexa ranking and snapshots obtained from archive.org. We used the training data set, on which all classifiers were trained, to run a backtest on our wrapper maintenance approach. Other data sets were used to compare the accuracy of the wrapper maintenance process from DeepLibrary with the approaches presented by Ferrara et al. [7] and Lerman et al. [11]. In most test cases, DeepLibrary clearly outperformed the other approaches with an overall accuracy between 66% and 80%. We also compared the runtime of the wrapper maintenance approach by DeepLibrary with the approach by Ferrara et al. [7]. The runtime of DeepLibrary was in each test case higher than the runtime by Ferrara et al. [7]. This was expected, since the computations by DeepLibrary are much more expensive. Finally, we evaluated the adaptability of wrappers generated by DeepLibrary. To do so, we tried to use a wrapper generated on a specific web page on other web pages from the same website. In 80% of all test cases the wrapper could haven been adapted, in 20% of the test cases, we needed to use our wrapper maintenance approach in order to adapt the wrapper. As for future work, there are several things that could be done. It would be interesting to see if it is possible to identify an element in a web page after a template change by using its attributes and CSS classes. If this would be possible, one could add another classifier that uses attributes and CSS classes as features. In our evaluation, we saw that the simple tree matching algorithm by Ferrara et al. [7] achieved an accuracy of 59%. Although this is below the accuracy of DeepLibrary, one could consider to add another classifier that uses the subtree of an annotated element as feature. Since we saw that the tree matching algorithm only works for large subtrees, DeepLibrary should only use this classifier if the size of the subtree of an annotated element exceeds a certain threshold. To obtain a more detailed insight on the performance of DeepLibrary, it would be useful to extend the size of our data sets. This could be done by adding more websites or by adding more snapshots for websites that are already in our data sets.

62 A Training Data Set Website Annotations Snapshots shopping.yahoo.co.jp Product Listing stores.ebay.com Product Pricture Gallery news.google.co.jp News Headlines Product Title Description Price Product Title Price Title Description Source

63 54 yandex.ru Web Directory stores.ebay.com Product List View espn.go.com NFL Standings amazon.co.jp Product Listing reddit.com What s Hot wordpress.com Top Posts Title Description URL Product Title Price Time Team Wins Losses Ties Win Percentage Home Record Road Record Division Record Conference Recrod Points For Points Against Point Differential Streak Product Title Note Old Price New Price Discount Rank Title Points User #Comments Source Title Description Source

64 APPENDIX A. TRAINING DATA SET 55 blogger.com Blogger Buzz imdb.com Top 250 stackoverflow.com Newest Questions stackoverflow.com Tags stackoverflow.com Users craigslist.org Article Listing jd.com Product Listing Title Content Date Time User Rank Rating Title #Votes Title Description Tags #Votes #Answers #Views Time User Tag Frequency Description Name Location #Points Article Title Note Product Title Price

65 56 kat.cr Download Listing flipkart.com Product Listing Title User Category Comments Size #Files Age #Seed #Leech Product Title Price

66 B Test Data Set Website Annotations Snapshots news.google.com News Headlines youtube.com All Channels Title Description Source Title

67 58 dir.yahoo.com Web Directory shopping.yahoo.com Product Listing wikipedia.org Current Events shopping.msn.com Product Listing amazon.com Product Listing twitter.com Tweets Title URL Product Title Description Price Date Event Product Title Description Price Product Title List Price Price Discount Used Price Message

68 APPENDIX B. TEST DATA SET 59 amazon.com Product Detail Title Author List Price Price Discount

69 60

70 C Lerman et al. Data Set Website Annotations Snapshots aircharter.com Airport Listing Airport Name amazon.com Product Detail barnesandnoble.com Product Detail quote.com Ticker Overview smartpages.com Phonebook Record finance.yahoo.com Ticker Overview Product Title Author Price ISBN Author Title Price ISBN Availability Price Change Ticker Volume Share Price Name Street Phone Price Change Ticker Volume Share Price

71 62

72 D Adaptability Data Set All results were collected on the 10th of April google.com Search Results Annotations: Result Title Url Description Tested Search Queries: Haskell (Used to generate wrapper) OS X (Wrapper worked without maintenance) Cosine (Wrapper worked without maintenance) Spotify (Wrapper worked after maintenance) youtube.com Video Search Results Annotations: Video Title Username Description Tested Search Queries: Max Flow (Used to generate wrapper) 63

73 64 Porsche 911 turbo (Wrapper worked without maintenance) Gumball 3000 (Wrapper worked without maintenance) Saul Goodman (Wrapper worked without maintenance) yahoo.com Search Results Annotations: Result Title Url Description Tested Search Queries: Saul Goodman (Used to generate wrapper) OS X (Wrapper worked after maintenance) Gumball 3000 (Wrapper worked after maintenance) Max Flow (Wrapper worked after maintenance) amazon.com Book Listing Annotations: Book Title Author Price Tested Categories: Picture Book (Used to generate wrapper) Classics (Wrapper worked without maintenance) Action & Adventure (Wrapper worked without maintenance) Fantasy & Magic (Wrapper worked without maintenance) twitter.com Tweet Listing Annotations: Author Tweet Text Tested (Used to generate (Wrapper worked without (Wrapper worked without (Wrapper worked without maintenance)

74 APPENDIX D. ADAPTABILITY DATA SET 65 taobao.com Product Listing Annotations: Price Description Tested Categories: Phone Case (Used to generate wrapper) Laptop Parts (Wrapper worked without maintenance) Jumpsuit (Wrapper worked without maintenance) Sunglasses (Wrapper worked without maintenance) msn.com News Listing Annotations: Headline Source Tested Categories: Schweiz (Used to generate wrapper) International (Wrapper worked without maintenance) Wirtschaft (Wrapper worked without maintenance) Kultur (Wrapper worked without maintenance) bing.com Search Results Annotations: Result Title Url Description Tested Search Queries: Cosine (Used to generate wrapper) Max Flow (Wrapper worked after maintenance) Articulation Point (Wrapper worked without maintenance) Breadth-First-Search (Wrapper worked without maintenance) yandex.ru Search Results Annotations: Result Title Url

75 66 Description Tested Search Queries: OS X (Used to generate wrapper) Haskell (Wrapper worked after maintenance) Cosine (Wrapper worked without maintenance) Saul Goodman (Wrapper worked without maintenance) ebay.com Product Listing Annotations: Product Title Price Tested Categories: Screen Protectors (Used to generate wrapper) Batteries (Wrapper worked without maintenance) Camcorders (Wrapper worked without maintenance) Binoculares (Wrapper worked without maintenance)

76 List of Figures 2.1 The life-cycle of a wrapper The structure of a wrapper The structure of an example The structure of an annotation The CSS rules container structure The URL container structure Tag name change likelihood matrix The architecture of DeepLibrary Entity-relationship diagram of the database Add annotation dialog Wrapper controller popup Wrapper save popup Wrapper verification popup Icon badge wrapper suggestion Results dialog Text content classifier evaluation XPath classifier evaluation Visual classifier evaluation Tag name classifier evaluation Siblings classifier evaluation Backtest chart Test chart Single annotation based evaluation on training data set Single annotation based evaluation on test data set Single annotation based evaluation on Lerman et al. [11] data set

77 68 LIST OF FIGURES 5.11 Record based evaluation on training data set Record based evaluation on test data set Runtime comparison Runtime comparison chart

78 List of Tables 3.1 CSS properties that didn t change during a template change Weights and thresholds for classifiers used by DeepLibrary Siblings classifier likelihoods Backtest results Test results Adaptability test results

80 Bibliography [1] G.O. Arocena and Mendelzon A.O. Weboql: Restructuring documents, databases, and webs. Proceedings of the 14th IEEE International Conference on Data Engineering (ICDE), pages 24 33, [2] C. Chang and S. Kuo. Olera: Semisupervised web-data extraction with visual support. IEEE Intelligent Systems 19(6), pages 56 64, [3] C. Chang and S. Lui. Iepad: Information extraction based on pattern discovery. Proc. 10th Intl. Conf. on World Wide Web (WWW), pages , [4] C.H. Chang, M. Kayed, M.R. Girgis, and Shaalan K. A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering (Volume:18, Issue: 10), pages , [5] C.H. Chang, Y.L. Lin, K.C. Lin, and M. Kayed. Page-level wrapper verification for unsupervised web data extraction. Web Information Systems Engineering WISE 2013, pages , [6] V. Crescenzi, G. Mecca, and Merialdo P. Roadrunner: Towards automatic data extraction from large web sites. Proceedings of the 27th International Conference on Very Large Data Bases, pages , [7] E. Ferrara and R. Baumgartner. Automatic wrapper adaptation by tree edit distance matching. Combinations of Intelligent Methods and Applications, pages 41 54, [8] J.R. Gruser, L. Raschid, Vidal E.M., and L. Bright. Wrapper generation for web accessible data sources. Cooperative Information Systems, Proceedings. 3rd IFCIS International Conference, pages 14 23, [9] U. Irmak and Torsten S. Interactive wrapper generation with minimal user effort. Proceedings of the 15th International Conference on World Wide Web, pages , [10] N. Kushmerick. Wrapper verification. World Wide Web, pages 79 94, [11] K. Lerman, S.N. Minton, and C.A. Knoblock. Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research 18, pages , [12] V.I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10(8), pages ,

81 72 BIBLIOGRAPHY [13] W. Luo, Q. Li, and Ding Y. An approach based on extracted data for wrapper maintenance. Pervasive Computing and Applications (ICPCA), th International Conference, pages 88 92, [14] X. Meng, D. Hu, and C. Li. Schema-guided wrapper maintenance for web-data extraction. Proceedings of the 5th ACM International Workshop on Web Information and Data Management, pages 1 8, [15] Z.B. Miled, M. Mahoui, M. Dippold, A. Farooq, N. Li, and O. Bukhres. A wrapper induction application with knowledge base support: A use case for initiation and maintenance of wrappers. Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE 05), pages 65 72, [16] A. Murolo and M.C. Norrie. Revisiting web data extraction using in-browser structural analysis and visual cues in modern web designs. 16th Intl. Conf. on Web Engineering (ICWE), [17] E Pek, Li X., and Y. Liu. Web wrapper validation. Web Technologies and Applications: 5th Asia-Pacific Web Conference 2003 Proceedings, pages , 2003.

Deccansoft Software Services

Deccansoft Software Services (A Microsoft Learning Partner) HTML and CSS COURSE SYLLABUS Module 1: Web Programming Introduction In this module you will learn basic introduction to web development. Module