WebJDBC Relational Data Extractor

Size: px

Start display at page:

Download "WebJDBC Relational Data Extractor"

Randell Copeland
6 years ago
Views:

1 WebJDBC Relational Data Extractor [Subtítulo se existir] (14pt normal) Daniel Filipe Piedade Santana Dissertation submitted to obtain the Master Degree in Information Systems and Computer Engineering Jury Chairman: Prof. João Paulo Marques da Silva Supervisor: Prof. Paulo Jorge Fernandes Carreira Co-Supervisor: Prof. Pável Pereira Calado Members: Prof. João Carlos Serrenho Dias Pereira June 2012

2 placeholder

3 placeholder

4 placeholder

5 Resumo Um dos objectivos principais de um JDBC driver é aceder a bases de dados relacionais numa aplicação Java. Propõe-se a aceder a páginas Web usando a abstração das bases de dados relacionais, o que é muito importante quando se pretende construir aplicações que necessitem de integrar múlitplas fontes de informação. Contudo a extração de dados web não é uma tarefa trivial. Existem muitas páginas web com estrutura HTML diferente e consequentemente com diferentes representações de dados. O objectivo deste trabalho é contribuir com a construção de um WebJDBC driver que possibilite o acesso estruturado e sistemático a dados de páginas web em aplicações Java. A construção do WebJDBC driver dividiu-se em 2 trabalhos distintos: (i) implementação de um extractor de dados web relacionais (ii) implementação de um processador de queries que conseguirá avaliar queries sobre dados obtidos pelo extractor. Este trabalho é focado na implementação do extractor e no estudo de abordagens existentes para a extracção de dados web. O trabalho alcançou a integração de diversas técnicas no Extractor de Dados Web Relacionais que está desenhado de forma facilitar a integração de novas técnicas. Palavras Chave: Páginas Web, Extracção de dados, Região de dados, Registo de dados, WebJDBC driver, Processamento de Queries v

6 placeholder

7 Abstract The main goal of a JDBC driver is accessing relational databases in a Java application. This work proposes accessing web pages using a relational database abstraction, which is very important when building applications that require integrating data from multiple data sources such as web data. However, extraction of data from web pages lists and tables is not a trivial task. There are many web pages with different HTML structure and consequently with different data representations. Our goal is contributing with the construction of a WebJDBC driver that allows the systematic and structure access of data from web pages in Java applications. The WebJDBC driver construction was split in two different tasks: (i) the implementation of a relational data extractor for web pages lists and tables (ii) the implementation of a query processor that is capable of evaluating SQL queries over data returned by the extractor. This work focuses the implementation of the relational data extractor and the studies existing approaches for web data extraction. The work achieved the integration of multiple techniques to the Relational Web Data Extractor, that component is designed to ease the integration of new techniques. Keywords: Web pages, Data extraction, Data Region, Data Record, WebJDBC driver, Query Processing vii

8 placeholder

9 Contents Resumo Abstract v vii 1 Introduction Objectives and Contributions Document Layout Concepts Webpage Representation Data Record and Data Region Wrapper Related Work Autonomous Pattern Discovery Techniques Machine Learning Techniques Semantic Model Techniques Technique Overview Input Complexity Output Format Field Labelling CPU Time Solution Design and Implementation Extractor Model Manager Technique Factory Technique ix

10 4.2 Techniques Autonomous Technique Regex Technique Supervised Technique Validation Evaluation Methodology Evaluation Measures Experimental Results Football - Task Quota - Task Bigbook - Task Computer Brands - Task IST Student - Task UEFA Rank - Task Summary and Discussion Conclusions Work review and Conclusions Conclusions Summary Future Work Bibliography 56 Appendix 60 A User Tests Statement 61 B Profiling Form 63 C Profiles 65 C.1 User C.2 User C.3 User C.4 User C.5 User C.6 User x

11 C.7 User C.8 User D Experimental Results 69 D.1 Football - Task D.2 Quota - Task D.3 Bigbook - Task D.4 Computer Brand - Task D.5 IST Student - Task D.6 UEFA Rank - Task xi

12 placeholder

13 placeholder

14 placeholder

15 List of Tables 3.1 Autonomous Pattern Discovery Features Supervised and Unsupervised Machine Learning Features Semantic Model Features D.4 Football Results - User D.5 Football Results - User D.6 Football Results - User D.7 Football Results - User D.8 Football Results - User D.9 Football Results - User D.10 Football Results - User D.11 Football Results - User D.12 Quota Results - User D.13 Quota Results - User D.14 Quota Results - User D.15 Quota Results - User D.16 Quota Results - User D.17 Quota Results - User D.18 Quota Results - User D.19 Quota Results - User D.20 Bigbook Results - User D.21 Bigbook Results - User D.22 Bigbook Results - User D.23 Bigbook Results - User D.24 Bigbook Results - User D.25 Bigbook Results - User D.26 Bigbook Results - User D.27 Bigbook Results - User xv

16 D.28 Computer Brand Results - User D.29 Computer Brand Results - User D.30 Computer Brand Results - User D.31 Computer Brand Results - User D.32 Computer Brand Results - User D.33 Computer Brand Results - User D.34 Computer Brand Results - User D.35 Computer Brand Results - User D.36 IST Student Results - User D.37 IST Student Results - User D.38 IST Student Results - User D.39 IST Student Results - User D.40 IST Student Results - User D.41 IST Student Results - User D.42 IST Student Results - User D.43 IST Student Results - User D.44 UEFA Rank Results - User D.45 UEFA Rank Results - User D.46 UEFA Rank Results - User D.47 UEFA Rank Results - User D.48 UEFA Rank Results - User D.49 UEFA Rank Results - User D.50 UEFA Rank Results - User D.51 UEFA Rank Results - User xvi

17 placeholder

18 placeholder

19 List of Figures 1.1 Project Overview Diagram Tag Tree Example Webpage List of Products Example String Edit Distance Matrix Example Clean Fields Example General Diagram Extractor Diagram ModelManager Diagram TechniqueFactory Diagram TechniqueStrategy Diagram Autonomous Technique Diagram DocumentVisitor Diagram DataRegionSelector Diagram RecordExtractor Diagram RegexTechnique Diagram Supervised Example File ExtractionContext Diagram TupleState Diagram FieldState Diagram Task Flow Diagram xix

20 placeholder

21 Chapter 1 Introduction Jdbc driver is a software component that enables a Java application to interact with a relational database. JDBC drivers exist for distinct relational databases such as MySQL, Oracle, SQLServer or PostgresSQL. This access is done by using an uniform interface 1. For each type of information source, there is the necessity of the construction of a driver accordingly the specifications of the JDBC API Fisher et al. (2003). Essentially, a JDBC driver provides classes that allow the creation of connections to sources of information, the execution of SQL queries and the accessing queries results. Over the years, some JDBC drivers appeared, with the purpose of abstracting other sources of information, such as CSV and XML files 2 through the same interface used for relational databases Jigyasu et al. (2006). This allows the using of similar JDBC primitives for different data sources, making more convenient the integration of different sources of information in the same application. Internet is a huge source of information. The web pages contain information that can be extracted and used. However, if we intend to use this information in our Java applications and/or integrate it with other information from other sources, the construction of a WebJDBC driver is worthy. The driver provides a systematic and structured way to access this information and the integration with other sources of information by using other drivers. This kind of driver is valuable for Java developers who need to work with web information in their applications. To construct a WebJDBC driver, we need some way to extract data from web pages. The study Web data extraction is a problem that has been studied for many years Laender et al. (2002b). The study of Web data extraction algorithms and techniques is profitable and worthy because the internet is a vast repository of free and accessible information. There are many web pages that present list or tables of products, such as or One common solution for web data extraction is creating an ad-hoc program that performs the extraction, based on the structure of a web page Laender et al. (2002b). Although this approach can be the most effective, since a human is able to understand a pattern or structure of a web page, their exceptions, and identify data and its semantic, 1 See more information about JDBC in 2 JDBC driver code for CSV and XML files in and respectively 1

22 2 CHAPTER 1. INTRODUCTION the manually approach is not the most efficient. The number of existing web pages, the constant modification of their structure and their extensive HTML code that is hard to capture efficiently, makes an ad-hoc manual solution not scalable nor efficient. There are sophisticated techniques for web data extraction, which are more efficient than the manual approach. Some try to infer the schemas in an autonomous way. They discover patterns in the HTML code of the pages without any previous knowledge Álvarez et al. (2010); Liu et al. (2003). Others approaches use an algorithm of examples learning from training sets to create an systematic process of extraction Crescenzi et al. (2001); Muslea et al. (1998). Finally, there are techniques that use an alternative approach, which is based on semantic models such as ontologies, to infer relations between entities from text and extract them directly into relational tables Embley et al. (1999); Limaye et al. (2010). These techniques have different features, such as their input and output. 1.1 Objectives and Contributions There is no information about JDBC drivers for web pages or its study. Since Web pages can be seen as HTML 1 documents rich in information, the implementation of an Web- JDBC driver that allows integration of these web pages information in Java applications is worthy. To construct a driver of this type we must implement an extractor based on existent techniques of web data extraction. This allows the filtering of relevant information from an HTML document and the using of this information in a similar way such the way we use relational database tables. Unlike CSV and XML files, web pages are documents that do not present information in a regular structure. For this reason, a mechanism will be needed to filtering an HTML document and extracting the relevant data into a structured format over which the driver will be able to process a query. The relational data extractor is oriented for extraction of data from lists and tables of web pages because these lists and tables commonly have organized information that can be worthy for the user of the driver. This work have the following objectives/contributions: (i) creation of an open-source relational data extractor for lists and tables of web pages (ii) integration of existing or adapted extraction techniques into the extractor (iii) integration of a web relational data extractor with a WebJDBC driver (iv) description of some frequent web extraction problems and existing web extraction techniques. Figure 1.1 presents an overview of the overall project. A Java application can submit SQL statements, which can be queries or updates through the WebJDBC driver. Then, the WebJDBC driver processes the SQL statement, which can be either an update ( useful for giving necessary input to the extractor for further extractions) or a query that will result in a request a cursor, which is an known interface for the driver. Herein, we will focus on the Relational Web Relational Extractor component of the project. The Extractor integrates multiple techniques and choose, based on the input provided by SQL statements, the most suitable technique for the extraction task. To accomplish this, we studied existing web extraction techniques for a better understanding of how to integrate them in the Extractor. Finally, we implemented some of these techniques to 1 HTML definition in

23 1.2. DOCUMENT LAYOUT 3 Internet gives input requests cursor Relational Web Data Extractor Extract Java Aplication SQL statement Results WebJDBC Driver uses provides Cursor Extract Extract Figure 1.1: Project Overview Diagram - SQL statements are performed in WebJDBC driver and their parameters are used to provide input information about the web pages for the Extractor. Given that information the extractor is able to create a cursor, which is an interface that the WebJDBC driver is able to use and that also provides results to the driver. These results can be used by a Java application through the WebJDBC driver. construct the extractor and evaluate whether the extractor approach is worthy or not for an WebJDBC driver. For given SQL queries the web relational extractor is able to choose the best technique for data extraction. This technique (or algorithm) implements the interface of JDBC cursor. The results are extracted and processed from the web page by the cursor delivering data on-demand to the Java application. 1.2 Document Layout This document is structured in the following way: Chapter 2 presents some relevant concepts related with web relational data extraction. Chapter 3 provides an overview of related work about existing extraction techniques or algorithms. Section 4 presents details about the implemented solution and its design, along with the decisions taken in the implementation and design process. Section 5 presents a description about the evaluation methodology and experimental results of user test cases. Finally, Section 6 presents some important conclusions and ideas of this work, and possible future work.

24 placeholder

25 Chapter 2 Concepts This chapter describes some important concepts about web data extraction, which will be important for better understanding of the following chapters. These concepts will be described along with their relation with web data extraction. 2.1 Webpage Representation Webpage is a document written in HTML code. HTML is a markup language and uses markup tags to describe webpages. Webpages can represent data in structured, semistructured or unstructured way. In this work are mainly considered webpages that represent data through lists or tables. These webpages are structured or semi-structured. An example of a webpage is shown in Figure 2.3. As we can see, webpages do not have only relevant information. Commonly, webpages have visual information, descriptive information, advertising, separators, which can be irrelevant. Usually, for web data extraction, many approaches use some kind of webpage representation. There are two frequent representations: (i) Tag tree based on HTML code Álvarez et al. (2010); Chang & Lui (2001); Dalvi et al. (2009); Liu et al. (2003); Nie & Yu (2010); Zheng et al. (2009); (ii) Tag tree based on visual features of the webpage Hiremath & Algur (2010); Liu & Zhai (2005); Zhai & Liu (2005). A tag tree fragment example is shown in Figure 2.2. Commonly, the first is constructed by parsing the HTML code of a webpage. A DOM Tree is an example of this type of representation 1. Sometimes, for the construction of a tag tree, there is the necessity of cleaning the HTML code, which can be not well formed. Conversion of HTML into XHTML is common because XHTML has a well formed structure 2. The second is constructed using tools, like MSHTML or VIPs Deng et al. (2003), that allow the discovery of properties of the HTML tags, such as location, width and height occupied in a webpage 3. 1 DOM Tree information in 2 Information about XHTML in 3 More information about MSHTML in aa753630(vs.85).aspx 5

26 6 CHAPTER 2. CONCEPTS Figure 2.2: Tag Tree Fragment (adapted from Liu et al. (2003)) - This fragment represents a table in a webpage. 2.2 Data Record and Data Region Data records are items of a data, as seen in Figure 2.3 and Figure 2.2. These are a well known concept in the domain of web data extraction as we can see in Álvarez et al. (2010); Hiremath & Algur (2010); Liu & Zhai (2005); Liu et al. (2003); Zhai & Liu (2005). Although they can differ in structure because some records have optional fields, they have similar HTML structure between them. They contain data with relevant information such as, for example, a price, name or a type of a product. They also may have some potential irrelevant data, which concern visual representation on a page, such as <b>price:</b>. We can extract relational tuples from data records. For example, from a data record, which represents a product and contains fields representing name, brand and price, the following relational tuple can be extracted: product(name, brand, price) These exist at least tree types of data records: Flat Data Records have the fields concerning only to one item Hiremath & Algur (2010),i.e, the fields of this kind of record represent only one entity. An example is shown in Figure 2.3. Nested Data Records represent information of many items of the same type in a single data record Hiremath & Algur (2010). In Figure 2.3 is shown an example of a nested data record. As we can see within the record, which contains the field Plastic Mixing Bowls by Zack that represents a type, we have tree products that differ in name and price fields. We can also see that the list of data records contain a flat and nested data records. Crossed Data Records have the fields are mixed in the code. Suppose for example that we have some product that have image, name and price attributes. Consider that we have

27 2.2. DATA RECORD AND DATA REGION 7 product 1 and product 2. For crossed records, in the HTML code of the page, their attributes will appear in the possible following order, image 1, image 2, name 1, name 2 and then price 1, price 2, but in the webpage, visually, they appear to be distinct records Zheng et al. (2009). Figure 2.3: Webpage Sample - This figure shows a list of products, which have some information that can be extracted. Big regions, which are rich in data, can be found in webpages. These regions are known by data regions in many approaches Álvarez et al. (2010); Hiremath & Algur (2010); Liu & Zhai (2005); Liu et al. (2003); Nie & Yu (2010); Zhai & Liu (2005). They can be lists or tables of items. We can see an example in Figure 2.3. The identification of these regions is useful for web data record extraction to avoid a lot of information which possibly is irrelevant for user, such as advertisements or navigation bars and focus only in finding relevant information, such as data records. Data regions and data records can be identified and extracted from a webpage when using techniques that detect their structure similarity. Similarity techniques are used by some approaches to discover how different or similar are, for example, two strings or two trees Álvarez et al. (2010); Amin & Jamil (2009); Baeza-Yates & Ribeiro-Neto (1999); Chang & Lui (2001); Liu & Zhai (2005); Liu et al. (2003); Zhai & Liu (2005). These techniques are used in web data record extraction because data records have similar tag patterns in a webpage. We can view data records as sequence of tags, which can be converted to a string or a tree. This way we can see if two data records are similar or not. It is possible to back-trace string and tree similarity techniques in order to align the strings or the trees. This way we can use this alignment for data records extraction Liu & Zhai (2005); Yang (1991); Zhai & Liu (2005). An alignment technique can infer which fields in two data records corresponds to each other. This allows grouping fields of same type and find other optional or irrelevant fields. Figure 2.4 presents one example of a matrix generated by the string edit distance algorithm. The algorithm calculates each cell of the matrix based on the adjacent cells and the matching of the characters of each row and column. The last cell of the matrix has the edit distance of the two strings i.e. the number of operations (character insertion,

28 8 CHAPTER 2. CONCEPTS deletion or substitution) needed to convert a string on other. In the end we can backtrace the matrix calculation (gray cells) and through that infer a possible alignment for the two strings, which can be seen in Figure Wrapper A wrapper is a procedure that automatically extracts data from webpages. Wrappers can be constructed manually or automatically. Many approaches generate the wrappers automatically based on examples or discovering patterns in webpages Crescenzi et al. (2001); Dalvi et al. (2009); Zheng et al. (2009). One example of an wrapper is a state machine.

29 2.3. WRAPPER 9 String Edit Distance Matrix String1 T A A G G T C A String T A C A G G T A C C String Alignment String1 T A G G T C A String2 T A C G G T A C C - Character remove - Character substitution - Algorithm traceback for string alignment Figure 2.4: String Edit Distance Matrix Example - String edit distance and alignment between two similar strings.

30 placeholder

31 Chapter 3 Related Work This chapter describes and identifies the different extraction techniques and their features, which are important to decide and design a better solution for an web relational extractor. The chapter 3 shows the study of existent web data extraction techniques, how they work and their strong and weak features. In Laender et al. (2002b) we can see a brief description of some existing techniques grouped by classes, such as HTML-Aware techniques, Natural Language Process based techniques, Wrapper induction techniques, Modelling based techniques, Ontology based techniques. Here, we propose other classification for these techniques that is more appropriate for our goals, because we must be aware of the differences between the techniques concerning input, output and the way they work. Web data record extraction is important for this work because the extraction of relational tuples from data records is possible. We can see a tuple as a set of data records fields. There are many approaches to extract data records or their fields from web pages. Some approaches try to find patterns in a web page and they only need one single web page as input. Commonly, they have data region identification step and a data record identification and extraction step. We will refer to them as Autonomous Pattern Discovery Techniques. Other techniques use a set of training examples for generation of a wrapper to extract data records or their fields. These techniques have a learning process or algorithm that receives a set of training examples as input and then process them for creation of a wrapper. These techniques we will call Machine Learning Techniques. There is an alternative approach that uses semantic models. This approach uses a set of known entities and relationships, as ontologies, to identify data and infer relational tuples in a web page. To these, we will call Semantic Model Techniques. 3.1 Autonomous Pattern Discovery Techniques Some web data record extraction approaches consists in inferring the schema of the web pages and extract data from that, without previous knowledge. They just need a single web page to perform the extraction. In this work, they will be refereed as Autonomous Pattern Discovery Techniques. This kind of techniques makes some assumptions about the web pages, based on observations. Most of them consider that there is a data region with data records in a web 11

32 12 CHAPTER 3. RELATED WORK page Álvarez et al. (2010); Hiremath & Algur (2010); Liu et al. (2003); Nie & Yu (2010). Visually, data regions appear as lists or tables and data records as elements of these lists and tables. They also assume that there is a similar pattern between data records. Some use similarity and alignment techniques, which are frequently used to find data region and data records, and to extract data record fields correctly Liu & Zhai (2005); Zhai & Liu (2005). Most of them have a step to identify data regions and to identify and extract data records. We separate the description of these approaches in two subsections: (i) Techniques that use tag tree; (ii) Techniques that use other representations, such as suffix trees. Amin & Jamil (2009); Chang & Lui (2001). MDR Liu et al. (2003) and DEPTA Zhai & Liu (2005) try to find a data region, comparing adjacent generalized nodes in a tag tree. The first constructs the tag tree based on the HTML code. The second constructs the tree based on visual clues. They consider that a generalized node is combination of one or more adjacent tag s that contain a data record. In Figure 2.2 the two data records are generalized nodes. These techniques use string edit distance to find whether a pair of generalized nodes is similar or not. The number of comparisons is not so high, because only adjacent nodes are compared, and there are combinations excluded from the comparison, because further nodes are a combination of the previously compared nodes. The data region is found when there is a collection of two or more generalized nodes that have the same parent and same length, that are adjacent and finally that have the normalized string edit distance, between adjacent generalized nodes, less than a fixed threshold. In MDR, data records will be contained in generalized nodes. Data records that are similar will be extracted according to similarity functions and a fixed threshold. This approach extracts flat data records but do not extract their fields. DEPTA Zhai & Liu (2005) identifies data records in the same way that MDR, plus it has a solution for crossed data records. They identify crossed data records based on two observations: (i) there are crossed records, when two or more generalized nodes have, within, different tag nodes with similar children (ii) there are crossed records, when we have two or more adjacent different data regions. They consider that each data region has a data record s part. After the data records identification, it is constructed a tree for each one and it is used a partial tree alignment algorithm to join those trees. This way data records and their fields are aligned, allowing a correct extraction. NET Liu & Zhai (2005) uses Simple Tree Matching algorithm (STM) to detect similar data records Yang (1991). This technique is able to detect also nested data records because it performs bottom-up STM. This way it detects first the nested data records in lower levels of the tree, and then the other data records. After that there is a back-trace on STM and the fields are aligned. In the end tables are created, with tuples from aligned data record s fields. Hiremath & Algur (2010) also identifies data regions using visual clues. It uses the MSHTML tool to discover the tag parameters on the web page, such as coordinates, width and height. Those parameters allow calculating the rectangle area that each tag occupies in a web page. In this work they assume that the data region is contained in the tag that occupies the largest area of the page. After the data region has been identified,

33 3.1. AUTONOMOUS PATTERN DISCOVERY TECHNIQUES 13 visual properties of the web page are used to get all rectangles from that region. The average area of the rectangles is calculated and then all rectangles that have a smaller area than the average are excluded, thus all unnecessary information is excluded. Data records are identified whether as flat or as nested data record, based on the observation that, commonly, flat records have 40% fields of the nested records. This technique only identifies data records and their type. Álvarez et al. (2010) considers that each node in a tag tree has a score value. They divide text nodes with the same tag path from de root in different groups. Then for each pair of nodes, in the same group, the deepest common parent node is found and and his score value is incremented. In the end, the node with the highest score is considered the root node of the data region. They consider, based on empirical results, that if two text nodes, with the same tag path, are in different data records, then the deepest parent node of both is the node of data region. However, if the two text nodes, with the same tag path, are in the same data record, the deepest common parent node it will be deeper than the data region node. The data region node will be assigned the highest score with higher probability, based on the observation that there are more pairs of nodes from different data records than from the same data record. Then there is a division of the data region in candidate data record s lists. The computation of the similarity between the sub-trees of the data region is performed. After that one cluster is assigned to each sub-tree. The clusters will be joined according their inter-similarity based on an established threshold. Candidate record lists will be formed based on the composition of data record. This work consider that one data record starts or ends with a cluster s sub-tree. The list with higher similarity between data records will be chosen as the right one. XmlE Nie & Yu (2010) is a technique based on XML encoding. The target HTML page is converted to XML and then the document is transformed to a linear sequence. Tuples indicating the order of appearance, the inverse order of appearance and the depth level of a node in the tag tree are assigned to tag tree nodes. Their algorithm detects regions by analysing these tuples and tag names, which is more efficient than using traditional tree similarity techniques. They only consider data records that present attributes in text form. As data records are detected text nodes can easily be extracted from them. They assume that a web page can have multiple data regions, and as consequence can have different representations for data records. IEPAD Chang & Lui (2001) and FastWrap Amin & Jamil (2009) are similar techniques. They convert a HTML file to a suffix tree Baeza-Yates & Ribeiro-Neto (1999). Through this tree they find the longest repeated pattern. Then the pattern is refined and converted to a regular expression that extracts all texts nodes of the pattern. The regular expression is then refined to have a valid HTML pattern. In IEPAD Chang & Lui (2001) the user has a graphical interface where he can chose the regular expression that has better results. IEPAD Chang & Lui (2001) only extract data records, in opposite of FastWrap Amin & Jamil (2009) that detects data records and extract the their text fields.

34 14 CHAPTER 3. RELATED WORK 3.2 Machine Learning Techniques Machine Learning approaches consist in the construction of an wrapper or something similar based on training examples or a set of pages. Usually, the training examples are web page samples labelled by users, to the techniques that use labelled training examples we will call Supervised Machine Learning Techniques. The ones that do not receive labelled training data and learn based on a set of web pages we will call Unsupervised Machine Learning Techniques. Stalker Muslea et al. (1998) constructs an automaton by creating rules from web page s samples. Dalvi et al. (2009) use examples to construct a probabilistic model for decide a robust wrapper for extraction. Next, we will describe how those techniques work. RoadRunner Crescenzi et al. (2001) generate wrapper based on a set of web pages. Stalker Muslea et al. (1998) is given a training set composed by data record samples, that contain relevant information, and some indexes that indicate where a data field starts and ends. The algorithm tries to create simple rules that can extract the majority of positive examples possible. It tries to detect symbols and tags that appear in all examples and that precede the beginning of the field. Then as it creates the rules, it excludes the examples caught by those rules and used symbols. It will try generate other rules with the remaining symbols to catch all remaining the examples. The output is an automaton based on the created rules. Then the automaton is used as a wrapper to extract data record fields. It parses HTML code until it finds a match to a rule on the automaton. When a match occurs the data record fields are extracted. SoftMealy Hsu & Dung (1998) considers that an HTML file is a sequence of tokens. They consider different classes of tokens such as, upper-case string class, lower-case string class, HTML tag class, punctuation class etc. The rules to extract fields in a web page are composed by two separators, left separator and right separator. A separator is a sequence of tokens. A field can be extracted by knowing the left separator (prefix) and the right separator (suffix). The algorithm tries to generalize the rules by aligning them. Then the rules are tested and if an incorrect field is extracted, the algorithm tries to specify the rule. A wrapper, based on a automaton, is generated, where there are states for extracting fields and states to skip tokens, and the edges corresponds to input tokens. WIEN Kushmerick (2000) talks about six classes of wrappers. Left Right (LR) is a wrapper that uses separators as rules, similar to SoftMealy Hsu & Dung (1998), the algorithm tries to found the minimal prefix and suffix of a field to be extracted. Head Left Right Tail (HLRT) is the same of LR but it avoids a lot irrelevant information at the beginning and the ending of a web page, using delimiters. Open Close Left Right (OCLR) uses delimiters to infer where a tuple begins and ends, to directly reach them for extraction, avoiding irrelevant information between tuples. Head Open Closed Left Right Tail (HOCLRT) combines the functionalities of HLRT and OCLR. In Nested Left Right (N-LR) the principle is the same as LR, but is designed to extract nested records too. The wrapper will extract the tuple s fields sequentially, and it will only extract the next field if there is not more prefix occurrences of previous fields. Finally, Nested Head Left Right Tail (N-HLRT)

35 3.2. MACHINE LEARNING TECHNIQUES 15 combines the functionalities of N-LR and HLRT. W L 2 Cohen et al. (2002) uses an extensive learning algorithm that generates Disjunctive Normal Form expressions (DNF) in which primitive elements are predicates. Predicates are sets so they can easily be manipulated in DFN expressions. Basically, a predicate indicates if an element is contained in a set. The algorithm receives a set of triples (Outer x, Scope x, InnerSet x ), as training set, where Outer x usually corresponds to a web page, Scope x is the part of Outer x that an user has completely labelled, and InnerSet x is the set of all spans that should be extracted from Outer x. The algorithm also receives a set of builders. A builder is a learning component of the system that exploits a representation of the web page. We can add these components in the algorithm and that makes it extensible. For example, one builder considers a document as a sequence of tokens, another one considers the document as a DOM Tree. The builders implement two operations. Least General Generalization operation LGG consists in finding the LGG from a training Set. LGG is the smallest set of predicates that covers all example data that is to be extracted in the training set. The other operation is Refine LGG and consists in splitting a LGG in smaller sets that cover only part of the example data in the training set. There is a proposed builder for table extraction that relies on visual aspects of the tables. From the builder LGG, the algorithm will create DNF expressions for the fields extraction. DEbyE Laender et al. (2002a) is based in examples chosen by an user, through a graphical interface. The user chooses pieces of information to be extracted and then put them in a table. Based on these examples the extractor component of DEbyE will generate two types of patterns. Attribute Value Pair Pattern (AVP-Pattern) is an extraction pattern based on the string prefix and string suffix of selected piece of information that corresponds to a field for extraction. It is used to extract all fields that have a equal prefix and suffix. Object Extraction Pattern (OE-Pattern) is the hierarchical structure of the user s objects selected examples that contains in the leafs AVP-Patterns. After the patterns are generated, there are two types of extraction. In Top-Down Extraction the objects are directly recognized by OE-Patterns. Then they are broken in their components. This works better in web pages with little variations in their structure. The Bottom-Up Extraction uses first the AVP-Patterns to extract the fields. Then it tries to generate the object structure from the fields. This works better in more complex web pages, that have nested data records. Zheng et al. (2009) receives a set of training pages that are converted to DOM Trees by an HTML parser. Then, semantic labels of specific fields are manually assigned, through a graphical interface, to certain DOM Tree nodes, to indicate their semantic functions. Based on these labels, the algorithm constructs a tag path from the root to the tag that involves all labelled fields. This path is called stick. The tag that involves the fields is called boundary record, and is adjacent with other boundary records. A wrapper is generated from the boundary record to the labelled fields, based on the tag paths. The resulting wrappers are aligned and marked with symbols that specify whether the fields are optional or occur more that once. The output is a set of wrappers for each type of data record identified from the labelled fields. To these wrappers they call record-level wrappers. For each wrapper there is a stick associated. This approach also deals with

36 16 CHAPTER 3. RELATED WORK crossed data records. It identifies them when there are labelled fields from different data records on the same boundary record. A record-level wrapper library is generated to extract the data records in new pages. Data record fields are extracted according the rules of those wrappers. Dalvi et al. (2009) is based on a probabilistic model. The probabilistic model is used to infer future changes in web pages. Basically, their technique consists in choosing, through the probabilistic model, from a set of wrappers, the most appropriated. The model is based on web page snapshots, i.e., the web page tag tree in a given moment, and tries to infer the future web pages snapshots based on that model. The technique considers some components that we following describe. Archival Data is a sequences of web pages snapshots. This can be obtained by monitoring a set of web pages over time. Model is a probabilistic model that gives a probability distribution over the possible next states of a web page given its current snapshot. The model is specified by a set of parameters, each defining the probability of an atomic edit operation as inserting, deleting or changing tags. Model Learner takes as input the Archival Data and learns a Model that best fits the data. The Model Learner learns the parameter values of Model that maximize the probability of observing the Archival Data. Training Data is a small subset of the set of pages of interest along with labels that specify the value of the field to be extracted. Candidate Generator takes Training Data and generates a set of alternate wrappers. They use some known wrapper generation techniques to create this set. Robustness Evaluator takes the set of candidate wrappers, evaluates the robustness of each using the probabilistic model learned on the Archival Data, and chooses the most robust wrapper. The most robust wrapper will be the one that better continuously works in predicted web pages snapshots. RoadRunner Crescenzi et al. (2001) is a Unsupervised Machine Learning Technique and tries to infer an common schema on a web page set and is able to deal with flat and nested data records. The algorithm deals with two pages each time. It considers the first page a wrapper, if there is not created a wrapper yet, that will be used to align with other pages of the set. In each iteration a more complete wrapper is created. The idea is to match tags until it finds a mismatch, that will improve the wrapper. This technique considers two types of mismatches. String Mismatches are used to find fields with relevant information and descriptive fields. If the fields are equal then it considers that is a descriptive field because they appear repetitively. For example, RoadRunner Crescenzi et al. (2001) considers descriptive fields, such as title: or price:, appear frequently repeated in a web page code. If not it considers that is a field with information. Tag Mismatches are used to find iterators, tag repetition, or optional tags. If it does not find iterators it will try to find optional tags. This is done by assuming that the optional tag appears in the wrapper or in the other page. When the tag appears it is added or marked in wrapper as optional. For iterators it must find where the tag repetition begins and ends in wrapper or in the training page. It is done by looking to the start tag and terminal tag on both. Once discovered a wrapper s tag is marked with a symbol that indicate that the tag occurs more than once. It uses the generated wrapper to extract all data records with fields aligned.

37 3.3. SEMANTIC MODEL TECHNIQUES Semantic Model Techniques This approach consists in data extraction based on semantic models, such as ontologies or semantic knowledge bases. In this work we consider ontologies and knowledge semantic bases as semantic models because both work through the semantic of entities. An ontology Guarino (1998) is a representation of knowledge through a set concepts and the relationships between them. An ontology describes a certain domain through this concepts and relationships. Semantic knowledge bases, like YAGO, are repositories of entities and facts about them 1. Semantic models are used in some approaches of web data extraction. With these models we are able to infer tuples from a web page, automatically creating data base tables. Next, we describe some of these techniques. Embley et al. (1999) uses ontologies that describe a domain. They describe entities, relationships, lexical form, and context words. The ontology is parsed and database tables are constructed by automatically recognition and extraction of entities in a web page. The entities are recognized and extracted using their lexical form and context words. Context words help to locate and understand which type is an entity. The ontologies are constructed manually for certain domains but they are effective for multiple pages concerning these domains. Limaye et al. (2010) labels tables in a way that each column has a type associated and there is a binary relationship between a pair of tables. The objective is to associate entities in table cells, types in table s columns and relationship between column pairs based on potential functions. As input it takes a semantic knowledge base, such as YAGO, and a table structure that is not totally labelled. Entities have a type that may be described in multiple ways. Entities are instances of the type and occur in web pages. The algorithm consists in selecting entities, types and relationships for table structure based on potential functions. These potential functions are based on some pre-defined probabilistic formulas and they give the probability of a given entity belonging to a cell or a type belonging to a given column a relationship existing between columns. The associations are made according higher probability. New collected data is used for learning new relationships between entities. We can see entities as attributes in a database table and a collection of entities with relationships between them as tuples. 3.4 Technique Overview In this section we will make an overview of the previously described techniques. The description will focus some important features related with this work. We will address the Input Complexity, i.e. the amount of information and user interaction to the creation of input for extraction techniques, which can be provided through a SQL statement of a WebJDBC driver, Output Format i.e. the format of the output of each technique, Field Labelling i.e. which will explain to us whether the extracted data of the techniques comes 1 YAGO is an example of a semantic knowledge base, see yago-naga/yago

38 18 CHAPTER 3. RELATED WORK Features Autonomous Pattern Discovery Techniques MDR DEPTA NET FastWrap XmlE IEPAD Álvarez et al. (2010) Hiremath & Algur (2010) Input Complexity L L L L L L L L Output Format data records fields database tables fields fields data records fields data records Labelling no no no yes no no no no CPU Time Order ms ms N/A ms N/A N/A ms N/A Table 3.1: Autonomous Pattern Discovery Features - N/A: Not Applicable, L: Low, M: Medium, H: High, VH: Very High, ms: milliseconds, sec: seconds, min: minutes. already labelled or not, and CPU Time i.e. an average of time that some techniques spent to be able to extract data from web pages. The Tables 3.1, 3.2, 3.3 summarize these features, for Autonomous Pattern Discovery, Supervised and Unsupervised Machine Learning and Semantic Model Techniques respectively Input Complexity The complexity of the input will be measured as the amount of the necessary information or/and user interaction to create the input of the techniques. All of previous described Autonomous Pattern Discovery Techniques receive only a single web page as input. This is a great advantage of this class of techniques because they simplify the task of the user and the complexity of the SQL query. This is because that there is only the necessity of giving an URL of an web page as parameter for a SQL query. These techniques have a low Input Complexity. Machine Learning Techniques have a more complex input because they usually need labelled training examples. The input of Unsupervised Machine Learning Techniques, such as RoadRunner Crescenzi et al. (2001), is simpler than the most of Supervised Machine Learning Techniques because they only need a set of similar web pages to generate a wrapper. This way, a SQL statement would only need having a set of web pages URLs as parameters of input. Supervised Machine Learning Techniques generate wrappers using labelled training examples, which require manual and cognitive effort in the process of choosing good examples and labelling them. The complexity of the input for these techniques is high because an user needs to know about the structure of the page and create a set of examples to use the techniques. Semantic Model Techniques have the most complex input. The work of D. Embley et al. Embley et al. (1999) requires an ontology as input, which requires ontology expertise and great manual and cognitive effort. By using these techniques in a WebJDBC driver, a SQL query would need to have an ontology or a semantic knowledge base as parameter along with a semi-labelled table corpus as input. We conclude from this feature that the input complexity increases by technique class in the following order (ascending): (i) Autonomous Pattern Discovery Techniques; (ii) Machine Learning Techniques; (iii) Semantic Models Techniques

39 3.4. TECHNIQUE OVERVIEW 19 Features Supervised and Unsupervised Machine Learning Techniques RoadRunner Stalker SoftMealy DEbyE WIEN W L 2 Zheng et al. (2009) Dalvi et al. (2009) Input Complexity M H H H H H H H Output Format data records fields database tables fields fields data records fields fields Labelling no yes yes yes yes yes yes yes CPU Time Order ms/sec sec/min N/A ms/sec sec/min N/A ms/sec N/A Table 3.2: Supervised and Unsupervised Machine Learning Features - N/A: Not Applicable, L: Low, M: Medium, H: High, VH: Very High, ms: milliseconds, sec: seconds, min: minutes. Features Semantic Model Techniques Embley et al. (1999) Limaye et al. (2010) Input Complexity VH VH Output Format data base tables data base tables Labelling yes yes CPU Time Order N/A ms Table 3.3: Semantic Model Features - N/A: Not Applicable, L: Low, M: Medium, H: High, VH: Very High, ms: milliseconds, sec: seconds, min: minutes Output Format Output format is a set of fields for the most of the techniques or for techniques that generate wrappers. We can see these fields, which are values extracted in web pages, as relational tuples. This is a positive feature as we can easily group these tuples and create a relational table. Following, there is a description of the techniques that have a different output. MDR Liu et al. (2003), XmlE Nie & Yu (2010) and IEPAD Chang & Lui (2001) only extract data records. This implies the necessity of the extraction of the data record fields to create database tables. NET Liu & Zhai (2005), Embley et al. (1999) and Limaye et al. (2010) returns extracted data in database tables, which is the format of the output that we intend for WebJDBC driver. RoadRunner Crescenzi et al. (2001) output is an integrated schema from multiple web pages that contains all data records aligned. Notice that is common that Autonomous Pattern Discovery Techniques output to have additional noise, i.e, the extracted fields do not represent exactly the values that are supposed to be extracted. This happens, because they focus on the extraction of the text nodes from HTML. We can see in the Figure 3.5 examples of a clean text field (product name) and unclean text fields (price) in a data record HTML fragment Field Labelling Autonomous Pattern Discovery Techniques do not label the output data, except Álvarez et al. (2010), which uses an heuristic labelling based on the fields that present the same value in all of data records (descriptive fields). Supervised Machine Learning Techniques can easily label the output data by assigning labels for training examples that correspond to a field to be extracted. Zheng et al. (2009) allows the labelling of some field examples through a graphical interface. Semantic Model Techniques have the output data labelled because they work with the semantic of the data.

40 20 CHAPTER 3. RELATED WORK Figure 3.5: Data Record HTML fragment - We see that the text with the product name is only the product name (green one). The field that represents the price,great Deal: $179.95, contains the extra text Great Deal:, which is noise CPU Time Techniques spent time to prepare and perform the data extraction. Here, the time cost of the techniques is analysed. Some works present, in their experimental results section, their technique average time for learning examples and/or extracting data from web pages. Of course we have to consider that this time depends on the type of CPU that is used and the types and size of webpages tested. This analysis gives a view of this feature. The Autonomous Pattern Discovery Techniques works, which have extraction times available on their results, perform the extraction in the order of milliseconds. This is a low CPU Time, which is positive for users that would perform SQL queries through a WebJDBC driver that use this kind of techniques. RoadRunner Crescenzi et al. (2001) presents an average sum of learning and extraction times in the order of milliseconds, in most of tests. Although, there are some tests where this technique spends times in the order of seconds. The time cost of preparing examples for RoadRunner Crescenzi et al. (2001) is acceptable, since this process consists in selecting a number of similar pages. This time combined with the time of processing of this technique makes it suitable to be integrated in the WebJDBC driver. Zheng et al. (2009) presents results an average time of extraction in the order of milliseconds in most of the tests and in the order of seconds in other two tests, which is a suitable order of time for the its use in a WebJDBC driver. However, they do not present the average time spent in examples learning/labelling process. DEbyE Laender et al. (2002a) presents an average time for examples learning in the order of seconds, but it also has a few cases where the average time is in the order of seconds. Stalker Muslea et al. (1998) and WIEN Kushmerick (2000) presents an average time in the order of minutes. The first also has one example where the average time is in the order of seconds. The second also have cases where the average time is in order of seconds depending on which type of wrapper is used. The learning or/and extraction times in order of seconds and minutes are very costly, because if an user of an WebJDBC driver performs a SQL query that uses multiple pages the amount of waiting time will be extremely high. Finally, Limaye et al. (2010) presents an average time of extraction in the order of milliseconds per table extracted. The remaining works do not present their average time.

41 placeholder

42 placeholder

43 Chapter 4 Solution Design and Implementation The webdata extraction is an opportunity to the creation of an extractor that can provide web data extraction services to a WebJDBC Driver. We decide to implement two important interfaces for WebJDBC Driver - Extractor communication. One is the Model that is a structure with input provided by a SQL statement. The Model is used by the Extractor to create a set of parameters for the extraction. The other interface is the MetaDataCursor that is a representation of a table for the WebJDBC Driver. MetaDataCursor holds the logic for web data extraction. The implementation of MetaDataCursor is responsible for the creation of connections with web pages from Web, and is responsible for the web data extraction. WebJDBC Driver can request MetaDataCursors from the Extractor. In the next sections we look into the artefacts created during the solution development, their design and purpose. Model WebJDBC driver request model creation request MetaDataCursor Extractor send MetaDataCursor Web Figure 4.6: General Project Diagram - WebJDBC Driver can request Models creation and Meta- DataCursors for extraction, accordingly of the input of a SQL Query. 23

44 24 CHAPTER 4. SOLUTION DESIGN AND IMPLEMENTATION 4.1 Extractor To provide the integration of multiple techniques and the standardization of their use, the extractor class was created with the objective of managing the techniques and technique inputs creation. The extractor provides an interface that the WebJDBC driver can use to create some important technique parameters, which can help the web data extraction. This interface also allows the creation of MetaDataCursors, which will encapsulate extraction techniques. We can see the code of the Extractor interface in Listing public c l a s s E x t r a c t o r { 2 public void createmodel ( RegexModel rmodel ) {... } 3 public void createmodel ( SupervisedModel smodel ) {... } 4 public MetaDataCursor<S t r i n g [] >[] createfromname ( S t r i n g name ) {... } 5 public MetaDataCursor<S t r i n g [] >[] createfromurl ( S t r i n g u r l ) {... } 6 } Listing 4.1: Extractor class selects the correct createmodel method accordingly the given Model. MetaDataCursor creation is possible by using the createfromname or createfromurl methods, which create techniques using the given model name or the webpage url. Basically, the extractor is composed by 3 important classes, ModelManager, Technique Factory and Technique as we can see in figure 4.7. ModelManager is responsible for the creation and storage of technique parameters. The creation of technique parameters is done by using models that are provided by the WebJDBC driver. Technique Factory uses the created parameters to instantiate a Technique whenever WebJDBC driver requests table(s) from a web page, i.e MetaDataCursor(s). Finally the extractor uses the Technique to create an array of MetaDataCursor(s), which is the representation of a web page table(s) for the WebJDBC Driver. In next sections these classes are described in more detail. creates uses TechniqueFactory 1 +create() : Technique creates «interface» uses TechniqueParameter +getparameter() 1 uses 1 1 ModelManager +create() : Technique +getparameterfromurl() +getparameterfrommodel() «interface» Technique +extract() : MetaDataCursor[] ConcreteTechnique Extractor -technique : Technique -extract() : MetaDataCursor[] +createmodel() : void +createfromurl() : MetaDataCursor[] +createfrommodel() : MetaDataCursor[] «interface» MetaDataCursor provides +next() : object +hasnext() : bool +reset() : void +extract() : MetaDataCursor[] - Classes that implement Strategy Pattern - Classes that implement Factory Pattern Figure 4.7: Extractor Diagram - Extractor uses ModelManager to create and reuse TechniqueParameters. These parameters are used by TechniqueFactory to create a Technique, which provides implementations of MetaDataCursor interface for the WebJDBC driver

45 4.1. EXTRACTOR Model Manager This class provides an interface to create, store and retrieve technique parameters for creation of extraction techniques as we can see in Listing 4.2. This way, the separation of the construction of technique parameters from the construction of techniques, and reuse parameters in further technique constructions is possible. This separation allows splitting the time spent in parameters construction and techniques extraction. The reusing of parameters provides time saving. We can see the creation of technique parameters as creation of indexes for a database. Indexes increases the performance in some SQL queries, and technique parameters have similar function. They increase efficiency and efficacy by helping the creation of appropriated extraction techniques, accordingly the web page. 1 public c l a s s ModelManager { 2 public void c r e a t e ( RegexModel regexmodel ) {... } 3 public void c r e a t e ( SupervisedModel supervisedmodel ) {... } 4 public TechniqueParameter gettechniqueparameterfromname ( S t r i n g modelname ) {... } 5 public TechniqueParameter gettechniqueparameterfromurl ( S t r i n g u r l ) {... } 6 } Listing 4.2: ModelManager class selects the create method accordingly the Model. The create methods use specific code to create and store TechniqueParameters. The storage is done by using Java Serialization. TechniqueParameters can be retrieved by using the get methods methods. To simplify the design of parameters creation, the Abstract Factory Pattern was chosen. This way all the construction logic of technique parameters is hidden in ModelManager class, and we can always add new methods for construction of new parameters. As we can see in 4.8, ModelManager uses different types of models as input, and returns a concrete parameter that implements the TechniqueParameter interface. The TechniqueParameter interface hides the implementation of a concrete parameter, allowing parameters be processed in the same way by the TechniqueFactory. To provide standardization in the creation of models, the Model class was created with the objective of holding the necessary input for the technique parameters creation. Model is composed by an URL, which is the url from the web page where the data extraction is required, a name that identifies the model and other input concerning extraction techniques, which will be described later in this document. Accordingly the model, a correct method is selected to create and store the technique parameters. After the parameter construction, the parameter is stored in a file by using the object serialization of Java. The file has the name of the model and is saved in a specific and configurable directory. Then those parameters can be retrieved to be used in techniques construction by the TechniqueFactory. The parameters can be retrieved by model name or URL. If we choose to retrieve a parameter by model name, ModelManager has an index with a search key by model name. This index is a Java Hash Map and makes the search faster. Otherwise, if we choose to retrieve a parameter by URL, ModelManager has another index, but this has the search key composed by the concatenation of an url and a technique type identifier. This index is a Java Hash Map composed by the search key and the name of the file with the technique parameter.

46 26 CHAPTER 4. SOLUTION DESIGN AND IMPLEMENTATION RegexParameter DefaultParameter «interface» TechniqueParameter +getparameter() SupervisedParameter ModelManager +create() : Technique +getparameterfromurl() +getparameterfrommodel() SupervisedModel ModelType RegexModel - Factory Pattern Figure 4.8: ModelManager Diagram - ModelManager uses a Model of different technique types to create a TechniqueParameter. The models known and are provided by the Web- JDBC driver. The driver creates these models using the input provided by the SQL statements Technique Factory There was the necessity of a systematic and dynamic process of techniques creation. TechniqueFactory was created to fulfil that necessity. This class provides an interface to create techniques by using a TechniqueParameter, which allows holding all the logic of techniques construction in this class. The interface specification can be seen in the Listing 4.3. Given the TechniqueParameter, TechniqueFactory can create one of the techniques defined in its methods. By default i.e. if there is not a specific parameter to use, TechniqueFactory creates a technique, which requires only a url as parameter (such as an Autonomous Pattern Discovery Technique). 1 public c l a s s TechniqueFactory { 2 public Technique c r e a t e ( TechniqueParameter tparam ) {... } 3 p r i v a t e Technique c r e a t e ( RegexParameter param ) {... } 4 p r i v a t e Technique c r e a t e ( SupervisedParameter param ) {... } 5 p r i v a t e Technique c r e a t e D e f a u l t ( S t r i n g u r l ) {... } 6 } Listing 4.3: TechniqueFactory class sample - For each type of technique parameter there is a method to create a technique accordingly that parameter. This provides dynamic in the process of technique creation The Abstract Factory Pattern is used to hide all construction logic of techniques in Techni quefactory class, which allows easier addiction of new methods for construction of new types of Techniques. As we can see in 4.9, TechniqueFactory uses a TechniquePara

47 4.1. EXTRACTOR 27 meter as input and returns a concrete technique based on TechniqueParameter implementation. The Technique interface hides the implementation of a concrete technique, allowing techniques be processed in the same way. «interface» TechniqueParameter +getparameter() TechniqueFactory +create() : Technique «interface» Technique +extract() : MetaDataCursor[] RegexTechnique SupervisedTechnique AutonomousTechnique +extract() : MetaDataCursor[] +extract() : MetaDataCursor[] +extract() : MetaDataCursor[] - Classes that implement Factory Pattern. Figure 4.9: TechniqueFactory Diagram - Technique parameters are used by TechniqueFactory to instantiate techniques Technique Technique is an interface 4.4 that all extraction techniques must implement because it provides an interface that allows other classes, such as Extractor, to initialize or configure the techniques. That logic is separated from the logic of MetaDataCursor implementations, which is an interface that some techniques them selves implement or are able to create implementations for it. For example, MetaDataCursor interface does not allow Extractor define configurations for techniques. These potential configurations should be done at level of Extractor, which is the class that should hold the configurations. MetaDataCursor is the interface that should be used by WebJDBC driver when all initializations and configurations of the technique are done. 1 public i n t e r f a c e Technique { 2 MetaDataCursor<S t r i n g [] >[] e x t r a c t ( ) ; 3 } Listing 4.4: Technique Interface - This interface allows techniques to be initialized and configured. To allow the Extractor initialize or configure techniques in the same way, independently of the implementation of each technique, Strategy Pattern was used in the Technique design. See Figure 4.10.

48 28 CHAPTER 4. SOLUTION DESIGN AND IMPLEMENTATION -technique Extractor -extract() : <unspecified> +createmodel() : void +createfromurl() : <unspecified> +createfrommodel() : <unspecified> 1 1 «interface» Technique +extract() : MetaDataCursor[] ConcreteTechnique1 ConcreteTechnique2 +extract() : MetaDataCursor[] +extract() : MetaDataCursor[] - Classes that implement Strategy Pattern Figure 4.10: TechniqueStrategy Diagram - The Technique interface allows the Extractor initialize and configure a technique in the same way independently of the Technique implementation. 4.2 Techniques Techniques are classes that hold all the logic of web data extraction. That logic must be used accordingly an appropriated interface for WebJDBC driver, which is MetaDataCursor. This way we can do the extraction by parts, extracting only one row of a web data table every time the WebJDBC driver requests a row of data. The advantage of this is that we do not need processing all the extraction first and then process all results. Instead, we try to use results at the same time that we extract them, performing pipelining. However, there is the possibility of the existence of techniques that require to do all the extraction in one time due the technique features. There are multiple kinds of techniques that were implemented in this work. They are the Autonomous Technique, RegexTechnique and SupervisedTechnique. In this section we describe each one of these techniques, their design and implementation. We describe also how the technique parameters are created for each technique Autonomous Technique This technique is an Autonomous Pattern Discovery Technique. The technique was based on work Liu et al. (2003) on aspects of data region and data record identification. Basically, the technique detects repeated patterns in a web page, chooses a data region or more from web page and extracts the data records from them. The only parameter that this technique needs is an URL, because the technique is unsupervised and autonomous, i.e. does not need any input or intervention from user to extract web data. So the technique parameter for this technique consists only in a URL,

49 4.2. TECHNIQUES 29 «interface» HtmlDomParser +parse() : Document/Node «interface» Technique +extract() : MetaDataCursor[] 1 «interface» MetaDataCursor +next() : object +hasnext() : bool +reset() : void «interface» DocumentVisitor<T> +visitnode() +visitnodes() +getresult() : <unspecified> AutonomousTechnique +extract() : MetaDataCursor[] «interface» DataRegionSelector +select() : List<DataRegion> 1 «interface» RecordExtractor +extract() : MetaDataCursor[] - Classes that implement Strategy Pattern. - Classes that implement Visitor Strategy. Figure 4.11: AutonomousTechnique Diagram - This technique is composed by many classes that help the web data extraction. Most of them are configurable. Initially, HtmlDomParser is used to convert the web page in Java DOMTree format. Then, DocumentVisitor is used to search the DOMTree for data regions, and constructs a tree with the structure of the web page, filled with the found data regions. DataRegionSelector is used to select one or more data regions from the data region tree. Finally, RecordExtractor is responsible for the extraction of data records from the data regions. which is provided in a model, which is generated by WebJDBC driver. This technique is composed by HtmlDomParser, DocumentVisitor,DataRegionSel ector and RecordExtractor classes as we can see in Figure The technique has well defined extraction steps that use those classes. These steps are isolated, so we can change the implementation of the classes in one step without great impact in the other steps. This allows the technique to be fully configured to achieve different results in extraction. We can see a sample of the code that represents the steps of webdata extraction, which use these classes in Listing 4.5. Java patterns were used in the design of these classes and we explain now how they were applied in the classes.

50 30 CHAPTER 4. SOLUTION DESIGN AND IMPLEMENTATION 1 p u b l i c c l a s s AutonomousTechnique implements Technique{ 2 3 p r i v a t e S t r i n g u r l ; 4 p r i v a t e DataRegionSelector drselector ; 5 p r i v a t e RecordExtractor r e c o r d E x t r a c t o r ; 6 p r i v a t e DocumentVisitor<Tree<DataRegionInfo>> d o c V i s i t o r ; 7 p r i v a t e HtmlDomParser parser ; 8 9 public MetaDataCursor<S t r i n g [] >[] e x t r a c t ( ) { HtmlConverter htmlconverter = new HtmlConverter ( u r l ) ; 12 htmlconverter. setdomparser ( parser ) ; new XNode( htmlconverter. ConverToDOM ( ) ). accept ( d o c V i s i t o r ) ; 15 Tree<DataRegionInfo> t r e e = d o c V i s i t o r. getresult ( ) ; List <DataRegion> dr = selectdr ( t r e e ) ; 18 I t e r a t o r <DataRegion> i t = dr. i t e r a t o r ( ) ; MetaDataCursor<S t r i n g [] >[] r e s u l t s = new MetaDataCursor [dr. s i z e ( ) ] ; 21 f o r ( i n t i = 0 ; i t. hasnext ( ) ; i ++) { 22 r e s u l t s [ i ] = e x t r a c t R ( i t. next ( ) ) ; 23 } 24 return r e s u l t s ; 25 } p r i v a t e List <DataRegion> selectdr ( Tree<DataRegionInfo> t r e e ) { 28 return drselector. s e l e c t ( t r e e ) ; 29 } p r i v a t e MetaDataCursor<S t r i n g [] > e x t r a c t R ( DataRegion dr) { 32 return r e c o r d E x t r a c t o r. e x t r a c t (dr ) ; 33 } 34 } Listing 4.5: AutonomousTechnique class sample - Here, we can see all classes, which compose Autonomous Technique, being used to obtain the MetaDataCursor that will be used by WebJDBC Driver. HtmlDOMParser is responsible to convert the webpage in a Java DOMTree. As we remember, a DOMTree can be based on the HTML structure or visual aspects of the webpage. The Strategy Pattern is applied here to allow the HtmlDomParser interface having different implementations like an implementation based on structure or other based on visual aspects. AutonomousTechnique uses a known parser that is Tidy 1. This parser creates a DOMTree based on the HTML structure of the page. For a better understanding of DocumentVisitor we must follow figure Document Visitor purpose is visit all nodes of a DOMTree and identify data regions in each node. Visitor Pattern is used to perform the visit. This way we can easily define which type of processing we need when we visit nodes. Two classes were created to use this pattern, DocumentVisitor and DVElement. All classes that need to be visited and processed should implement DVElement interface. The Decorator Pattern is applied here allowing the class Node, which is from Java 1 Visit for documentation and code

51 4.2. TECHNIQUES 31 Tree<DataRegionInfo> «interface» DocumentVisitor<T> +visitnode() +visitnodes() +getresult() : <unspecified> DRTreeVisitor «interface» DVElement +accept() XNode «interface» Document/Node 1..* «interface» DataRegionIdentifier +identify() : <unspecified> 1 MDRIdentifier DataRegion GeneralizedNode 1 1..* - Classes that implement Strategy Pattern - Classes that implement Visitor Pattern - Classes that implement Decorator Pattern Figure 4.12: DocumentVisitor Diagram - DocumentVisitor was created to visit nodes of a DOMTree. DOMTree nodes had to be encapsulated by XNode to be visited and processed. MDRIdentifier has all the logic to process the XNodes, i.e to find data regions in each node. DOM API, to be visited and processed. For this purpose, XNode was created to encapsulate the Node class, adding new capabilities allowing the Java classnode to be visited and processed by DocumentVisitor. DataRegionIdentifier was created to apply Strategy Pattern in the identification of Data Regions step. Strategy Pattern is used in the processing of nodes to allow the change of type of processing be easier and have few impact in the technique implementation. Our implemented strategy consists in finding repeated patterns, to identify whether a node is a data region or not. This process can be done by techniques of similarity as string edit distance or tree similarity. For this, nodes are converted in text and compared. LingPipe API is used to compute similarity between nodes 1. Our implementation of DataRegionIdentifier, which is MDRIdentifier, is based on MDR Liu et al. (2003). We can see in Listing 4.6 the code of MDRIdentifier and a little explanation of its algorithm. 1 Visit to learn more details about LingPipe API

52 32 CHAPTER 4. SOLUTION DESIGN AND IMPLEMENTATION 1 public c l a s s MDRIdentifier implements D a t a R e g i o n I d e n t i f i e r { 2 3 public ArrayList<DataRegion> i d e n t i f y (Node node ) { 4 ArrayList<DataRegion> drs = new ArrayList<DataRegion > ( ) ; 5 DataRegion dr = n u l l ; 6 GeneralizedNode gn1, gn2, gnlast = n u l l ; 7 NodeList c h i l d s = n u l l ; 8 i n t maxgnlength = 0 ; 9 i n t childslength = 0 ; 10 i n t maxcompares = 0 ; i f ( node. haschildnodes ( ) ) { 13 c h i l d s = node. getchildnodes ( ) ; 14 childslength = c h i l d s. getlength ( ) ; 15 maxgnlength = childslength / 2; 16 } 17 f o r ( i n t n=0;n<maxgnlength ; n++) { 18 f o r ( i n t s t a r t =0; s t a r t <=n ; s t a r t ++){ 19 maxcompares = ( ( childslength s t a r t ) / ( n+1)) 1; 20 dr = new DataRegion ( node, n + 1 ) ; 21 f o r ( i n t i = s t a r t, j =0; j <maxcompares ; i = i +n+1, j ++) { 22 gn1 = new GeneralizedNode ( 23 NodeUtils. getnodes ( childs, i, i +n ) ) ; 24 gn2 = new GeneralizedNode ( 25 NodeUtils. getnodes ( childs, i +n+1, i +n * ) ) ; 26 i f (gn1. s i m i l a r (gn2, threshold ) ) { 27 dr. add (gn1 ) ; 28 gnlast = gn2 ; 29 } 30 e l s e i f ( gnlast!= n u l l ) { 31 dr. add ( gnlast ) ; 32 drs. add (dr ) ; 33 gnlast = n u l l ; 34 dr = new DataRegion ( node, n + 1 ) ; 35 } 36 } 37 i f ( gnlast!= n u l l ) { 38 dr. add ( gnlast ) ; 39 drs. add (dr ) ; 40 gnlast = n u l l ; 41 } 42 } 43 } 44 return drs ; 45 } 46 } Listing 4.6: MDRIdentifier Class Sample - This class identify all Data Regions in a node. First, we need to get all child nodes of the candidate node for Data Region. The first for instruction purpose is to find all possible combination of nodes that composes a Generalized Node, which is a child of a Data Region. Generalized nodes are composed by one or more nodes and are adjacent with other similar Generalized Nodes. The second for instruction purpose is to make node comparisons starting from different nodes. The third for instruction purpose is to compare all nodes beginning in one specific node position. The comparisons are always made with adjacent nodes, and for each set of nodes that are similar, a data region is created and added to a list.

53 4.2. TECHNIQUES 33 Finally, Strategy Pattern is also applied with the DocumentVisitor interface, which allows the developer create other implementations to visit and process DOMTree nodes. At this level implementations can be based on the structure of the DOMTree and how to visit it. DataRegionSelector was implemented by using strategy pattern allowing Tree<DataRegionInfo> «interface» DataRegionSelector +select() : List<DataRegion> List<DataRegion> 1..* MoreTupleSelector 1 DataRegion DataRegion Document/Node 1..* * GeneralizedNode «interface» RecordExtractor +extract() : MetaDataCursor[] «interface» MetaDataCursor +next() : object +hasnext() : bool +reset() : void TextFieldRExtractor - Classes that implement Strategy Pattern Figure 4.13: DataRegionSelector Diagram - Multiple implementations of DataRegionSelector can be created to select data regions. For this, is possible analysing all information provided by a tree that contains data regions info. - Classes that implement Strategy Pattern Figure 4.14: RecordExtractor Diagram - Multiple implementations of RecordExtractor can be created to extract tuples from data records in different ways. multiple implementations for the process of Data Region selection. MoreTupleSelector is one of those implementations and selects the Data Regions with more Data Records. That is an heuristic of choosing the region with more repeated patterns because probably is a region with a table full of data records. At this level we can make implementations for Data Region selection, based on Data Region location in the DOMTree, the number of data records, the name of the tag of the DOMTree Node, e.g table or div etc. Record Extractor is an interface for implementations that extract data from data records. Strategy Pattern is applied. We can have multiple implementations at this level, based on extraction of text fields or tag attributes. Data cleaning is possible too at this level, by having some implementation that clean data during extraction. Our implementation,textfieldextractor, extract all text fields from data records. Normally, the content of tables from web pages are text fields Regex Technique In the Chapter 1 we mentioned ad-hoc solutions for webdata extractions. To provide this possibility for the users of the WebJDBC driver, the Regex Technique was developed. This technique uses regular expressions to extract data from web pages. Basically, for each field, which we need to extract, is assigned a correspondent regular expression. In this section we explain how the Model is created for Regex Technique and the technique design

54 34 CHAPTER 4. SOLUTION DESIGN AND IMPLEMENTATION and implementation. The Regex Technique parameters are created by using Java Pattern class. Pattern class allows the compilation of regular expressions that are in string format. This way we can create a Java Map with fields associated to each compiled regular expression. This Map can be used further for other extraction requests. We can see this logic in Listing HashMap<String, Pattern> compiledregexs = new HashMap<String, Pattern > ( ) ; 2 Map<String, String > regexs = (Map<String, String >)regexmodel. getparameters ( 3 Model. REGEXS ) ; 4 I t e r a t o r <String > l a b e l s = regexs. keyset ( ). i t e r a t o r ( ) ; 5 S t r i n g l a b e l ; 6 while ( l a b e l s. hasnext ( ) ) { 7 l a b e l = l a b e l s. next ( ) ; 8 compiledregexs. put ( l a b e l, Pattern. compile ( regexs. get ( l a b e l ) ) ) ; 9 } Listing 4.7: Regex Compile Code Sample - The strings are provided and compiled by this code. The compiled regular expressions are stored in a Java Map. This technique has low complexity in the extraction process, which difficult a modular development. All logic of extraction is on RegexTechnique. This class implements Technique and MetaDataCursor interfaces as we can see in figure The extraction is performed by applying the regular expressions on the web page. This is done in 3 main steps. First we use HtmlConverter class to get the HTML code of an web page, in raw string format. Then Java Matcher class is used to create matchers for each field by using the compiled patterns. Finally, the matchers are used to extract field by field. We can see the logic of creating matchers and finding fields on a web page in Listing 4.8. «interface» Technique +extract() : MetaDataCursor[] «interface» MetaDataCursor +next() : object +hasnext() : bool +reset() : void RegexTechnique +extract() : MetaDataCursor[] HtmlConverter +converttostring()() : string Figure 4.15: RegexTechnique Diagram - RegexTechnique uses HtmlConverter to obtain the HTML code string for extraction and can be used by WebJDBC driver as a MetaDataCursor.

55 4.2. TECHNIQUES 35 1 p r i v a t e void creatematchers (Map<String, Pattern> p a t t e r n s ) { 2 matchers = new HashMap<String, Matcher> ( ) ; 3 l a b e l s = p a t t e r n s. keyset ( ). toarray (new S t r i n g [ p a t t e r n s. keyset ( ). s i z e ( ) ] ) ; 4 S t r i n g l a b e l ; 5 f o r ( i n t i = 0, len = l a b e l s. length ; i<len ; i ++) { 6 l a b e l = l a b e l s [ i ] ; 7 matchers. put ( l a b e l, p a t t e r n s. get ( l a b e l ). matcher ( stream ) ) ; 8 } 9 } 10 p r i v a t e boolean find ( ) { 11 Matcher matcher = n u l l ; 12 boolean e x i s t s = f a l s e ; 13 S t r i n g [ ] row = new S t r i n g [ value. length ] ; 14 f o r ( i n t i =0, length= l a b e l s. length ; i<length ; i ++) { 15 matcher=matchers. get ( l a b e l s [ i ] ) ; 16 i f ( matcher. find ( ) ) { 17 row [ i ]= matcher. group ( ) ; 18 e x i s t s =true ; 19 } 20 e l s e { 21 row [ i ]=new S t r i n g ( ) ; 22 } 23 } 24 r e s u l t. addrow( row ) ; 25 return e x i s t s ; 26 } Listing 4.8: RegexTechnique Class Sample - Matchers are created using the compiled patterns. Then, the matchers process the HTML stream to extract the data.

56 36 CHAPTER 4. SOLUTION DESIGN AND IMPLEMENTATION Supervised Technique To represent the class of the Supervised Machine Learning Techniques we developed the Supervised Technique. This allows the users have the option of choosing the data they want to extract by providing examples. Technique is supervised because the extraction is based on learning of labelled examples. The implementation of this technique was based on work Hsu & Dung (1998). Examples are created from one example file, which is used to create a state machine that is able to extract data from webpages. Examples are composed by tokens. The state machine use these tokens to match and extract the data from the webpage. Before of this technique description, some concepts are described for better comprehension of the technique. Literal : This interface represents special classes for text tokens. These special classes are represented by one or more characters. There are two different groups of special classes. Word class that represents all sequence of characters, which are letters. NonWord class that represents punctuation, HTML tags, control characters, e.g. newline, tab, space and others. Converting the HTML code in a sequence of these literal classes helps the generalization of some tokens. Generalization means that the extraction is also based on the classes of the literals and not only on the exact match of extracted literals from the web page. Separator : Separator is composed by a sequence of literals. This class delimits fields. Every field has a left Separator and a right Separator. This way we can find and extract those fields on a web page. Field : This class is composed by a left Separator, a right separator and a label. The label indicates the meaning of the field. Each field has a position, which is the position of that Field in a tuple (set of fields similar to a row of a database table). Example : This class is composed by a sequence of fields. This class represents an example of a tuple to extract from a web page sample. Examples can have different sequences of the same fields and is used to create a state machine. FieldState : This class represents a state responsible for extraction of a field from a web page. Field State is composed by a field and use the two separators to perform the extraction. Position is used to place the extracted field on a correct position in the tuple. TupleState : This class represents a state responsible for extraction of fields until a complete tuple is extracted. TupleState is composed by FieldStates and use them to extract the next field. Once a field is extracted TupleState is changed because it is necessary extract the next field of the same tuple. ExtractionContext : This class use TupleState to extract fields and save them to construct a tuple. Supervised parameters are created by using an example file. Figure 4.16 shows an example of these files. This example file is composed by a MetaData section, which is an

4.2. TECHNIQUES 37 Figure 4.16: This is an example file that the Supervised technique uses to create the state machine. 1-Label that is associated to a field, 2-Left separator of the field.

57 4.2. TECHNIQUES 37 Figure 4.16: This is an example file that the Supervised technique uses to create the state machine. 1-Label that is associated to a field, 2-Left separator of the field., 3-NonWord Literal (HTML Tag), 4-Word Literal, 5-NonWord Literal (Spaces, newlines), 6-Position that the field ocupies in a tuple. This position indicates which label in MetaData is associated to the field.,7- Right separator of the field. ordered sequence of labels of the fields, and by a sequence of examples, which are composed by some web pages samples. The examples are extracted from the file by using a parser, which is generated by a JavaCC grammar. The parser creates an object, Data, that has the labels information and the extracted examples. Then the examples are converted in a state machine by using the TupleStateFactory class. The SupervisedTechnique implements Technique and MetaDataCursor interfaces. This technique works as an wrapper. The technique is composed by an Extraction Context, which is composed by a TupleState. TupleState has access to Extraction Context variables and can change its state. This is a similar approach of State Pattern. This way ExtractionContext retrieves fields from TupleState and construct tuples. The change of state is abstracted from the ExtractionContext because TupleState has that logic and the next states. ExtractionContext uses TupleState to extract fields. See figure Extraction Context uses the extracted fields to construct a tuple. Whenever a field is extracted, ExtractionContext get its position in the tuple and place it there. In the end a tuple is extracted and TupleState is reseted, i.e. TupleState return to initial state. The extraction continues until there is no sufficient input to construct a tuple.

EXTRACTION AND ALIGNMENT OF DATA FROM WEB PAGES

EXTRACTION AND ALIGNMENT OF DATA FROM WEB PAGES Praveen Kumar Malapati 1, M. Harathi 2, Shaik Garib Nawaz 2 1 M.Tech, Computer Science Engineering, 2 M.Tech, Associate Professor, Computer Science Engineering,