Experimenting with Open Data

Size: px

Start display at page:

Download "Experimenting with Open Data"

Eugene Marsh
6 years ago
Views:

1 Experimenting with Open Data Aybüke Öztürk August, 2013 Master s Thesis in Computing Science, 15 credits Under the supervision of: Asst. Prof. Henrik Björklund, UmeåUniversity, Sweden Herve Dejean, Xerox Research Centre Europe, France Examined by: Dr. Jerry Eriksson, UmeåUniversity, Sweden UmeåUniversity Department of Computing Science SE UMEÅ SWEDEN

3 Abstract Public (open) data are now provided by many governments and organizations. Access to them can be made through central repositories or applications such as Google public data. 1 On the other hand, usage is still very much human oriented; there is no global data download, the data need to be selected and prepared manually, and need to be decided data formatting. Experimenting with open data project aim is to design and to evaluate a research prototype for crawling open data repository and collecting extracted data. A key issue is to be able to automatically collect and organize data in order to ease their re-use. Our scenario here is not searching for a single and specific dataset, but downloading a full repository to see what we can expect/automate/extract/learn from this large set of data. This project will involve conducting a number of experiments to achieve this. 1

5 Contents Abstract List of Figures List of Tables i v v 1 Introduction General Problem Statement Outline of the Thesis Web Data Extraction Wrapping a Web Page Web Data Extraction Tools - Previous Works Our Proposed Method For Automatic Data Extraction The Nature of Open Data Websites Automatic Extraction for Open Data Pagination Detection List Detection Data formats in Open Data Websites Conclusion and Future Work Conclusion Future Work Acknowledgements 25 Bibliography 27

7 List of Figures 1.1 The Open Data Icons The Experiment with Open Data Project Architecture The US Government Open Data Website The UK Government Open Data Website The Kenya Government Open Data Website The Pagination Web Design Examples The Example HTML Pagination Structure The Example HTML Pagination Structure The Example DOM Tree for Pagination and List Structure The Example DOM Tree for Pagination and List Structure The Indian Government Open Data Website Data Format One of The US Government Open Data Website Data Format The Indian Open Data Website Data Format in Download Page List of Tables 3.1 Example list for Government and Organization Websites

9 Chapter 1 Introduction 1.1 General Problem Statement Users get Web data either by browsing data on the Web or by searching keywords. Those search strategies have numerous limitations. For instance, browsing data is not locating particular item of data and easy to get lost while visiting uninteresting links. At the same time, searching keyword often returns huge amount of data far from what the users looking for. Consequently, Web data can not be properly manipulated as done even though being publicly and readily available. For a long while, some researchers have tried to apply traditional database techniques, however, structured data is required to apply those techniques in Web data. A traditional approach for Web extraction is to write specialized programs which are called Wrappers. It determines Web data such as using mark up, in-line code, navigation hints and map them some suitable formats such as XML, relational tables. [1] After traditional approach, many tools have been proposed to improve methods of generating Web data extraction tools. Such tools are based on several distinct techniques, for instance, declarative languages, HTML structure analysis, natural language processing, machine learning, data modelling, and ontologies. [1] The increasing large amounts of Web data are being published to the Web with aim of interoperability. However, Web data is uncommonly made available in a manner that makes it reachable from the users because licenses are required that make explicit the terms under which data can be used. By explicitly granting permissions, the grantor reassures those who may wish to use their data, and takes a conscious step to increase the pool of data available to the web. [5] Open source 1 is interesting and demanding concept in the commercial area and academic sector. For instance, both reports of research and data produced by research 1

10 are required to make easily available for re examination, and organizations by some funders e.g. Creative Commons. The Science Commons project is their one of the project which is taken a precious interest by them. In the meantime, only a small number of projects e.g. OpenStreetMap was created in which data can be used and reused. According to [5], project is really demanding to create and access the data by traditional models. Open data was defined by Bizer et. al. [12] in They said that open data is the idea that certain data should be openly available to use and republish for everyone. It can be used without restrictions from copyright, patents, or other mechanisms of control. Used Web icons for open data is in Figure 1.1. Nowadays, the term "open data" itself is not new, but ongoing popularity with the rise of the Internet and Web and especially, with open data governments and organizations websites. These websites are built using text-based mark-up languages e.g. HTML and often contain a wealth of useful data in different forms. On the other hand, most of those websites are designed for human end-users and not for ease of automated use. Figure 1.1: The Open Data Icons The experimenting with open data project goal is to design and to evaluate a research model for obtaining open data repository and storing an appropriate way to re-use. This report mainly present the issues which we came across while crawling open data. Motivation of open data extraction as given below: Open data makes a lot of data available in the Web. Nowadays, a huge amount of information available in websites is coded in form of HTML documents. The Web contains an enormous quantity of information which is usually formatted for human users. This makes it difficult to extract relevant content from various sources. In other words, usage of public data very much human oriented and automatic data collection systems are not suitable for government and organization websites. 2

The critical issues of open data extraction is that websites have very heterogeneous layout e.g. websites consists of tables, list and images etc.

11 The critical issues of open data extraction is that websites have very heterogeneous layout e.g. websites consists of tables, list and images etc. The first substantial question is how do we automatically locate a large amount of Web pages that are structured. The second question is that is it favourable to generate some large database from these pages. Based on the questions as given above, the defined steps in this project are: finding automatically and collecting list of items that consist of data. storing obtained data in an appropriate format and mining them. The experiment with open data project Architecture is demonstrated in Figure 1.2. List of websites are given as an input e.g. main URL of governments and organization websites. Each URL is parsed one by one to parser in order to extract all URLs in the given website. After collecting URLs, each URL is sent to pagination detection algorithm to detect whether URL has pagination structure or not. And then, acquired pagination structures are used to locate list of data structure and to gather connected pages that are associate with same list of data. These information are used to identify and create the sequence list in order to get data by list detection algorithm. Obtained data are recorded in database to do analysis and evaluation. Detail information about work-flow are explained in chapter 3. Figure 1.2: The Experiment with Open Data Project Architecture

12 1.2 Outline of the Thesis This report is organized as follows: Chapter 2 presents the previous existing techniques for Web data extraction and focuses on the various Web data extraction tools. Chapter 3 describes the nature of open data websites, pagination detection strategy, issues during extracting pagination structure, list detection strategy, implementation of our automatic data extraction model and data formats in open data websites. Chapter 4 explains conclusion, future work and the natural next step for this project. Finally, chapter 5 show acknowledgements. 4

13 Chapter 2 Web Data Extraction In this chapter, we present traditional methods for Web data extractions and previous research regarding implemented Web data extraction tools. 2.1 Wrapping a Web Page Several data extraction systems are implemented for data extraction from Web. As we mention in the introduction chapter, traditional way of data extraction is called Wrappers. Much of the early research papers are called "Extractors" as well. [6] According to [4], semi-structured or unstructured Web sources are extracted using implemented several algorithms that seek and find data required by users. These data are transformed into structured data and are merged for further processes in different ways such as semi-automatic or fully automatic way. This method is however the most primal way of extracting data from Web. According to [4], several systems are created based on this method such as Stalker [13], WIEN [14], which are not automatic Web extraction tools. The important point is that how to generate Wrappers in an automatic way. Later, other researches show that automatic Wrapper approach is expensive and scalable because it needs too much human effort to check which instructions are required to examine each page. [8] E. Schlyter, characterize Web Wrappers by different steps: First step is Wrapper generation which a Wrapper is defined based on several techniques such as defining regular expressions over the HTML documents. Second step is Wrapper execution which the information is extracted unceasingly by the Wrapper such as using an inductive approach or hybrid approach. Inductive approach needs high level automation strategies, however, hybrid approach requires running Wrappers semi-automatically.

14 Last step is Wrapper maintenance that if the structure of data source is changed, the Wrapper should be updated to continue working in an appropriate way. These changes may be effected badly other functionalities in system. On the other hand, Web data extraction tools gain importance due to definition of automatic strategies for Wrapper maintenance. In the same paper, three different methods are discussed to generate Wrappers using these type of tools. Those are regular expressions based approaches, Wrapper programming languages and tree-based approaches. Regular expression based approach is to identify patterns in unstructured text based on regular expressions. For instance, writing regular expressions on HTML pages relies on either word boundaries or HTML tags and tables structure. It needs a great expertise for writing them manually. In the way of the paper, using regular expressions have some advantages such as necessary regular expression is automatically inferred to determine elements which are selected by the users in a Web page. And then, a Wrapper may be created and similar elements are extracted from other Web pages. Logic based approach is used for data extraction purposes which comes from Web Wrapper programming languages. Web pages are considered as semi structured tree documents instead of simple text strings by tools based on Wrapper programming languages. As regular expression based approach, there are also some advantages for logic based approach such as Wrapper programming languages might be created to fully exploit both the semi-structured nature of the Web pages and their content. The first implementation of this wrapping language in a real world scenarios is by Baumgartner et al. [9]. Tree-based approach is called partial tree alignment in the paper. Mostly, in adjacent regions of the page collect information in Web documents that are called record regions. The aim of partial tree alignment is to describe and to extract these regions. Please take a look [8] for more information regarding partial tree alignment. In respect to [2], a classification of Web Wrappers is defined based on which kind of HTML pages could be needed to extract by Wrappers. These are unstructured, semi-structured, and structured pages. Free-text documents written in natural languages are considered as unstructured pages. Apart from information extraction techniques, there is no technique can be applied with a certain degree of confidence. Only a structured data source obtains structured pages. Based on the syntactic matching, simple techniques are used to complete successfully information extraction. 6

15 Semi-structured pages are located in the middle of the unstructured and structured pages. 2.2 Web Data Extraction Tools - Previous Works Some methodologies about Web data extraction have been presented in the literature are summarized in this section. First one is Laender et al. [1], their survey introduced a set of criteria and a qualitative analysis of various Web data extraction tools such as languages for Wrapper development, html-aware tools, Natural language processing based tools, modelling based tools and ontology based tools. Language for Wrapper development, developing languages to assist users in constructing Wrappers such as Java, Perl. Html-aware tools are based on turning the documents into a parsing tree that reflects its html tag hierarchy. Natural language processing based tools usually apply techniques such as filtering, part-of speech tagging and lexical semantic tagging to build relational between phrases and sentences elements, so that extraction rules can be derived. They are more suitable for Web pages consisting of free text, apartment rental advertisements and job listing. Ontology based tools are to locate constant present in the page and to construct object with them, however, modelling based tools try to locate in Web page portion of data that implicit conform to given target structure. Wrapper inducting tools do not use linguistic constraints or Natural Language Processing, rather then formatting features that implicit describe the structure of the pieces of found data. [1] Kushmerick [11][15][16] classified many of the information extraction tools into two distinct categories finite-state and relational learning tools and tracked a profile of finite-state approaches to the Web data extraction problem. Web data extraction techniques derive from Natural Language Processing and Hidden Markov Models were also discussed. The latter paper, Chang et al. [7] introduced a tri-dimensional categorization of Web data extraction systems, based on task difficulties, techniques used and degree of automation. Fiumara [2] applied these criteria to classify four among the latest Web data extraction systems since Among the large number of information extraction tools, Lixto [16] is example of powerful commercial semi-supervised Wrapper generator, while RoadRunner [20] is a prototype of fully automatic tools. In 2011, Web information extraction takes two forms. First one is extracting information from natural language text. Second one is extracting information from structured sources. Recently, this second work is named extracting information from lists on the Web.

16 To the best of our knowledge, the latest work from Ferrara et. al. [4] is the most updated survey for this date. According to them, two main categories are defined. Those are Tree Matching algorithm approach and Machine Learning algorithm approach. As we mention earlier section, Tree Matching algorithms are based on describing and extracting data regions which are applied the semi-structured nature of Web pages. Labelled ordered rooted trees that consist of HTML tags. Machine learning algorithms are suggested different interesting ideas as solutions. These techniques rely on training sessions during which a system gets a domain expertise. The system requires a high level of human efforts to label huge amount of Web pages. In the next chapter, we present the nature of open data websites, automatic extraction for open data websites, briefly highlight the cases in which describe the common issues during the extraction process of open data websites, and data formats for open data websites. 8

17 Chapter 3 Our Proposed Method For Automatic Data Extraction As we mention in introduction chapter, the problem is naturally divided into two individual components that each requires individual consideration. First part is how to find and extract the data from given website. The process of extracting the information has to be automated to make the system scalable. Second step in the process is to find a way to store all this information in a way that makes it available and searchable without any problem. Our work is focused on semi-structured as for the large volume of HTML pages on the Web that are defined as semi-structured because the embedded data are often rendered regularly by the use of HTML tags. Thus, semi-structured data may be presented in HTML or non-html format [7]. They consist of many tables and lists. The HTML code itself promotes it with the use of tables, lists. Data usually take part in these lists or tables. Thus, propose method is started by exploring nature of open data websites. 3.1 The Nature of Open Data Websites When we examine different open data websites with the intention to collect their general information, we stumble on a few basic problems. Open data websites consist of heterogeneous layout such as tables, lists, and images. Two different illustration is given in Figure 3.1. and in Figure 3.2. For instance, example in the Figure 3.1 is US government open data website and example in the Figure 3.2 is UK government open data website. Both example have many images and hyper-links for different categories and have header with navigation menu that has data category to locate open data information. Moreover, both websites are written in http protocol. Following is general information about open data websites:

18 They are semi structured form HTML pages. Generally, they consist of navigation menus, hyper-links, tables and images. In each open data website, data are located under a separate menu such as the menu name data, dataset etc. Often you have to navigate through their websites to find the dataset. Data page consists of large amount of data. Thus pagination is used for displaying a limited number of result when viewing a website. (details of pagination is discussed in subsection ) Either all data information take place in one pagination structure in a website or the website categorises same data using hyper-links in many pagination structure. Figure 3.1: The US Government Open Data Website We conduct experiments on 20 different input open data websites such as given in Table 3.1. Other example websites could be found in Wikipedia open data website. 1. Underlying reasons of using limited websites is that we come across some issues when we try to parse websites. Thus, we ignore whose websites before starting extraction process. These issues are summarizes as given below: Some websites written in JavaScript, and not possible to reach all page content. Rarely, websites are protected by a username/password authentication mechanism

Figure 3.2: The UK Government Open Data Website URL Description http://data.belgium.be/ Belgium government open data website http://www.opengovdata.

19 Figure 3.2: The UK Government Open Data Website URL Description Belgium government open data website Russia government open data website Greek government open data website Norway government open data website Republic of Ghana government open data website Indian government open data website U.S. government open data website Indonesia government open data website The Open Database Of The Corporate World website U.S. anothor government open data website open source crowd sourcing website British Indian Ocean Territory government open data website Aquatic Biosystems online journal European Unian open data portal UK government open data website Table 3.1: Example list for Government and Organization Websites Some websites do not change current page URL while navigation between pages. Some websites are made secured with HTTPS.

20 3.2 Automatic Extraction for Open Data This section presents an information extraction algorithm that can locate and extract the data from open data websites. The algorithm does not depend on training dataset, and does not require any interaction with a user during the extraction process. It works without any requirements such as input websites do not need to share the similar template. Our approach is clearly as given below: 1. System starts extraction process by taking a list of websites as input. 2. For each website, parser function is called to obtain content of website. 3. This content is used to find all hyper-links for the same website. 4. Found hyper-links are given to Pagination Detection algorithm: First, pagination structure consists of limited pagination hyper-links. Algorithm finds those hyper-links to update and record in a list. Second, algorithm discovers location of pagination structure to discover location of data list structure. 5. In case of no pagination structure is found from the obtained hyper-links in step 3, system automatically repeats step 2 and 3 for obtained hyper-links till finding pagination structure. (The underlying reason of checking obtained hyper-links is that a pagination structure might be divided into many pagination structures for various data categories. Checking only first level hyper-links is not enough to reach data content.) 6. List Detection algorithm uses both information that are found in Pagination Detection step in order to detect repetitive contiguous patterns to extract data. 7. Extracted data is stored in a database to analyse them Pagination Detection In this subsection, we present pagination algorithm and issues while implementation process. Pagination structure consists of connected hyper-links which let a user quickly pick a content such as in Figure 3.4. The pagination structure transforms long content of Web page into a series of virtual pages browsable via pagination hyper-links. Each content is simply normal HTML wrapped in tags for effortless integration. A real example is presented in Figure 3.3 for Kenya government open data website. This website lists different items as we circle with blue colour. As you see in the example website, it contains header with navigation menu, table of data view types, 12

21 table of categories and topics etc. Website is formed 19 different categories for 540 data. Bottom of the website, there is a number list in order to navigate between connected pages which presents pagination structure. These 540 data are divided returned data and are displayed in multiple pages using this pagination structure. Figure 3.3: The Kenya Government Open Data Website First task of pagination detection algorithm is that algorithm extracts pagination hyper-links to update missing hyper-links and record in a list. Following issues are obtained during the extracting and updating pagination hyper-links: Each pagination structure consists of an opening tag and a closing tag. Within each corresponding tag pair, there can be other pairs of tags, resulting in nested blocks of HTML codes. An opening HTML tag has class name (as well as other HTML tags) which is directly associated with pagination structure name e.g. pagination is tag class name in Figure 3.5. Finding this HTML tag class name is challenging because different name might be selected by website designer e.g. "pagination clear-block" or "hyper-links page-list". Thus, we work on various websites to see possibility of different tag class names. We observe that only common point is all class name has word "pag" etc. In our work most of the example pagination structure is in tag <div>, or <ul>. We use regular expressions

22 Figure 3.4: The Pagination Web Design Examples to capture class name of HTML tag to find pagination structure among those tags. As you see in Figure 3.4, limited number is represented in pagination structure. However, user can navigate using "previous", "next" or "last" hyper-links to reach other connected pages. Thus, we try to find all hyper-links which are not written in pagination structure. Finding missing hyper-links needs determining first and last hyper-links in pagination structure. For instance, in Figure 3.6, last hyper-link page number is 6. It is same hyper-link as "last" hyper-link of pagination structure. Thus, hyper-links number in pagination structure is 1 to 6. On the other hand, example in Figure 3.5, last hyper-link number is 83 and there is no "last" hyper-link. Thus, we should obtain hyper-links number in pagination structure is 1 to 83 and update between 3 to 83 missing hyper-links to add in a list. To do this work, we get all numbers between those numbers and create new hyper-links with this missing page numbers. For notational convenience, capturing a page number from hyper-link is not easy task because hyper-link might be involved a page number as well as other numbers. For example, hyper-link is colour with orange color in the Figure 3.6 has many numbers such as "id=1", "itemid=15" and " limitstart=5". Creating new hyper-link is required to update existing hyper-link based on a number which represents page number in hyper-link. If pagination hyper-link has "result per page" limitation number like limited by 10, 20, 50 etc. Pagination is updated by corresponding limitation number. It makes more complex to update connected pagination hyper-links. Instead of checking only last two 14

23 number to find missing page numbers, we calculate differences between limited page numbers. For example, in Figure 3.6, page is limited by 5 and last page finish by 25. If last page finish by 50. After page 25, page 30, 35, 40, 45 and 50 should be added in a list. Different language problem, pagination structure of Greece government open data website is given in the Figure 3.6. This website is written in Greek language. Thus, meaning of word "previous", "next" and "last" are checked for various languages to determine last hyper-link in pagination structure. Some websites show "more results" or show "all results" hyper-link which makes duplicate information for data while obtaining pagination hyper-links. Other issue is that some websites would have two or more pagination structure within same page with same hyper-links. We ignore this kind of example websites in experiment. Figure 3.5: The Example HTML Pagination Structure-1 Second, pagination algorithm discovers location of pagination structure to discover location of data list structure. Our observation is that pagination structure and a list data being placed in a specific region. They are under one parent node. Our propose method for pagination detection is found out this parent using HTML tag of pagination structure. In other words, list data is another sub-tree under the same parent node with pagination sub-tree. Visualization step help us to see this situation. Details are given in List detection subsection. Figure 3.7 and Figure 3.8 are

24 Figure 3.6: The Example HTML Pagination Structure-2 visualization examples that clearly show pagination sub-tree and list data sub-tree are under the same parent node. Due to the difficulty in extracting and updating pagination structure, we spend quite a lot of time in this step. This approach has limitation that the target structure must be known in advance, which is not possible in all cases. A considerable amount of human effort is required to label a pagination structure and update pagination hyper-links in proper format List Detection Several methods have been proposed for the task of extracting information embedded in lists on the Web. Most of them rely on the underlying HTML mark-up and corresponding Document Object Model (DOM) 2 structure of a Web page. The general idea behind the Document Object Model is that HTML Web pages are represented by means of plain text, which contains HTML tags. HTML tags may be nested one

25 into another forming a hierarchical structure. This hierarchy is captured in the DOM by the document tree, whose nodes represent HTML tags. The document tree has been successfully exploited for Web data extraction purposes and various techniques discussed in the [12] [22]. Before using List detection algorithm, we try to visualise structures of websites content using the Document Object Model. Anything found in an HTML or XML document can be accessed, changed, deleted, or added using the Document Object Model. For this task, we transform websites into a well formed document using the Newick tree format 3. A tag tree representation is constructed based on the nested structure of start and end HTML tags. Sub tree examples are given in the Figure 3.7 and in the Figure 3.8. As an example, in mathematics, Newick tree format is a way of representing graphtheoretical trees with edge lengths using parentheses and commas. Examples: (A,B,(C,D)); leaf nodes are named (A,B,(C,D)E)F; all nodes are named (A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F; distances and all names List detection algorithm is designed for paper documents such as Page frame detection, header and footer detection, or information extraction. We use same algorithm to mine sequential patterns in a web page, focusing on ordered repetitive contiguous patterns. Basically, Algorithm takes an input that is flat list of elements and based on the feature elements which we selected tags and attributes, it gives structured segmented list as output. [10] The method relies on the following steps: 1. Elements characterization: features characterizing an element are computed 2. Features are calibrated: similar features are regrouped together( normalization step to consider similar features as equal, kind of clustering) 3. N-grams generation: a set of n-grams is generated for the sequence of elements 4. Sequential N-grams are selected and ordered by frequency. 5. The most frequent n-gram is selected and sequences of elements matching this n-gram are structured (regrouped under same node) 6. The identified sequences are enriched with additional n-grams 3

Figure 3.7: The Example DOM Tree for Pagination and List Structure-1 7. The method is applied iteratively over the new sequence of elements as long as step 4 can generate new n-gram.

26 Figure 3.7: The Example DOM Tree for Pagination and List Structure-1 7. The method is applied iteratively over the new sequence of elements as long as step 4 can generate new n-gram. Example the Figure 3.7 and the Figure 3.8 show sub-tree for pagination detection and list detection structures. Pagination structure is coloured with orange color that helps us to reach sub-tree of list items. In the Figure 3.7, parent tag of pagination sub-tree is in <div> tag and it s child nodes are in <div> tag. On the other hand, for the Figure 3.8, parent tag of pagination sub-tree is in <table> and data embed in <table> in sub-trees. The Figure 3.8 consists of 4 data lists which are coloured with blue color that are structured in <table> tags. These examples proves that: 18

27 Figure 3.8: The Example DOM Tree for Pagination and List Structure-2 1. Data can be placed in the short list in web page using <table> tags as opposed to in the longest list. This approach is opposite to most of the extraction tools approaches. 2. Our approach shows that tabular information on the Web would be encoded in an increasing number with <div> instead of <table> tags as a result of spread of CSS usage for Web page implementation. Only a few number of web page use tables for open data websites. 3. List data structure can be found by discovering repetitive patterns.

28 3.3 Data formats in Open Data Websites The task is to find out how to categories the actual data. Some open data websites contain lots of information that are not interesting for the extraction, like navigation hyper-links, etc. In our work, the data format are divided into the following three categories according to their attributes and our requirement: 1. Short text data: This kind of data always appear in list data. Most of them ultimately contain all the extraction information such as publication date, type of data, number of data, popularity or rating of data as given example about Indian government open data website in Figure Long text data: Only long text format which is not downloadable data such as one U.S. open data website in Figure Hyper-links: This kind of data corresponds to hyper-links in a web page which usually have tags <a> in HTML files. Web pages inside a website are connected to each other through hyper-links. For example, in the Figure 3.9, when we click one of the title to download data, it refers to another hyper-link to give details of that data such as post title, description, download and reference information etc. Example website is given in Figure Figure 3.9: The Indian Government Open Data Website Data Format 20

29 Figure 3.10: One of The US Government Open Data Website Data Format Figure 3.11: The Indian Open Data Website Data Format in Download Page

31 Chapter 4 Conclusion and Future Work In this chapter, we conclude by explaining conclusion, future work and the natural next step for this project. 4.1 Conclusion The World Wide Web involves a large amount of unstructured data. To complete automatically extracting data for structured information from Web sources needs development and implementation of several strategies as well as a wide range of applications in several fields such as ranging from commercial to open data websites. In the first part of this paper, we provide a classification of algorithmic techniques exploited to extract data from Web pages. We mention previous works by presenting first basic techniques such as Wrappers. Finally, we focus on how Web data extraction systems works. We provide different perspectives to classify Web data extraction systems. The second part of the work is about a system which provides automatic extraction from open data websites based on sub-structure. We present the nature of open data websites, and briefly highlight the cases in which describe the common issues during the extraction process of open data in pagination. We present list detection steps and some real-world scenarios. This part ends with a discussion about data formats for open data websites. As a conclusion, in this paper, we focus on implementation of automatic web extraction system for open data websites which are semi-structured documents. Used methods are new approaches in term of finding location of data using pagination structure and applying algorithm which is used for page frame detection or header and footer detection. We believe that these approaches will open new perspectives for further research in open data extraction area and shows high potential for significant improvements in the future.

32 4.2 Future Work The determined project time was 6 months for Experimenting Open Data project. Because of the visa issues, we work on only 4 months on it. The limited working time effects conclusion of project. We didn t complete data storing step. As a future work, we are going to store data in databases and then we are going to analyse and evaluate them. The natural next step for this project is : first, how to extract websites which are problematic when we try to parse them, second step is some natural languages processing methods might be applied to improve pagination detection algorithm in order to extract and update wide range of hyper-links. 24

33 Chapter 5 Acknowledgements This research project would not have been possible without the support of many people. I would like to express my greatest gratitude to the people who have helped and supported me throughout my project. I would like to express my sincere gratitude to my external supervisor Herve Dejean for the continuous support of my master thesis project study and for his motivation, enthusiasm, and immense knowledge. His guidance helped me in all the time of project work. I am sure it would have not been possible without his help. Special thanks of mine to Asst. Prof. Henrik Björklund who gave me valuable advices for my project report. I would like to thank my parents and friends for their support who encouraged me to go my own way. And especially to God, who made all things possible.

35 Bibliography [1] Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, and Juliana S. Teixeira. A brief survey of web data extraction tools. SIGMOD Rec., 31(2):84 93, June [2] Giacomo Fiumara. Automated information extraction from web sources: a survey. [3] Xiaoqing Zheng, Yiling Gu, and Yinsheng Li. Data extraction from web pages based on structural-semantic entropy. In Alain Mille, Fabien L. Gandon, Jacques Misselis, Michael Rabinovich, and Steffen Staab, editors, WWW (Companion Volume), pages ACM, [4] Emilio Ferrara, Pasquale De Meo, Giacomo Fiumara, and Robert Baumgartner. Web data extraction, applications and techniques: A survey. CoRR, abs/ , [5] Paul Miller, Rob Styles, and Tom Heath. Open data commons, a license for open data. In Christian Bizer, Tom Heath, Kingsley Idehen, and Tim Berners-Lee, editors, LDOW, volume 369 of CEUR Workshop Proceedings. CEUR-WS.org, [6] Fabio Fumarola, Tim Weninger, Rick Barber, Donato Malerba, and Jiawei Han. Extracting general lists from web documents: a hybrid approach. In Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I, IEA/AIE 11, pages , Berlin, Heidelberg, Springer- Verlag. [7] Chia-Hui Chang, Mohammed Kayed, Moheb Ramzy Girgis, and Khaled Shaalan. A survey of web information extraction systems, [8] Erik Schlyter. Structured data extraction [9] Robert Baumgartner, Wolfgang Gatterbauer, and Georg Gottlob. Web data extraction system. In Ling Liu and M. Tamer Özsu, editors, Encyclopedia of Database Systems, pages Springer US, 2009.

36 [10] Hervé Déjean. Numbered sequence detection in documents. In Laurence Likforman-Sulem and Gady Agam, editors, DRR, volume 7534 of SPIE Proceedings, pages SPIE, [11] Nickolas Kushmerick, Daniel S. Weld, and Robert B. Doorenbos. Wrapper Induction for Information Extraction. In Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI 97), pages , [12] Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, and Zachary Ives. Dbpedia: A nucleus for a web of open data. In In 6th Int?l Semantic Web Conference, Busan, Korea, pages Springer, [13] I. Muslea, S. Minton, and C. Knoblock. A hierarchical approach to wrapper induction, [14] Nicholas Kushmerick. Wrapper induction: Efficiency and expressiveness. Artif. Intell., 118(1-2):15 68, [15] Nicholas Kushmerick. Finite-state approaches to web information extraction. In Proc. 3rd Summer Convention on Information Extraction, pages Springer- Verlag, [16] Robert Baumgartner, Sergio Flesca, and Georg Gottlob. Visual web information extraction with lixto. In In The VLDB Journal, pages ,

EXTRACTION AND ALIGNMENT OF DATA FROM WEB PAGES

EXTRACTION AND ALIGNMENT OF DATA FROM WEB PAGES Praveen Kumar Malapati 1, M. Harathi 2, Shaik Garib Nawaz 2 1 M.Tech, Computer Science Engineering, 2 M.Tech, Associate Professor, Computer Science Engineering,