Extracting Content from Online News Sites

Size: px

Start display at page:

Download "Extracting Content from Online News Sites"

Quentin Collin Stevenson
5 years ago
Views:

1 Extracting Content from Online News Sites Sigrid Lindholm January 31, 2011 Master s Thesis in Computing Science, 30 ECTS credits Supervisor at CS-UmU: Johanna Högberg Examiner: Per Lindström Umeå University Department of Computing Science SE UMEÅ SWEDEN

3 Abstract Society is producing more and more data with every year. The number of unique URLs indexed by Google recently surpassed the one-trillion mark. To fully benefit from this surge in data, we need efficient algorithms for searching and extracting information. A popular approach is to use the so-called vector space model (VSM), that organises documents according to the terms that they contain. This thesis contributes to an investigation of how adding syntactical information to VSM affects search results. The thesis focuses on techniques for content extraction from online news sources, and describes the implementation and evaluation of a selection of these techniques. The extracted data is used to obtain test data for search evaluation. The implementation is generic and thus easily adopted to new data sources, and although the implementation lacks precision, its performance is sufficient for evaluating the syntax-based version of VSM.

4 ii

5 Chapter 1 Preface CodeMill is a Umeå-based IT company that offers system development and resource consulting. The company also has a Research & Development division (R&D) that collaborates with the academia to turn scientific results into commercial products. A prioritized project at R&D is the implementation of a syntax-based search engine for information retrieval. The engine is tailored for companies with a constant need of updated information, which is generally the case within the financial sector and ICT 1 sector. Concrete applications are monitoring media coverage of a company s affairs, or keeping track of an entire field of business. The search engine consists of two components; a frontend that extracts and parses online news articles, and a backend for indexing and searching the gathered documents. Information is transferred from the frontend to the backend through a shared database. The author of this thesis contributes with an implementation of the frontend written in Java, and with a survey and evaluation of current techniques for article extraction. The backend is described by Thomas Knutsson in [12]. 1 Information and Communications Technology iii

6 iv Chapter 1. Preface

7 Contents 1 Preface iii 2 Introduction and Problem Description The Vector Space Model Latent Semantic Analysis Random Indexing Improvements Parse Trees Motivation for Improvements Web Extraction Requirements Thesis Outline Overview of Previous Work Wrapper Techniques in General Extracting Articles From News Sites Tree Edit Distance (2004) The Curious Negotiator (2006) Tag Sequence and Tree Based Hybrid Method (2006) Linguistic and Structural Features (2007) Visual and Perceptual A Generic Approach (2008) Approach Web Content Syndication Extraction Parsing Database v

8 vi CONTENTS 4.5 System Overview Results 29 6 Conclusion Limitations Future work Acknowledgements 41 References 43

9 List of Figures 2.1 A vector space indexed by terms Possible parses of a pair of English sentences Degree of structure of different types of documents Mapping Top down mapping Restricted top down mapping Node extraction patterns DOM tree with path prefixes Left and right numbering in the Nested Set Model Outline of the complete information-retrieval system A sample article from the Wall Street Journal The continuation of the article in figure A sample article from usa today The continuation of the article in Figure vii

10 viii LIST OF FIGURES

11 List of Tables 4.1 Corresponding html to Figure Sample output from parser Comparison of automatically- and manually extracted text from a Wall Street Journal article Comparison of automatically- and manually extracted text from a usa today article ix

12 x LIST OF TABLES

13 Chapter 2 Introduction and Problem Description Search algorithms are an important field of computer science, and will remain so in the years to come. There is more and more data available, increasingly many people produce, analyse and manage information, both professionally and otherwise. Information and knowledge in itself is becoming a resource, and the necessity for good search methods are increasing. In an attempt to evaluate a possible improvement to existing search methods a small system will be built. The system will consist of a frontend, extracting articles from websites, and storing these articles in a database, and a backend which indexes the data inserted into the database and provides an interface for users to make queries on the data. Given a word or phrase, the indexed documents/articles with greatest relevance should be returned. By relevance one does not imply the documents with the greatest lexical similarity but rather those which are semantically similar. There are several motivations behind the use of semantic indexing: synonyms and homographs. Synonyms are a great obstacle in lexical search. The descriptions that users give documents tend to vary to a large degree, and only documents which exactly match the search terms entered by a user are returned in lexical search. Any documents which are of interest, but to not contain an exact match are left out, giving a large number of false negatives. A further source of bother are words with more than one meaning, homographs, which give rise to false positives. When given the word desert a lexical search algorithm will return both pages relating to dry geographic areas, as well as yummy recipes. [6] 2.1 The Vector Space Model A common approach in semantic indexing is to use a vector space model. The base is a co-occurrence matrix, a sparse matrix where the rows represent unique 1

14 2 Chapter 2. Introduction and Problem Description Term 3 Term 2 Term 1 Figure 2.1: A vector space indexed by terms words and the columns represents either words (a word-by-word matrix) or, as is more common, longer pieces of text such as a document (a words-bydocument matrix). Each entry in a words-by-document matrix is the frequency of occurrence, the number of times that a certain word occurs in a document. [18] From the words-by-document matrix, each word, (or term) is mapped on an axis in an n-dimensional coordinate system, where n is the number of words, i.e the number of rows in the co-occurrence matrix. Vectors are then created, with coordinates corresponding to the entries in the matrix. If a document contains the word w with an occurrence frequency of a, the value of the vector coordinate corresponding to w s axis is a, otherwise 0. This results in a vector for each document. A tiny example with only three terms is depicted in Figure 2.1, where each vector is a representation of a document, and the coordinates of the vectors are given by the frequency of the terms in the document. All vector space models is based on the distributional hypothesis: Words with similar distributional properties have similar meanings. Words with similar meaning do not necessarily occur with each other, but they do tend to occur with the same other words [19]. It is possible to find other documents similar to a specific document by looking at the vector for the document and then calculate which vectors are close to it. [20] It is possible to make queries using this model by inserting a vector into the vector space, created in the same fashion as the document vectors, (with entries corresponding to the frequency of the words in the query). Then, one calculates which vectors are the query vectors closest neighbours, these vector correspond to documents which are likely to be good matches to the query.

15 2.2. Parse Trees Latent Semantic Analysis The above description is a simplified one. The vector space for any real life problem is very large and there are a number of optimisations and reduction techniques employed to make this model useful. LSA (Latent Semantic Analysis) is one method that is based on the vector space model. Here the co-occurrence matrix is analysed using a technique called SVD (Single Value Decomposition). Any terms or documents whose contribution is very small, are removed from the matrix. The resulting vector space is a good approximation of the real one, and hopefully easier to compute Random Indexing As in LSA, a vector space is constructed and the clustering of vectors indicate similarity, but the method employed is quite different. Each word is represented by an index vector of limited length, randomly filled with 0, 1,1. A context vector is constructed for each document, by adding the index vectors for all words occurring in the document. To make a query vector using Random Indexing, simply add the index vectors corresponding to the words in the query. Then, find context vectors which are close to it in the vector space. [18] Improvements By using parse trees instead of words, and using collections of parse trees or annotated texts instead of documents in the above mentioned methods, LSA and Random Indexing, some increase in search performance is expected. 2.2 Parse Trees As described by [10] a language L consists of a set of strings, and can be described by a context free grammar (CFG, or simply grammar). A grammar G = (V, T, P, S) has the following components: T Terminals, the symbols that the strings of L are built from. V Variables/nonterminals, representing a language L. S Start symbol. P Rules/Productions, recursive rules which define the sentences of a language. Some small examples of parse trees are shown in Figure 2.2. A bit more formally, a parse tree (also known as a concrete syntax tree), is a structural representations of a valid string in a language L. All interior nodes of a parse tree are nonterminals from the grammar G, its leaves are either nonterminals, terminals, or if the leaf is an only child ǫ (the empty string) from G. Also, the children of any interior node, must follow one of the productions in P: If

16 4 Chapter 2. Introduction and Problem Description the node s children are named X 1,X 2,...,X k (left to right) there must be a production P in G such that A X 1 X 2 X k where A is the interior node. S S NP NNP Johanna VBD VP NP PP NP NNP Jeremy AUX is VP JJ ADJP PP left PRP her NNS keys IN on NP DT NN the table afraid IN of NP NNS mopeds Figure 2.2: Possible parses of a pair of English sentences 2.3 Motivation for Improvements There are several motivations given in [9] why parse trees might provide better results. It is possible to make conclusions about a term s semantics depending upon its location in the syntactical structure which is inherent in parse trees. Furthermore, a term s location in the structure may also provide information about how important a word is, providing a means to remove less important words, called syntactic filtering. 2.4 Web Extraction This thesis covers only the frontend of this project which is concerned with the extraction of articles from news sites. Extraction is a technique to find relevant parts of a document and store these in a structured way, connecting values data with their semantics. Here, the aim is to produce a sufficient amount of data for the indexing and data analysis of the backend. News articles was chosen as an appropriate data type due to the Penn Treebank, a corpus 1 of parse trees originating from articles in the Wall Street Journal 2. The Penn Treebank is a parsed corpus, where each sentence has been annotated as a parse tree. It is well known and often used in the research community, making comparisons of this work to previous ones easier and more credible. The main target of extraction therefore is the Wall Street Journal, but preferable extraction should also work on other U.S news sites as well, ideally it should 1 A structured, large set of texts 2

17 2.5. Requirements 5 be as generic as possible. (The restriction on the news origin is due to that The Penn Treebank, of course, is in American English) 2.5 Requirements So, briefly the task is to build a system, which extracts articles, parses these and inserts them into a database. More specifically the requirements are the following: Fetch and extract articles. Primarily from Wall Street Journal, secondarily from other U.S websites. Extraction should be as accurate as possible. Include the Charniak natural language parser into the project, and build an interface to it. Find a suitable database structure. Insert parsed trees into the database, while maintaining the syntactical structure. After each stage is completed, testing should be performed if appropriate. If time is available; research to find an another better natural language parser to generate parse trees from the documents found, and research the possibility of using web crawlers, and which difficulties which might arise from the use of these. If possible the implementation will be extended with results from research. There were some technical requirements from the company which commissioned the work, most importantly all programming must be done in java, using the Eclipse IDE, and SVN. It was also required that RSS be used to find pages to extract articles from. 2.6 Thesis Outline The outline of this thesis is as follows: Chapter 2 describes the problem at hand, first the project as a whole, and then narrowing in on the frontend. A survey of web extraction methods in general, but with focus on the field of extracting articles from news sites, is given in chapter 3. The approach used in this project is presented in chapter 4, followed by the results of this labour in chapter 5. The last chapter contains contains concluding remarks and gives pointers for future work.

18 6 Chapter 2. Introduction and Problem Description

19 Chapter 3 Overview of Previous Work Web extraction is the retrieval of relevant parts of a document and storing these in a structured way, connecting data values with their semantics. [11] Websites tend to be dynamic and heterogeneous, which makes the task non trivial. The most common approach to web extraction (henceforth: extraction) is the use of a wrapper, a small program which applies a set of scripts or rules to identify relevant data and output it in some structured fashion. Initially this was all done manually, but considerable research has been done, mainly within machine learning, to create semi- or fully automated induction (generation) of wrappers. Pages from the same site tend to share the same underlying structure, or template, whose content often is filled in by the backend database. Wrapper induction (WI) is about finding this hidden template. Each site has its own template and therefore require a wrapper of its own. There are several advantages of using WI: 1. If the underlying structure of a web page is changed, generating a new wrapper is easier than writing a new set of rules. 2. Creating rules manually demands domain knowledge. WI lessen the demand on the end user, which does not need the same computer and programming experithe. 3. Sites have different underlying structure and must be processed differently. Creating wrappers manually is time consuming and therefore rarely feasible for larger data sets. Maintaining wrappers over time is particularly cumbersome. Many recent tools are either fully- or semi-automated. Typically, manual and semi-automated tools require user interaction, often through a GUI, where the user selects areas of interest, and the tool attempts to learn which areas the user are interested in. Naturally, this is not a feasible method when a large number of sites are involved. However, fully automated tools also require some kind of user interaction. However this rarely takes place in the extraction phase, but rather the post-processing phase during data labelling. 7

20 8 Chapter 3. Overview of Previous Work 3.1 Wrapper Techniques in General Various techniques have been used to extract data or generate wrappers, different ways to classify wrappers have been presented, among these are [3] which mainly is concerned with the degree of supervision from the user (not to be confused with supervised learning from AI). Supervision may be through different inputs such as training sets or through a GUI where the user chooses which parts of a page he finds interesting. A semi supervised system requires less specific examples in its training set than a supervised system, alternatively it only needs some user response after extraction has taken place to evaluate the result. Unsupervised systems need very little interaction with the user during the extraction process, although some post-extraction filtering, or labelling of data often is needed. Laender et al. [13] instead divides wrapper generation techniques into the following six categories: (All references to the degree of supervision in the comparison below are from [3]). Languages for Wrapper Development. This is one of the first methods of wrapper generation. Special purpose languages were proposed in addition to existing programming languages such as Perl or Java. These were the earliest attempts at web extraction systems, and were more or less rule based and required that users had a large knowledge about computers and programming [3]. However, these languages, which include Minerva, TSIMMIS and Web-OQL, did not make a large remaining impact in the field. They were often based on simple rules in different formats. TSIMMIS is one of the very first web wrappers. Input was a long list of commands, stating where the data of interest was to be placed in a page, and which variables to instantiate with the data[3]. Minerva used grammar-based production rules, written in EBNF, that expressed a pattern to help locate the target data[3]. HTML-aware Tools. By examining the structure of html documents, these tools transform documents into trees, retaining the structure of the html document. The trees are then processed to generate wrappers. W4F, XWrap and RoadRunner are examples of tools from this category. W4F and XWrap are both considered by [3] to be constructed manually. RoadRunner on the other hand is far newer and one of few unsupervised systems. It attempts to find similar structures at the html tag level in several different pages and works best on data-dense sites (which tend to be dense in structure as well)[11]. NLP-based Tools. These tools include RAPIER, SRV and WHISK. They use natural language processing (NLP) e.g filtering or lexical semantic tagging to identify relevant parts of a document, and build rules on how to extract these parts. These tools only work for documents containing mainly plain text, such

21 3.1. Wrapper Techniques in General 9 as different kinds of ads (job, rent, dating etc). RAPIER and WHISK are both supervised WI systems. Wrapper Induction Tools. These tools use the structure and formatting of documents. A number of example training sets must be supplied by the user, and are used to induce extraction rules. Systems belonging to this classification are WIEN, SoftMealy and STALKER, and they are all seen as supervised WI systems due to the large amount of input (training sets) required. Modelling-based Tools. Given a structure of interest, formatted using a simple primitive, such as a tuple or a list, a modelling-based tool will try to find the same structure in the document of interest. NoDoSE and DEByE are both user supervised systems modelling based systems that provide a GUI where the user selects which regions are interesting. Ontology-based. In contrast with all tools previously described, an ontology based tool does not use the underlying structure/presentation, but instead examines the data. This is only applicable to specific domains and often must be constructed by an human well acquainted with the domain. Being a very abstract concept, this has not been researched much. Ontology based tools are hard to create, demanding a very high level of expertise, nor are they very generic due to the domain constraint. However, they are easily adapted to pages within the same domain, and formatting changes will not affect their performance. Brigham Young University has developed the most mature tool in this field. Some newer tools which are not included in [13] are for instance IEPAD, OL- ERA, Thresher (semi-supervised) and DeLa, EXALG, DEPTA (unsupervised). A semi-supervised system, as previously mentioned, requires less exact examples in the training set than a supervised system, or it only needs some user response after extraction has taken place to evaluate the result. IEPAD is able to find, (given only a few examples), other similar repetitive patterns, provided these patterns are large enough; it is not able to find any single instances of target data. [3] Unsupervised systems, such as RoadRunner or DEPTA require very little interaction with the user before extraction, no training sets are needed. However, some post-extraction filtering, or labelling of extracted data is often needed. Unfortunately they have the drawback of only working on pages which are data dense, since they require a certain amount of input to correctly identify patterns. [3] One of the tools which has attracted the most attention recently is a supervised system called LiXto. A GUI is provided where the user interactively selects regions of interest from an integrated web browser (Mozilla), without dealing with the underlying html. The system stores the path in the html tree

22 10 Chapter 3. Overview of Previous Work Structured Database XML cgi-generated HTML Elmarsri & Navathe Semi-structured Hand-written HTML Postings on newsgroups e.g. apartment rentals Equipment maintenance logs Medical records Soderland Unstructured Free texts e.g. News articles Hard to understand by machine Easy to understand by machine Figure 3.1: Degree of structure of different types of documents of this selected region, and notes its pattern. LiXto is then able to find similar patterns using a programming language called Elog, which is related to Datalog. The system is quite friendly to end users. Neither Elog nor html is ever presented to the user, therefore no knowledge about these is needed. As Laender et al. points out in [13], there is a trade off between a tool s degree of flexibility and its degree of automatisation. A tool with a high degree of automatisation tend to have parameters and heuristics tuned to a certain type of page. These must be changed if applied to another domain, there are different challenges to face when extracting data from a page selling hard drives compared to a page listing available jobs. 3.2 Extracting Articles From News Sites The level of difficulty for a particular extraction task (in general, not necessarily web extraction) is dependent upon how structured a document is. The method of measuring degree of structure varies between different research domains, as seen in Figure 3.1 from [3] which depicts the relationship between difficulty and structure when viewed by a linguist, (Soderland) or by database researchers (Elmarsri and Navathe). Most challenging of all is unstructured text, such as news articles. This applies when one extracts data from a news article. However, this project is concerned with the extraction of the news article itself, from html, a semi-structured document type, which is a less overwhelming task. [3] Nonetheless news extraction is a specific domain in itself. Semantically, the content of the article is not important, the only thing that matters is that the article as a whole has been successfully identified. This is a much easier task that extracting for instance product information for different sites, where data first must be correctly extracted and semantically labelled. If extracting information from sites selling race bikes, it is not enough to find the bike information, e.g. model number, colour, frame size and price, these data fields must also correctly

23 3.2. Extracting Articles From News Sites 11 be detected as precisely these. However, there are also numerous difficulties. News extraction is less about separating one structure from a larger structure or many structures and more about filtering content from unwanted clutter. The lack of structure becomes a problem. Fully automated wrapper generation is unsuitable for news page extraction, since these do not contain enough reoccurring block patterns. [24] Just like most other sites there are many factors creating nuisances: Every news site has it is own design, which also frequently change. Pages are covered with scripts, images and various other distracting elements. There are also some more minor details complicating the process which are more specific to news sites. According to [15] web designers even try to make it harder to extract data from news sites data, motivated by the need to complicate the function of ad-blockers. Sites might require users to log in before allowing them to read (full) articles. Papers may choose to divide their news articles into several separate pages. There is not a large amount of work specifically aimed at this domain, the most relevant works are described briefly in the rest of this chapter. There are not many of them, since as claimed by Zhang et al. [22], wrapper techniques have, (as far as by 2006), never been aimed at news extraction. (This however is not true, as one of the works listed below is from 2004, but it was obviously overlooked) Tree Edit Distance (2004) Reis et al. [17] have conducted the most known work in this field in 2004, it has since then been the major work of reference. Their method is aimed at finding hidden templates, using structural similarities. In short, training set pages from a site of interest are clustered based on similarity. A pattern with wildcards is then calculated to match all of the pages clustered together, and data is matched and labelled. During extraction, the current page is compared to the patterns of the clustered pages to find the most similar cluster/pattern. Similarity is based on tree-edit distance, which is the cost of transforming one labelled ordered rooted tree into another. There are three permittable operations involved in transformations; node insertion, node deletion and node replacement. All operations have a cost, which may be unit cost. An alternative way of describing the cost of transformations is through a mapping[17]: Definition Let T x be a tree and let T x [i] be the i-th vertex of tree T x in a preorder walk of the tree. A mapping between a tree T 1 of size n 1 and a tree T 2 of size n 2 is a set M of ordered pairs (i,j), satisfying the following conditions for all (i 1,j 1 ),(i 2,j 2 ) M i 1 = i 2 iff j 1 = j 2 T 1 [i 1 ] is on the left of T 1 [i 2 ] iff T 2 [j 1 ] is on the left of T 2 [j 2 ]. T 1 [i 1 ] is an ancestor of T 1 [i 2 ] iff T 2 [j 1 ] is an ancestor of T 2 [j 2 ].

24 12 Chapter 3. Overview of Previous Work T1 T2 R R A C B BB G D E A E Figure 3.2: Mapping T1 T2 R R A C B BB G D E A E Figure 3.3: Top down mapping An example of a mapping can be seen in Figure 3.2 taken from [17]. Any dotted lines going from T1 to T2 indicate that that vertex should be changed in T1. Any vertices in T1 which are not connected by lines to T2 should be deleted and similarly, any vertices in T2 which are not connected to T1 should be inserted. Calculating the edit distance is quite computationally expensive and many simplified, restricted definitions exist, e.g. the top down mapping defined below, which disallows removal and insertion in all positions but the leaves, see Figure 3.3 from [17] for an example. Definition A mapping M between a tree T 1 and a tree T 2 is said to be top-down only if for every pair (i 1,i 2 ) M there is also a pair (parent(i 1 ), parent(i 2 )) M, where i 1 and i 2 are non-root nodes of T 1 and T 2 respectively. This restricted definition is restricted further by the authors, by adding that replacements too, must only take place in leaves, resulting in a restricted top down mapping RTDM. An example provided by [17] is depicted in Figure 3.4 Definition A top-down mapping M between a tree T 1 and a tree T 2 is said to be restricted top-down only if for every pair (i 1,i 2 ) M, such that T 1 [i 1 ] T 2 [i 2 ], there is no descendant of i 1 or i 2 in M, where i 1 and i 2 are non-root nodes of T 1 and T 2 respectively. The algorithm locates all identical subtrees in trees T 1 and T 2 in the same level, and group them in equivalence classes. For each class, a mapping is found between the trees.

25 3.2. Extracting Articles From News Sites 13 T1 T2 R R A C B BB G D E A E Figure 3.4: Restricted top down mapping example + + *? Figure 3.5: Node extraction patterns To extract data from a web site one must use a training set, a large number of pages from the same site. Pages are compared using the RTDM algorithm and then clustered using a standard clustering technique. For each of the clusters a node extraction pattern (ne-pattern) is generated, a form of regular expression for trees, se Figure 3.5 for some examples given in [17]. Definition Let a pair of sibling sub-trees be a pair of sub-trees rooted at sibling vertices. A node extraction pattern is a rooted ordered labelled tree that can contain special vertices called wildcards. Every wildcard must be a leaf in the tree, and each wildcard can be one of the following types. Singel( ) A wildcard that captures one sub-tree and must be consumed. Plus(+) A wildcard that captures sibling sub-trees and must be consumed. Option(?) A wildcard that captures one sub-tree and may be discarded. Kleene( ) A wildcard that captures sibling sub-trees and may be discarded. If one wishes to extract a page, the tree representing this page is compared with node extraction patterns generated for each of the clusters from the training data. Once a page is matched against a pattern, data is easily located. Data is labelled using a number of number of heuristics, including length of text.

26 14 Chapter 3. Overview of Previous Work The Curious Negotiator (2006) The curious negotiator is an agent systems for negotiations described by [22]. The negotiated items are pieces of information, specifically news articles. As a part of the system articles must be fetched, classified and stored for later use by negotiation agents. A data extraction agent added to the system performs the following three stages: 1) data extraction, 2) text filtering using a dynamic filter, 3) keyword validation. The key ideas behind the extraction is that web sites are constructed from hidden or visible nested tables in combinations with CSS 1 and that a news article is the largest block of text in a web page. Extracting news from a page becomes the task of identifying the largest portion of text in a table. This is done by inserting all html tags of a page into an array, removing all tags which are not a table <table>...</table> or within table tags. Iterate the array, for each text item, append it to a container which holds all text at this nesting depth. Once the array has been iterated, the container with the largest amount of text is returned. A second page, preferably similar to the one first is the fetched, and extraction done in the same manner. The result is compared to the extracted body of text from the first page. Any identical sentences are considered static parts of the page and are then removed, filtered, from the end result. To ensure that the URL used during extraction was a valid one, or that nothing else went wrong, the result is validated. Validation is performed using keywords that should reasonably occur in the text. The keywords that are used are words appearing in the title, (except for stop words 2 ). If they are found to a satisfiable degree the text is accepted as an article. Naïve as this approach may seem, the authors claim this method is fully functional, although, they do not compare it to any other Tag Sequence and Tree Based Hybrid Method (2006) Li et al. views all extraction techniques as either tag sequence based or tree based, and give a proposal in [14] on an hybrid method using both of these. With a tag sequence based approach, one may use existing techniques such as languages with good support for regular expressions. Unfortunately the nesting structure of the document is hard to keep track of using this approach, which on the other hand is an inherent property of tree based approaches. These however, do not have the same support for comparing similarity or pattern matching. The html document is transformed into a novel representation format for web pages, what the authors call a TSReC, Tag Sequence with Region Code, a list, whose entries are called tag sequences (TS). TS =< N,RC b,rc e,rc p,rc l,c > 1 Cascading Style Scheets (CSS) 2 Stop words are words which do not add any real information e.g. she, the.

27 3.2. Extracting Articles From News Sites 15 Tag sequences contain information about region beginning (RC b ), region end (RC e ), parent(rc p ) and level(rc l ). C is the content; inner html tags or nothing at all. Together these provide the possibility to treat the TSReC as a tree. Any web page can be divided into three different kinds of areas: Common parts, such areas which are common on all pages of a site, such as the top part of a news site which usually has items such as the paper name, date and a small navigational menu with the sections of the paper. Regular parts, for instance navigational field, which occur on all pages, but whose content may change. Content parts, the target of extraction. Another page from the same web site, as similar as possible is compared to the current page. The common parts of both pages are first identified using sequence matching and then the regular part using tree matching techniques, leaving the content part as a final result. The sequence matching is done by finding the string edit distance between the two pages, and the tag sequences of the common parts are removed from the sequence. String edit distance 3 is used, since the common parts are not always completely identical, for instance date or time may be included. Finding the regular parts are harder than the common parts, in particular when they are too flexible, too different. The solution is to create subtrees of all tag sequences which belong together, using the region codes in the tag sequences. Subtrees of one page are then compared to the subtrees of the other page The remaining tag sequences contain the targeted content part Linguistic and Structural Features (2007) Ziegler et al [24] suggest a system which identifies text blocks in a document and finds threshold values for different features, (properties) of the blocks to determine if a block is a part of the article or not. The actual values of the thresholds are calculated using a stochastic non linear optimisation method called Particle Swarm Optimisation using a large amount of pages. These threshold values are then used when extracting other pages. Text Block Identification First all the text blocks of a page must be isolated. A html page is converted to xhtml to allow the page to be parsed into a DOM tree. The tree is traversed and its nodes are either removed, pruned or untouched. When pruning a node n, 3 The number of operations (taken from some predefined set) that are needed to convert one string into the other

28 16 Chapter 3. Overview of Previous Work the entire subtree of n is also deleted along with n, as opposed to when a node is removed and only n, or rather its the value, is deleted from the tree. The removal of an element or subtree may also mean that the value of the node is replace by whitespace, to ensure that whitespace in the text is preserved. This is the case of the <br>, <p> tags. Pure formatting tags, such as <big>, <span> or <i> are simply discarded. Pruning removes any parts of the tree which never contain any real content, such as <textarea>, <img> or script. The tree is then traversed and the text blocks are identified. Features For each block, minimum or maximum thresholds are calculated for eight separate features, properties, which can either be linguistic or structural. There are four features of each: Structural features 1. Anchor ratio - the amount of words which are links compared to the total number of words, tend to be low in text. 2. Format tag ratio - plain text contains relatively many formatting tags. 3. List element ratio - not very common in text compared to surroundings. 4. Text structuring ratio - headings, paragraphs, text block alignment tags are expected to be more frequent in text. Linguistic features 1. Average sentence length - text tend to have longer sentence length than for instance link lists. 2. Number of sentences - often higher in plain text. 3. Character distribution - ratio of alphanumeric characters compared to other characters. Expected to be higher in text. 4. Stop-word ratio - should be high for continuous text. Thresholds are found using a large set of pages, where the correct text blocks have been manually evaluated before hand. Extraction Once the thresholds have been found, extraction of a html page can be done by parsing it into a DOM tree which is then processed as described above, to identify blocks. Which blocks that belong to the article is determined by calculating the features for each block and then comparing them to the threshold values for each feature.

29 3.2. Extracting Articles From News Sites Visual and Perceptual Compared to previously described approaches to news content extraction, visual and perceptual methods are quite very different. In contrast to other methods, where the most common course of action is to process html either on a tag level basis and/or as trees, often as parsed DOM, visual and perceptual approaches are only concerned with how the html has been rendered, not its structure. Visual Consistency (2007) As observed by Zheng et al. [23], humans are easily able to locate the actual content of a news page, regardless if is written in a language they are familiar with or not. The reason is that papers, perhaps by convention, always give the area where the news article content the same characteristics, they are visually consistent: Its area is relatively large compared to surrounding objects. There is a title at top, with a contrasting size and/or font. It consists mainly of plain text, and a few other items such as pictures or diagrams. The area is fairly centred on page. Even if the hidden structure behind changes, these four characteristics remain. Motivated by this observation Zheng et al. suggest dividing a page into blocks corresponding to the visual parts that it is built from. These blocks are rendered from a pair of html tags or the text between them, and are given an ID. They are nestled, have a size and a position. Inner blocks have at least one child block, leaf blocks none. Leaf blocks are labelled as one of Title, Content or Others. The largest block is the entire page itself, everything within the <body>...</body> tags. A page can be viewed as a visual tree made from blocks. The tree contains a parent-child relationship for blocks where one is on top of another. The authors use a html parsing tool from Microsoft to obtain a visual tree. (It should perhaps be mentioned that two of the three authors work for Microsoft in China.) The tool is able to provide a lot of information about each block, including coordinates, width, height, information about formatting e.g. font size, if text is bold or italic. Lastly some statistics are given, including the number of images, hyperlinks and paragraphs, length of text among others. Machine learning is used to train the tool and induce a template independent wrapper. Since the visual properties of any news page are quite stable even when the structure behind is changed, there is no need to find any hidden template. Blocks of the pages in a training set are manually labelled as inner or leaf, and further: a positive inner block is one which has at least one child labelled as Title or Content, otherwise it is called negative.

30 18 Chapter 3. Overview of Previous Work By using the values of the inner blocks and their labels, and the information about each block given by the html parsing tool, the system is trained to determine which blocks contain some news. A second learning iteration is performed on these blocks to more precisely locate news and title. Perception Oriented (2008) Another, quite similar approach which also focuses on how a human is believed to go about when finding news is given by Chen et al. in [4]. Properties, quite similar to the ones mentioned by [23] are given to the areas of a page which contain the actual news content, and which humans use to identify news content. These properties are: Functional property: the key function of this area is to provide information. Space continuity: the contents are located continuously in space, separated only by non informational areas, such as images, navigational- or decorational areas. Formatting continuity: all news areas should have approximately the same formatting. The authors claim that their method mimics humans by first locating the area containing news and then locating the actual news content. Each page is turned into a structural representation consisting of objects, called Function based Object Models (FOM). Each object represents a content part of a page. Objects have different functions, basic or specific, depending on what kind of content they represent, and what the author is trying to mediate with this particular piece of content [5]. There are four types: Information-, Navigation-, Interaction-, or Decoration Objects. Objects carrying more than one function are called mixed. There are also a number of objects specific to the domain of news extraction: Block Object: separated by other objects by spaces. Inline Objects: objects which are displayed one after another within a block object. Text Information Object / Media Information Object: specific type of information object. Leaf Block Object: a special kind of block object. Leaf information objects can not contain any other kind of block objects, only inline objects. Leaf Block Information Object(LBIO): a leaf block object which main functionality is to supply information. If the media type of the object is text, this is a Text Leaf Block Information Object(TLBIO)

31 3.2. Extracting Articles From News Sites 19 The authors use a five step algorithm which is based on the succeeding theorems (the writers call these axioms): Theorem News content of a news Web page is presented as a set of TLBIOs in the page. Theorem A news TLBIO can only be contained in an information- or mixed object Theorem News TLBIOs of a news page are presented in one or more rectangular areas. Vertically, these rectangular areas are separated by media information objects and/or non-information objects. Theorem The major content format in a news area is similar to the formats used by the majority of objects inside all news areas. The five stage algorithm is, in short, as follows: 1. In the first stage of the algorithm, the page is transformed into tree structure of FOMs. We refer the reader to [5] for the details of this procedure. 2. Next all TLBIOs are detected. The tree is traversed top to bottom. Based on Theorem and all blocks which are information or mixed objects are added to a set of TLBIOs. The children of any composite block, whose relative area compared to the rest of the page is large enough, is also added to the set (to avoid missing an information object in a large navigational object). 3. Any areas in the set of TLBIOs where the areas are close enough (Theorem 3.2.3) or have similar formatting (Theorem 3.2.4) are merged. If the resulting area is larger than a predetermined threshold value this step is performed again, but with a more conservative view on closeness. 4. The merged area, and the TLBIOs are examined and based on position, formatting among other criteria, the system decides which parts are the actual news content. 5. In the final stage the title is extracted, using a few heuristics. Some of the factors which are considered, include that titles tend to be short, close to the article and relatively large in font size A Generic Approach (2008) Dong et al. [7] presents a method which does not need to find any hidden template for each page, but instead use a few heuristics based on the following observations about the DOM trees from articles on news web pages : news articles, (including text, date and title) are generally located under a separate node, they are comprised of a number of paragraphs, located close to each other and often with other, unrelated material, between them. Formatwise, they contain a lot of text and few links. The authors use the following terminology to describe this more concise:

32 20 Chapter 3. Overview of Previous Work Block Node: is the ancestor of nodes who contribute to the structure of the page. Usually contains html tags, such as <table>, <div>, <td>, <tr> or <body>. Paragraph Node: represents news, tags as <p> or <br>. Invalid Node: does not contribute to the content of the article, such as <script>, <form>, <input>, <select>, or whose children are empty. Node s Semantic Value: is the number of characters of the content below the node, which are not included in a hyperlink. The previous observations can now be put as into these general rules: 1) news, including text is located below a block node, 2) text is below a paragraph node 3) the block node containing news will have the largest semantic value 4) invalid nodes are irrelevant and should be deleted. These rules are then applied during extraction in the following manner: 1. To remove invalid nodes, the page is transformed into XML. 2. Next the tree is traversed top to bottom, layer by layer until a paragraph node is reached. In each step, the semantic value for each node is calculated and only the children of the node with the largest semantic value is evaluated next. The news article text is composed of text nodes under the current block node. 3. From the current block and upwards until a <table> or <div> the content of nodes are matched against a regular expressions in an attempt to find a date indication.

33 Chapter 4 Approach There are a number of design choices to take into consideration. First the major tasks of the system are identified as page fetching, extraction, parsing and database interaction. Motivated by the aim to keep the program modular, each of these major tasks, and other less central ones, are all implemented in separate parts. All four major parts are described in more detail below. Apart from these, some additional functionality is required, and is described shortly at the end of this chapter. 4.1 Web Content Syndication To compile information from different web-based sources, and present it in a concise manner to a user is called web (content) syndication. As a service to its readers, a news provider, or even a simple blogger, may choose to provide automatic updates on subjects of the users chosen interests by publishing what is called a feed. To be able to receive a feed one must subscribe to it using a program called an aggregator or feed reader, which polls the site for updated feeds. Updates are sent to the reader through a web feed or a syndication feed, an XML-formatted message containing the headline, a link to the page, and also often a short summary, or sometimes, even the actual content. There are a number of enduser syndication publishers and readers available, both standalone, web based and as extensions to most modern web browsers. Instead of visiting a number of different sites to see if there has been any updates, a user can simply use the feed reader to get an overview, and decide if this was information he or she was interested in. All this allows for an easy way to manage information flow, relieving users of the need to constantly check their favourite sites for updates. At the same time it is easier to get an overview of incoming updates giving the opportunity to easily determine if this was information of interest, in which case the content 21

34 22 Chapter 4. Approach of the link location can be investigated further. [21] There are two major formats for syndication feeds, RRS[2] and Atom. The majority of feeds use RSS. The actual meaning behind this acronym depends on which version is being referred to, the current, RSS 2.0, is said to be Really Simple Syndication [1]. A multitude of different versions of RSS exist, and there are numerous compatibility issues between them [16]. Atom is not as widespread, despite not having any compatibility issues and a stricter view on document formatting to avoid the use of poorly structured feeds. These techniques can be utilised for our purposes as well, as all major news sites provide feeds as a service to their readers. Instead of using spiders or crawlers these feeds can be used in data gathering. In addition to existing RSS and Atom readers there are also various libraries aiding programmers to directly publish and subscribe to feeds, among others: ROME 1, Informa 2, Eddie 3, Universal Feed Parser 4 All support several versions of RSS and Atom, but neither project except ROME have been updated in a long time, and Eddie is also unsuitable in this project from a license point of view. ROME is free, recently updated and still maintained, and is released under Apache License 2.0. It supports all versions of RSS and Atom and relieves the user from worrying about incompatible versions and other details. Due to these reasons ROME is used in the system to read RSS and Atom feeds. 4.2 Extraction There are many factors to consider when designing an extraction module. As argued in Section 3.2 it is clear that there are quite varied approaches to the problem. However, as work was in an initial stage, focus was set to get a simple prototype working, with less consideration to its actual performance. This was motivated by the desire to get a basic version of the entire system up an running at an early stage, partly to avoid delaying the development of the backend of the system. Inarguably, the most simple of all the algorithms in Section 3.2 is the one given in the Curious Negotiator in Section However, questions were raised as to how effective this algorithm really was. It was considered slightly too naïve, and some alterations, described below, were done. The Curious Negotiator algorithm presented by Zhang and Simoff [22] is entirely tag oriented, and also makes the assumption that page structure is completely constructed from tables. The extraction implemented in this system is based on Zhang and Simoffs algorithm but it is less focused on tables and is not tag oriented, instead the page is converted into a DOM tree. In order for conversion to work, the page 1 Project homepage: 2 Project homepage: 3 Project homepage: 4 Project homepage:

35 4.2. Extraction 23 Document 1 2 <head> <body> <title> <h2> <p> <p> text node "A short document" text node "My Hobby" text node "I like watching TV <i> text node "!!!" text node "Some of my favourite shows are the Simpsons and the Big Bang Theory." <b> text node "a lot" Figure 4.1: DOM tree with path prefixes must be processed. Few web pages conform to set standards, and therefore must be cleaned before conversion can be done. This is done by a open source library called HtmlCleaner 5, which interprets html in a fashion similar to that of browsers. Among other things the library rearranges tags to produce more well formatted html, and is finally able to return a DOM tree. A very small, simplified example of a DOM tree is given in Figure 4.1. The corresponding html is provided in Figure 4.1. <html> <head> <title>a short document!</title> </head> <body> <h1>my Hobby!</h1> <p>i like watching TV <i><b>a lot</b></i>!!!</p> <p>some of my favourite shows are the Simpsons and the Big Bang theory.</p> </body> </html> Table 4.1: Corresponding html to Figure 4.1 Instead of looking at tags, the DOM tree returned from HtmlCleaner is examined, and this examination is not limited to only nodes below table nodes. Instead all text nodes, (with a few exceptions, most notably lists of links,) are 5 Project homepage:

36 24 Chapter 4. Approach included in the process. All nodes are given a unique path name based on their location in the tree. These are the numbers close to each node in the tree in Figure 4.1, which are not part of the DOM tree, but a manner of describing the path or location of each node. All text nodes which have a common path name prefix are appended after each other. The longest body of text with a common prefix, is considered to be the article. This description is a simplified one. There are several alterations done to the DOM tree, prior to finding the longest common path name prefix. The alterations are simple heuristics which have been found, by observation, to improve results. To further improve results [22] applies a dynamic filter, which is constructed by extracting the content from two pages on a site. Any common material is added to the filter. The curious negotiator algorithm uses a web crawler to fetch pages from news sites, and to avoid including invalid pages, or pages not containing news, keyword validation is applied. Since this system will use RSS feeds from news sites, the risk of encountering invalid pages is considered small enough too completely ignore this stage. To cope with pages of a less well structured character, with plenty of formatting such as <i>, <b>, or <blockquote> which might be used in a non-standard manner, any text node which occurs below a node that represents any tags that are simple and purely for text formatting are given the same prefix as its parent node. Although no dynamic filters are available, a user can add static filter sentences which will never be included in an extraction. By providing the name of the html tag pointing to the next page in a multipage article, the extractor can fetch all pages in the article, and extract the article from all pages of it. 4.3 Parsing The original requirement that some implementation of the Charniak parser must be used was changed as there apparently is no Java implementation easily available. The Stanford NLP parser 6 is considered an acceptable alternative, is available in Java and is therefore used in place of the Charniak parser. Unfortunately, it is not the ideal choice when it comes to software licences, since the Stanford NLP parser is released under GNU GLP V2, which does not allow incorporation into proprietary software. However, by building the system modularly, the parser can be replaced if needed, as long as output for a new parser follows the standard Penn Tree Bank annotation style. Two small examples of output from the Stanford parser are provided in Table 4.2, the sentences are Negative, I am a meat popsicle and Jeremy is afraid of mopeds. A corresponding parse tree for the right sentence can be seen in Figure 2.2. (There is a small missmatch between them, they are however, both 6 Available at

37 4.4. Database 25 valid but the result of two different parsers. The tree in Figure 2.2 is generated using the Stanford parser and the one in Table 4.2 using a Charniak parser.) (ROOT (ROOT (S (S (ADVP (RB Negative)) (NP (NNP Jeremy)) (,,) (VP (VBZ is) (NP (PRP I)) (ADJP (JJ afraid) (VP (VBP am) (PP (IN of) (NP (DT a) (NN meat) (NN popsicle))) (NP (NNS mopeds)) ))) (..))) (..))) Table 4.2: Sample output from parser 4.4 Database Each document must be stored in the database for later processing by the backend. Information that must be stored include an URL to the document, the title of the article, all sentences from the article along with their corresponding parse trees and information about each site/rss feed. Parse trees are stored in a separate table, where each entry in the table is a node of a parse tree, and include a tree ID, node ID, word classification ID and the actual value (which is null for all inner nodes). The simple approach to storing trees would be to use an adjacency list, where each node entry contains a reference to the parent. To find an entire tree, one recursively queries the database for each level from the root and downwards. The adjacency list model is quite slow and inefficient, and troublesome to work with for applications such as ours. This work uses another model called the Nested Set Model. It is based on an idea found in [8] on how to preserve the hierarchical structure of trees. Each node entry in the database contains a left and right value which are numbered in a preorder tree traversal fashion. The left value of each node will be the smallest number among all its descendants, and its right value will be the largest number of all its descendants. All leaves will have a difference of one between its left and right value. A small example taken from [12] is depicted in Figure 4.2 In order to improve performance during backend queries, tree depth was also added during a later stage in the development, (which actually removed the need for a right value altogether). The database runs on a MySQL server provided by CodeMill AB. 4.5 System Overview In short, the components described above are the following. Extractor. Extracts articles, provides the possibility to add static filters and fetches multi-page articles.

38 26 Chapter 4. Approach 1 A 12 2 B 3 C 4 11 D 5 10 E F Figure 4.2: Left and right numbering in the Nested Set Model FeedReader. Takes a url pointing to a feed and extracts data from the feed, using a library called ROME. Parser. Returns a string representation of a parse tree, by using the Stanford NLP-Parser. DatabaseInteraction. Converts any parse tree into a more appropriate format for insertion into the database. Performs all interaction with database. In addition to these some other minor tasks must be carried out. These include managing information about each site, providing a common policy for handling strings and supporting DOM processing. They are implemented in the following parts: DOMparser. Auxiliary methods for examining DOM documents. SiteManagement. Handles adding and removing sites. TextProcessing. Ensure identical processing of text in different parts of the system. Detects sentences in the text using OpenNLP 7, a natural language processing library. A simplified outline of the entire system is depicted in Figure 4.3, with the frontend in the left part of the figure, the backend to the right and the database in the middle. The system has a small administration GUI which allows a user to add and remove RSS/Atom URLs, and add filters to any existing feeds. 7 Project homepage: :

39 4.5. System Overview 27 Frontend Backend SiteManagement FeedReader Mapping DomParser Extractior Indexing Parser Query processing Database Interaction User Figure 4.3: Outline of the complete information-retrieval system

40 28 Chapter 4. Approach

41 Chapter 5 Results Testing the effectiveness of the implemented extraction method is a complicated matter. It is difficult to determine what to include in an article, since this is a matter of subjective opinion, as is noted by Ziegler et al. in [24]. Humans do not always agree on which text excerpts that should be included in an article, so it is difficult to find a ground truth for evaluating automated extraction. Furthermore, there is the question of how to perform comparisons. The results can for instance be compared document-, character- or sentence-wise. Evaluating results through document-to-document comparison is not a feasible approach, since it is too coarse-grained to be useful. Character comparisons will most likely give better results than sentencewise comparisons, if one looks only at the percentage of correctly extracted text. However, not only the amount of text, but also the text quality is of major importance. If sentences are erroneously detected, then then the input to the parser, and thus also its output, will not be meaningful. For this reason, we decided to perform comparisons at a sentence-to-sentence level. The disadvantage of using sentences comparisons is that the outcome will depend upon the performance of the library used to detect sentences from free text. The extraction system is thus evaluated as follows. First articles from a number of pages are extracted by hand. The system employs a library called OpenNLP to perform sentence detection on all extracted text before invoking the parser sentence by sentence. During evaluation, all hand extracted articles are divided into sentences using OpenNLP. The html pages from which the articles were extracted are fetched by the system. Extraction is then performed on the fetched pages, and all the extracted text is divided into sentences, again using the OpenNLP library. Finally, machine extracted sentences are compared to the hand evaluated sentences. Testing is performed using a simple JUnit 1 class. It is difficult to choose a good metric for measuring the systems performance. To simply count the number of extracted sentences that are identical to those of 1 A framework for unit testing in Java. Available at 29

42 30 Chapter 5. Results the reference document is not reasonable. An exact sentence may be evaluated as missing from the extracted article compared to its hand evaluated counterpart, but a very similar sentence may be detected instead. Example 1. The Wall Street Journal An example of an attempted extraction from the article 2 in Figure 5.1 and Figure 5.2 detects 32 common (correctly extracted) sentences. Three sentences are labelled as missing from the extracted article and three sentences are mistakenly included as seen in Table 5.1. On closer inspections, the errors do not seem to be too severe. Sentences erroneously found by automatic extraction, which should not be included By. WILLIAM TUCKER There isn t much doubt that Congress and incoming President Barack Obama will try to impose some kind of limits on carbon emissions. Mr. Tucker is author of Terrestrial Energy: How Nuclear Power Will Lead the Green Revolution and End America s Long Energy Odyssey, published in October by Bartleby Press. Please add your comments to the Opinion Journal forum. Sentences present in the manually extracted text but missing from the automatic extraction Wind and biofuel could become the next subprime mortgage fiasco. By WILLIAM TUCKER There isn t much doubt that Congress and incoming President Barack Obama will try to impose some kind of limits on carbon emissions. Mr. Tucker is author of Terrestrial Energy: How Nuclear Power Will Lead the Green Revolution and End America s Long Energy Odyssey, published in October by Bartleby Press. Table 5.1: Comparison of automatically- and manually extracted text from a Wall Street Journal article The extraction does miss the subtitle Wind and biofuel could become the next subprime mortgage fiasco., and it wrongly includes the invite to add comments about the article. But the other two sentences are correctly detected as belonging to the text, although they are not correctly split into sentences. Example 2. USA TODAY The previous example demonstrated a fairly well executed extraction. Wall Street Journal in general is rather kind to this extraction approach, but there are occasions where the result is not quite as successful. Lets look at the result from extraction of an article published in usa today 3. Extraction correctly finds 54 sentences. There are a number of sentences in the extracted article, which are not present in the manually extracted article, please view Table Carbon Limits, Yes; Energy Subsidies, No 3 What s the attraction? Look to society, biology, not logic N.htm

43 31 Figure 5.1: A sample article from the Wall Street Journal The last four sentences in the list of erroneously included sentences can easily be removed with a filter for this site (the same sentences occur on every page where comments are allowed), see Figure 5.4. The sentences BRAIN SCANS: Honeymoon period doesn t always end and BETTER LIFE: More on sexual health can not be excluded using filters. These are links in the middle of the page see Figure 5.3, at the same level of the DOM tree as the article, inviting the reader to follow up on the subject by reading articles with similar content. The link is unique to this page and no site filter can be applied to solve this problem. Another common cause for error can be viewed here; often the initial sentence will be prefixed with information such as byline, news provider and news paper name.

44 32 Chapter 5. Results Figure 5.2: The continuation of the article in figure 5.1 There are some situations where extraction fails. One such example is blog posts, where presumably only the largest post on the page will be returned. Extracting an article from a page that allows readers to comment, may produce any comment which is longer than the article, or all comments (and perhaps even the article) concatenated to a massive blob of text, containing perhaps , or even more sentences! Further, articles which are very short may be completely looked over for any larger parts of text in the page. A small timing test was performed on 50 pages, measuring only the time for extraction, sentence detection and parsing, no page fetching, database management, etc is included. The proportion of time spent parsing the extracted text (on average) is 0.96, thus by far dominating the total computation time.

45 Figure 5.3: A sample article from usa today 33

46 34 Chapter 5. Results Figure 5.4: The continuation of the article in Figure 5.4

Gestão e Tratamento da Informação

Gestão e Tratamento da Informação Web Data Extraction: Automatic Wrapper Generation Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2010/2011 Outline Automatic Wrapper Generation