WICE- Web Informative Content Extraction

Size: px

Start display at page:

Download "WICE- Web Informative Content Extraction"

Annabel O’Brien’
5 years ago
Views:

1 WICE- Web Informative Content Extraction Swe Swe Nyein*, Myat Myat Min** *(University of Computer Studies, Mandalay ** (University of Computer Studies, Mandalay ABSTRACT With the accelerated Internet development a huge amount of data have been accumulated and stored on the Web. Web pages usually contain various contents, which are relevant or irrelevant with the main topic. The extraction of useful or relevant information in mass information becomes more complex and time consuming. Identifying of useful data region is a significant problem for information extraction from the Web documents. In this paper, we propose a system that can extract informative or useful content from Web pages across different sites. XPath-based extraction rules are generated to facilitate later extraction from other similar pages. We have performed experimental studies by using real Web pages over several Web sites namely, commerce, and business directory and publication sites. The result of extraction accuracy is also compared with other prior research and then observed that extraction results proved the validity of the approach convincingly. Keywords - Web informative extraction, Web mining, XPath. I. INTRODUCTION Nowadays, World Wide Web has become one of the most significant information resources. It delivers the information mainly in the form of the Web pages. Web sites are becoming more sophisticated, and to be competitive, a site needs to engage the visitor. This means a dynamic site with features such as polls, surveys, newsletters, and a discussion forum. However, with the overwhelming volume of information on the Web, the dynamical nature of the Web and its huge size make very difficult the process of compressing, ranking, indexing, or mining the Web. Due to the heterogeneity and lack of structure of Web information, automated discovery of relevant information becomes a difficult task [1]. The content on the Web in not accessible by a search on general search engines, which is also called as hidden Web or invisible Web [2]. Data in Web pages are unstructured data, semi-structured data and structured data. Structured data usually contain important information. These data are often retrieved from underlying databases and displayed in Web pages using fixed templates, called these structured data objects are data records [3]. Hence, the structured data on the Web are often very important since they represent their host page s essential information, eg; details about the list of products and services. In this paper, we extract data items from the data records as informative content or useful content. Extracting useful content is a non-trivial task because it allows us to integrate information from multiple sources to provide value-added services, eg. Customizable Web information gathering, Comparative shopping and Meta querying and searching. It can be applied in recommendation and Decision Support System (DSS). Extraction problem has been studied by researchers in AI, database and data mining, and Web communities [4]. There are several techniques for structured data extraction, which is also called wrapper generation. Example is the e-commerce Web sites. One may want to extract some items of information from this page such as product name, price, description for comparative shopping. We called these items as target items. We first observe that a few sample pages and their visual information from multiple sources to learn extraction rules. The generated rules using

2 target items from remaining pages. XPath is one of the ways to solve extraction problem. Although there are also automatic to generate wrapper or extraction rules, they are usually less accurate and also still need manual postprocessing to identify the items of interest. This work aims to extract informative content from dynamic Web document. The proposed system needs a particularly promising approach for extracting informative content from HTML documents is to employ XML technologies to translate HTML to pure Extensible Markup Language (XML). We apply XPath (XML Path) expression over DOM tree which is transformed from XML document for generating extraction rules. The origins of XPath wrappers can be traced to NK.TRAN [5], and extraspec system [6] for the extraction of relevant information. The rest of the paper is organized as follows: Related works are viewed in Section II. Background on Information Extraction is introduced in Section III. Our solutions to Web informative contents extraction are described in Section IV. Experimental results and conclusion is reported in Section V and VI respectively. II. RELATED WORKS With the widely application of Web2.0, the traditional web information extraction technology can t meet the needs of users. The traditional web information extraction is mainly directed against the static HTML pages, while it is powerless with the dynamic web page which contains JSP, Ajax, ASP, PHP and etc. How to efficiently extract information in dynamic web pages becomes one of a difficult problem in the information extraction field. There have been several researches on the general problem of extracting information from Web pages. Since a large percentage of dynamically generated Web pages have some form of underlying templates, RoadRunner [7] and Vertex system [8] try to extract structural data by identifying and exploiting the templates. W. Liu and X. Meng [9] have introduced a Vision-based Data Extractor (ViDE), to extract structured results from deep Web pages automatically. The visual information of Web pages was obtained by calling the programming APIs of IE, which was a time-consuming process. R. Baumgartner et al. [10] proposed a system that was semi-automatic or even manual, relying on training and human assistance to different extents. This technique was becoming impractical as more and more large-scale web applications are emerging, such as building large-scale meta-search engines or building meta-search engines on-demand. MDR [11] extracted the data-rich sub-tree indirectly by detecting the existence of multiple similar generalized-nodes, which is a collection of child nodes of the subtree. Then each generalized-node is checked to extract records. Y. Zhai and B. Liu proposed an unsupervised approach to automatically detect Web blocks and extracted the Web data from the blocks [12]. Extraction tools are compared in [13] [14] and [15]. Chang [16] attempted to generate repetitive patterns from unlabeled Web pages. Their system failed in the situation where pages containing single data record. However, existing methods still have some limitations. Most of the research did not take into account the data extraction time for all the tested sources. Our proposed system concentrates on informative content extraction based multiple pages and solves such limitation. III. BACKGROUND ON INFORMATION EXTRACTION Web Information extraction is the problem of extraction target information items from Web pages. There have been many works the extraction problem on three main approaches: manual data extraction, Wrapper induction or semi-automatic data extraction and automatic extraction. Manually constructed systems require programmers to deduct the extraction rules but are costly and difficult to scale up. Wrapper induction requires less user skills to label sample pages for these systems to induce the extraction rules. While automatic extraction systems automatically generate the wrappers without any user interventions and receive a lot of attention.

3 We can differentiate the various IE systems by the type of data that are used as origin: structured data, semi-structured data, and unstructured data. Unstructured data aims extracting data from totally unstructured free texts that are written in natural language. The data is embedded in full sentences within a continuous text. In Semi-structured data extraction, no semantic is applied to these data, but for extracting the relevant information no Natural Language Understanding (NLU), like analysis of words or sentences, is required. Examples are advertisements in newspapers and job postings or highly structured HTML pages. But HTML is rather more human-oriented or presentationoriented. It lacks the separation of data structure from layout, which XML provides. Structured data on the Web are typically data records retrieved from underlying databases and displayed in Web pages following some fixed templates. Extracting such data records is useful because it enables us to obtain and integrate data from multiple sources (Web sites and pages) to provide value-added services, e.g; customizable Web information gathering, comparative shopping, meta-search, etc. For this purpose a number of computer programs and systems were developed for semi-automatic and automatic information extraction, but it was not until the beginning of the Web that most important developments came about. IV. THE PROPOSED WEB INFORMATIVE CONTENT EXTRACTION This section is about how to extract the informative or useful content from Web pages. At first, we observe the content, layout and style, and structure of Web pages for constructing the extraction rules. And then we characterize the informative content including product title, price, and description in general. The content and structure on a Web page may change drastically and none of syntactic is features retained. In other words, the contents on the pages have lot of commonalities for small a time interval. The proposed system is based on the following two steps: extraction rules generation and informative content extraction. Rules Generation In order to extract informative content from Web pages, first the extraction rules are generated. Before generate the rules, it is important what regular data records are useful to a user. In a particular application, the user is usually interested in only specific type of data records. eg; a list of products. Simple extraction rules can be designed to output the required type of data records eg; product name, price, description, and etc. The extraction rules are specified in terms XPath (XM L Path language) expression. By specification the nodes in terms of structured and attributes values of its adjacent nodes, the XPath is likely to be reusable for other similar pages. We use the structure of an XML document to locate particular parts of a document. Location paths are the most useful and widely used feature of XPath. A location path is an expression that specifies how to navigate an XPath tree from one node to another. A location path can be absolute or relative. Location paths are composed of sequences of location steps. A location step contains an axis and a node test separated by a double-colon (::), optionally, a predicate enclosed in square brackets ([]). The use of XPaths for Web data extraction has been previously explored by Myllymaki and J. Jackson [17]. They used content-based (based on text on the Web page), attribute-based (the value of node attributes) and structure-based (local node structure) XPaths. In this work, we emphasize the use of attribute-based XPaths. Most of the Web page content (imag e, text) is nowhere to be found and therefore a better cue can be derived from attribute values on the page. It is common to see important data items highlighted in a certain font size or other text attribute. Pages within a Web site have a similar structure and page content is displayed using precise design parameters contained in attribute; attribute values (informative content) occur at fixed positions within pages. The fix positions can be defined as paths from the root to the node called Xpath containing attribute values on DOM tree of the pages. In

4 some applications, one needs to extract data from detail pages as they contain more information. For example, in a list page, the information on each product is usually quite brief, e.g, containing only the name, image, and price. However, if an application also needs the product description, one has to extract them from detail pages. Some Web sites, there are different layouts and structure in detail pages. Informative Content Extraction The extraction rules that are stored in database are applied to some other input in order to extract the informative content. The system first checks and validates the syntax of HTML as input using tidy tool which automatically fix markup errors. It is necessary to make wellformed document (XML document) in order to construct the correct DOM tree. Second, the well-formed XML document passes through the DOM tree. And then the system automatically extracts the informative content using extraction rules. WICE system consists of the following steps: Well-Formed Web Pages (HTML - XML) HTML always includes bad construction, language standards frequently being broken i.e. improper closed tags, wrong nested tags, bad parameters and incorrect parameter value. Pages that are not well-formed can be converted to well-formed pages. To be well-formed XML, it must start with an XML declaration to indicate the version of XML being used as well as any other relevant attributes. It must follow the syntactic guidelines of the tree model. This means that there should be a single root element, and every element must include a matching pair of start tag and end tag within the start and end tags of the parent element. A well-formed XML document is syntactically correct. This allows it to be processed by generic processors that traverse the document and create an internal tree representation. At first, we need to check up the HTML code using HTML tidy and also transform the semi-structured HTML documents to structured XML documents. After that, XML document is transformed into DOM tree. DOM Tree In order to handle a structured document written in HTML or XML, more efficiently and consistently, the World Wide Web Consortium (W3C) published the Document Object Model (DOM) specification. DOM gives the ability to access and manipulate information stored in a structured HTML or XML document. The Figure 1 shows the DOM tree result from sample XML page. Figure 1. DOM Tree Extraction based on XPath To find information in an XML document, parsing would be needed and then the elements returned would need to be examined. This is an inefficient approach for large documents. XPath provides a way of locating specific parts of an XML document. XPath expressions that identify the node in the new page corresponding to each attribute and other components that extract the actual text value of interest from the node. An XPath expression returns a collection of element nodes that satisfy certain patterns specified in the expression. The names in the XPath expression are node names in the XML document tree that are either tag (element) names or attribute names, possibly with additional qualifier conditions to further restrict

5 the nodes that satisfy the pattern. We can achieve XPath for informative content in all sample pages. Based on assumption of our method, informative content may appear in similar positions for pages with similar layout. In order to apply XPath information to remaining pages, we generated extraction rules for informative content. The generated rules are applied to the crawled Web pages to extract informative content from them. The advantage of XPath is portable to other applications; most popular programming language support executing Xpath statements on a DOM parsed from a Well-formed document. Now, XPath expression defines traversal through a DOM tree. A location path is used to address a certain node-set of a document. A location step is the most important construct of a location path in XPath, making it possible to select a number of nodes from a given set of nodes according to certain criteria (eg, selecting only the elements of a node-set which have a given name or a given relation to the context node) location steps are separated by slash characters. When we want to extract the product name and price from the sample Web page, we can apply the following expression rules: Exp. (1) /child: :a[@class='productlist -ex-infoname'] Exp. (2) /div/span[@class='sale-price'] Each location step is defined as consisting of three distinct parts, an axis, a node test, and a predicate. Each step is evaluated on a set of DOM nodes and yields a set of DOM nodes as its result. For example, Exp. (1) evaluates all child nodes a which attribute is class and attribute value is productlist-ex-info-name in sample DOM tree and results the text node inside the a elements. Exp. (2) selects all span nodes, which attribute class value is saleprice, that are children of the div element. The text nodes inside the span elements are yielded. In path expression, Node filters match elements whose tag name corresponds to the value of the node filter. Special node filters are the *, which matches all element nodes, but no text nodes. text ( ), which matches all text nodes, but no element nodes. node ( ), which matches both. Predicates ([ ]) are used for further filtering the nodes selected by the axis and the node test (and possibly other predicates), and they are applied to each node in the node set. Expressions inside predicates are evaluate in a boolean context, i.e., if a predicate evaluates to true, then the node remains in the resulting node set, otherwise it is removed from the node set. Path expressions starting are interpreted as accesses to attributes. Finally, the proposed system extracted the informative content from a variety of list and detail pages using generated rules and then the extracted results for each page are stored as attributes of a record. V. EXPERIMENTAL EVALUATION This chapter presents the empirical evaluation of the WICE system presented in the previous chapter. The objective of WICE is to provide a very compact form with informative (useful) information without cluttering the view. We also compare the performance of the WICE to the previous methods [18, 19]. In experiment, we trained a few sample pages from each domain for extracting informative content. After that we performed the Web content extraction to the evaluation dataset. Extraction rules are stored in the database and then for each document in the evaluation dataset we parse the HTML code, obtain the DOM tree, traverse DOM with extraction rules and finally obtain the extracted informative content. Extraction results are evaluated in the next section. Experimental Datasets and Performance Metrics We prepared three sets of Web pages for the empirical evaluation of WICE system. The first set of experiment data is the Web pages collected from 8 commercial Web sites, ebay, etsy, amazon, buy, bestbuy, productwiki, jr and myshopping. Those sites contain Web pages of many categories of products. We selected the Web pages that focus on the following categories of products: Books, Cell phones and Accessories, Clothing and accessories, Baby, Computers, Cameras and Photos, Jewelry and

6 Watches, Toy and games, Home and kitchen, Electronic, Movie and Television, Art, and Software and so on. The other two set of experiment data are the Web pages collected from two business directory sites and one publication site, YellowPages, Myanmaryp and Citeseer respectively. These sites contain Web pages of many business names and publication. In business directory, we can find the address of business name such as banks, hotels, restaurants, and etc. The sites used in the commercial dataset contain many introduction or overview pages of different kinds of products. The Web pages from these sites contain a large amount of noisy information such as advertisement, navigation bars, directory lists, header, footer and copy right notices, etc. To measure the accuracy of extraction task, we apply metrics adapted from IR by using Equation ( 1). We assume the availability of an evaluation set. This is a set of completely annotated pages, from the same domain as the extraction task, and that are assumed to be representative for that domain. Let T, denote the total number of target elements (informative content) for the extraction task in the evaluation set. Extraction rules are applied on the evaluation set. The number of target elements that are extracted correctly are called true positive (TP). Elements are extracted that are not target element, are called false positive (FP). The target elements from the evaluation set that were not extracted by the system, are called false negatives (FN). To measure whether extraction accuracy has both reasonable precision and recall, the F- measure is used. The F-measure is defined as the harmonic mean of precision and recall. We experimented with several well-known Commercial Web sites including amazon (A), ebay (EB), etsy (ET), buy, bestbuy (Best), productwiki (PW), myshopping (S), JandR (jr) and etc. Evaluation results for some domains are as shown in Table 1 and 2. Average extraction time of our proposed system taken over all tested Web pages from different domains is less than 250 milliseconds except yellowpages.com which is taken nearly 1 second in Figure 2. Table 2. Evaluation results for different domains Domain P R F Books Cell phones Clothing and Computers Cameras and Home and Movie and Art and Crafts Software Publication Business Actual Class Extract Not Extract Target Item TP FN Not Tartet Item FP - Precision (P) = TP/ TP+FP, (1) Recall (R) = TP/T, TP/ TP +FN F measure = 2PR/ (P+R) Precision is defined as the percentage of the elements extracted by the system that is extracted correctly. Recall means the percentage of the target elements in the evaluation set that is extracted by the system. If there are eight target items extracted by the system out of which only six items are correct, then the precision is 6/8 and recall is 6/6. Figure 2. Average run-times for 11 Web sites. The vertical axis represents the time of execution (in millisecond) for the different Web sites (plotted in the horizontal axis).

Comparison of Performance Results To evaluate WICE s extraction accuracy, we compared it to the previous approaches on Book and Publication and Nokia domains.

7 Comparison of Performance Results To evaluate WICE s extraction accuracy, we compared it to the previous approaches on Book and Publication and Nokia domains. We run WICE on the same domain and used their metrics. We compared WICE with same metric and different metrics. We have selected Books and Publication domains from OR in [18]. For Books, title, price, date, author are extracted as the standard classification which have been published by OR. While the same standard classification in addition to the product detail such as description, ISBN, format are extracted from same Web sites. For publication domain, OR extracted title, author/s and date attributes. In addition to the same attributes, proceeding or conference name, publisher and abstract are extracted in WICE. We randomly selected 50 pages per site. Pages within a site are list and detail pages. In order to illustrate that WICE performs well on multiple pages. Both systems focused on the precision of extraction; the precision for correctness (Pc) and the precision for partially correctness (Pp) to evaluate the system. According to the OR measure, the extraction results of WICE for Book and Publication domains are summarized in Table 3 and Figure 3. Table 4 shows both precision values for Book and Publication domains. We observe that overall, WICE outperforms OR by a significant rate. We also compare the result of two measures on book and publication. WICE measurement outperforms Book domain measure by OR but the same rate in Publication as shown in Figure 4. Table 4. Comparison results with OR Domain WICE ObjectRunner P c P p P c P p Books Publication Figure 4. Result of the comparison of two measures Figure 3. Overall accuracy of Book and Publication on different sources Second, we also compare extraction results with the second approach [19]. They used the terms precision and recall to refer to the metrics to evaluate their approach. We have selected Nokia products which have been proposed by [19]. In their system, Size, Display, Ringtones, Memory, Data, Features and Battery attributes were extracted as standard classification. In addition to the same attributes, we extracted the other informative attributes from the following Web sites and According to the [19] method, the overall accuracy of the WICE on Nokia products is as shown in Figure 5.

Figure 5. Overall accuracy of Nokia Product on different source Table 5 shows precision, recall and average values for Nokia product on both systems.

07 99.07 99.07 We also compare the result of two measures on Nokia Products as shown in Figure 6. WICE outperforms average extraction accuracy on previous approach. VI.

8 Figure 5. Overall accuracy of Nokia Product on different source Table 5 shows precision, recall and average values for Nokia product on both systems. We observe that overall, WICE outperforms the previous approach. Table 5. Comparison results with Previous Approach Precision Recall F- (%) (%) measure WICE Previous approach We also compare the result of two measures on Nokia Products as shown in Figure 6. WICE outperforms average extraction accuracy on previous approach. VI. CONCLUSION AND FUTURE WORK With the increase of the information on Web, users have a great opportunity to benefit from such rich information in it. In general, many web pages are generated automatically from an underlying database. Therefore, the HTML structure of pages is fairly specific and regular (semi-structured). However, output is intended for human consumption, not machine interpretation. An IE system for such generated pages allows the Web site to be viewed as a structured database. The desired information is embedded in the Web pages in the form of data records returned by Web databases when they respond to users queries. Thus, it is often necessary to extract the data embedded in the pages into a relational or other structured format for further processing. In the future, we plan to extend WICE system to open domains namely News, Blogs and Forum. VII. REFERENCES [1] P S Hiremath, Siddu P Algur,"Extraction of data from web pages: a vision based approach, International Journal of Computer and Information Science and Engineering, Vol.3, pp.50-59, [2] Yang, Y. and Zhang, H., HTML Page Analysis Based on Visual Cues, In 6th International Conference on Document Analysis and Recognition, Seattle, Washington, USA, [3] Zhai, Yanhong, and Bing Liu. "Automatic wrapper generation using tree matching and partial tree alignment." PROCEEDINGS OF THE NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE. Vol. 21. No. 2. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, [4] Bing Liu, Kevin chen-chuan chang, Editorial: Special issue on Web content mining, WWW 02, Figure 6. Result of the comparison of two measures [5] NK.TRAN, KC.Pham and QT. Ha. XPath- Wrapper induction for data extraction, 2010

9 International Conference on Asian Language Processing, IEEE. [6] T.Kaczmarek, D.Zyskowski, A. Walczak and W. Abramowicz. INFORMATION EXTRACTION FROM WEB PAGES FOR THE NEEDS OF EXPERT FINDING. ISBN , ISSN X, [7] V. Crescenzi, G. Mecca, P. Merialdo. RoadRunner: Towards Automatic Data Extraction from Large Web Sites. VLDB Conference, [8] P. Gulhane. Web-Scale Information Extraction with Vertex. Proc. 27th International Conference on Data Engineering IEEE, 2011, pages [9] W. Liu and X. Meng, ViDE: A Vision- Based Approach for Deep Web Data Extraction, IEEE Transaction on Knowledge and Data Engineering, [10] R. Baumgartner, S. Flesca, and G. Gottlob, Visual Web Information Extraction with Lixto. Proc. 27th International Conference on Very Large Data Bases: , [11] B. Liu, R.L. Grossman, and Y. Zhai, Mining Data Records in Web Pages, Proc. Int l Conf. Knowledge Discovery and Data Mining (KDD), pp , [12] Y. Zhai, and B. Liu. Web Data Extraction Based on Partial Tree Alignment. WWW [15] Leipzig, A comparison of HTML-aware tools for Web Data extraction, [16] Chang, C.-H. and Lui, S.-C., IEPAD: Information extraction based on pattern discovery. Proceedings of the Tenth International Conference on World Wide Web (WWW), Hong-Kong, pp , [17]Myllymaki and J. Jackson. Robust web data extraction with xml path expressions. Technical report, IBM Research Report RJ 10245, May [18] N. Derouiche, B.Cautis, and T.Abdessalem, "Automatic Extraction of Structured Web Data with Domain Knowledge," Data Engineering (ICDE), 2012 IEEE 28th International Conference on, vol., no., pp.726,737, 1-5 April [19] M.Shaker, H.Ibrahim, A. Mustapha, and L. N. Abdullah, "Information Extraction from Hypertext Mark-Up Language Web Pages". Journal of Computer Science, 5(8), [20] [21] [22] [23] [24] [13] C.-H. Chang, M. Kayed, M.R. Girgis, and K.F. Shaalan, A Survey of Web Information Extraction Systems, IEEE Trans. Knowledge and Data Eng., vol. 18, no. 10, pp , Oct [14] A. Laender, B. Ribeiro-Neto, A. da Silva, and J. Teixeira, A Brief Survey of Web Data Extraction Tools, SIGMOD Record, vol. 31, no. 2, pp , 2002.

10 Table 1. Evaluation results for each domain Domains Sites name price 1 Number of target items extracted Price 2 Author/ publisher model ID date description No. of recor ds TP FN FP A Buy Home & Kitchen Movies & Television Art & Craft Best EB ET S PW jr A Best Buy Jr EB S PW A ET EB PW Buy Best S jr

11 Table 3. Evaluation results for Book and Publication domains Domain Sites No. of Pa ges Optional Attributes Objects A c A p A i N o O c O p O i barnesandnoble 50 Yes 6/ bookdepository 50 Yes 5/ Books powells list Yes 7/ detail No 8/11 0 3/ list Yes 5/6 1/ detail No 15/ walmart list Yes 5/ detail No 10/ Booksamillion Publica-tion citeseer 50 No 5/ acm 50 No 5/ googlescholar 50 No 4/

WEB DATA EXTRACTION METHOD BASED ON FEATURED TERNARY TREE

WEB DATA EXTRACTION METHOD BASED ON FEATURED TERNARY TREE *Vidya.V.L, **Aarathy Gandhi *PG Scholar, Department of Computer Science, Mohandas College of Engineering and Technology, Anad **Assistant Professor,