Reverse method for labeling the information from semi-structured web pages

Similar documents
Using Clustering and Edit Distance Techniques for Automatic Web Data Extraction

Finding and Extracting Data Records from Web Pages *

EXTRACTION AND ALIGNMENT OF DATA FROM WEB PAGES

A survey: Web mining via Tag and Value

Web Page Analysis Based on HTML DOM and Its Usage for Forum Statistics and Alerts

Web Scraping Framework based on Combining Tag and Value Similarity

EXTRACTION INFORMATION ADAPTIVE WEB. The Amorphic system works to extract Web information for use in business intelligence applications.

Information Discovery, Extraction and Integration for the Hidden Web

Extraction of Automatic Search Result Records Using Content Density Algorithm Based on Node Similarity

DataRover: A Taxonomy Based Crawler for Automated Data Extraction from Data-Intensive Websites

WEB DATA EXTRACTION METHOD BASED ON FEATURED TERNARY TREE

Extraction of Automatic Search Result Records Using Content Density Algorithm Based on Node Similarity

Automatic Wrapper Adaptation by Tree Edit Distance Matching

Using Data-Extraction Ontologies to Foster Automating Semantic Annotation

A Hybrid Unsupervised Web Data Extraction using Trinity and NLP

Keywords Data alignment, Data annotation, Web database, Search Result Record

ISSN: (Online) Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

An Efficient Technique for Tag Extraction and Content Retrieval from Web Pages

RecipeCrawler: Collecting Recipe Data from WWW Incrementally

Template Extraction from Heterogeneous Web Pages

A Survey on Unsupervised Extraction of Product Information from Semi-Structured Sources

Web Data Extraction Using Tree Structure Algorithms A Comparison

Similarity-based web clip matching

Study of Design Issues on an Automated Semantic Annotation System

Automatic Generation of Wrapper for Data Extraction from the Web

The Wargo System: Semi-Automatic Wrapper Generation in Presence of Complex Data Access Modes

Exploring Information Extraction Resilience

F(Vi)DE: A Fusion Approach for Deep Web Data Extraction

Handling Irregularities in ROADRUNNER

WICE- Web Informative Content Extraction

Web Data Extraction and Alignment Tools: A Survey Pranali Nikam 1 Yogita Gote 2 Vidhya Ghogare 3 Jyothi Rapalli 4

Synergic Data Extraction and Crawling for Large Web Sites

Automatic Extraction of Structured results from Deep Web Pages: A Vision-Based Approach

An Approach To Web Content Mining

Extraction of Flat and Nested Data Records from Web Pages

Manual Wrapper Generation. Automatic Wrapper Generation. Grammar Induction Approach. Overview. Limitations. Website Structure-based Approach

An Automatic Extraction of Educational Digital Objects and Metadata from institutional Websites

OpenPC : A Toolkit for Public Cluster with Full Ownership

Mining Structured Objects (Data Records) Based on Maximum Region Detection by Text Content Comparison From Website

Deepec: An Approach For Deep Web Content Extraction And Cataloguing

Life Science Journal 2017;14(2) Optimized Web Content Mining

Decomposition-based Optimization of Reload Strategies in the World Wide Web

Association Rule Mining from XML Data

SEMI-AUTOMATIC WRAPPER GENERATION AND ADAPTION Living with heterogeneity in a market environment

MetaNews: An Information Agent for Gathering News Articles On the Web

Hidden Web Data Extraction Using Dynamic Rule Generation

Interactive Learning of HTML Wrappers Using Attribute Classification

Information Extraction from Web Pages Using Automatic Pattern Discovery Method Based on Tree Matching

Web Data Extraction. Craig Knoblock University of Southern California. This presentation is based on slides prepared by Ion Muslea and Kristina Lerman

Semi-Automated Extraction of Targeted Data from Web Pages

NEW APPROACHES TO PORTLETIZATION OF WEB APPLICATIONS Fernando Bellas (*), Iñaki Paz, Alberto Pan, and Óscar Díaz

Extracting Product Data from E-Shops

Recognising Informative Web Page Blocks Using Visual Segmentation for Efficient Information Extraction

Sentiment Analysis for Customer Review Sites

ISSN (Online) ISSN (Print)

Distributed Indexing of the Web Using Migrating Crawlers

Semantic Annotation using Horizontal and Vertical Contexts

A Probabilistic Approach for Adapting Information Extraction Wrappers and Discovering New Attributes

Automatically Maintaining Wrappers for Semi- Structured Web Sources

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Object Extraction. Output Tagging. A Generated Wrapper

A Review on Identifying the Main Content From Web Pages

A New Technique to Optimize User s Browsing Session using Data Mining

A Two-Phase Rule Generation and Optimization Approach for Wrapper Generation

Information mining and information retrieval : methods and applications

Vision-based Web Data Records Extraction

Deep Web Navigation in Web Data Extraction

A New Method Based on Tree Simplification and Schema Matching for Automatic Web Result Extraction and Matching

Information Extraction from Tree Documents by Learning Subtree Delimiters

Competitive Intelligence and Web Mining:

Automatic Wrapper Adaptation by Tree Edit Distance Matching

Trust4All: a Trustworthy Middleware Platform for Component Software

PS Web IE. End hand-ins Wednesday, April 27, 2005

A Novel PAT-Tree Approach to Chinese Document Clustering

Web Database Integration

Annotating Multiple Web Databases Using Svm

MURDOCH RESEARCH REPOSITORY

CONTENT ADAPTIVE SCREEN IMAGE SCALING

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach

Automatic Extraction of Informative Blocks from Webpages

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

Just-In-Time Hypermedia

EXTRACT THE TARGET LIST WITH HIGH ACCURACY FROM TOP-K WEB PAGES

Cross-lingual Information Management from the Web

EXTRACTION OF TEMPLATE FROM DIFFERENT WEB PAGES

Motivating Ontology-Driven Information Extraction

A generic algorithm for extraction, analysis and presentation of important journal information from online journals

Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler

A System For Information Extraction And Intelligent Search Using Dynamically Acquired Background Knowledge

Te Whare Wananga o te Upoko o te Ika a Maui. Computer Science

Adapting Web Information Extraction Knowledge via Mining Site-Invariant and Site-Dependent Features

A Crawljax Based Approach to Exploit Traditional Accessibility Evaluation Tools for AJAX Applications

SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Exhaustive Search-based Model for Hybrid Sensor Network

HTML Format Tables Extraction with Differentiating Cell Content as Property Name

A genetic algorithm based focused Web crawler for automatic webpage classification

Building a Restaurant Menu Presentation Database and Visualization

A Personal Web Information/Knowledge Retrieval System

September Information Aggregation Using the Caméléon# Web Wrapper. Paper 220

Transcription:

Reverse method for labeling the information from semi-structured web pages Z. Akbar and L.T. Handoko Group for Theoretical and Computational Physics, Research Center for Physics, Indonesian Institute of Sciences (LIPI) Kompleks Puspiptek Serpong, Tangerang 15310, Indonesia zaenal@teori.fisika.lipi.go.id, handoko@teori.fisika.lipi.go.id arxiv:0906.0080v2 [cs.ir] 6 Aug 2009 Abstract We propose a new technique to infer the structure and extract the tokens of data from the semi-structured web sources which are generated using a consistent template or layout with some implicit regularities. The attributes are extracted and labeled reversely from the region of interest of targeted contents. This is in contrast with the existing techniques which always generate the trees from the root. We argue and show that our technique is simpler, more accurate and effective especially to detect the changes of the templates of targeted web pages. Index Terms data extraction; data mining; web-based information system I. INTRODUCTION During the last decade, most websites are providing the information generated from the structured data in an underlying database through certain predefined templates or layouts. Following the great number of web pages available on the net, these semi-structured web sources contain rich and unlimited valuable data for a variety of purposes. Extracting those data and then rebuilding them into a structured database are a challenge to realize an automatic data mining from web sources. Several methods for these purposes have been proposed previously in the literature. Some of them can be classified as the so-called wrappers, for instance [1], [2], [3], [4], [5], and have been briefly surveyed in [10]. The wrapper technique allows an automatic data extraction through predefined wrapper created for each target data source. The wrappers then accepts a quesry against the data source and returns a set of structured results to the calling application. On the other hand, there are several automatic methods without a manual initial learning process. For example, some methods are generating a template automatically from first multiple pages before extracting the rest of data based on the template [6], [7], [8]. A more comprehensive method without requiring multiple pages has also been proposed using a page creation model which captures the main characteristics of semi-structured web pages to derive the set of properties [9]. Though, the last method is more intended to extract the lists of data records from a single web page with sibling subtrees. Recently we have worked on extracting the semi-structured data from targeted web pages with specific topics, that is the so-called focused web-harvesting [11]. The method is suitable for in particular the indirect data integration which is not tolerant to any error. The architecture is inspired and the combination of focused web-crawling and regular web-harvesting. The focused web-crawling does not indiscriminately crawl the web pages like general purpose search engines, but attempts to download pages that are similar to each other [12]. According to the data integration purposes and its requirement of high accuracy, the focused web-harvesting technique adopts human intervention in the initial setup by providing the targeted URL of list of data, for instance the publication list, and defining the template of the final page contains the relevant information by labeling the attributes. In the case of detail information of publication page as shown in Fig. 1, the relevant information and labeling are ranging from title, authors, abstract to the fulltext if available. The method has been applied to the Indonesian Scientific Index (ISI) which integrates the scientific data from scientific and academic institutions across Indonesia through their websites [13]. The system is also open for public under GNU License at SourceForge.net [14]. Obviously, in spite of its high accuracy and unnecessary machine-learning like system, the method is suffered from tedious labor time at its initial setup to determine the tokens, to label the attributes, and is lacking of the ability to detect effectively later changes of the targeted page templates. Because the labels and tokens are represented as DOM trees which are sensitive to later changes of targeted web templates [15], [16]. In this paper, in order to overcome the above-mentioned problems we present a new method to extract the tokens reversely from the region-of-interest (RoI) at the final web pages, and further label each attribute as normal procedure. This technique is in contrast with the existing mechanism which always starts from the root of web source and also the root of RoI. We argue that it is more efficient and accurate to extract the tokens, and on the other hand to detect the later template changes withour any ambiguities. We should also remark that the reverse method is applicable for any existing methods for data extraction, especially the ones which require initial setup by human intervention to define and label the attributes. This is actually similar to the previous method [17], but instead of using the PAT tree [18] we use the DOM tree like mechanism [16]. Moreover we improve its accuracy by taking into account the lower part of tree and not only the trees from the root. The paper is organized as follows. First, after this brief introduction we describe our approach in Sec. II. In the 978 1 4244 2287 6/08/$25.00 c 2008 IEEE

Fig. 1. The example of the final RoI page of a publication on the web and its RoI s HTML source. subsequent section, Sec. III, we present the implementation and the web interface to define the initial setup for each targeted web. Finally we summarize the paper and provide some future issues and further development. II. THE REVERSE MECHANISM No matter the method used to extract and to label the tokens from a web template or layout, correct initial setup is crucial for further data extraction. As mentioned before, this point plays an important role for indirect data integration which has no tolerance to any errors. This makes some methods based on the machine learning system are useless. From now, please note that we are not going to deal with the algorithms to mine the labeled data since the tasks after labeling can be done further using any existing methods, nor to find the relevant pages of data list which has been discussed in our previous work [11] and many previous works elsewhere. The reverse mechanism can be outlined recursively as follows : 1) Determine the URL of the final web page with desired information like Fig. 1. 2) Provide the whole sentences of the RoI by copying and pasting the displayed desired text. 3) Provide the whole sentence(s) of each sub-roi and assign the attributes for each of them. 4) Crawl the source. 5) Parse and clean the text-format HTML tags like <b>, <i> etc. 6) Take the upper part of source from the top till the last one before the first sentence of RoI. Parse and clean all texts inside except the layout-format HTML tags, like <tr> etc, Do the same thing for the lower part that is from the end of last sentence of RoI till the bottom. 7) Calculate the number of open-tag (n ot ) and closedtag (n ct ) from the deepest part in term of desired content, that is the nearest tags from the RoI. We should stress here that there is no need for the administrators to provide the web page sources at all. Open-tag here means the tags which have no pair in upper or lower part, while the closed-tag denotes the pairing tags within upper or lower part. Of course, our interest is only in the open-tag which should describe the whole structure of web template. Following the above procedures, we can obtain a kind of DOM tree as shown in Fig. 2. We can further calculate the number of trees according to the number of open-tags. Please remind that the calculation is done horizontally, from left to right shown by the arrow in the figure. The number of trees in upper and lower parts are determined by, Σ n ot n ct. (1) Concerning all possibilities on the number of trees in upper and lower parts, therefore we can generally categorize the web structures through the discrepancies between both numbers as, Σ upper Σ lower = 0 : fully symmetry < 0 : lower asymmetry > 0 : upper asymmetry Fig. 2 provides an example of tree in the case of Fig. 1 which is accidently asymmetry. That means the number of trees in (2)

Fig. 2. The tree for the example of RoI given in Fig. 1. The dashed box denotes the RoI. the upper and lower parts are not the same, Σ upper Σ lower. Again, we can use one of the existing methods to calculate the number of trees like the PAT tree algorithm [18] and so forth. Through the discussion above, it is clear that the present method has several advantages : We can separate independently the structure and the rules to obtain the RoI and the structure inside. We can find out the template changes and its relevance with the desired RoI, since we can compare and see the pairing tokens between the upper and lower parts. We discuss these points in more detail through the real implementation at ISI in the subsequent section. III. THE IMPLEMENTATION Our approach to web page information extraction has been experimentally implemented into the system of ISI. Following the wrapper induction programs which are usually supplemented by a user friendly GUI, we have also developed a web based interface to perform the initial setup. The example used at ISI is given in Fig. 3 which shows the web interface to define the content of RoI from a choosen final page and the labels for each attribute. ISI can now efficiently detect the changes of targeted web templates by comparing the old and new numbers of. It is done by executing the check procedure each time prior to the new crawling works into the same targeted web pages. Since the parameter as in Fig. 3 has been stored in the system, it can be used not only for the initial setup but also for rechecking the templates in a regular basis. The discrepancies between the old and new numbers of is usefull to detect easily the template changes time by time, and at the same time determine where the changes happened. It can be summarized as below, 1) No change at all : new = old ; Σ new upper = Σ old upper; Σ new lower = Σold lower 2) Simultaneous changes with same size in both upper and lower trees : new = old ; Σ new upper Σ old upper; Σ new lower Σold lower 3) Only one tree has changed : upper Σold upper ; Σnew lower = Σold lower or : upper = Σold upper ; Σnew lower Σold lower 4) Both trees have changed differently : upper Σold upper ; Σnew lower Σold lower Apparently, in the case 1 no need to alter the saved intial parameter. In contrast, from the case 2, 3 and 4 we can deduce that the templates have been changed, either in the upper, lower or both trees. The important point is, once the template changes

Fig. 3. The web interface to define the RoI and to label the attributes. have been detected, the system automatically replace the old version of template with the new one. Furthermore, the reverse method can be used to rebuild the content of RoI in terms of labels which enable us to restructure the database for further purposes. The procedure is completely the same as extracting the RoI. Tab. I shows the delimiters extracted from the RoI in Fig. 1 using the same interface as Fig. 3. IV. SUMMARY We have discussed a simple method based on the reverse algorithm and DOM tree to extract the RoI and label the relevant attributes in the initial setup. The resulted patterns can be used further to automatically extract the data from crawled targeted web pages. We argue that the method and its web interface reduce the administrator works significantly, while on the other hand improve the accuracy and speed of finding the tokens and labeling the attributes. We have found that this method is very effective to detect the template changes, for instance newly inserted advetorial in the middle of upper or lower tree which often occurs in any websites and leads to difficulties in existing methods. The experimental works on applying the method to the Start Attributes End Title <br> <p> Author <p> Abstrack <p> Publication <br> TABLE I THE DELIMITERS EXTRACTED FROM THE EXAMPLE IN FIG. 1 USING THE REVERSE MECHANISM DEFINED IN FIG. 3. available huge number of data stored at ISI is still under progress. The expected results and its effectiveness to detect the altered web pages will be analysed and published in a more complete and detail paper elsewhere. However, according to our trial experiments using 10000 data from several web sources, the method performs very well. It succeeded in detecting any template changes and improved the speed of whole processes up to 20%. Finally, we should remark that in principle the method is applicable for the web sources in a form of list of data. The work on this matter is also in progress. ACKNOWLEDGMENT The work is partially supported by the Riset Kompetitif LIPI in fiscal year 2009. REFERENCES [1] R. Baumgartner, S. Flesca, G. Gottlob, Visual Web Information Extraction with Lixto, Proceedings of the 27th International Conference on Very Large Data Bases, September 11 14, 2001, pp. 119 128. [2] I. Muslea, S. Minton, C.A. Knoblock, Hierarchical Wrapper Induction for Semistructured Information Sources, Autonomous Agents and Multi- Agent Systems 4 (2001) 93 114. [3] A. Pan, J. Raposo, M. lvarez, J. Hidalgo, A. Via, Semi-Automatic Wrapper Generation for Commercial Web Sources, Proceedings of the IFIP TC8 / WG8.1 Working Conference on Engineering Information Systems in the Internet Context, September 25 27, 2002, pp. 265 283. [4] J. Raposo, A. Pan, M. lvarez, J. Hidalgo, Automatically maintaining wrappers for semi-structured web sources, Data and Knowledge Engineering 61 (2007) 331 358. [5] Y. Zhai, B. Liu, Extracting web data using instance-based learning, in: Proceedings of Web Information Systems Engineering Conference (WISE), 2005, pp. 318-331. [6] L. Arlota, V. Crescenzi, G. Mecca, P. Merialdo, Automatic annotation of data extracted from large websites, in: Proceedings of the WebDB Workshop, 2003, pp. 7 12.

[7] N. Kushmerick, Regression testing for wrapper maintenance, Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence, July 18-22, 1999, pp. 74 79. [8] J. Wang, F.H. Lochovsky, Data extraction and label assignment for web databases, Proceedings of the 12th international conference on World Wide Web, May 20 24, 2003, pp. 187 196. [9] M. Alvarez, A. Pan, J. Raposo, F. Bellas, F. Cacheda, Extracting lists of data records from semi-structured web pages, Data and Knowledge Engineering 64 (2007) 491 509. [10] A.H.F. Laender, B.A. Ribeiro-Neto, A. Soares da Silva, J.S. Teixeira, A brief survey of web data extraction tools, ACM SIGMOD Record 31 (2002) 84 93. [11] Z. Akbar, L.T. Handoko, A Simple Mechanism for Focused Webharvesting, Proceeding of the International Conference on Advanced Computational Intelligence and Its Applications, May 13 19, 2008. [12] S. Chakrabarti, M. van den Berg and B. Dom, Focused crawling: A new approach to topic-specific web resource discovery, Computer Networks 31 (1999) 1623 1640. [13] Z. Akbar et.al., Indonesian Scientific Index - ISI, http://www.isi.lipi.go.id. [14] Z. Akbar et.al., openisi, http://sourceforge.net/projects/openisi/. [15] J. Wang, F. Lochovsky, Data extraction and label assignment for web databases, Proceeding of the 12th International World Wide Web Conference, 2003. [16] The W3 Consortium, The Document Object Model, http://www.w3.org/dom/. [17] C-H. Chang, C-N. Hsu, S-C. Lui, Automatic information extraction from semi-structured web pages by pattern discovery, Decision Support System 35 (2003) 129 147. [18] G.H. Gonnet, R.A. Baeza-Yates, T. Snider, New Indices for Text : Pat trees and Pat Arrays in Information Retrieval : Data Structures and Algorithms, Prentice Hall, 1992.