DataRover: A Taxonomy Based Crawler for Automated Data Extraction from Data-Intensive Websites

Similar documents
EXTRACTION AND ALIGNMENT OF DATA FROM WEB PAGES

MetaNews: An Information Agent for Gathering News Articles On the Web

Information Discovery, Extraction and Integration for the Hidden Web

A Hybrid Unsupervised Web Data Extraction using Trinity and NLP

Keywords Data alignment, Data annotation, Web database, Search Result Record

Automatic Generation of Wrapper for Data Extraction from the Web

A survey: Web mining via Tag and Value

WEB DATA EXTRACTION METHOD BASED ON FEATURED TERNARY TREE

Template Extraction from Heterogeneous Web Pages

Interactive Learning of HTML Wrappers Using Attribute Classification

Mining Structured Objects (Data Records) Based on Maximum Region Detection by Text Content Comparison From Website

EXTRACTION INFORMATION ADAPTIVE WEB. The Amorphic system works to extract Web information for use in business intelligence applications.

Reverse method for labeling the information from semi-structured web pages

A Survey on Unsupervised Extraction of Product Information from Semi-Structured Sources

Handling Irregularities in ROADRUNNER

An Efficient Technique for Tag Extraction and Content Retrieval from Web Pages

Manual Wrapper Generation. Automatic Wrapper Generation. Grammar Induction Approach. Overview. Limitations. Website Structure-based Approach

Web Data Extraction Using Tree Structure Algorithms A Comparison

Improving Web Data Annotations with Spreading Activation

Extraction of Flat and Nested Data Records from Web Pages

Te Whare Wananga o te Upoko o te Ika a Maui. Computer Science

ISSN (Online) ISSN (Print)

ISSN: (Online) Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Heading-Based Sectional Hierarchy Identification for HTML Documents

Using Clustering and Edit Distance Techniques for Automatic Web Data Extraction

Exploring Information Extraction Resilience

Web Data Extraction. Craig Knoblock University of Southern California. This presentation is based on slides prepared by Ion Muslea and Kristina Lerman

Using Data-Extraction Ontologies to Foster Automating Semantic Annotation

Title: Artificial Intelligence: an illustration of one approach.

RoadRunner for Heterogeneous Web Pages Using Extended MinHash

Finding and Extracting Data Records from Web Pages *

A Hierarchical Web Page Crawler for Crawling the Internet Faster

Hidden Web Data Extraction Using Dynamic Rule Generation

MIWeb: Mediator-based Integration of Web Sources

International Journal of Research in Computer and Communication Technology, Vol 3, Issue 11, November

Automatic Extraction of Informative Blocks from Webpages

analyzing the HTML source code of Web pages. However, HTML itself is still evolving (from version 2.0 to the current version 4.01, and version 5.

Object Extraction. Output Tagging. A Generated Wrapper

E-MINE: A WEB MINING APPROACH

Multi-Modal Data Fusion: A Description

Comparison of Online Record Linkage Techniques

EXTRACTION OF TEMPLATE FROM DIFFERENT WEB PAGES

Extracting Product Data from E-Shops

Extraction of Automatic Search Result Records Using Content Density Algorithm Based on Node Similarity

AAAI 2018 Tutorial Building Knowledge Graphs. Craig Knoblock University of Southern California

Mining Data Records in Web Pages

Deep Web Content Mining

WEBTracker: A Web Crawler for Maximizing Bandwidth Utilization

Decomposition-based Optimization of Reload Strategies in the World Wide Web

Web Data Extraction and Alignment Tools: A Survey Pranali Nikam 1 Yogita Gote 2 Vidhya Ghogare 3 Jyothi Rapalli 4

Information Extraction from Tree Documents by Learning Subtree Delimiters

RecipeCrawler: Collecting Recipe Data from WWW Incrementally

Deep Web Data Extraction by Using Vision-Based Item and Data Extraction Algorithms

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

The Wargo System: Semi-Automatic Wrapper Generation in Presence of Complex Data Access Modes

PS Web IE. End hand-ins Wednesday, April 27, 2005

Accuracy Avg Error % Per Document = 9.2%

ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies

Similarity-based web clip matching

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM

Omnibase: Uniform Access to Heterogeneous Data for Question Answering

A Review on Identifying the Main Content From Web Pages

SEMI-AUTOMATIC WRAPPER GENERATION AND ADAPTION Living with heterogeneity in a market environment

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

Extraction of Automatic Search Result Records Using Content Density Algorithm Based on Node Similarity

Web Crawling and Basic Text Analysis. Hongning Wang

Gestão e Tratamento da Informação

AN APPROACH FOR EXTRACTING INFORMATION FROM NARRATIVE WEB INFORMATION SOURCES

Web Information Modeling, Extraction and Presentation

Much information on the Web is presented in regularly structured objects. A list

Web site Image database. Web site Video database. Web server. Meta-server Meta-search Agent. Meta-DB. Video query. Text query. Web client.

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

F(Vi)DE: A Fusion Approach for Deep Web Data Extraction

Lessons Learned and Research Agenda for Big Data Integration of Product Specifications (Discussion Paper)

Information Extraction from Web Pages Using Automatic Pattern Discovery Method Based on Tree Matching

Automatic Wrapper Adaptation by Tree Edit Distance Matching

Data Extraction from Web Data Sources

A SMART WAY FOR CRAWLING INFORMATIVE WEB CONTENT BLOCKS USING DOM TREE METHOD

Wrapper Induction for End-User Semantic Content Development

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Annotating Multiple Web Databases Using Svm

Recognising Informative Web Page Blocks Using Visual Segmentation for Efficient Information Extraction

Architecture of A Scalable Dynamic Parallel WebCrawler with High Speed Downloadable Capability for a Web Search Engine

Reconfigurable Web Wrapper Agents for Web Information Integration

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content

Effective Metadata Extraction from Irregularly Structured Web Content

Sentiment Analysis for Customer Review Sites

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai

Semistructured Data Store Mapping with XML and Its Reconstruction

Incremental Knowledge Acquisition Approach for Information Extraction on both Semi-structured and Unstructured Text from the Open Domain Web

A Probabilistic Approach for Adapting Information Extraction Wrappers and Discovering New Attributes

Deepec: An Approach For Deep Web Content Extraction And Cataloguing

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

Similarity Matrix Based Session Clustering by Sequence Alignment Using Dynamic Programming

Implementation of Enhanced Web Crawler for Deep-Web Interfaces

Ontology Extraction from Tables on the Web

VIPS: a Vision-based Page Segmentation Algorithm

News Page Discovery Policy for Instant Crawlers

Poster Session: An Indexing Structure for Automatic Schema Matching

Transcription:

DataRover: A Taxonomy Based Crawler for Automated Data Extraction from Data-Intensive Websites H. Davulcu, S. Koduri, S. Nagarajan Department of Computer Science and Engineering Arizona State University, Tempe, AZ, 85287, USA {hdavulcu, koduri, nrsaravana}@asu.edu ABSTRACT The advent of e-commerce has created a trend that brought thousands of catalogs online. Most of these websites are taxonomy-directed. A Web site is said to be ``taxonomydirected'' if it contains at least one taxonomy for organizing its contents and it presents the instances belonging to a category in a regular fashion. This paper describes the DataRover system, which can automatically crawl and extract products from taxonomy-directed online catalogs. DataRover utilizes heuristic rules to discover the structural regularities among: taxonomy segments, list-of-product and single-product pages and it uses these regularities to turn the online catalogs into a database of categorized products without the need for user interaction or the wrapper maintenance burden. We provide experimental results to demonstrate the efficacy of the DataRover and point to its current limitations. of single products pages reachable from list of product pages. This kind of structure is characteristic of most online catalogs. Taxonomy Categories and Subject Descriptors H.m [Information Systems]: Miscellaneous. General Terms Algorithms, Design. Keywords Web Data Extraction, Web Annotation, Web Data Integration 1. INTRODUCTION The advent of e-commerce has created a trend that brought thousands of catalogs online. Most of these websites are taxonomy-directed. A Web site is said to be ``taxonomydirected'' if it contains at least one taxonomy for organizing its products and it presents the instances belonging to a category in a regular fashion. Notice that, neither the presentation of the taxonomy among different pages, nor the presentation of instances among for different concepts needs to be regular for a Web site to be classified as taxonomy-directed. A taxonomydirected web site also presents its content as a sequence of taxonomy pages, that eventually leads to a list-of-products page, which leads to single-product pages. For example if you go to http://store.yahoo.com/shoedini, all the top level categories are listed in the homepage, see Figure 1, and if you take any of these top level URLs, it leads to a list of products page, similar to one in Figure 6. Figure 8 shows samples Figure 1 Home page of [18] with taxonomies DataRover is a specialized domain-specific crawler that uses domain-specific heuristics and the regularities within taxonomydirected online catalogs to turn these web sites into product databases. Such product databases enable construction of domainspecific search engines, such as Froogle.com [17]. DataRover uses a Page Segmentation Algorithm that takes a DOM tree of the web page as input and finds the logical segments in it. Intuitively, this algorithm groups contiguous similar structures in the Web pages into segments by detecting a high concentration of neighboring repeated nodes, with similar root-toleaf tag-paths. Next, it applies taxonomy-tests to identify segments corresponding to taxonomies, crawls the taxonomies recursively until no more sub-taxonomies are detected. Next, it applies list-of-products-test to page segments to see if any qualifies as list of products. If a segment qualifies as a list-ofproducts segment, DataRover fetches two distinct single product pages, segments and aligns them to learn the root to leaf html tag paths to product data. Once it learns these paths, it applies them to

all the pages from the list-of-products page to extract and collect all product data organized by their root-to-leaf tag-paths. The early method for information extraction from web sites was to use wrappers. A wrapper is a program script that created using location of the data in the HTML page. But, wrappers are sitespecific and needs some user interaction to create them and they also need to be maintained whenever a site changes. Second kind of approach is to learn and use template. A template is generated from the set of input pages and this template is used to do actual extraction. A key characteristic of DataRover is that, unlike the early pioneering systems described in [9, 10, 12] it does not make any assumptions about the roles and usage patterns of the HTML tags within the Web pages. The organization of the rest of the paper is as follows. Section 2 outlines the architecture of the DataRover. Section 3 describes the Flat Partitioning algorithm. Section 4 presents the data extraction algorithms. Section 5 presents ideas for data labeling. Section 6 presents experimental results. Section 7 presents related work and concludes the paper. 2. ARCHITECTURE of DATAROVER number of similarity links that cross the boundary to the total number of similarity links inside the current segment. If this ratio is less than threshold δ, the previous node is marked as the segment boundary. Otherwise, current node is added to the current segment and next node is considered as the segment boundary. The above process is stopped when the last element in the list is reached. A Path Index is built from the DOM tree, from which the similarity links between the leaf nodes can be determined. The Path Index Tree (PIT) is a trie data structure, which has all unique tag-paths in the tree. In its leaf nodes, PIT also stores the list of leaf nodes in DOM tree that share the same root to leaf tag-path. B1 B2 B3 B4 B5 Figure 2 Architecture of the DataRover System B6 3. FLAT PARTITIONING Page Segmentation component creates segments of the given web page i.e. it detects the various logical sections of the web page. Consider the product page from http://store.yahoo.com/shoedini/007850.html. The various logical sections of the web page are marked in boxes in Fig 3. The function of Page Segmentation Component can be understood by considering the parse tree view of the given web page. The Page Segmentation component finds the boundaries indicated by dotted lines as shown in the Fig 3. The Page Segmentation Algorithm takes a DOM tree of the web page as input and finds the segments in it. Intuitively, this algorithm groups contiguous similar structures in the Web pages into segments by detecting a high concentration of neighboring repeated nodes, with similar root-to-leaf tag-paths. First, the segment boundary is initialized to the first leaf node in the DOM tree. Two leaf nodes in the tree is said to have similarity link if they share the same path from the root of the tree and the leaf nodes in between have different paths. It then counts the number of similarity links that crosses the boundary and finds the ratio of Figure 3: Snapshot of a web page and its logical segments Figure 4 illustrates the Page Segmentation Algorithm, presented in Figure 5, on the HTML DOM tree. The Arrows in the Figure 4 denotes the similarity links between the leaf nodes. The threshold δ is set as 60%. For example, when the current node is 320, the total number of outgoing unique similarity links (out) is 1(line 9) and total number of unique similarity links (total) is 1 (line 10). Hence the ratio of out to total is 100% which is greater than threshold (condition in line 11 fails). Hence current node becomes the next leaf node (line 6). At node 341, out becomes 2 and total is also 2. Hence the ratio is still greater than 60%. When node 345 is reached, out becomes 1 whereas total is 2 and hence the ratio is less than threshold δ (line 11). Now, node 345 (B1 in Figure 4) is added to the segment_boundaries (line 12) and all the similarity links are removed from the segment_nodes (line13). The same condition is encountered when the algorithm reaches the node 360 where out becomes 1 and total is 2. Hence the node 360 (B2 in Figure 4) is also added to the segment_boundaries.

Figure 4: Parse Tree View of the web page 4. DATA EXTRACTION Data extraction mainly consists of three phases. 1. Finding sequence of taxonomies from the Home Page 2. Identifying list of products segment 3. Extracting data from single product pages 4.1 Finding Taxonomies First step in the data extraction phase is finding the taxonomies or top level categories. Taxonomy is identified using the following algorithm. The inputs to this algorithm are the HTML DOM tree corresponding to the set of segments identified using the flat partitioning and a set of URLs previously seen in that website while crawling. INPUT: <S1,, Sk> : Set of segment DOM Trees, SeenURLs: Hash table of URLs found in previous pages OUTPUT: List of taxonomy segments BEGIN END 1) Find sequences, <U1, U2,, Uk>, where Ui is a list of consecutive links within Si 2) Find those Ui s where the sequence of URL s are similar, <U1, U2,, Ul > 3) IF ( U1 +.. + Ul ) / Si > d THEN Si is a Taxonomy segment Two URL s are similar if they differ only by the values of their CGI arguments. The d is an experimentally determined threshold. The algorithm measures if the sequences of similar consecutive URLs form the majority of the segment. If this test succeeds then the segment is a taxonomy segment and DataRover crawls from these URLs. Figure 5 Page Segmentation Algorithm Figure 6: An example list of products page

4.2 Identifying List of Products Segments 4.3 Extracting Data from Single-Product INPUT: DOM Tree of a segment OUTPUT: Product Nodes (if page is a list of products page else null) BEGIN 1. Use a regular expression for PRICE ($ followed by digits) to match all leaf nodes in the HTML tree with PRICE=true. 2. Match the leaf nodes corresponding to HTML anchors with URL=true. 3. Propagate the PRICE or URL to ancestors as long as there are no conflicts. A conflict occurs when an ancestor has two children nodes such that (i) PRICE=true and (ii) URL=true with two distinct urls. 4. Mark the children of all such conflict nodes as Product Nodes. 5. If the number of Product Nodes within a segment is greater than two then the segment is a list-of-products END Page When the list of products page identifier succeeds and returns the list of Product Nodes then the Extractor component retrieves two sample product pages and gives them to the Segmentation Component, which segments these two pages as explained in Section 4.3.1. Next, Segment Aligner takes this sequence of segments and aligns them as explained in Section 4.3.2. All the data from the aligned dissimilar segments are considered as Product Information. At this stage, we also learn the root to leaf tag-paths of Product Information nodes in the HTML tree and apply these tag-paths iteratively on the remaining Product Pages to collect and organize all the product information. 4.3.1 Segment Aligner: Segment Aligner takes two sequences of segments and aligns them based on the content similarity in the segments. This procedure helps in separating the product data from the template. Segments that are aligned and dissimilar are considered as product data where as the segments that are aligned and similar are discarded as page template. To align the two segments we use minimum edit distance algorithm and to check the similarity between two segments we use Jaccard coefficient [11], which is calculated as ratio of the common words in the two segments to the total number of words present in the two segments. For example from Figure 8, P1 = <S1, S2, S3, S4>, where S1, S2, S3, S4 are segments P2 = <S1, S2, S3, S4 >, where S1, S2, S3, S4 are segments These two sets of segments are aligned as M X X M e.g <S1 S2 S3 S4> Figure 7: DOM Tree for the list-of-products and markings In Figure 7, u1,.., u6 are the URL markings and p1,, p6 are the PRICE markings. When we propagate these markings to td[i] nodes in Line 3, still no conflicts are detected. Upon propagating one more level up, to tr nodes, we detect conflict since the markings at the tr[1] = {u1, u2, p1, p2} which contains two td s, td[1] and td[2] with two distinct urls, u1 and u2. Hence, the children td[1] and td[2] are marked as Product Nodes. Similar reasoning marks the remaining td s as Product Nodes too. <S1 S2 S3 S4> In this example, S1 aligns with S1 with a content match (M), S2 aligns with S2 with content mismatch (X), S3 aligns with S3 with content mismatch (X), and S4 aligns with S4 with content match (M). Hence, we identify segments S2, S3 and S2, S3 to be the product content regions. After finding the product content regions, we learn the root to leaf path HTML paths and apply on the remaining single product pages obtained from the list of products.

S1 S1 S2 S2 S3 S3 S4 S4 Figure 8: Two single-product pages with partition 5. DATA LABELING The objective of the data labeler is to annotate the columns obtained from the data extraction phase. When we get the data, we manually label the attributes of a few tables or use techniques such as [16] to extract the labels from the web page itself, and then, train column classifiers based upon the syntactic properties of the data values. Syntactic features identified include string length, word counts, number of words and special characters such as comma, dot, hyphen etc. Using these features, we build a decision tree classifier for classifying data values to columns similar to techniques used in [14]. This way, we can automate labeling of product data columns from hundreds of vendors. Whenever the classifiers fail, we ask the domain expert to labels these new columns on demand, thus providing more supervision for the column classifiers. In our experiments, we used the decision tree classifier package C4.5 from [15]. 6. EXPERIMENTAL RESULTS We run the DataRover on the following sites and obtained the results as illustrated in the following table. The table illustrates the performance of DataRover tested against 8 online catalogs. Any cell marked with YES means that DataRover successfully identified all the segments corresponding to a taxonomy, list-ofproducts or single product and extracted the corresponding data Table 1 Experiments Results Web Catalog Taxonomy Finder List - of Products Finder Single Product Finder www.zappos.com YES YES YES www.onestepahead.com YES YES YES www.homevisions.com YES YES YES www.walgreens.com YES YES YES www.smartbargains.com YES YES YES www.overstock.com YES YES YES www.walmart.com YES NO YES www.officedepot.com YES NO YES For two web sites, our list-of-products test failed to find the item boundaries since the children of conflicting parent node do not correspond to single product nodes but sequences of nodes that make up a single product itself. Hence, more sophisticated techniques for product record boundary detection, such as [13], is necessary. The extracted product data for the above experiments are presented at http://www.public.asu.edu/~skoduri/datarover 7. RELATED WORK Information extraction from the web is a well-studied problem. But most of the work is wrapper based. Wrapper is a small program that extracts the data from the web sites. First, wrapper is created either manually or semi-automatically after analyzing location of the data in the HTML page. The wrapper-based approaches are explained in [4, 5, 6, 7, 8]. The problems with these approaches are that the wrapper creation and site-specific changes in the web page requires maintenance. Second kind of approach is to learn and use a template. A template is generated from the set of input pages and this template is used to do actual extraction. A key characteristic of DataRover is that, unlike the early pioneering systems described in [9, 10, 12] it does not make any assumptions about the roles and usage patterns of the HTML tags within the Web pages. 8. REFERENCES [1] http://www.w3.org/dom/, DOM tree [2] http://www.w3.org/tr/xpath, X Path [3] http://lempinen.net/sami/jtidy/, JTIDY parser

[4]Joachim Hammer, Hector Garcia-Molina, Junghoo Cho, Arturo Crespo, and Rohan Aranha, Extracting semi-structure information from the web, In Proceedings of the Workshop on Management of Semistructured Data,1997. [5] N. Kushmerick, D. Weld, and R. Doorenbos, Wrapper induction for information extraction, In Proc. of the 1997 Intl. Joint Conf. on Artificial Intelligence, 1997. [6]I. Muslea, S. Minton, and C. A. Knoblock, A hierarchical approach to wrapper induction, In Proceedings of third International Conference on Autonomous Agents, Seattle, WA, 1999. [7] Sergey Brin, Extracting patterns and relations from the world wide web, In WebDB Workshop at 6th International Conference on Extending Database Technology, EDBT 98, 1998. [8] Stephen Soderland. Learning extraction rules for semistructured and free text, Machine Learning, 34:233 272, 1999 [9] Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo, Roadrunner:Towards automatic data extraction from large web sites, In Proc. of the 2001 Intl. Conf. on Very Large Data Bases, 2001. [10] Arvind Arasu, Hector Garcia-Molina, Extracting structured data from web pages, Fourth International Workshop on the Web and Databases, WebDB 2001 [11] R. R. Korfhage, Information Storage and Retrieval, John Wiley Computer Publications, New York,1999. [12] C. Y. Chung, M. Gertz, and N. Sundaresan. Reverse engineering for web data: From visual to semantic structures. In Intl. Conf. on Data Engineering, 2002. [13] D. W. Embley,Y. Jiang and Y.-K. Ng. Record-boundary discovery in Web documents. ACM SIGMOD International Conference on Management of Data, 1999. [14] B. Chidlovskii. Automatic repairing of web wrappers. WIDM 2001. [15] I. H. Witten, E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, October 1999. [16] Luigi Arlotta, Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. Automatic annotation of data extracted from large web sites. WebDB 2003, San Diego, CA. [17] www.froogle.com [18] http://store.yahoo.com/shoedini