Automatically Constructing a Directory of Molecular Biology Databases

Similar documents
Automatically Constructing a Directory of Molecular Biology Databases

Combining Classifiers to Identify Online Databases

Implementation of Enhanced Web Crawler for Deep-Web Interfaces

Organizing Hidden-Web Databases by Clustering Visible Web Documents

Intelligent Web Crawler: A Three-Stage Crawler for Effective Deep Web Mining

Enhance Crawler For Efficiently Harvesting Deep Web Interfaces

Creating a Classifier for a Focused Web Crawler

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

Siphoning Hidden-Web Data through Keyword-Based Interfaces

Developing Focused Crawlers for Genre Specific Search Engines

An Efficient Method for Deep Web Crawler based on Accuracy

Mining Web Data. Lijun Zhang

Content Based Smart Crawler For Efficiently Harvesting Deep Web Interface

A NOVEL APPROACH TO INTEGRATED SEARCH INFORMATION RETRIEVAL TECHNIQUE FOR HIDDEN WEB FOR DOMAIN SPECIFIC CRAWLING

Mining Web Data. Lijun Zhang

Information Retrieval

LINK context is utilized in various Web-based information

An Focused Adaptive Web Crawling for Efficient Extraction of Data From Web Pages

HYBRID QUERY PROCESSING IN RELIABLE DATA EXTRACTION FROM DEEP WEB INTERFACES

An Effective Deep Web Interfaces Crawler Framework Using Dynamic Web

Information Retrieval Issues on the World Wide Web

Focused Crawling with

Enhanced Crawler with Multiple Search Techniques using Adaptive Link-Ranking and Pre-Query Processing

A Framework for adaptive focused web crawling and information retrieval using genetic algorithms

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content

Smart Three Phase Crawler for Mining Deep Web Interfaces

Chapter 6: Information Retrieval and Web Search. An introduction

DATA MINING II - 1DL460. Spring 2014"

Deep Web Crawling to Get Relevant Search Result Sanjay Kerketta 1 Dr. SenthilKumar R 2 1,2 VIT University

Information Retrieval. Lecture 11 - Link analysis

DATA MINING - 1DL105, 1DL111

Focused Crawling with

Search Engines. Information Retrieval in Practice

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Tag-based Social Interest Discovery

CS47300: Web Information Search and Management

Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection

Downloading Textual Hidden Web Content Through Keyword Queries

Searching the Deep Web

Information Retrieval

E-FFC: an enhanced form-focused crawler for domain-specific deep web databases

Keywords Data alignment, Data annotation, Web database, Search Result Record

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

Web Database Integration

Searching the Web What is this Page Known for? Luis De Alba

Downloading Hidden Web Content

Information Retrieval Spring Web retrieval

ISSN: [Zade* et al., 7(1): January, 2018] Impact Factor: 4.116

A Two-stage Crawler for Efficiently Harvesting Deep-Web Interfaces

Prudent Schema Matching For Web Forms

Formation Of Two-stage Smart Crawler: A Review

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Application of rough ensemble classifier to web services categorization and focused crawling

SEARCH ENGINE INSIDE OUT

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Chapter 27 Introduction to Information Retrieval and Web Search

I. INTRODUCTION. Fig Taxonomy of approaches to build specialized search engines, as shown in [80].

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

CS6200 Information Retreival. Crawling. June 10, 2015

Competitive Intelligence and Web Mining:

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.

Improving Relevance Prediction for Focused Web Crawlers

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Evaluation Methods for Focused Crawling

Session Questions and Responses

A genetic algorithm based focused Web crawler for automatic webpage classification

Evaluation of Query Generators for Entity Search Engines

Featured Archive. Saturday, February 28, :50:18 PM RSS. Home Interviews Reports Essays Upcoming Transcripts About Black and White Contact

Search Engines Information Retrieval in Practice

Web Crawling As Nonlinear Dynamics

DATA MINING II - 1DL460. Spring 2017

A B2B Search Engine. Abstract. Motivation. Challenges. Technical Report

A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

A P2P-based Incremental Web Ranking Algorithm

Finding Information on the Information Highway. How to get around in the Internet

Information Discovery, Extraction and Integration for the Hidden Web

Information Retrieval May 15. Web retrieval

World Wide Web has specific challenges and opportunities

Personalizing PageRank Based on Domain Profiles

TELCOM2125: Network Science and Analysis

Deep Web Content Mining

Intuitive and Interactive Query Formulation to Improve the Usability of Query Systems for Heterogeneous Graphs

Searching. Outline. Copyright 2006 Haim Levkowitz. Copyright 2006 Haim Levkowitz

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

Collective Intelligence in Action

Part I: Data Mining Foundations

WEB SPAM IDENTIFICATION THROUGH LANGUAGE MODEL ANALYSIS

LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

How Does a Search Engine Work? Part 1

Smartcrawler: A Two-Stage Crawler for Efficiently Harvesting Deep-Web Interfaces

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Feature LDA: a Supervised Topic Model for Automatic Detection of Web API Documentations from the Web

7. Mining Text and Web Data

Searching the Web [Arasu 01]

Content Discovery of Invisible Web

Title: Artificial Intelligence: an illustration of one approach.

Information Retrieval. Lecture 9 - Web search basics

Transcription:

Automatically Constructing a Directory of Molecular Biology Databases Luciano Barbosa Sumit Tandon Juliana Freire School of Computing University of Utah {lbarbosa, sumitt, juliana}@cs.utah.edu

Online Databases and the Hidden Web Millions of online databases [Hsieh et al. SIGMOD 2006] Web content hidden behind form interfaces Look for books, airfare tickets, gene information Not accessible through traditional search engines How to find and leverage this information? June 28, 2007 2

Searching for Molecular Biology Databases Google: Molecular Biology Database -- 27 million results Automatic, but imprecise June 28, 2007 3

Searching for Molecular Biology Databases Google: Molecular Biology Database -- 27 million results Automatic, but imprecise June 28, 2007 4

Making the Hidden Web more Accessible: Current Approaches Database directories [NAR database compilation - Galperin NAR2007, CompletePlanet.com] Web Integration Systems and metasearchers [Google Base; Chang et al. CIDR 2005] Hidden-Web crawlers [Raghavan & Molina VLDB 2001; Barbosa & Freire SBBD 2004] Challenge: Automation and Scalability Requirements: Automatically locating Web forms--the entry points to these databases [Barbosa and Freire WebDB2005, Barbosa and Freire WWW2007a] Organizing databases: domain discovery [Barbosa et al, ICDE2007], identifying relevant forms [Barbosa and Freire WWW2007b] June 28, 2007 5

Outline Locating online databases Organizing databases---identify relevant forms The ACHE Framework Case Study: Creating a high-quality Molecular Biology Databases directory June 28, 2007 6

Find and identify the entry points to the hidden Web Given the set of all Web pages W, find W f W st W f contains searchable forms Very sparse space: the Web is huge, there are relatively few databases -- W f << W Google indexes several (>16b) billion pages Few million databases [Hsieh et al., 2006] Look for a few needles in a haystack Forms not precisely defined There is great variation in the way Web forms are designed, even within a well-defined domain Requirements Perform a broad search Avoid visiting pages unnecessarily June 28, 2007 7

Traditional Web Crawlers Start from a seed set of urls Recursively follow links to other documents Problems: Too many documents must be retrieved before hitting the target Inefficient: it takes too long to crawl the whole Web seed f page links June 28, 2007 8

Focused Crawlers Search and retrieve only subset of Web that pertains to a specific topic of relevance consider the contents Retrieves a small subset of the documents on the Web Problem: still inefficient Too many pages retrieved -- few forms even within a restricted domain Not relevant f June 28, 2007 9

The ACHE Form-Focused Crawler Focus on topic just like a focused crawler Also focus on finding forms: Prioritizes links to follow based on hyperlink path patterns Goal: ensure a high harvest rate fraction of pages fetched that contain forms Not relevant f f June 28, 2007 10

A Form-Focused Crawler Page classifier: Focus on a specific topic based on the page content Link classifier: Prioritizes promising links that are close to forms Frontier manager: implements crawling policy to maximize reward Web Crawler Most relevant link Frontier Manager Page (Link, Relevance) Page Classifier Links Link Classifier Forms Form Database Form page Link neighborhood at level 1 Link neighborhood at level 2 Off-topic pages On-topic pages Level 1 Level 2 On-topic pages June 28, 2007 11

A Form-Focused Crawler Page classifier: Focus on a specific topic based on the page content Link classifier: Prioritizes promising links that are close to forms Frontier manager: implements crawling policy to maximize reward Crawler Most relevant link Frontier Manager Page (Link, Relevance) Page Classifier Links Link Classifier Forms Form Database June 28, 2007 12

Form-Focused Crawler Effectiveness 3000 2500 2000 1500 Car form crawler (3 levels) form crawler (2 levels) form crawler (1 level) fixed depth baseline SC x 1000 500 0 0 5000 10000 15000 20000 25000 30000 Number of pages FFC obtains harvest rates that are substantially higher than other focused crawlers Multiple levels lead to improved efficiency June 28, 2007 13

Identifying Relevant Forms Even a form-focused crawler retrieves a large percentage of irrelevant forms Only 16% on average! Problem: Given a set F of Web forms automatically gathered by a focused crawler in an online database domain D, our goal is to select from F only the forms that are entry points to databases in D Challenge: form variability June 28, 2007 14

Form Variability Not all forms lead to databases: Searchable X Non-searchable Searchable Non-searchable June 28, 2007 15

Form Variability Different domains with similar content Hotel Airfare June 28, 2007 16

Form Variability Heterogeneity in same domain June 28, 2007 17

HIFI: Identifying Relevant Forms HIFI: new hierarchical classification framework for identifying forms in a domain Intuition: partition set of features and classify in parts Locating Forms Identifying Relevant Forms Web pages Focused Crawler Forms Generic Form Classifier Searchable forms Domain-Specific Form Classifier Relevant forms Page textual content Form structure Form textual content June 28, 2007 18

Looking at Form Structure Searchable forms share similar structure Structural features are good indicators of whether forms are searchable or not June 28, 2007 19

Generic Form Classifier Uses structural features GFC is domain independent Previous classifiers for identifying searchable forms are domain dependent Use the content inside tags June 28, 2007 20

Looking inside the Form Content <form id="search_form" name="search" action="http://us.rd.yahoo.com/hotjobs/search/home/* method="get"> Search for Jobs Across the Web Job Category <select tabindex="4" name="industry1" id="industry1"> <option value="fin">accounting/finance</option> <option value="adv">advertising/public Relations</option> <option value="art">arts/entertainment/publishing</option> <option value="bam">banking/mortgage</option> </select> Keyword(s) <input name="keywords_all" id="keywords_all" type="text" value=""> (e.g. Job title, company, occupation) City & State or Zip <input tabindex="3" type="checkbox" align="left" name="metro_area" id="metro_area" value="1" checked /> Include surrounding cities <input type="hidden" name="country1" id="country1" value="usa"> </form> June 28, 2007 21

Domain-Specific Form Classifier DSFC Forms in a given domain contain a well-defined and restricted vocabulary [He et al., CIKM 2004] Use textual content that can be automatically extracted from forms: Forms as a bag of words Previous works relied on the ability to extract labels--a task that is hard to automate June 28, 2007 22

Hierarchical Classification GFC Coarse classification: high recall Domain independent DSFC Smaller search space: high precision Domain specific Benefits Simplification of the search space: construction of simpler classifiers Appropriate learning techniques for each feature space June 28, 2007 23

HIFI Performance for commercial databases HIFI = GFC + DSFC High accuracy: from 0.89 to 0.99 High recall: from 0.73 to 0.96 High precision: from 0.80 to 0.97 June 28, 2007 24

HIFI X Monolithic Classifier Configuration 1 Content Configuration 2 Structure + content High recall More specific model Low precision over non-searchable forms Combining classifiers gives the best tradeoff between precision and recall June 28, 2007 25

ACHE: Overview FFC + Online learning Learning from experience: dynamically adapt crawler policy Simplifies crawler configuration Improve harvest rate! Details in Barbosa & Freire, WWW2007 Crawler Page Page Classifier Forms Searchable Searchable Forms Form Classifier Domain-Specific Form Classifier Relevant Forms Form Database Most relevant link Frontier Manager (Link, Relevance) Links Link Classifier HIFI Adaptive Link Learner Form path Automatic Feature Selection June 28, 2007 26

Constructing the Molecular Biology Database Directory: Page Classifier Data Gathered from dmoz.org 2800 of positive examples 4671 of negative examples: great variety and spam communities as casino and porn Iterative process Rainbow, a freely available Naıve Bayes classifier Best words in terms of information gain biology, molecular, protein, genome, ncbi June 28, 2007 27

Constructing the Molecular Biology Database Directory: Link Classifier Data: 64 form pages from NAR collection Backward crawl of depth 3 the 5 most frequent words in each feature space: URL, anchor and text in link neighborhood Tool: Weka Algorithm: Naïve bayes June 28, 2007 28

Constructing the Molecular Biology Database Directory: Form Classification Generic Form Classifier (GFC) was not so effective Classifier was refined Misclassified forms added to positive example Accuracy ~96% June 28, 2007 29

Constructing the Molecular Biology Database Directory: Form Classification Domain-Specific Form Classifier Positive examples from NAR collection Negative examples» Gathered from the searchable forms returned by GFC» About chemistry, agriculture, authors and journals about molecular biology, non-english pages Top stemmed terms» name, type, select, result, keyword, gene, sequenc June 28, 2007 30

Experimental Evaluation Assess the effectiveness of ACHE for molecular biology databases Effectiveness measures Accuracy, recall, precision and specificity wrt number of relevant forms Crawler setup 35 seeds from dmoz.org 100,000 pages crawled June 28, 2007 31

Form Classification: Results GFC Recall: 0.82 Specificity: 0.96 Form Filtering: GFC + DSFC Recall: 0.73 Precision: 0.93 Accuracy: 0.96 Form classification is effective June 28, 2007 32

Performance of Crawler Configurations over Time Adaptive outperforms Static and Baseline June 28, 2007 33

Comparison with NAR Collection NAR Collection 968 databases Maintained over 7 years Different database concept: there is not necessarily a searchable Web form Crawled from the links in the NAR collection until depth 1--- 20,000 pages retrieved and 700 relevant forms Ache: 513 relevant forms in 2 hours June 28, 2007 34

Conclusion and Future Work Scalable and effective solution to build and maintain high-quality database directories Efficient crawling strategy to automatically discover hidden-web databases Accurate form classification With some modest tuning, the system can be customized for different domains Future directions Simplify the process of gathering positive and negative examples Query interface to access the data in the directory Automatic generation of snippets describing databases June 28, 2007 35