Automatically Constructing a Directory of Molecular Biology Databases

Automatically Constructing a Directory of Molecular Biology Databases Luciano Barbosa Sumit Tandon Juliana Freire School of Computing University of Utah {lbarbosa, sumitt, juliana}@cs.utah.edu

Online Databases and the Hidden Web Millions of online databases [Hsieh et al. SIGMOD 2006] Web content hidden behind form interfaces Look for books, airfare tickets, gene information Not accessible through traditional search engines How to find and leverage this information? June 28, 2007 2

Searching for Molecular Biology Databases Google: Molecular Biology Database -- 27 million results Automatic, but imprecise June 28, 2007 3

Searching for Molecular Biology Databases Google: Molecular Biology Database -- 27 million results Automatic, but imprecise June 28, 2007 4

Making the Hidden Web more Accessible: Current Approaches Database directories [NAR database compilation - Galperin NAR2007, CompletePlanet.com] Web Integration Systems and metasearchers [Google Base; Chang et al. CIDR 2005] Hidden-Web crawlers [Raghavan & Molina VLDB 2001; Barbosa & Freire SBBD 2004] Challenge: Automation and Scalability Requirements: Automatically locating Web forms--the entry points to these databases [Barbosa and Freire WebDB2005, Barbosa and Freire WWW2007a] Organizing databases: domain discovery [Barbosa et al, ICDE2007], identifying relevant forms [Barbosa and Freire WWW2007b] June 28, 2007 5

Outline Locating online databases Organizing databases---identify relevant forms The ACHE Framework Case Study: Creating a high-quality Molecular Biology Databases directory June 28, 2007 6

Find and identify the entry points to the hidden Web Given the set of all Web pages W, find W f W st W f contains searchable forms Very sparse space: the Web is huge, there are relatively few databases -- W f << W Google indexes several (>16b) billion pages Few million databases [Hsieh et al., 2006] Look for a few needles in a haystack Forms not precisely defined There is great variation in the way Web forms are designed, even within a well-defined domain Requirements Perform a broad search Avoid visiting pages unnecessarily June 28, 2007 7

Traditional Web Crawlers Start from a seed set of urls Recursively follow links to other documents Problems: Too many documents must be retrieved before hitting the target Inefficient: it takes too long to crawl the whole Web seed f page links June 28, 2007 8

Focused Crawlers Search and retrieve only subset of Web that pertains to a specific topic of relevance consider the contents Retrieves a small subset of the documents on the Web Problem: still inefficient Too many pages retrieved -- few forms even within a restricted domain Not relevant f June 28, 2007 9

The ACHE Form-Focused Crawler Focus on topic just like a focused crawler Also focus on finding forms: Prioritizes links to follow based on hyperlink path patterns Goal: ensure a high harvest rate fraction of pages fetched that contain forms Not relevant f f June 28, 2007 10

A Form-Focused Crawler Page classifier: Focus on a specific topic based on the page content Link classifier: Prioritizes promising links that are close to forms Frontier manager: implements crawling policy to maximize reward Web Crawler Most relevant link Frontier Manager Page (Link, Relevance) Page Classifier Links Link Classifier Forms Form Database Form page Link neighborhood at level 1 Link neighborhood at level 2 Off-topic pages On-topic pages Level 1 Level 2 On-topic pages June 28, 2007 11

A Form-Focused Crawler Page classifier: Focus on a specific topic based on the page content Link classifier: Prioritizes promising links that are close to forms Frontier manager: implements crawling policy to maximize reward Crawler Most relevant link Frontier Manager Page (Link, Relevance) Page Classifier Links Link Classifier Forms Form Database June 28, 2007 12

Form-Focused Crawler Effectiveness 3000 2500 2000 1500 Car form crawler (3 levels) form crawler (2 levels) form crawler (1 level) fixed depth baseline SC x 1000 500 0 0 5000 10000 15000 20000 25000 30000 Number of pages FFC obtains harvest rates that are substantially higher than other focused crawlers Multiple levels lead to improved efficiency June 28, 2007 13

Identifying Relevant Forms Even a form-focused crawler retrieves a large percentage of irrelevant forms Only 16% on average! Problem: Given a set F of Web forms automatically gathered by a focused crawler in an online database domain D, our goal is to select from F only the forms that are entry points to databases in D Challenge: form variability June 28, 2007 14

Form Variability Not all forms lead to databases: Searchable X Non-searchable Searchable Non-searchable June 28, 2007 15

Form Variability Different domains with similar content Hotel Airfare June 28, 2007 16

Form Variability Heterogeneity in same domain June 28, 2007 17

HIFI: Identifying Relevant Forms HIFI: new hierarchical classification framework for identifying forms in a domain Intuition: partition set of features and classify in parts Locating Forms Identifying Relevant Forms Web pages Focused Crawler Forms Generic Form Classifier Searchable forms Domain-Specific Form Classifier Relevant forms Page textual content Form structure Form textual content June 28, 2007 18

Looking at Form Structure Searchable forms share similar structure Structural features are good indicators of whether forms are searchable or not June 28, 2007 19

Generic Form Classifier Uses structural features GFC is domain independent Previous classifiers for identifying searchable forms are domain dependent Use the content inside tags June 28, 2007 20

Looking inside the Form Content <form id="search_form" name="search" action="http://us.rd.yahoo.com/hotjobs/search/home/* method="get"> Search for Jobs Across the Web Job Category <select tabindex="4" name="industry1" id="industry1"> <option value="fin">accounting/finance</option> <option value="adv">advertising/public Relations</option> <option value="art">arts/entertainment/publishing</option> <option value="bam">banking/mortgage</option> </select> Keyword(s) <input name="keywords_all" id="keywords_all" type="text" value=""> (e.g. Job title, company, occupation) City & State or Zip <input tabindex="3" type="checkbox" align="left" name="metro_area" id="metro_area" value="1" checked /> Include surrounding cities <input type="hidden" name="country1" id="country1" value="usa"> </form> June 28, 2007 21

Domain-Specific Form Classifier DSFC Forms in a given domain contain a well-defined and restricted vocabulary [He et al., CIKM 2004] Use textual content that can be automatically extracted from forms: Forms as a bag of words Previous works relied on the ability to extract labels--a task that is hard to automate June 28, 2007 22

Hierarchical Classification GFC Coarse classification: high recall Domain independent DSFC Smaller search space: high precision Domain specific Benefits Simplification of the search space: construction of simpler classifiers Appropriate learning techniques for each feature space June 28, 2007 23

HIFI Performance for commercial databases HIFI = GFC + DSFC High accuracy: from 0.89 to 0.99 High recall: from 0.73 to 0.96 High precision: from 0.80 to 0.97 June 28, 2007 24

HIFI X Monolithic Classifier Configuration 1 Content Configuration 2 Structure + content High recall More specific model Low precision over non-searchable forms Combining classifiers gives the best tradeoff between precision and recall June 28, 2007 25

ACHE: Overview FFC + Online learning Learning from experience: dynamically adapt crawler policy Simplifies crawler configuration Improve harvest rate! Details in Barbosa & Freire, WWW2007 Crawler Page Page Classifier Forms Searchable Searchable Forms Form Classifier Domain-Specific Form Classifier Relevant Forms Form Database Most relevant link Frontier Manager (Link, Relevance) Links Link Classifier HIFI Adaptive Link Learner Form path Automatic Feature Selection June 28, 2007 26

Constructing the Molecular Biology Database Directory: Page Classifier Data Gathered from dmoz.org 2800 of positive examples 4671 of negative examples: great variety and spam communities as casino and porn Iterative process Rainbow, a freely available Naıve Bayes classifier Best words in terms of information gain biology, molecular, protein, genome, ncbi June 28, 2007 27

Constructing the Molecular Biology Database Directory: Link Classifier Data: 64 form pages from NAR collection Backward crawl of depth 3 the 5 most frequent words in each feature space: URL, anchor and text in link neighborhood Tool: Weka Algorithm: Naïve bayes June 28, 2007 28

Constructing the Molecular Biology Database Directory: Form Classification Generic Form Classifier (GFC) was not so effective Classifier was refined Misclassified forms added to positive example Accuracy ~96% June 28, 2007 29

Constructing the Molecular Biology Database Directory: Form Classification Domain-Specific Form Classifier Positive examples from NAR collection Negative examples» Gathered from the searchable forms returned by GFC» About chemistry, agriculture, authors and journals about molecular biology, non-english pages Top stemmed terms» name, type, select, result, keyword, gene, sequenc June 28, 2007 30

Experimental Evaluation Assess the effectiveness of ACHE for molecular biology databases Effectiveness measures Accuracy, recall, precision and specificity wrt number of relevant forms Crawler setup 35 seeds from dmoz.org 100,000 pages crawled June 28, 2007 31

Form Classification: Results GFC Recall: 0.82 Specificity: 0.96 Form Filtering: GFC + DSFC Recall: 0.73 Precision: 0.93 Accuracy: 0.96 Form classification is effective June 28, 2007 32

Performance of Crawler Configurations over Time Adaptive outperforms Static and Baseline June 28, 2007 33

Comparison with NAR Collection NAR Collection 968 databases Maintained over 7 years Different database concept: there is not necessarily a searchable Web form Crawled from the links in the NAR collection until depth 1--- 20,000 pages retrieved and 700 relevant forms Ache: 513 relevant forms in 2 hours June 28, 2007 34

Conclusion and Future Work Scalable and effective solution to build and maintain high-quality database directories Efficient crawling strategy to automatically discover hidden-web databases Accurate form classification With some modest tuning, the system can be customized for different domains Future directions Simplify the process of gathering positive and negative examples Query interface to access the data in the directory Automatic generation of snippets describing databases June 28, 2007 35