E-FFC: an enhanced form-focused crawler for domain-specific deep web databases

Size: px

Start display at page:

Download "E-FFC: an enhanced form-focused crawler for domain-specific deep web databases"

Dwayne Ellis
5 years ago
Views:

1 DOI /s E-FFC: an enhanced form-focused crawler for domain-specific deep web databases Yanni Li Yuping Wang Jintao Du Received: 17 March 2012 / Revised: 8 August 2012 / Accepted: 14 August 2012 Springer Science+Business Media, LLC 2012 Abstract A key problem of retrieving, integrating and mining rich and high quality information from massive Deep Web Databases (WDBs) online is how to automatically and effectively discover and recognize domain-specific WDBs entry points, i.e., forms, in the Web. It has been a challenging task because domain-specific WDBs forms with dynamic and heterogeneous properties are very sparsely distributed over several trillion Web pages. Although significant efforts have been made to address the problem and its special cases, more effective solutions remain to be further explored towards achieving both the satisfactory harvest rate and coverage rate of domain-specific WDBs forms simultaneously. In this paper, an Enhanced Form- Focused Crawler for domain-specific WDBs (E-FFC) has been proposed as a novel framework to address existing solutions limitations. The E-FFC, based on the divide and conquer strategy, employs a series of novel and effective strategies/algorithms, including a two-step page classifier, a link scoring strategy, classifiers for advanced searchable and domain-specific forms, crawling stopping criteria, etc. to its end achieving the optimized harvest rate and coverage rate of domain-specific WDBs forms simultaneously. Experiments of the E-FFC over a number of real Web pages in a set of representative domains have been conducted and the results show that the E-FFC outperforms the existing domain-specific Deep Web Form-Focused Crawlers in terms of the harvest rate, coverage rate and crawling robustness. Keywords Deep Web Databases (WDBs) Form-Focused Crawler (FFC) Harvest rate Coverage rate Y. Li (B) Y. Wang School of Computer Science and Technology, Xidian University, Xi an , People s Republic of China yannili@mail.xidian.edu.cn J. Du School of Software, Xidian University, Xi an , People s Republic of China

2 1 Introduction Recently, The Web has been rapidly deepened by massive WDBs online (Steve and Giles 1998, 1999; Bergman 2001; Ghanem and Aref 2004; Chang et al. 2004; He et al. 2007; BrightPlanet.com 2001). The information hidden in WDBs can only be accessed by filling out WDBs entry points, i.e., forms and submitting queries. Conventional search engines do not have the function of automatically filling out the forms and submitting queries, and therefore, they are unable to index the information in WDBs directly. The set of Webs which cannot be indexed by conventional search engines is called Deep Web (also called Invisible Web or Hidden Web), and in contrast, the set of static Webs which can be indexed is called Surface Web. Studies have shown that the majority of the high quality information in the Deep Web originates from WDBs (Chang et al. 2004; Heetal.2007). According to the estimation in the white paper issued by the Bright Planet in July 2000, the number of WDB sites has reached to the range of 43,000 to 96,000, the information involved is estimated about 7,500 terabytes, and the amount of information in the Deep Web is about 550 times of that in the Surface Web (BrightPlanet.com 2001). A later research conducted by He et al. in 2004 and 2007, who randomly collected and systematically analyzed 1,000,000 independent IPs as experimental samples, has shown that, in 99 % confidence intervals, the number of WDB sites has reached about 236,000 to 377,000, the number of WDBs has risen dramatically to 366, ,000, and the amount of the information in the Deep Web has increased 3 7 times in 4 years (from 2000 to 2004) (Chang et al. 2004;Heetal.2007). The information in WDBs is usually characterized as structured in representation, prodigious in quantity and subject-oriented in content. The retrieval, mining and integration of the relevant information in the WDBs are therefore vital for various applications. At present, the relevant research has been mainly conducted based on the following methods: (1) large-scale integration and retrieval of WDBs information in a specific domain (Chang et al. 2005;Pengetal.2004;Wuetal.2004); (2) surfacing the Deep Web (Madhavan et al. 2008, 2009; Hornung et al. 2009); (3) task/service-oriented strategies (Rocco et al. 2004; Dehua 2010; Heetal.2006). On the other hand, in order to apply these methods effectively, a fundamental yet challenging problem is how to discover and recognize domain-specific WDBs entry points, i.e., forms, in the Web automatically and efficiently. The following factors make this problem particularly complicated: WDBs cover almost all areas in the real world, and the number of WDBs is huge with the continued explosive growth of WDBs sites. However, WDBs forms are sparsely distributed given that the Web consists of terabyte-scaled Web pages. It is estimated by experiments that the ratio of the Web pages containing WDBs forms to the whole Web pages is only % (1,258,000/19.2 billion Web pages), and the number of a domain-specific WDBs forms is furthermore extremely sparse in the whole Web pages (Chang et al. 2004; Heetal.2007, 2006). It likes looking for a needle in a haystack to discover and recognize domain-specific WDBs forms from a great number of Web pages; WDBs forms are characterized as dynamic, heterogeneous and diverse. It is due to the fact that WDBs are constantly being changed in the Web, such that new WDBs are added, old WDBs are removed and modified, etc. Moreover, it is difficult to abstract and clearly specify the WDBs form schemas, even in a

3 well-defined domain (e.g., a book domain) where the mean attribute numbers of the forms are small and the form structures are relatively simple. It in turn imposes difficulty in further target-oriented search without clear WDBs form schemas; The WDBs forms (also known as searchable forms) are very similar to those socalled non-searchable forms that do not represent WDBs forms such as forms for login, mailing list, subscriptions, quote requests, Web-based forms, etc. In addition, with the correlation between domains, WDBs forms from other related domains would be found during the search of domain-specific WDBs forms. For example, while domain-specific WDBs forms are being searched, such as the forms from Airfares, a large number of WDBs forms from Rental Cars and Hotels are likely to be retrieved, considering these forms are often colocated with the forms from Airfares in travel sites. Therefore, it is considerably difficult to distinguish searchable forms and non-searchable forms and filtrate forms from non-focused domains. It is worth pointing out that several Deep Web portal services (e.g., CompletePlanet, Invisible-web) have emerged online, providing Deep Web directories classified in taxonomy manually. However, the maximum coverage is only 15.6 % of the total WDBs in the whole Web, and some directories cover even less, within the mere range of % (Chang et al. 2004; Heetal.2007). It is clear that such a directory-based WDBs indexing service can hardly solve the problem that is dealt with in this paper. Although significant efforts (Chang et al. 2005; He et al. 2006; Wang et al. 2008; Li et al. 2010; Raghavan and Garacia-Molina 2001; Barbosa and Freire 2005, 2007) have been made to address the problem of automatic discovery and recognition of domain-specific WDBs forms in the whole Web and special cases (Aggarwal et al. 2001; Cope et al. 2003; Ester et al. 2004; Zhang et al. 2005; Akilandeswari and Gopalan 2008; Wang et al. 2009b), existing solutions to the aforementioned problem remain to be further explored towards achieving both the satisfactory harvest rate and coverage rate of domain-specific WDBs forms simultaneously. As a new attempt, an Enhanced Form-Focused Crawler for domainspecific WDBs (E-FFC) as a novel framework to effectively overcome the limitations of the solutions available has been developed in this paper. Main contributions and characteristics of this research are as follows: The E-FFC is based on the divide and conquer strategy. A series of novel and effective strategies/algorithms, such as two-step page classifier, a link scoring strategy, classifiers for advanced searchable and domain-specific forms, crawling stopping criteria, etc., have been proposed to automatically and effectively discover and recognize domain-specific WDBs forms in the Web; With the site-based view of the Web, the E-FFC gradually prunes its search space. Based on learning link path features of target pages, a novel link scoring strategy, and an effective two-step page classifier, the E-FFC can focus on its target throughout its crawling process, and identify the promising links to the target pages effectively. Moreover, rational crawling stopping criteria are also employed to avoid unproductive search; Performance comparison experiments of the FC, FFC, and E-FFC over a number of real Web pages in a set of representative domains have been carried out.

4 Experimental results demonstrate that the E-FFC outperforms the existing domain-specific Deep Web Form-Focused Crawlers in terms of the harvest rate, coverage rate and crawling robustness. The organization of the remaining parts of the paper is as follows: Section 2 briefly overviews the previous work on the problem. Section 3 gives a formal description of the problem of WDBs Form-Focused Crawler in a specific domain and the E-FFC framework. Section 4 discusses various strategies and relevant algorithms adopted by E-FFC in detail. Section 5 presents experimental results and analysis. Section 6 summarizes the paper and outlines the future work. 2 Related work A brief overview of previous main work on the automatic discovery and recognition of domain-specific WDBs forms is given in the following. The automatic discovery and recognition of domain-specific WDBs forms was first realized by General-purpose Crawlers. For example, Chang et al. (2005) developed a Deep Web data integration system prototype called MetaQuerier, in which the WDBs Form Crawler (FC) was realized by the General-purpose Crawler technology. A General-purpose Crawler is a program of automatically extracting Web pages, which usually uses a breadth-first search method to locate and crawl to its target in the Web. During the target search process, a General-purpose Crawler needs to crawl exhaustively a large number of irrelevant Web pages, thus leading to a huge search space, poor efficiency and further practically unusable in many circumstances. To overcome the drawbacks of the General-purpose Crawlers, a topic-focused Crawler (Chakrabarti et al. 1999) (shortly referred to as Focused Crawler) and a variety of improved topic-focused Crawlers/algorithms (Rennie and McCallum 1999; Diligenti et al. 2000; Chakrabarti et al. 2002; Jamali et al. 2006; Castillo 2005; Chau and Chen 2008; Wangetal.2009a; Yadav et al. 2009; Bazarganigilani et al. 2011) were developed. Based on a topic-focused approach, the Focused Crawlers can filter out irrelevant links, i.e., off-topic links, and put only the most valuable/promising links into its URL queue waiting for grasping, thus achieving a great reduction in the search space and improvement in the search efficiency and quality. However, due to the correlation between domains, domain-specific WDBs forms reside on not only the topic-focused Web pages but also the off-topic relative Web pages (e.g., with Hotel and Airfare being relative domains, their WDBs forms are often co-located in travel Web sites). As thus the domain-specific WDBs forms Focused Crawler would miss some domain-specific WDBs forms which reside on other relative pages. With the object-focused, topic-neutral and coverage-comprehensive technology, He et al. (2006) introduced a structure-driven yield-aware Web Form Crawler. Their experiments show that the Web Form Crawler could maintain stable and very high form harvest rate and coverage rate throughout its crawling. However, it is a pity that the Web Form Crawler is topic-neutral to build a database of online databases in each domain as a map of the Deep Web, and therefore, it doesn t meet the requirements of this paper for the effective discovery and recognition domain-specific WDBs forms in the Web.

5 Raghavan and Garacia-Molina (2001) proposed a task-specific hidden-web Crawler HiWe, which can semi-automatically fill out structured Web forms. The pioneering work automates the Deep Web crawling to a great extent, but it needs to manually input some data to set up forms attribute label sets. On the other hand, a Focused Crawler s performance also heavily depends on its link selection algorithm during its crawling process (Cho et al. 1998; Bharat et al. 1998; Novak 2004). The link selection algorithm calculates the value of a link to a search target through building up the relations between Web documents, and then decides whether to select and grab the link for crawling. Some preferable link selection algorithms were proposed, e.g., PageRank and HITS algorithms (Cho et al. 1998; Novak 2004; Jamali et al. 2006). However, due to being page-based rather than WDBs form-based, the above link selection algorithms are not suitable for domain-specific WDBs forms Crawlers, thus resulting inefficient. For example, in the movie domain, when a Focused Crawler has grasped 100,000 pages, it only gets 94 relevant searchable forms (Chakrabarti et al. 1999). That is, the harvest rate is merely %. Barbosa and Freire (2005) combined the focused topic technology with a link classifier and presented a Form-Focused Crawler (FFC), which could automatically and efficiently find domain-specific WDBs forms. The FFC achieved a much better search result by focusing its crawling on a given topic, by judiciously choosing links to follow within a topic that more likely leads to pages that contain forms, and by employing appropriate stopping criteria. However, the FFC greatly depends on the seed quality of the trained-manual link classifier. Moreover, for a set of representative database domains, on average, only 16 % of the forms retrieved by the FFC are actually relevant. To overcome the defects of the FFC, Barbosa and Freire (2007) later presented an improved FFC, i.e., Adaptive Crawler for Hidden- Web Entries (ACHE), which is aimed to efficiently and automatically locate other forms in the same domain by introducing an agent learning model. The existing literature and research results show that the FFCs proposed by Barbosa and Freire (2005, 2007) are relatively effective and can achieve a higher harvest rate than other existing FFCs. Nevertheless, none of the FFCs considered the performance metric of the form coverage rate effectively. E-FFC proposed in this paper is mainly different from FFCs (Barbosa and Freire 2005, 2007) in the following aspects: (1) The Page Classifier of FFCs, which adopts the Naïve Bayesian categorization algorithm, always focused its crawler on the pages of the specific domain. However, as discussed before, due to the correlation among domains, Deep web forms exist not only in the pages of the specific domain, but also often in the pages of the relevant domains (e.g. the hybrid domain pages). Thus, the Page Classifier of FFCs (Barbosa and Freire 2005, 2007) can fail to find the forms on the relevant domain pages. This will result in the reduction of the form coverage rate. To overcome the shortcoming, a novel two-step page classification strategy is proposed for E-FFCs in this paper, which can accurately crawl the pages in both the specific domain and the relevant domains. (2) To efficiently identify promising links/delayed benefit links, Deep Web Clawers heavily depend on their link scoring strategies. However the strategies have never been clearly clarified in FFCs in Barbosa and Freire (2005, 2007), while in E-FFC, an efficient and effective link score strategy (see (6) and(7) insection4.2.4) is clearly designed. (3) For a set of representative database domains, on average, only 16 % harvest rate of FFC

6 (Barbosa and Freire 2005) and no more than 50 % harvest rate of FFC in (Barbosa and Freire 2007) were obtained. Although these results are better than those obtained in other existing works, they are not good enough. To improve the harvest rate of the FFCs, E-FFC investigates and utilizes some inherent characteristics of the possible good links (see Section 4.2), and presents a novel Domain-Specific Form Classifier based on ontology technology. As a result, the harvest rate can be greatly improved. (4) E-FFC adopts four crawling stopping criteria instead of two crawling stopping criteria in FFCs in Barbosa and Freire (2005, 2007), and this will greatly reduce the crawling time and improve efficiency of the crawler. Also, a breakpoint preservation mechanism is adopted in E-FFC, which can preserve the pages retrieved by crawler even when the crawling is suddenly interrupted. These schemes adopted greatly improve efficiency and robustness of the existing FFCs. 3 Problem description and framework of the E-FFC 3.1 Problem description Each WDB only presents its external view to the public in either simple or advanced/complex forms (so-called query interfaces). These forms are the only entrances for users to access the WDBs. Simple forms are similar to the search frames (shown in Fig. 1a) of search engines, while advanced/complex forms consist of multiple attributes. Each of the attributes in an advanced/complex form has a label marking its semantic information, corresponding to the attributes/columns of the background WDBs tables (shown in Fig. 1b). The performance of a domain-specific WDBs Form-Focused Crawler is generally measured by the harvest rate and coverage rate He et al. (2006). Let P denote the total number of the Web pages, P c the total number of the Web pages crawled by a Form-Focused Crawler, N f the total number of searchable forms in a specific domain, and N cf the number of the searchable forms in a specific domain crawled Fig. 1 Simple and advanced/complex WDBs forms in Books domain

7 by a Form-Focused Crawler, so that the harvest rate H and the coverage rate C are defined respectively as follows: H = N cf, C = N cf P c N f (1) where 0 < P c, N cf, N f < P, N cf N f To construct an effective domain-specific WDBs Form-Focused Crawler, two conflicting requirements must be met: (1) to be efficient, the Form-Focused Crawler must have a high harvest rate H to acquire forms as many as possible without crawling many pages; (2) to be comprehensive, it must have a high coverage rate C so as to widely cover a reasonable snapshot of the all WDBs forms in the Web. To overcome the shortcomings of the FFC available, the E-FFC has been developed in this paper aiming to achieve the objectives of the optimized harvest rate H and coverage rate C of domain-specific forms simultaneously in an efficient means. 3.2 The framework of the E-FFC The framework of the E-FFC is illustrated in Fig. 2. It mainly consists of seven components: (1) Crawler. It crawls and extracts pages based on the most relevant link that it has chosen; (2) Page Classifier. It is trained to determine the domain that a page belongs to in a taxonomy (e.g. Books, Jobs, Automobiles in Dmoz); (3) Link Classifier. It learns the features of links and paths leading to target pages that contain searchable forms in one or several steps, then extracts relevant links from pages and places them into a relevant priority queue in the Frontier Manager; (4) Frontier Manager. It manages two types of queues, i.e., a single-level site-link queue which stores relevant websites root page links, and multiple-level insite-link priority queues which store promising links in a website, in order to conduct the crawling process in an efficient means; (5) Searchable Form Classifier. It judges whether a form is searchable or no-searchable and filters out those no-searchable forms; (6) Domain-Specific Form Classifier. It analyzes the domain that a searchable form belongs to and only selects the searchable forms in an interested domain, and then adds them to the Form Database wherein they are not already present; (7) Adaptive Domain Feature Learner. It can automatically learn patterns from Form Database Fig. 2 The framework of the E-FFC

8 samples so as to improve the performances of the Page Classifier, Link Classifier and Domain-Specific Searchable Form Classifier. By the means of the above components (1) (4), and (7), the E-FFC is able to fulfill coordinately its efficient search and avoid crawling through unproductive paths, while the components (5) and (6) are used in a sequence to identify and achieve domain-specific WDBs forms by continually pruning the search space of Web forms. The implementation and underlying algorithms of the E-FFC will be discussed in detail in Section 4. 4 Implementation and underlying algorithms of the E-FFC 4.1 Page classifier Based on the similarity between an extracted Web page and a given specific domain, the Page Classifier in the E-FFC determines the domain that the page belongs to in a taxonomy (e.g., Books, Jobs, Automobiles in Dmoz). Considering the correlation between domains and in order to overcome the deficiency of the Naïve Bayesian categorization algorithm that is unable to judge multi-domain pages, the Page Classifier of the E-FFC introduces a novel two-step page classification strategy so as to accurately classify an extracted Web page into a given specific or relevant domain Def inition of text feature vector Because both a Web page and a specific domain can be viewed as collections of texts, they can be represented by a text feature vector. A text T can be seen as a set of words, whose feature can be characterized by VSM (Vector Space Model). That is, T can be represented by its feature vector V as: V = (w 1 t 1,,w i t i,,w n t n ) (2) where t i is the i-th feature word of the text T, andw i is the weight of the word t i, which is usually computed by (3): tf(t i, T) log(n/n ti ) w i = [ ti V tf (ti, T) log ( ) ] (3) 2 N/n ti where tf(t i, T) is the frequency of the word t i in the text T, N is the total number of training texts, and n ti is the number of the training texts containing word t i and the denominator as a normalization factor. Basedon(2)and(3), both the feature vectors of a Web page and a specific domain can be obtained. The feature vector of a Web page is composed of the feature words extracted from the page, while a domain-specific feature vector consists of all the feature words of the great representativeness to the domain.

9 4.1.2 Construction of the feature vector of the domain-specif ic pages containing WDBs forms The successful Page Classifier of the E-FFC depends heavily on the feature vector construction of the domain-specific Web page. This paper has used the UIUC dataset (The UIUC Web 2003) as learning samples to construct the feature vector of a domain-specific Web page, which contains 447 real WDBs forms in 8 domains (Hotels, Automobiles, Books, CarRentals, MusicRecords, Airfares, Jobs and Movies). As the UIUC dataset was collected in 2003, some of the IPs have changed their DNS or no longer existed. 243 accessible WDBs forms from the above 8 domains have been collected here and the results are shown in Table 1. For a domain-specific sample page that contains WDBs forms, the algorithm for extracting its feature vectors is illustrated in Fig. 3. BasedonthealgorithminFig.3, the page feature vectors from the sample dataset (shown in Table 1) in the domains of Airfares, Books, Automobiles and Hotels are obtained and shown in Table 2. Due to the paper length restriction, only the top ten items are listed, wherein items in bold are of the great representativeness to the domain A novel two-step page classif ier algorithm Given V i the feature vector of a page P i and V j a feature vector of domain-specific pages, the similarity degree between them can be calculated as follows: Sim(V i, V j ) = ( M M w ik w jk k=1 )( M ) (4) wik 2 k=1 w 2 jk k=1 where M is the dimension of a feature vector, w ik is the weight of the k-th feature word of feature vector V i,andw jk is the weight of the k-th feature word of feature vector V j. The above discussions show that the main task of the E-FFC s Page Classifier is to determine which domain a given page belongs to according to (4), so that only the irrelevant pages, i.e., the pages not belonging or being irrelevant to a specific domain, should be filtered out. A Page Classifier in a traditional Form-Focused Crawler adopted the Naïve Bayesian classifier and failed to determine which domain a hybrid domain page mostly belongs to, thus discarding some useful pages that may contain domain-specific WDB s forms. To overcome the limitation, a novel two-step page classifier algorithm is presented in this paper, which can accurately determine not only the specific domain pages (step Table 1 WDBs forms page sample numbers in 8 domains from the UIUC dataset (The UIUC Web 2003) Domain Number Domain Number of forms of forms Airfares 28 Automobiles 40 Books 39 CarRentals 14 Hotels 25 Jobs 23 Movies 39 MusicRecords 35

10 Fig. 3 The pseudo code for extracting feature vector algorithm of a domain-specific page containing WDBs forms 1), but also the relevant domain pages (step 2). Also we design an efficient similarity threshold δ determination scheme (Section 4.1.4) by which the similarity measure (4) adopted can efficiently distinguish the specific domain pages and the relevant domain pages. The detail of the proposed two-step page classifier algorithm is shown as follows: Algorithm Step 1 Step 2 (A novel two-step page classifier algorithm) Determine the similarity between a page and the focused domain. If the value of the similarity is greater than a certain threshold δ, the page will be considered to belong to the focused domain, and the Page Classifier should extract the forms and/or links in the page; Otherwise, go to Step 2. Determine the similarity between the page and other domains (e.g., Automobiles, Books, Automobiles, MusicRecords, Airfares, Jobs, and Movies), Table 2 Top ten items of feature vectors of Airfares, Books, Automobiles and Hotels domains Airfares w i Books w i Automobiles w i Hotels w i Flight Book Car Resort Hotel Manual Rear Bali Calplacehold Textbook Sedan Villa Travel Subject Model Lodge Swis Buy Hatchback Allhotel Southwest Store Coupe City Air Asset Wagon Room Vacation Bookshop Array Htlclrd Rate Gift Pickup Inn Airport Intro Minivan Beach

11 and find the domain with the biggest similarity, Then judge whether the found domain is relevant to the focused domain. If it is, the Page Classifier should extract the forms and/or links in the page; Otherwise, the page is considered to be irrelevant to the focused domains, and it would be discarded Determination of the page similarity threshold To determine the domain which a page belongs to, the key issue is to define the threshold δ of the similarity degree. Experiments to determine the threshold δ using the dataset in Table 1 are designed as follows: Experiment 1 Extract domain-specific pages from the dataset in Table 1, and then compute the similarity between each page and the domain feature vector that it belongs to. Experiment 2 Extract domain-specific pages from the dataset in Table 1, and compute the similarity between these pages and the feature vectors of other domains. The experimental results are shown in Fig. 4,inwhichthe and + represent the results of Experiments 1 and 2, respectively. By observing and analyzing the experimental results shown in Fig. 4, three conclusions can be inferred: (1) Pages belonging to the same domain almost have similarity values above 0.18; (2) Some pages, such as those belonging to comprehensive websites, may have a relatively high similarity with the feature vectors of some relevant domains; (3) Only few pages get similarity values less than 0.18 with the feature vector in their domains because of their less information. Based on the experimental and analytical results, the appropriate threshold δ is determined as Fig. 4 The results chart of the experiments to determine the similarity degree threshold δ Experiment 1 Experiment Domain Similarity Number of pages

12 4.2 Link classifier To deal with the very sparse distribution of the domain-specific WDBs forms and achieve the optimized harvest rate H and coverage rate C of forms simultaneously, the E-FFC s Link Classifier should only extract links promising to reach the target page containing domain-specific WDBs forms from a related page by one or multiple steps efficiently. It is different from the Exhaustive Crawler which extracts all the links of a page. By learning the features of good links and paths that lead to target pages containing domain-specific WDBs forms, the E-FFC s Link Classifier assigns a score to a link by the link score strategy proposed by this paper. The link score strategy corresponds to the distance between the link and a relevant target page. In this section, several key strategies/algorithms of the E-FFC s Link Classifier will be discussed Extracting possible good links A link of a Web page in an html tag <a href=...> (<a>, shorthand for anchor in English) represents a linking point which function is to connect the texts or images from the current location of the link to pages, texts and images from other locations. The features of links can be characterized by its anchor, URL and text in the proximity of the URL. The possible good links are those that may lead to target pages containing domain-specific WDBs forms from related pages by one or multiple steps. To extract possible good links efficiently, the following characteristics are observed and investigated. Firstly, by crawling 281 websites exhaustively and in each website its first 10 levels have been crawled, the statistical results of the levels where the forms have been found indicate that none of the 129 forms has a depth deeper than 5, and 94 % WDBs forms appear within the 3-level depth (shown in Fig. 5) (Chang et al. 2004; Heetal.2007). Secondly, after having observed and analyzed a large number of Web pages possibly containing WDBs forms, two conclusions have been drawn: (1) All URLs of the possible target pages are http (Hypertext Transfer Protocol) or https (Hypertext Transfer Protocol over Secure Socket Layer) protocols; (2) All possible target pages are html files, i.e., all suffixes of the URLs are.html. Through the analysis and summary on the suffixes of various types of Web resources, a string set S including non-html file suffixes have been introduced in this paper, i.e., S={.bin,.oda,.pdf,.ai,.eps,.ps,.rtf,.mif,.csh,.dvi,.hdf,.latex,.nc,.cdf,.sh,.tcl,.texi,.tr,.roff,.man,.me,.ms,.src,.zip,.bcpio,.cpio,.gtar,.shar,.sv4crc,.sv4cpio,.tar,.ustar,.au,.snd,.aif,.wav,.gif,.ief,.jpg,.jpe,.tif,.ras,.pnm,.pbm,.pgm,.ppm,.rgb,.xbm,.xpm,.xwd,.rtx,.tsv,.etx,.mpe,.mpg,.qt,.mov,.avi,.java,.arj,.exe,.mp3,.mid,.icojpg,.idc,.gz,.z,.lib,.dll,.ram,.doc,.rm,.css,.c,.h,.cpp,.hpp,.cxx,.hxx,.inc,.asm,.jav,.bat,.cmd,.ini,.def,.mak,.rc,.sed,.em,.zip,.reg,.ico,.ppt,.lon,.ra,.wma,.asf,.bmp,.rar }. Therefore, the Link Classifier of the E-FFC has been designed to only extract the links within the 3-level depth in a relevant WDBs website, the links having http or https protocols, and the links whose suffix is an element of the set S.

13 Fig. 5 The distribution of WDBs forms over each level depth The Proportion of WDBs Forms 35% 30% 25% 20% 15% 10% 5% 0% Depth of the Web pages Constructing a proximate target page backlinks diagram To capture good links more efficiently, the features of paths to target pages containing WDBs forms should be learnt. Usually, these learning samples can be obtained from the connectivity diagrams pointing to the target pages for a series of representative WDBs websites. However, constructing the connectivity diagrams needs to crawl all the links in the websites exhaustively by the Form Crawlers. Since exhaustively crawling over thousands of the websites will be very time-consuming and impractical, this research has constructed a backlinks diagram pointing to a target page by a free linking service offered by Yahoo ( which is a practically proximate connectivity diagram pointing to a target page shown in Fig Learning good links and paths features and constructing link feature vector The elements of links: anchor, URL and text in the proximity of the URL contain not only the domain features but also the features of pages containing WDBs forms. Fig. 6 Proximate backlinks diagram pointed to a target page

14 For example, the links in a page containing WDBs forms often include the strings of the advanced and/or search. Therefore, link features of the E-FFC are learnt and extracted from the elements of good links. The precision of a Link Classifier heavily depends on a constructed link feature vector. To learn and get the features in the backlinks of the different levels of a target page as shown in Fig. 6, a backlinks diagram for each of the good target pages in the Table 1 dataset is constructed, respectively. Meanwhile, in each domain, the link feature vectors on all levels of a target page are calculated. Due to the huge number of feature words extracted from the texts of the URLs and links and the low word frequency in most of them, in this paper, these feature words are pre-processed, i.e., to delete stop-words, to convert the remaining words into their roots, and to select the words of higher appearances as link feature words, where an appearance frequency threshold has been obtained as 10 through a comprehensive experimental study. Limited to the paper length restriction, Table 3 only shows the first 10 feature words and their word frequencies of the first three levels feature vectors of the links in Automobiles and Jobs domains. Table 3 shows: (1) feature words of a link are clearly associated with the domain as well as WDBs forms, such that the words car and job are related to Automobiles and Jobs domains, respectively; Whereas search and advanced are related to the WDBs forms. The similar phenomenon in other domains has also been observed, e.g., in the books domain, the feature words in anchor also include Table 3 Excerpt of feature words and their frequencies of different level link vectors for the Automobiles and Jobs domains Domain Level/number of links Feature words and their frequencies of different level link vectors Automobiles 3/414 2/384 1/218 search 100 car 103 deal 175 car 82 autoseek 88 vehicle 154 carpric 37 forum 78 autoseek 115 inventory 33 price 45 car 111 national 25 deal 42 price 95 lead 24 fee 41 vehicle 77 external 21 inventory 34 face 59 motor 21 omega 34 warrantyweek 42 advance 21 warranty 31 archive 41 exchangeandmart 18 finance 30 motor 27 Jobs 3/775 2/397 1/197 job 139 job 341 job 381 search 66 edu 96 search 222 jobsearch 42 search 80 interbiznet 119 employment 34 interbiznet 55 swap 110 career 29 swap 44 youthntouch 88 healthopp 25 care 42 teach 61 seek 22 marist 38 need 60 careerexchange 16 onmouseout 34 site 60 hotjob 12 jobsearch 32 jillhackett 58 advance 12 nofollow 31 onmouseout 57

15 search, book and booktext ; (2) With a link gets farther away from the target page, the frequencies of clearly related words tend to decrease. For example, the frequency of the word job in Level 1 goes down from is 381, it goes down to 341 and 139 in Levels 2 and 3 for the Jobs domain, respectively. Because the frequency of a feature word in a link has significant impact on calculating the distance between the current link and a target page, the Inverse Text Frequency IDF (3) cannot be applied when calculating the weight of a feature word in a link. In this paper, frequency TF (5) has been used: tf (t i, link) w i = [ tf (ti, link) ] (5) 2 t i link where w i is the weight of word t i in the link, tf(t i, link) is the word frequency of t i in the link, and the denominator is a normalization factor A new link scoring strategy Definitions of two commonly used terms in this paper have been introduced for a clear description: Definition 1 Immediate benefit link: It includes links that directly points to a target pages containing WDBs searchable forms, which correspond to the links of Level 1 in the backlinks diagram. Definition 2 Promising link/delayed benefit link: It includes links that are likely to reach the target pages in one or multiple steps, which are the links of Level 2 and/or Level 3 in the backlinks diagram. Since the vast majority of the WDBs forms are sparsely distributed in the first three levels of Web pages, the E-FFC Link Classifier should identify and select the above two types of links (i.e., the immediate benefit link and delayed benefit link) efficiently with the least miss out of the promising links/delayed benefit links. To this end, the link scoring strategy of the E-FFC Link Classifier is designed as follows: Firstly, given a link in a domain-specific or relevant page, the E-FFC Link Classifier assigns an initial score, with a range of 1, 2, or 3, to the link corresponding to the distance between the link and the target page that is reachable from that link. Assuming that the feature vectors of three level input backlinks of a target page are V Level1, V Level2 and V Level3, respectively, thus, for the given link, the similarities between the link feature vector and V Level1, V Level2, V Level3 are S 1, S 2 and S 3, respectively. The subscript value of S max in (6) indicates the distance between the given link and the target page. S max = max i=1 3 {S i} (6) Secondly, analytical and experimental results obtained through this research show that the score of the given link also depends on the similarity between the page that the given link point belongs to and the target page. Therefore, a new links scoring equation is proposed as: Score(link) = λs max + μsim ( V ling page, V target page ) (7)

Where Score(link) is the final score of the given link, Sim(V link page, V target page ), calculated by (4), is the similarity between the feature vector V link page of the page that the given link

16 Where Score(link) is the final score of the given link, Sim(V link page, V target page ), calculated by (4), is the similarity between the feature vector V link page of the page that the given link point belongs to and the feature vector V target page of the target page. Large numbers of experimental results show that the optimal values of weight coefficients λ and μ are 0.7 and 0.3, respectively. 4.3 Searchable form classifier and domain-specific form classifier The ultimate goal of the E-FFC is to identify all the domain-specific WDBs searchable forms in the Web. Though the aforementioned strategies are adopted to focus the search on this goal the E-FFC would still grab some non-searchable forms similar to WDBs searchable forms. Meanwhile, due to the correlation among domains, the E-FFC would also grab WDBs forms from other domains that are related to the focused domain. Therefore, a generic (domain-independent) Searchable Form Classifier and a Domain-Specific Form Classifier have been introduced in the E- FFC, which are used in a sequence so as to automatically recognize and select only the focused domain-specific searchable forms in the gradually pruned search space Searchable and non-searchable forms Forms can be classified as searchable and non-searchable forms. A searchable form is a WDB s entry, through which the user can interact with the WDB. Once a user submits a query through the form, the results will be returned by the WDB. An example of searchable forms is shown in Fig. 7a. A non-searchable form, which is similar to those searchable forms in the outward appearance, is used to submit information to the WDB only rather than query WDB s information, e.g., the forms Fig. 7 Examples of searchable and non-searchable forms

of login, registration, subscription, search engine and evaluation feedback etc. The examples of non-searchable forms are shown in Fig. 7b. 4.3.

17 of login, registration, subscription, search engine and evaluation feedback etc. The examples of non-searchable forms are shown in Fig. 7b A novel searchable form classif ier Based on a large number of statistical experiments with different attributes of searchable and non-searchable forms conducted in this research, the average numbers of selection lists, checkboxes and textboxes of searchable and non-searchable forms are quite different as shown in Fig. 8. It indicates that searchable forms have a higher number of checkboxes and selection lists, whereas non-searchable forms have a higher number of textboxes. Therefore, a novel searchable form identification tactic has been proposed based on the features of the learned searchable and nonsearchable form controls: the number of checkboxes, selection lists, textboxes, etc. A total of 180 WDBs forms (positive sample data) from the UIUC dataset (The UIUC Web 2003) and 156 non-wdbs forms (negative sample data) from the Web pages over 8 domains have been manually extracted as the learning samples, with the experimental platform Weka 3.6. The experiments have been conducted using the method of ten-fold cross-validation on the collected forms. Four types of classification algorithms, i.e., C4.5 decision trees, k-nn, SVM, and Naive Bayes, are selected for construction of the Form Classifier. The experiment results are shown in Table 4. Having the lowest error test rate (experimental result shown in Table 4), the C4.5 decision tree has been chosen for the Searchable Form Classifier of the E-FFC to determine whether a form is searchable or not. For each form in the above sample set, following features of the form controls are obtained: number of hidden tags; number of checkbox; number of radio box; number of file inputs; number of submit tags; number of image inputs; number of buttons; number of resets; number of password tags; number of textboxes; number of selection lists; and submission method (post/get). The trained decision tree of the Searchable Form Classifier of the E-FFC is shown in Fig. 9. It is worth to point out that the best Searchable Form Classifier of the FFC (Barbosa and Freire 2005) available also uses the C4.5 decision tree to classify searchable and non- searchable forms by using 14 features of the form controls. In comparison, the Searchable Form Classifier of the E-FFC has merely used 12 features of the form controls and obtains a 1.91 % accuracy enhancement in identifying searchable forms. Fig. 8 Comparison of the average number of checkboxes, selection lists and textboxes between searchable and non-searchable forms

18 Table 4 Experiment results of four types of classification algorithms Algorithm Confusion matrix Accuracy Positive data Negative data (%) C4.5 decision tree SVM Naïve Bayes Domain-specif ic form classif ier A previous study has shown that different domains are not isolated but related, such that Movies and Music Records are heavily linked, and that Airfares, Hotels, and Car Rentals form another related cluster (Chang et al. 2004; He et al. 2007). Although the E-FFC can identify searchable forms by its Searchable Form Classifier, some searchable forms belonging to other domains may also be grabbed by the E-FFC due to the correlation between domains. Therefore, the main goal of the Domain-Specific Form Classifier of the E-FFC is to filter out these forms belonging to other domains. From the above discussion, it is known that searchable forms are classified as simple forms and advanced/complex forms shown in Fig. 1. With the two types of Fig. 9 The trained decision tree of the Searchable Form Classifier of the E-FFC

19 form classification, different methods should be adopted (He et al. 2004a, b; Barbosa et al. 2007). For the simple forms, attributes of the forms provide no/less semantic information, post-query method should be used to identify the forms domain, e.g., by submitting domain keywords to attributes and judging returned query results, the domain can be obtained. If the forms domain is not a focused domain, the form would be dropped. Otherwise, the form would be added to the domain-specific form database of the E-FFC if it doesn t exist yet. Complex forms have a lot of semantic information contained in its attribute labels, so that a pre-query method is used to identify and filter the forms of non-focused domains. The domain-judgment method of advanced/complex forms will be discussed briefly below. Observation and analysis (Wu et al. 2004; Barbosa et al. 2007; HeandChang 2003) show that, after the pre-process, such as steps 2 to 6 of the algorithm shown in Fig. 1 and normalized process of synonyms to different attribute names in a specific domain, the attribute names (in short, attributes) from a searchable form cover almost entirely the attributes set of all WDBs searchable forms in the domain, in which some domain attributes are always contained. For example, in Books domain, a searchable form typically consists of this domain attributes such as ISBN, author, and publisher; in Airfares domain, each searchable form typically contains departure time, place of departure, cabin class, and number of passengers. Since the domain attributes uniquely characterize the domain, they are named as the domain anchor attributes in this paper. For further accurate classification, the domain ontology has been introduced, which is a tree to express the concepts of a given domain and relationships between concepts. In this tree, every node is a concept in the domain (using words/phrases). A parent-child relationship of nodes expresses their inheritance/containment relationship between concepts in the domain, and a brother relationship indicates a subconcept in the same domain belonging to the parent node. With a constructed domain ontology tree, synonymous of different attribute names between forms can be dealt with to get more precise forms domain identification and classification. Due to the fast convergence and the Zip s distribution characteristics of attributes of domain-specific searchable forms (Chang et al. 2004; He et al. 2007), a domain ontology tree and its anchor attribute set including the representative domain attributes have been built semi-automatically or manually according to the dataset in Table 1. Figure10 shows the constructed ontology trees and their corresponding anchor attributes in Books and Airfares domains. Notes: the bold underlined attributes are the anchor attributes in the domain ontology tree, and strings in square brackets corresponding the synonymous of the attributes. This paper presents the following (8) to determine the domain that an advanced/complex form belongs to. D i = Cov ij (i = 1 8, j = 1 N) (8) where, D i is the scoring value of the i-th domain to which the j-th form belongs. The Cov ij is the coverage rate of the attributes that the j-th form contains to the anchor attribute set of the i-th domain. Given a form, each D i (i=1, 2,..., 8) is calculated by (8) in turn. The form domain is the same domain as that of the maximum D i belonging to. The experimental results,

20 Fig. 10 Ontology trees and the corresponding anchor attributes in the Books and Airfares domains based on constructed domain ontology trees, their anchor attribute sets and (8), show that the Domain-Specific Form Classifier of the E-FFC can obtain very high accuracies, with an average precision of 94.6 % and an average recall of 95.4 %, respectively. 4.4 Link queue manager To achieve its ultimate goal, with the site-based view of the Web, the E-FFC has adopted two types of link queues, i.e., a single-level site-link queue and multi-level insite-link priority queues. The single-level site-link queue is used to store root page links from every website for the E-FFC search, in which all the links are ranked by the sequences of their inserting. In order to rapidly locate the target page, to make full use of the crawling stopping criterion and exit from current crawling website more efficiently, the E-FFC has employed multi-level insite-link priority queues, in which links are put into different level insite-link priority queues according to the distance between current link and the target page. If the distance between a link and the target page is i, the link will be put into the i-th queue of the multi-level insite-link priority queues. The links in the queue are then sorted by the scores using (7). The multi-level insitelink priority queues are shown in Fig. 11.

21 4.5 Seed links selection, stopping criteria and breakpoint preservation Selection of seed links The harvest rate H and coverage rate C of the E-FFC are directly related to the selection of seed links. Jamali et al used hub links and seed links provided by Yahoo for a Form Crawler crawling experiments (Jamali et al. 2006). The experiment results showed that a higher harvest rate would be obtained using the seed links. As a result, the research in this paper used Automobiles and Jobs as key words in Google, and the two first query results from Google have been used as the E-FFC crawling seed links, that are, and for the E-FFC crawling experiments in the Automobiles and Jobs domains, respectively New crawling stopping criteria Due to the sparse distribution of domain-specific WDBs forms, an appropriate design of crawler s stopping criteria is essential to achieve the optimal objectives of the E-FFC. Through a theoretical study and a large number of experiments, this paper presents a new set of crawling stopping criteria in the E-FFC: (1) If the E-FFC has found a certain number of forms in a crawling website, e.g., greater than or equal to 4, it would quit from the website immediately due to an average of 4.2 forms on each website (Chang et al. 2004;Heetal.2007); (2) If the depth of visited pages away from its root page by the E-FFC in a website has exceeded 5 level (It is due to the fact that none of the WDBs forms has a depth deeper than 5, and 94 % WDBs forms appear within the 3-level depth (Chang et al. 2004; Heetal.2007).), the E-FFC would exit from the website in order to avoid wasting time on the website; Fig. 11 The diagram of the multi-level insite-link priority queues in the E-FFC

Automatically Constructing a Directory of Molecular Biology Databases

Automatically Constructing a Directory of Molecular Biology Databases Luciano Barbosa Sumit Tandon Juliana Freire School of Computing University of Utah {lbarbosa, sumitt, juliana}@cs.utah.edu Online Databases