CRAWLING THE CLIENT-SIDE HIDDEN WEB

Size: px

Start display at page:

Download "CRAWLING THE CLIENT-SIDE HIDDEN WEB"

Hope Bryant
5 years ago
Views:

1 CRAWLING THE CLIENT-SIDE HIDDEN WEB Manuel Álvarez, Alberto Pan, Juan Raposo, Ángel Viña Department of Information and Communications Technologies University of A Coruña A Coruña - Spain {mad,apan,jrs,avc}@udc.es ABSTRACT There is a great amount of information on the web that can not be accessed by conventional crawler engines. This portion of the web is usually called hidden web data. To be able to deal with this problem, it is necessary to solve two tasks: crawling the client-side and crawling the server-side hidden web. In this paper we present an architecture and a set of related techniques for accessing the information placed in the client-side hidden web, dealing with aspects such as JavaScript technology, non-standard session maintenance mechanisms, client redirections, pop-up menus, etc. Our approach leverages current browser APIs and implements novel crawling models and algorithms. KEYWORDS Web-Crawler, Hidden Web, Client Side. 1. INTRODUCTION The Hidden Web or Deep Web [Bergman01] is usually defined as the part of WWW documents that is dynamically generated. The problem of crawling the hidden web can be divided in two tasks: crawling the client-side and crawling the server-side hidden web. Client-side hidden web techniques are concerned about accessing content dynamically generated in the client web browser, while server-side techniques are focused in accessing to the valuable content hidden behind web search forms [Raghavan01] [Ipeirotis02]. This paper proposes novel techniques and algorithms for dealing with the first of these problems. 1.1 The case for client-side hidden web Today s complex web pages use scripting languages intensively (mainly JavaScript), session maintenance mechanisms, complex redirections, etc. Developers use these client technologies to add interactivity to web pages as well as for improving site navigation. This is done through interface elements such as pop-up menus or by disposing of content in layers that are either shown or hidden depending on the user actions. In addition, many sources use scripting languages, such as JavaScript, for a variety of internal purposes, including dynamically building HTTP requests for submitting forms, managing HTML layers and/or performing complex redirections. This situation is aggravated because most of the tools used for visually building web sites, generate pages which use scripting code for content generation and/or for improving navigation. 1.2 The problem with conventional crawlers There exist some problems that make it difficult for traditional web crawling engines to obtain data from client-side hidden web pages. The most important problems are described in the following sub-sections.

2 1.2.1 Client-side scripting languages. Many HTML pages make intensive use of JavaScript and other client-side scripting languages (such as Jscript or VBScript) for a variety of purposes such as: a) Generating content at runtime (e.g. document.write methods in JavaScript). b) Dynamically generating navigations. Scripting code may be for instance in the href attribute of an anchor, or can be executed when some event of the page is fired (e.g. onclick or onmouseover for unfolding a pop-up menu when the user clicks or moves the mouse over a menu option). It is also possible for the scripting code to rewrite a URL, to open a new window or to generate several navigations (more than URL to continue the crawling process). c) Automatically filling out a form in a page and then submitting it. Successfully dealing with scripting languages requires that HTTP clients implement all the mechanisms that make it possible to a browser to render a page and to generate new navigations. It also involves following anchors and executing all the actions associated to the events they fire. Using a specific interpreter (e.g. Mozilla Rhino for JavaScript [Rhino]) does not solve these problems, since real world scripts assume a set of browser-provided objects to be available in their execution environment. In addition, in some situations such as multi-frame pages, it is not always easy to locate and extract the scripting code to be interpreted. That is why most crawlers built to date, including the ones used in the most popular web search engines, do not provide support for these kind of pages. To provide a convenient execution environment for executing scripts is not the only problem associated with client-side dynamism. When conventional crawlers reach a new page, they scan it for new anchors to traverse and add them to a master list of URLs to access. Scripting code complicates this situation because they may be used to dynamically generate or remove anchors in response to some events. For instance, many web pages use anchors to represent menus of options. When an anchor representing an option is clicked, some scripting code dynamically generates a list of new anchors representing sub-options. If the anchor is clicked again, then the script code may fold the menu again, removing the anchors corresponding with the sub-options. A crawler dealing with the client-side deep web should be able to detect these situations and to obtain all the hidden anchors, adding them to the master URL list Session maintenance mechanisms. Many websites use session maintenance mechanisms based on client resources like cookies or scripting code to add session parameters to the URLs before sending them to the server. This provokes a number of problems: - While most crawlers are able of dealing with cookies, we have already stated that is not the case with scripting languages. - Another problem arises for distributed crawling. Conventional architectures for crawling are based on a shared master list of URLs from which crawling processes (maybe running in different machines) pick URLs and access them independently in a parallel manner. Nevertheless, with session-based sites, we need to insure that each crawling process has all the session information it needs available (such as cookies or the context for executing the scripting code). In other cases, the attempts to access the page will fail. Conventional crawlers do not deal with these situations. - The problem of later accessing of the documents. Most web search engines work by indexing the pages retrieved by a web crawler. The crawled pages are usually not stored locally but they are indexed with their URLs. When at a later moment a user obtains the page as result of a query against the index, he can access the page through its URL. Nevertheless, in a context where session maintenance issues exist, the URLs may not work when used at a later time. For instance, the URL may include a session number that expires a few minutes after being created Redirections Many websites use complex redirections that are not managed by conventional crawlers. For instance, some pages include JavaScript redirections executed after an on load page event (the client redirects after the page has been completely loaded);

3 <BODY onload="executejavascriptredirectionmethod() > In these cases, the HTTP client would have to analyze and interpret the page content to detect and correctly manage these types of redirections Applets and flash code Other types of client technology are applets or flash code. They are executed on the client side, so it has to implement a container component to process them. Although accessing the content shown by programs written in these languages is difficult due to their compiled nature, a web crawler should at least be able to deal with the common situation where these components are used as an introduction that finally redirects the user to a conventional page where the crawler can proceed Other issues Issues such as frames, dynamic HTML or HTTPS, accentuate the aforementioned problems. In general terms, we can say that it is very difficult to consider all the factors, which make a Website visible and navigable through a web browser. 1.3 Our approach Due to all the reasons mentioned above, many designers of web sites avoid these practices in order to make sure their sites are on good terms with the crawlers. Nevertheless, this forces them to either increment the complexity of their systems by moving functionality to the server, or reducing interactivity with the user. Neither of these situations is desirable: web site designers should think in terms of improving interactivity and friendliness of sites, not about how the crawlers work. This paper presents an architecture and a set of related techniques to solve the problems involved in crawling the client-side hidden web. Our system has already been successfully used in several real applications in the fields of corporate search and technology watch. The main features of our approach are the following: Our crawling processes are not based on http clients. Instead, they are based on automated mini web browsers, built using standard browser APIs (our current implementation is based on the MSIE Microsoft Internet Explorer [MSIE] - WebBrowser Control). These mini-browsers understand NSEQL (see section 2), a language for expressing navigation sequences as macros of actions on the interface of a web browser. This enables our system to deal with executing scripting code, managing redirections, etc. To deal with pop-up menus and other dynamic elements that can generate new anchors in the actual page, it is necessary to implement special algorithms to manage the process of generating new routes to crawl from a web page (see section 3.4). To solve the problem of session maintenance, our system uses the concept of route to a document, which can be seen as a generalization of the concept of URL. A route specifies a URL, a session object containing the needed session context for the URL, and a NSEQL program for accessing the document when the session used for crawling the document has expired. The system also includes some functionality to access pages hidden behind forms. More precisely, the system is able to deal with authentication forms and with value-limited forms. We term the ones exclusively composed of fields whose possible values are restricted to a certain finite list as value-limited forms (e.g. forms composed exclusively of fields select, checkbox, radio button, ).

4 2. INTRODUCTION TO NSEQL NSEQL [Pan02] is a language to declaratively define sequences of events on the interface provided by a web browser. NSEQL allows to easily expressing macros representing a sequence of user events over a browser. NSEQL works at browser layer instead of at HTTP layer. This lets us forget about problems such as successfully executing JavaScript or dealing with client redirections and session identifiers. Navigate( FindFormByName( login_form,0); SetInputValue( login,0, loginvalue ); SetInputValue( passwd,0, passwordvalue ); ClickOnElement(.save,Input,0); ClickOnAnchorByText( Go to Inbox,0,false); Figure 1. NSEQL Program Figure 1 shows an example of NSEQL program, which is able to execute the login process at YahooMail and navigating to the list of messages from the Inbox folder. The Navigate command makes the browser navigate to the given URL. Its effects are equivalent to that of a human user typing the URL on his/her browser address bar and pressing ENTER. The FindFormByName(name, position) command looks for the position-th HTML form in the page with the given name. Then, the SetInputValue(fieldName, position, value) commands are used to assign values to the form fields. The clickonelement(name, type, position) command, clicks on the position-th element of the given type and name from the current selected form. In this case, it is used to submit the form and load the result page. The ClickOnAnchorByText (text, position) command looks for the position-th anchor, which encloses the given text and generates a browser click event on it. This will cause the browser to navigate to the page pointed by the anchor. Although not included here, NSEQL also includes commands to deal with frames, pop-up windows, etc 3. THE CRAWLING ENGINE As well as in conventional crawling engines, the basic idea consists in maintaining a shared list of routes (pointers to documents), which will be accessed by a certain number of concurrent crawling processes, which may be distributed into several machines. The list is initialized with a list of routes. Then, each crawling process picks a route from the list, downloads its associated document and analyzes it for obtaining new routes from its anchors, which are then added to the master list. The process ends when there are no routes left or when a specified depth level is reached. The structure of this section is as follows. In section 3.1, we introduce the concept of route in our system, and how it enables us to deal with sessions. Section 3.2 provides some detail about the mini-browsers used as the basic crawling processes in the system, as well as the advantages they provide us with. Section 3.3 describes the architecture and basic functioning of the system. Finally, section 3.4 reviews the algorithm used for generating new routes from anchors and forms controlled by scripting code (e.g. JavaScript). 3.1 Dealing with sessions: Routes In conventional crawlers, routes are just URLs. Thus, they have the problems with session mechanisms that we have already mentioned in section In our system, a route is composed of three elements: - A URL pointing to a document. In the routes from the initial list, this element may also be a NSEQL program. This is useful to start the crawling in a document, which is not directly accessible through a URL (for instance, this is usually the case with websites requiring authentication).

5 CrawlerWrapper Pool CrawlerServer CrawlerWrapper CrawlerWrapper Internet Internet URLManager Component (Global URL List) Configuration Manager Component CrawlerWrapper State Local URL List Download Manager Component document Content Manager Component Content filters Generic DownloadManager Navigation DownloadManager Error Filter GenerateNewURLs Filter URL Filter StorageManager Browsers Pool Crawled Document Repository index Indexer ActiveX Generic Searcher Figure 2. Crawler Architecture - A session object containing all the required information (cookies, etc.) for restoring the execution environment that the crawling process had in the moment of adding the route to the master list. - A NSEQL program representing the navigation sequence followed by the system to reach the document. The second and third elements are automatically computed by the system for each route. The second element allows a crawling process to access a URL added by other crawling process (even if the original crawling process was running in another machine). The third element is used to access the document pointed by the route when the session originally used to crawl the document has expired. This is useful when session expiration times are short and, as we will see later, to allow for later access to crawled documents. 3.2 Mini-browsers as crawling processes Conventional engines implement crawling processes by using http clients. Instead, the crawling processes in our system are based on automated mini web browsers, built using standard browser APIs (our current implementation is based on the MSIE WebBrowser Control), and which are able to execute NSEQL programs. This allows our system to: Access the content dynamically generated through scripting languages (e.g. JavaScript document.write methods). Evaluate the scripting code associated with anchors and forms, so we can obtain the real URLs these elements are pointing to. Deal with client redirections: after executing next navigations, the mini-browser waits until all the navigation events of the actual page have finished. Provide an execution environment for technologies such as Java applets and Flash code. Although the mini-browsers cannot access the content shown by these compiled components, they can deal with the common situation where these components are used as a graphical introduction, which finally redirects the browser to a conventional web page. 3.3 System architecture / basic functioning The architecture of the system is shown in Figure 2. When the crawler engine starts, it reads its configuration parameters from the Configuration Manager module. The metainformation for configuring the system includes a list of URLs and/or NSEQL navigation sequences for accessing the initial sites, the desired depth for each initial route, download handlers for different kinds of documents, content filters, a list of DNS domains to be included and excluded from the crawling, and other metainformation not dealt with here. The following step consists in starting the URL Manager Component with the list of initial sites for the crawling, as well as in starting the pool of crawling processes.

6 The URL Manager is responsible of maintaining the master list of routes to be accessed, and all the crawlers share it. As the crawling proceeds, the crawling processes add new routes to the list by analyzing the anchors and value-limited forms found in the crawled pages. Once the crawling processes have been started, each one picks a route from the URL Manager. It is important to note that each crawling process can be executed either locally or remotely to the server, thus allowing for distributed crawling. As we have already remarked, each crawling process is a mini webbrowser able to execute NSEQL sequences. Then the crawling process loads the session object associated to the route and downloads the associated document (it uses the Download Manager Component to choose the right handler for the document). If the session has expired, the crawling process will use the NSEQL program for accessing the document again. The content from each downloaded document is then analyzed using the Content Manager Component. This component specifies a chain of filters to decide if the document can be considered relevant and, therefore, if it should be stored and/or indexed. For instance, the system includes filters which allow checking if the document verifies a keyword-based boolean query with a minimum relevance in order to decide whether to store/index it or not. Another chain of filters is used for post-processing the document. For instance, the system includes filters to extract relevant content from HTML pages or to generate a short document summary. Finally, the system tries to obtain new routes from analyzed documents and adds them to the master list. In a context where scripting languages can dynamically generate and remove anchors and forms, this involves some complexities. See section 3.4 for detail. The system also includes a chain of filters to decide whether the new routes must be added to the master list or not. In the most usual configuration, while the maximum desired depth is not reached, all the anchors of the documents will generate new routes. Value-limited forms (those having only fields with a finite list of possible values, as it were commented on section 1.3) will generate a new route for each possible combination of the values of its fields. The architecture also includes components for indexing and searching the crawled contents, using state of the art algorithms. The crawler generates a XML file for each crawled document, including metainformation such as its URL and the NSEQL sequence needed to access it. The NSEQL sequence will be used by another component of the system architecture: the ActiveX for automatic navigation Component. This component receives as a parameter a NSEQL program, downloads itself into the user browser and makes it execute the given navigation sequence. In our system this is used to solve the problem of the later access to documents (see section 1.2.2). When the user makes a search against the index and the list of answers contains some results, which cannot be accessed using its URL due to session issues, the anchors associated to those results in the list will invoke the ActiveX component passing as parameter the NSEQL sequence associated to the page. Then, if the users click on the anchor, the ActiveX will make their browser automatically navigate to the desired page. 3.4 Algorithm for generating new routes This section describes the algorithm used in our system to generate new routes to be crawled given a HTML page. This algorithm deals with the difficulties associated to anchors and forms controlled by scripting languages. In general, to get the new routes to be crawled from a given HTML document, it is necessary to analyze the page looking for anchors and value-limited forms. A new route will be added for each anchor and for each combination of all the possible values of the fields from each value-limited form. The anchors and forms which are not controlled by scripting code can be dealt with as in conventional crawlers: for anchors, a new route is built from the value of its href attribute, while for static forms, the new routes for each combination of values can also be routinely built by analyzing the action attribute of the form tag and the tags representing the form fields and their possible values. Nevertheless, if the HTML page contains client-side scripting technology, the situation is more complicated. The main idea of the algorithm consists in generating click events over the anchors controlled by scripting languages in order to obtain valid URLs (NOTE: we will focus our discussion on the case of

7 anchors. The treatment of value-limited forms would be analogous), but there are several additional difficulties: - Some anchors may appear or disappear from the page depending on the scripting code executed (e.g. pop-up menus). - The script code associated to anchors must be evaluated in order to obtain valid URLs. - One anchor can generate several navigations. - In pages with several frames, it is possible for an anchor to generate new URLs in some frames and navigations in others. Now we proceed to describe the algorithm. Remind our crawling process is a mini-browser able to execute NSEQL programs. The browser can be in two states: in the navigation state the browser functions normally, and when it executes a click event on an anchor or submits a form, it performs the navigation and downloads the resulting page; in turn, in the simulation state the browser only captures the navigation events generated by the click or submit events, but it does not download the resource. 1. Let P be an HTML page that has been downloaded by the browser (navigation state). 2. The browser executes the scripts sections, which are not associated to conditional events. 3. Let A p be all the anchors of the page with the scripting code already interpreted. 4. For each a i Є A p, remove ai a) If the href attribute from a i does not contain associated scripting code and it has not got an onclick attribute (or, if the system is configured to do so, other attributes used to assign scripting code to specific events such as onmouseover), the anchor a i is added to the master list of URLs. b) Otherwise, the browser changes to simulation state, and generates a click event on the anchor -and, if configured to do so, other relevant events such as mouseover- (simulation state): a. There exist some anchors that, when clicked, can generate undesirable actions (e.g. a call to the javascript:close method closes the browser). The approach followed to avoid this is to capture these undesirable events and to ignore them. b. The crawler captures all the new navigation events that happen as a consequence of the click. Each navigation event produces a URL. Let A n be the set of all the new URLs. c. A p = A n U A p. d. Once the execution of the events associated to a click over an anchor has finished, the crawler analyzes again the same page looking for new anchors that could have been generated by the click event (for instance, new options corresponding to pop-up menus), A np. New anchors are also added to A p, A p = A np U A p. 5. The browser changes to navigation state, and the crawler is ready to process a new URL. If the processed page has several frames, then the system will process each frame in the same way. Note that the system processes the anchors in a page following a bottom-up approach, so new anchors are added on the list before the existing ones. This way, new anchors will be processed before some other click can remove them from the page. Also note that the added anchors will have to agree with the filters for adding URLs mentioned in section RELATED WORK AND CONCLUSIONS A well-known approach for discovering and indexing relevant information is to crawl a given information space (e.g. the WWW, the repositories of a corporate Intranet, etc.) looking for information verifying certain requirements. Nevertheless, today s web crawlers or spiders [Brin98] do not deal with the hidden web. During the last few years, there have been some pioneer research efforts dealing with the complexities of accessing the hidden web [Raghavan01] [Ipeirotis02] using a variety of approaches. Nevertheless, these systems are only concerned with server-side hidden web (that is, learning how to interpret and query HTML forms). Some crawling systems [WebCopier] have included JavaScript interpreters [Rhino] [SpiderMonkey] in the HTTP clients they use in order to provide some limited support for dealing with JavaScript. Nevertheless, our system offer several advantages over them: - It is able to correctly execute any scripting code in the same manner it would be executed by a conventional web browser.

8 - It is able to deal with session maintenance mechanisms for both crawling and later access to documents (the latter is made through an ActiveX component able to execute NSEQL programs). - It is able to deal with anchors and forms generated dynamically in response to events produced by the user (e.g. pop-up menus). - It is able to deal with redirections (including those generated by Java applets and Flash programs). Finally, we want to remark that the system presented in this paper has already been successfully used in several real-world applications in fields such as corporate search and technology watch. We have found the need for crawling the client-side hidden web to be very frequent in these application domains. The reason is that, although most popular mainstream websites avoid using JavaScript and other similar techniques in order to be correctly indexed by large-scale engines such as Google, many mediumscale websites containing information of great value continue to use them intensively. This is specially the case for websites requiring subscription or user authentication: since these sites do not have any incentive for easing the work of the large scale search engines, many of them make intensive use of client dynamism. Nevertheless, this kind of sites usually is the most valuable for many focused search applications, like technology watch or vertical search engines. Thus, our experience says the efforts for accessing the client-side deep web are valuable and should be continued. ACKNOWLEDGEMENT This research was partially supported by the Spanish Ministry of Science and Technology under project TIC Alberto Pan s work was partially supported by the Ramón y Cajal programme of the Spanish Ministry of Science and Technology. REFERENCES [Bergman01] Bergman M.K. The Deep Web. Surfacing Hidden Value. [Brin98] Brin S and Page L., The Anatomy of a Large-Scale Hypertextual Search Engine. In Proceedings of the 7th International World Wide Web Conference. [Ipeirotis02] G. Ipeirotis P. and Gravano L, 2002 Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection. In Proceedings of the 28th International Conference on Very Large Databases (VLDB). [MSIE] Microsoft Internet Explorer WebBrowser Control, [Pan02] Pan A., et al, Semi-Automatic Wrapper Generation for Commercial Web Sources. In Proceedings of IFIP WG8.1 Working Conference on Engineering Information Systems in the Internet Context (EISIC 2002). [Raghavan01] Raghavan S. and García-Molina H., Crawling the Hidden Web. In Proceedings of the 27th International Conference on Very Large Databases. [Rhino] Mozilla Rhino - JavaScript Engine (Java). [SpiderMonkey] Mozilla SpiderMonkey JavaScript engine (C) [WebCopier] WebCopier Feel the Internet in your Hands.

The Wargo System: Semi-Automatic Wrapper Generation in Presence of Complex Data Access Modes

The Wargo System: Semi-Automatic Wrapper Generation in Presence of Complex Data Access Modes J. Raposo, A. Pan, M. Álvarez, Justo Hidalgo, A. Viña Denodo Technologies {apan, jhidalgo,@denodo.com University