CRAWLING THE CLIENT-SIDE HIDDEN WEB
|
|
- Hope Bryant
- 5 years ago
- Views:
Transcription
1 CRAWLING THE CLIENT-SIDE HIDDEN WEB Manuel Álvarez, Alberto Pan, Juan Raposo, Ángel Viña Department of Information and Communications Technologies University of A Coruña A Coruña - Spain {mad,apan,jrs,avc}@udc.es ABSTRACT There is a great amount of information on the web that can not be accessed by conventional crawler engines. This portion of the web is usually called hidden web data. To be able to deal with this problem, it is necessary to solve two tasks: crawling the client-side and crawling the server-side hidden web. In this paper we present an architecture and a set of related techniques for accessing the information placed in the client-side hidden web, dealing with aspects such as JavaScript technology, non-standard session maintenance mechanisms, client redirections, pop-up menus, etc. Our approach leverages current browser APIs and implements novel crawling models and algorithms. KEYWORDS Web-Crawler, Hidden Web, Client Side. 1. INTRODUCTION The Hidden Web or Deep Web [Bergman01] is usually defined as the part of WWW documents that is dynamically generated. The problem of crawling the hidden web can be divided in two tasks: crawling the client-side and crawling the server-side hidden web. Client-side hidden web techniques are concerned about accessing content dynamically generated in the client web browser, while server-side techniques are focused in accessing to the valuable content hidden behind web search forms [Raghavan01] [Ipeirotis02]. This paper proposes novel techniques and algorithms for dealing with the first of these problems. 1.1 The case for client-side hidden web Today s complex web pages use scripting languages intensively (mainly JavaScript), session maintenance mechanisms, complex redirections, etc. Developers use these client technologies to add interactivity to web pages as well as for improving site navigation. This is done through interface elements such as pop-up menus or by disposing of content in layers that are either shown or hidden depending on the user actions. In addition, many sources use scripting languages, such as JavaScript, for a variety of internal purposes, including dynamically building HTTP requests for submitting forms, managing HTML layers and/or performing complex redirections. This situation is aggravated because most of the tools used for visually building web sites, generate pages which use scripting code for content generation and/or for improving navigation. 1.2 The problem with conventional crawlers There exist some problems that make it difficult for traditional web crawling engines to obtain data from client-side hidden web pages. The most important problems are described in the following sub-sections.
2 1.2.1 Client-side scripting languages. Many HTML pages make intensive use of JavaScript and other client-side scripting languages (such as Jscript or VBScript) for a variety of purposes such as: a) Generating content at runtime (e.g. document.write methods in JavaScript). b) Dynamically generating navigations. Scripting code may be for instance in the href attribute of an anchor, or can be executed when some event of the page is fired (e.g. onclick or onmouseover for unfolding a pop-up menu when the user clicks or moves the mouse over a menu option). It is also possible for the scripting code to rewrite a URL, to open a new window or to generate several navigations (more than URL to continue the crawling process). c) Automatically filling out a form in a page and then submitting it. Successfully dealing with scripting languages requires that HTTP clients implement all the mechanisms that make it possible to a browser to render a page and to generate new navigations. It also involves following anchors and executing all the actions associated to the events they fire. Using a specific interpreter (e.g. Mozilla Rhino for JavaScript [Rhino]) does not solve these problems, since real world scripts assume a set of browser-provided objects to be available in their execution environment. In addition, in some situations such as multi-frame pages, it is not always easy to locate and extract the scripting code to be interpreted. That is why most crawlers built to date, including the ones used in the most popular web search engines, do not provide support for these kind of pages. To provide a convenient execution environment for executing scripts is not the only problem associated with client-side dynamism. When conventional crawlers reach a new page, they scan it for new anchors to traverse and add them to a master list of URLs to access. Scripting code complicates this situation because they may be used to dynamically generate or remove anchors in response to some events. For instance, many web pages use anchors to represent menus of options. When an anchor representing an option is clicked, some scripting code dynamically generates a list of new anchors representing sub-options. If the anchor is clicked again, then the script code may fold the menu again, removing the anchors corresponding with the sub-options. A crawler dealing with the client-side deep web should be able to detect these situations and to obtain all the hidden anchors, adding them to the master URL list Session maintenance mechanisms. Many websites use session maintenance mechanisms based on client resources like cookies or scripting code to add session parameters to the URLs before sending them to the server. This provokes a number of problems: - While most crawlers are able of dealing with cookies, we have already stated that is not the case with scripting languages. - Another problem arises for distributed crawling. Conventional architectures for crawling are based on a shared master list of URLs from which crawling processes (maybe running in different machines) pick URLs and access them independently in a parallel manner. Nevertheless, with session-based sites, we need to insure that each crawling process has all the session information it needs available (such as cookies or the context for executing the scripting code). In other cases, the attempts to access the page will fail. Conventional crawlers do not deal with these situations. - The problem of later accessing of the documents. Most web search engines work by indexing the pages retrieved by a web crawler. The crawled pages are usually not stored locally but they are indexed with their URLs. When at a later moment a user obtains the page as result of a query against the index, he can access the page through its URL. Nevertheless, in a context where session maintenance issues exist, the URLs may not work when used at a later time. For instance, the URL may include a session number that expires a few minutes after being created Redirections Many websites use complex redirections that are not managed by conventional crawlers. For instance, some pages include JavaScript redirections executed after an on load page event (the client redirects after the page has been completely loaded);
3 <BODY onload="executejavascriptredirectionmethod() > In these cases, the HTTP client would have to analyze and interpret the page content to detect and correctly manage these types of redirections Applets and flash code Other types of client technology are applets or flash code. They are executed on the client side, so it has to implement a container component to process them. Although accessing the content shown by programs written in these languages is difficult due to their compiled nature, a web crawler should at least be able to deal with the common situation where these components are used as an introduction that finally redirects the user to a conventional page where the crawler can proceed Other issues Issues such as frames, dynamic HTML or HTTPS, accentuate the aforementioned problems. In general terms, we can say that it is very difficult to consider all the factors, which make a Website visible and navigable through a web browser. 1.3 Our approach Due to all the reasons mentioned above, many designers of web sites avoid these practices in order to make sure their sites are on good terms with the crawlers. Nevertheless, this forces them to either increment the complexity of their systems by moving functionality to the server, or reducing interactivity with the user. Neither of these situations is desirable: web site designers should think in terms of improving interactivity and friendliness of sites, not about how the crawlers work. This paper presents an architecture and a set of related techniques to solve the problems involved in crawling the client-side hidden web. Our system has already been successfully used in several real applications in the fields of corporate search and technology watch. The main features of our approach are the following: Our crawling processes are not based on http clients. Instead, they are based on automated mini web browsers, built using standard browser APIs (our current implementation is based on the MSIE Microsoft Internet Explorer [MSIE] - WebBrowser Control). These mini-browsers understand NSEQL (see section 2), a language for expressing navigation sequences as macros of actions on the interface of a web browser. This enables our system to deal with executing scripting code, managing redirections, etc. To deal with pop-up menus and other dynamic elements that can generate new anchors in the actual page, it is necessary to implement special algorithms to manage the process of generating new routes to crawl from a web page (see section 3.4). To solve the problem of session maintenance, our system uses the concept of route to a document, which can be seen as a generalization of the concept of URL. A route specifies a URL, a session object containing the needed session context for the URL, and a NSEQL program for accessing the document when the session used for crawling the document has expired. The system also includes some functionality to access pages hidden behind forms. More precisely, the system is able to deal with authentication forms and with value-limited forms. We term the ones exclusively composed of fields whose possible values are restricted to a certain finite list as value-limited forms (e.g. forms composed exclusively of fields select, checkbox, radio button, ).
4 2. INTRODUCTION TO NSEQL NSEQL [Pan02] is a language to declaratively define sequences of events on the interface provided by a web browser. NSEQL allows to easily expressing macros representing a sequence of user events over a browser. NSEQL works at browser layer instead of at HTTP layer. This lets us forget about problems such as successfully executing JavaScript or dealing with client redirections and session identifiers. Navigate( FindFormByName( login_form,0); SetInputValue( login,0, loginvalue ); SetInputValue( passwd,0, passwordvalue ); ClickOnElement(.save,Input,0); ClickOnAnchorByText( Go to Inbox,0,false); Figure 1. NSEQL Program Figure 1 shows an example of NSEQL program, which is able to execute the login process at YahooMail and navigating to the list of messages from the Inbox folder. The Navigate command makes the browser navigate to the given URL. Its effects are equivalent to that of a human user typing the URL on his/her browser address bar and pressing ENTER. The FindFormByName(name, position) command looks for the position-th HTML form in the page with the given name. Then, the SetInputValue(fieldName, position, value) commands are used to assign values to the form fields. The clickonelement(name, type, position) command, clicks on the position-th element of the given type and name from the current selected form. In this case, it is used to submit the form and load the result page. The ClickOnAnchorByText (text, position) command looks for the position-th anchor, which encloses the given text and generates a browser click event on it. This will cause the browser to navigate to the page pointed by the anchor. Although not included here, NSEQL also includes commands to deal with frames, pop-up windows, etc 3. THE CRAWLING ENGINE As well as in conventional crawling engines, the basic idea consists in maintaining a shared list of routes (pointers to documents), which will be accessed by a certain number of concurrent crawling processes, which may be distributed into several machines. The list is initialized with a list of routes. Then, each crawling process picks a route from the list, downloads its associated document and analyzes it for obtaining new routes from its anchors, which are then added to the master list. The process ends when there are no routes left or when a specified depth level is reached. The structure of this section is as follows. In section 3.1, we introduce the concept of route in our system, and how it enables us to deal with sessions. Section 3.2 provides some detail about the mini-browsers used as the basic crawling processes in the system, as well as the advantages they provide us with. Section 3.3 describes the architecture and basic functioning of the system. Finally, section 3.4 reviews the algorithm used for generating new routes from anchors and forms controlled by scripting code (e.g. JavaScript). 3.1 Dealing with sessions: Routes In conventional crawlers, routes are just URLs. Thus, they have the problems with session mechanisms that we have already mentioned in section In our system, a route is composed of three elements: - A URL pointing to a document. In the routes from the initial list, this element may also be a NSEQL program. This is useful to start the crawling in a document, which is not directly accessible through a URL (for instance, this is usually the case with websites requiring authentication).
5 CrawlerWrapper Pool CrawlerServer CrawlerWrapper CrawlerWrapper Internet Internet URLManager Component (Global URL List) Configuration Manager Component CrawlerWrapper State Local URL List Download Manager Component document Content Manager Component Content filters Generic DownloadManager Navigation DownloadManager Error Filter GenerateNewURLs Filter URL Filter StorageManager Browsers Pool Crawled Document Repository index Indexer ActiveX Generic Searcher Figure 2. Crawler Architecture - A session object containing all the required information (cookies, etc.) for restoring the execution environment that the crawling process had in the moment of adding the route to the master list. - A NSEQL program representing the navigation sequence followed by the system to reach the document. The second and third elements are automatically computed by the system for each route. The second element allows a crawling process to access a URL added by other crawling process (even if the original crawling process was running in another machine). The third element is used to access the document pointed by the route when the session originally used to crawl the document has expired. This is useful when session expiration times are short and, as we will see later, to allow for later access to crawled documents. 3.2 Mini-browsers as crawling processes Conventional engines implement crawling processes by using http clients. Instead, the crawling processes in our system are based on automated mini web browsers, built using standard browser APIs (our current implementation is based on the MSIE WebBrowser Control), and which are able to execute NSEQL programs. This allows our system to: Access the content dynamically generated through scripting languages (e.g. JavaScript document.write methods). Evaluate the scripting code associated with anchors and forms, so we can obtain the real URLs these elements are pointing to. Deal with client redirections: after executing next navigations, the mini-browser waits until all the navigation events of the actual page have finished. Provide an execution environment for technologies such as Java applets and Flash code. Although the mini-browsers cannot access the content shown by these compiled components, they can deal with the common situation where these components are used as a graphical introduction, which finally redirects the browser to a conventional web page. 3.3 System architecture / basic functioning The architecture of the system is shown in Figure 2. When the crawler engine starts, it reads its configuration parameters from the Configuration Manager module. The metainformation for configuring the system includes a list of URLs and/or NSEQL navigation sequences for accessing the initial sites, the desired depth for each initial route, download handlers for different kinds of documents, content filters, a list of DNS domains to be included and excluded from the crawling, and other metainformation not dealt with here. The following step consists in starting the URL Manager Component with the list of initial sites for the crawling, as well as in starting the pool of crawling processes.
6 The URL Manager is responsible of maintaining the master list of routes to be accessed, and all the crawlers share it. As the crawling proceeds, the crawling processes add new routes to the list by analyzing the anchors and value-limited forms found in the crawled pages. Once the crawling processes have been started, each one picks a route from the URL Manager. It is important to note that each crawling process can be executed either locally or remotely to the server, thus allowing for distributed crawling. As we have already remarked, each crawling process is a mini webbrowser able to execute NSEQL sequences. Then the crawling process loads the session object associated to the route and downloads the associated document (it uses the Download Manager Component to choose the right handler for the document). If the session has expired, the crawling process will use the NSEQL program for accessing the document again. The content from each downloaded document is then analyzed using the Content Manager Component. This component specifies a chain of filters to decide if the document can be considered relevant and, therefore, if it should be stored and/or indexed. For instance, the system includes filters which allow checking if the document verifies a keyword-based boolean query with a minimum relevance in order to decide whether to store/index it or not. Another chain of filters is used for post-processing the document. For instance, the system includes filters to extract relevant content from HTML pages or to generate a short document summary. Finally, the system tries to obtain new routes from analyzed documents and adds them to the master list. In a context where scripting languages can dynamically generate and remove anchors and forms, this involves some complexities. See section 3.4 for detail. The system also includes a chain of filters to decide whether the new routes must be added to the master list or not. In the most usual configuration, while the maximum desired depth is not reached, all the anchors of the documents will generate new routes. Value-limited forms (those having only fields with a finite list of possible values, as it were commented on section 1.3) will generate a new route for each possible combination of the values of its fields. The architecture also includes components for indexing and searching the crawled contents, using state of the art algorithms. The crawler generates a XML file for each crawled document, including metainformation such as its URL and the NSEQL sequence needed to access it. The NSEQL sequence will be used by another component of the system architecture: the ActiveX for automatic navigation Component. This component receives as a parameter a NSEQL program, downloads itself into the user browser and makes it execute the given navigation sequence. In our system this is used to solve the problem of the later access to documents (see section 1.2.2). When the user makes a search against the index and the list of answers contains some results, which cannot be accessed using its URL due to session issues, the anchors associated to those results in the list will invoke the ActiveX component passing as parameter the NSEQL sequence associated to the page. Then, if the users click on the anchor, the ActiveX will make their browser automatically navigate to the desired page. 3.4 Algorithm for generating new routes This section describes the algorithm used in our system to generate new routes to be crawled given a HTML page. This algorithm deals with the difficulties associated to anchors and forms controlled by scripting languages. In general, to get the new routes to be crawled from a given HTML document, it is necessary to analyze the page looking for anchors and value-limited forms. A new route will be added for each anchor and for each combination of all the possible values of the fields from each value-limited form. The anchors and forms which are not controlled by scripting code can be dealt with as in conventional crawlers: for anchors, a new route is built from the value of its href attribute, while for static forms, the new routes for each combination of values can also be routinely built by analyzing the action attribute of the form tag and the tags representing the form fields and their possible values. Nevertheless, if the HTML page contains client-side scripting technology, the situation is more complicated. The main idea of the algorithm consists in generating click events over the anchors controlled by scripting languages in order to obtain valid URLs (NOTE: we will focus our discussion on the case of
7 anchors. The treatment of value-limited forms would be analogous), but there are several additional difficulties: - Some anchors may appear or disappear from the page depending on the scripting code executed (e.g. pop-up menus). - The script code associated to anchors must be evaluated in order to obtain valid URLs. - One anchor can generate several navigations. - In pages with several frames, it is possible for an anchor to generate new URLs in some frames and navigations in others. Now we proceed to describe the algorithm. Remind our crawling process is a mini-browser able to execute NSEQL programs. The browser can be in two states: in the navigation state the browser functions normally, and when it executes a click event on an anchor or submits a form, it performs the navigation and downloads the resulting page; in turn, in the simulation state the browser only captures the navigation events generated by the click or submit events, but it does not download the resource. 1. Let P be an HTML page that has been downloaded by the browser (navigation state). 2. The browser executes the scripts sections, which are not associated to conditional events. 3. Let A p be all the anchors of the page with the scripting code already interpreted. 4. For each a i Є A p, remove ai a) If the href attribute from a i does not contain associated scripting code and it has not got an onclick attribute (or, if the system is configured to do so, other attributes used to assign scripting code to specific events such as onmouseover), the anchor a i is added to the master list of URLs. b) Otherwise, the browser changes to simulation state, and generates a click event on the anchor -and, if configured to do so, other relevant events such as mouseover- (simulation state): a. There exist some anchors that, when clicked, can generate undesirable actions (e.g. a call to the javascript:close method closes the browser). The approach followed to avoid this is to capture these undesirable events and to ignore them. b. The crawler captures all the new navigation events that happen as a consequence of the click. Each navigation event produces a URL. Let A n be the set of all the new URLs. c. A p = A n U A p. d. Once the execution of the events associated to a click over an anchor has finished, the crawler analyzes again the same page looking for new anchors that could have been generated by the click event (for instance, new options corresponding to pop-up menus), A np. New anchors are also added to A p, A p = A np U A p. 5. The browser changes to navigation state, and the crawler is ready to process a new URL. If the processed page has several frames, then the system will process each frame in the same way. Note that the system processes the anchors in a page following a bottom-up approach, so new anchors are added on the list before the existing ones. This way, new anchors will be processed before some other click can remove them from the page. Also note that the added anchors will have to agree with the filters for adding URLs mentioned in section RELATED WORK AND CONCLUSIONS A well-known approach for discovering and indexing relevant information is to crawl a given information space (e.g. the WWW, the repositories of a corporate Intranet, etc.) looking for information verifying certain requirements. Nevertheless, today s web crawlers or spiders [Brin98] do not deal with the hidden web. During the last few years, there have been some pioneer research efforts dealing with the complexities of accessing the hidden web [Raghavan01] [Ipeirotis02] using a variety of approaches. Nevertheless, these systems are only concerned with server-side hidden web (that is, learning how to interpret and query HTML forms). Some crawling systems [WebCopier] have included JavaScript interpreters [Rhino] [SpiderMonkey] in the HTTP clients they use in order to provide some limited support for dealing with JavaScript. Nevertheless, our system offer several advantages over them: - It is able to correctly execute any scripting code in the same manner it would be executed by a conventional web browser.
8 - It is able to deal with session maintenance mechanisms for both crawling and later access to documents (the latter is made through an ActiveX component able to execute NSEQL programs). - It is able to deal with anchors and forms generated dynamically in response to events produced by the user (e.g. pop-up menus). - It is able to deal with redirections (including those generated by Java applets and Flash programs). Finally, we want to remark that the system presented in this paper has already been successfully used in several real-world applications in fields such as corporate search and technology watch. We have found the need for crawling the client-side hidden web to be very frequent in these application domains. The reason is that, although most popular mainstream websites avoid using JavaScript and other similar techniques in order to be correctly indexed by large-scale engines such as Google, many mediumscale websites containing information of great value continue to use them intensively. This is specially the case for websites requiring subscription or user authentication: since these sites do not have any incentive for easing the work of the large scale search engines, many of them make intensive use of client dynamism. Nevertheless, this kind of sites usually is the most valuable for many focused search applications, like technology watch or vertical search engines. Thus, our experience says the efforts for accessing the client-side deep web are valuable and should be continued. ACKNOWLEDGEMENT This research was partially supported by the Spanish Ministry of Science and Technology under project TIC Alberto Pan s work was partially supported by the Ramón y Cajal programme of the Spanish Ministry of Science and Technology. REFERENCES [Bergman01] Bergman M.K. The Deep Web. Surfacing Hidden Value. [Brin98] Brin S and Page L., The Anatomy of a Large-Scale Hypertextual Search Engine. In Proceedings of the 7th International World Wide Web Conference. [Ipeirotis02] G. Ipeirotis P. and Gravano L, 2002 Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection. In Proceedings of the 28th International Conference on Very Large Databases (VLDB). [MSIE] Microsoft Internet Explorer WebBrowser Control, [Pan02] Pan A., et al, Semi-Automatic Wrapper Generation for Commercial Web Sources. In Proceedings of IFIP WG8.1 Working Conference on Engineering Information Systems in the Internet Context (EISIC 2002). [Raghavan01] Raghavan S. and García-Molina H., Crawling the Hidden Web. In Proceedings of the 27th International Conference on Very Large Databases. [Rhino] Mozilla Rhino - JavaScript Engine (Java). [SpiderMonkey] Mozilla SpiderMonkey JavaScript engine (C) [WebCopier] WebCopier Feel the Internet in your Hands.
The Wargo System: Semi-Automatic Wrapper Generation in Presence of Complex Data Access Modes
The Wargo System: Semi-Automatic Wrapper Generation in Presence of Complex Data Access Modes J. Raposo, A. Pan, M. Álvarez, Justo Hidalgo, A. Viña Denodo Technologies {apan, jhidalgo,@denodo.com University
More informationAN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES
Journal of Defense Resources Management No. 1 (1) / 2010 AN OVERVIEW OF SEARCHING AND DISCOVERING Cezar VASILESCU Regional Department of Defense Resources Management Studies Abstract: The Internet becomes
More informationInformation Discovery, Extraction and Integration for the Hidden Web
Information Discovery, Extraction and Integration for the Hidden Web Jiying Wang Department of Computer Science University of Science and Technology Clear Water Bay, Kowloon Hong Kong cswangjy@cs.ust.hk
More informationA MODEL FOR ADVANCED QUERY CAPABILITY DESCRIPTION IN MEDIATOR SYSTEMS
A MODEL FOR ADVANCED QUERY CAPABILITY DESCRIPTION IN MEDIATOR SYSTEMS Alberto Pan, Paula Montoto and Anastasio Molano Denodo Technologies, Almirante Fco. Moreno 5 B, 28040 Madrid, Spain Email: apan@denodo.com,
More informationLesson 4: Web Browsing
Lesson 4: Web Browsing www.nearpod.com Session Code: 1 Video Lesson 4: Web Browsing Basic Functions of Web Browsers Provide a way for users to access and navigate Web pages Display Web pages properly Provide
More informationUsing Google API s and Web Service in a CAWI questionnaire
Using Google API s and Web Service in a CAWI questionnaire Gerrit de Bolster, Statistics Netherlands, 27 September 2010 1. Introduction From the survey department of Traffic & Transport in Statistics Netherlands
More informationSite Audit SpaceX
Site Audit 217 SpaceX Site Audit: Issues Total Score Crawled Pages 48 % -13 3868 Healthy (649) Broken (39) Have issues (276) Redirected (474) Blocked () Errors Warnings Notices 4164 +3311 1918 +7312 5k
More informationAppSpider Enterprise. Getting Started Guide
AppSpider Enterprise Getting Started Guide Contents Contents 2 About AppSpider Enterprise 4 Getting Started (System Administrator) 5 Login 5 Client 6 Add Client 7 Cloud Engines 8 Scanner Groups 8 Account
More informationTHE HISTORY & EVOLUTION OF SEARCH
THE HISTORY & EVOLUTION OF SEARCH Duration : 1 Hour 30 Minutes Let s talk about The History Of Search Crawling & Indexing Crawlers / Spiders Datacenters Answer Machine Relevancy (200+ Factors)
More informationCoveo Platform 6.5. Microsoft SharePoint Connector Guide
Coveo Platform 6.5 Microsoft SharePoint Connector Guide Notice The content in this document represents the current view of Coveo as of the date of publication. Because Coveo continually responds to changing
More informationBuilding a Web-based Health Promotion Database
6 th International Conference on Applied Informatics Eger, Hungary, January 27 31, 2004. Building a Web-based Health Promotion Database Ádám Rutkovszky University of Debrecen, Faculty of Economics Department
More informationSite Audit Boeing
Site Audit 217 Boeing Site Audit: Issues Total Score Crawled Pages 48 % 13533 Healthy (3181) Broken (231) Have issues (9271) Redirected (812) Errors Warnings Notices 15266 41538 38 2k 5k 4 k 11 Jan k 11
More informationAnatomy of a search engine. Design criteria of a search engine Architecture Data structures
Anatomy of a search engine Design criteria of a search engine Architecture Data structures Step-1: Crawling the web Google has a fast distributed crawling system Each crawler keeps roughly 300 connection
More informationHTTP Protocol and Server-Side Basics
HTTP Protocol and Server-Side Basics Web Programming Uta Priss ZELL, Ostfalia University 2013 Web Programming HTTP Protocol and Server-Side Basics Slide 1/26 Outline The HTTP protocol Environment Variables
More informationContent Discovery of Invisible Web
6 th International Conference on Applied Informatics Eger, Hungary, January 27 31, 2004. Content Discovery of Invisible Web Mária Princza, Katalin E. Rutkovszkyb University of Debrecen, Faculty of Technical
More informationS1 Informatic Engineering
S1 Informatic Engineering Advanced Software Engineering WebE Design By: Egia Rosi Subhiyakto, M.Kom, M.CS Informatic Engineering Department egia@dsn.dinus.ac.id +6285640392988 SYLLABUS 8. Web App. Process
More information5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search
Seo tutorial Seo tutorial Introduction to seo... 4 1. General seo information... 5 1.1 History of search engines... 5 1.2 Common search engine principles... 6 2. Internal ranking factors... 8 2.1 Web page
More informationSession 16. JavaScript Part 1. Reading
Session 16 JavaScript Part 1 1 Reading Reading Wikipedia en.wikipedia.org/wiki/javascript / p W3C www.w3.org/tr/rec-html40/interact/scripts.html Web Developers Notes www.webdevelopersnotes.com/tutorials/javascript/
More informationSite Audit Virgin Galactic
Site Audit 27 Virgin Galactic Site Audit: Issues Total Score Crawled Pages 59 % 79 Healthy (34) Broken (3) Have issues (27) Redirected (3) Blocked (2) Errors Warnings Notices 25 236 5 3 25 2 Jan Jan Jan
More informationJavaScript Specialist v2.0 Exam 1D0-735
JavaScript Specialist v2.0 Exam 1D0-735 Domain 1: Essential JavaScript Principles and Practices 1.1: Identify characteristics of JavaScript and common programming practices. 1.1.1: List key JavaScript
More informationWeb Site Development with HTML/JavaScrip
Hands-On Web Site Development with HTML/JavaScrip Course Description This Hands-On Web programming course provides a thorough introduction to implementing a full-featured Web site on the Internet or corporate
More informationPROCE55 Mobile: Web API App. Web API. https://www.rijksmuseum.nl/api/...
PROCE55 Mobile: Web API App PROCE55 Mobile with Test Web API App Web API App Example This example shows how to access a typical Web API using your mobile phone via Internet. The returned data is in JSON
More informationOU EDUCATE TRAINING MANUAL
OU EDUCATE TRAINING MANUAL OmniUpdate Web Content Management System El Camino College Staff Development 310-660-3868 Course Topics: Section 1: OU Educate Overview and Login Section 2: The OmniUpdate Interface
More informationEVENT-DRIVEN PROGRAMMING
LESSON 13 EVENT-DRIVEN PROGRAMMING This lesson shows how to package JavaScript code into self-defined functions. The code in a function is not executed until the function is called upon by name. This is
More informationThe figure below shows the Dreamweaver Interface.
Dreamweaver Interface Dreamweaver Interface In this section you will learn about the interface of Dreamweaver. You will also learn about the various panels and properties of Dreamweaver. The Macromedia
More informationFirefox for Android. Reviewer s Guide. Contact us:
Reviewer s Guide Contact us: press@mozilla.com Table of Contents About Mozilla 1 Move at the Speed of the Web 2 Get Started 3 Mobile Browsing Upgrade 4 Get Up and Go 6 Customize On the Go 7 Privacy and
More informationAbusing Windows Opener to Bypass CSRF Protection (Never Relay On Client Side)
Abusing Windows Opener to Bypass CSRF Protection (Never Relay On Client Side) Narendra Bhati @NarendraBhatiB http://websecgeeks.com Abusing Windows Opener To Bypass CSRF Protection Narendra Bhati Page
More informationAJAX Programming Overview. Introduction. Overview
AJAX Programming Overview Introduction Overview In the world of Web programming, AJAX stands for Asynchronous JavaScript and XML, which is a technique for developing more efficient interactive Web applications.
More informationScanning to SkyDrive with ccscan Document Capture to the Cloud
Capture Components, LLC White Paper Page 1 of 15 Scanning to SkyDrive with ccscan Document Capture to the Cloud 32158 Camino Capistrano Suite A PMB 373 San Juan Capistrano, CA 92675 Sales@CaptureComponents.com
More informationPart of this connection identifies how the response can / should be provided to the client code via the use of a callback routine.
What is AJAX? In one sense, AJAX is simply an acronym for Asynchronous JavaScript And XML In another, it is a protocol for sending requests from a client (web page) to a server, and how the information
More informationEmbedded WAYF A slightly new approach to the discovery problem. Lukas Hämmerle
Embedded WAYF A slightly new approach to the discovery problem Lukas Hämmerle lukas.haemmerle@switch.ch The Problem In a federated environment, the user has to declare where he wants to authenticate. The
More informationCHAPTER 2 MARKUP LANGUAGES: XHTML 1.0
WEB TECHNOLOGIES A COMPUTER SCIENCE PERSPECTIVE CHAPTER 2 MARKUP LANGUAGES: XHTML 1.0 Modified by Ahmed Sallam Based on original slides by Jeffrey C. Jackson reserved. 0-13-185603-0 HTML HELLO WORLD! Document
More informationFrequently Asked Questions Exhibitor Online Platform. Simply pick the subject (below) that covers your query and topic to access the FAQs:
Exhibitor Online Platform Simply pick the subject (below) that covers your query and topic to access the FAQs: 1. What is Exhibitor Online Platform (EOP)?...2 2. System requirements...3 2.1. What are the
More information(Refer Slide Time: 01:40)
Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #25 Javascript Part I Today will be talking about a language
More informationScreen Scraping. Screen Scraping Defintions ( Web Scraping (
Screen Scraping Screen Scraping Defintions (http://www.wikipedia.org/) Originally, it referred to the practice of reading text data from a computer display terminal's screen. This was generally done by
More information13. Databases on the Web
13. Databases on the Web Requirements for Web-DBMS Integration The ability to access valuable corporate data in a secure manner Support for session and application-based authentication The ability to interface
More informationThe Insanely Powerful 2018 SEO Checklist
The Insanely Powerful 2018 SEO Checklist How to get a perfectly optimized site with the 2018 SEO checklist Every time we start a new site, we use this SEO checklist. There are a number of things you should
More informationCSC105, Introduction to Computer Science I. Introduction and Background. search service Web directories search engines Web Directories database
CSC105, Introduction to Computer Science Lab02: Web Searching and Search Services I. Introduction and Background. The World Wide Web is often likened to a global electronic library of information. Such
More informationChecklist for Testing of Web Application
Checklist for Testing of Web Application Web Testing in simple terms is checking your web application for potential bugs before its made live or before code is moved into the production environment. During
More informationCS WEB TECHNOLOGY
CS1019 - WEB TECHNOLOGY UNIT 1 INTRODUCTION 9 Internet Principles Basic Web Concepts Client/Server model retrieving data from Internet HTM and Scripting Languages Standard Generalized Mark up languages
More information2013 Case Study 4for4
Case Study 4for4 The goal of SEO audit The success of website promotion in the search engines depends on two most important factors: the inner site condition and its link popularity. Also, a lot depends
More informationInformation Retrieval Spring Web retrieval
Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic
More informationCrownPeak Playbook CrownPeak Search
CrownPeak Playbook CrownPeak Search Version 0.94 Table of Contents Search Overview... 4 Search Benefits... 4 Additional features... 5 Business Process guides for Search Configuration... 5 Search Limitations...
More informationDeveloping a Basic Web Site
Developing a Basic Web Site Creating a Chemistry Web Site 1 Objectives Define links and how to use them Create element ids to mark specific locations within a document Create links to jump between sections
More informationWholesale Lockbox User Guide
Wholesale Lockbox User Guide August 2017 Copyright 2017 City National Bank City National Bank Member FDIC For Client Use Only Table of Contents Introduction... 3 Getting Started... 4 System Requirements...
More informationNetscape Introduction to the JavaScript Language
Netscape Introduction to the JavaScript Language Netscape: Introduction to the JavaScript Language Eckart Walther Netscape Communications Serving Up: JavaScript Overview Server-side JavaScript LiveConnect:
More informationEarly Data Analyzer Web User Guide
Early Data Analyzer Web User Guide Early Data Analyzer, Version 1.4 About Early Data Analyzer Web Getting Started Installing Early Data Analyzer Web Opening a Case About the Case Dashboard Filtering Tagging
More informationCrawling the Hidden Web Resources: A Review
Rosy Madaan 1, Ashutosh Dixit 2 and A.K. Sharma 2 Abstract An ever-increasing amount of information on the Web today is available only through search interfaces. The users have to type in a set of keywords
More informationTo find a quick and easy route to web-enable
BY JIM LEINBACH This article, the first in a two-part series, examines IBM s CICS Web Support (CWS) and provides one software developer s perspective on the strengths of CWS, the challenges his site encountered
More informationbla bla Open-Xchange Server Mobile Web Interface User Guide
bla bla Open-Xchange Server Mobile Web Interface User Guide Open-Xchange Server Open-Xchange Server: Mobile Web Interface User Guide Published Wednesday, 29. August 2012 version 1.2 Copyright 2006-2012
More informationMaster Syndication Gateway V2. User's Manual. Copyright Bontrager Connection LLC
Master Syndication Gateway V2 User's Manual Copyright 2005-2006 Bontrager Connection LLC 1 Introduction This document is formatted for A4 printer paper. A version formatted for letter size printer paper
More informationEclipse as a Web 2.0 Application Position Paper
Eclipse Summit Europe Server-side Eclipse 11 12 October 2006 Eclipse as a Web 2.0 Application Position Paper Automatic Web 2.0 - enabling of any RCP-application with Xplosion Introduction If todays Web
More information웹소프트웨어의신뢰성. Instructor: Gregg Rothermel Institution: 한국과학기술원 Dictated: 김윤정, 장보윤, 이유진, 이해솔, 이정연
웹소프트웨어의신뢰성 Instructor: Gregg Rothermel Institution: 한국과학기술원 Dictated: 김윤정, 장보윤, 이유진, 이해솔, 이정연 [0:00] Hello everyone My name is Kyu-chul Today I m going to talk about this paper, IESE 09, name is "Invariant-based
More informationE ECMAScript, 21 elements collection, HTML, 30 31, 31. Index 161
A element, 108 accessing objects within HTML, using JavaScript, 27 28, 28 activatediv()/deactivatediv(), 114 115, 115 ActiveXObject, AJAX and, 132, 140 adding information to page dynamically, 30, 30,
More informationIntroduction to JavaScript p. 1 JavaScript Myths p. 2 Versions of JavaScript p. 2 Client-Side JavaScript p. 3 JavaScript in Other Contexts p.
Preface p. xiii Introduction to JavaScript p. 1 JavaScript Myths p. 2 Versions of JavaScript p. 2 Client-Side JavaScript p. 3 JavaScript in Other Contexts p. 5 Client-Side JavaScript: Executable Content
More informationCS50 Quiz Review. November 13, 2017
CS50 Quiz Review November 13, 2017 Info http://docs.cs50.net/2017/fall/quiz/about.html 48-hour window in which to take the quiz. You should require much less than that; expect an appropriately-scaled down
More informationContent Publisher User Guide
Content Publisher User Guide Overview 1 Overview of the Content Management System 1 Table of Contents What's New in the Content Management System? 2 Anatomy of a Portal Page 3 Toggling Edit Controls 5
More informationAutomatically Maintaining Wrappers for Semi- Structured Web Sources
Automatically Maintaining Wrappers for Semi- Structured Web Sources Juan Raposo, Alberto Pan, Manuel Álvarez Department of Information and Communication Technologies. University of A Coruña. {jrs,apan,mad}@udc.es
More informationCHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER
CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER 4.1 INTRODUCTION In 1994, the World Wide Web Worm (WWWW), one of the first web search engines had an index of 110,000 web pages [2] but
More informationConnecting with Computer Science Chapter 5 Review: Chapter Summary:
Chapter Summary: The Internet has revolutionized the world. The internet is just a giant collection of: WANs and LANs. The internet is not owned by any single person or entity. You connect to the Internet
More information5/19/2015. Objectives. JavaScript, Sixth Edition. Introduction to the World Wide Web (cont d.) Introduction to the World Wide Web
Objectives JavaScript, Sixth Edition Chapter 1 Introduction to JavaScript When you complete this chapter, you will be able to: Explain the history of the World Wide Web Describe the difference between
More informationActive Server Pages Architecture
Active Server Pages Architecture Li Yi South Bank University Contents 1. Introduction... 2 1.1 Host-based databases... 2 1.2 Client/server databases... 2 1.3 Web databases... 3 2. Active Server Pages...
More informationWeb Programming Paper Solution (Chapter wise)
Introduction to web technology Three tier/ n-tier architecture of web multitier architecture (often referred to as n-tier architecture) is a client server architecture in which presentation, application
More informationNaresh Information Technologies
Naresh Information Technologies Server-side technology ASP.NET Web Forms & Web Services Windows Form: Windows User Interface ADO.NET: Data & XML.NET Framework Base Class Library Common Language Runtime
More informationAt the Forge JavaScript Reuven M. Lerner Abstract Like the language or hate it, JavaScript and Ajax finally give life to the Web. About 18 months ago, Web developers started talking about Ajax. No, we
More informationSmartAnalytics. Manual
Manual January 2013, Copyright Webland AG 2013 Table of Contents Help for Site Administrators & Users Login Site Activity Traffic Files Paths Search Engines Visitors Referrals Demographics User Agents
More informationWDD Fall 2016Group 4 Project Report
WDD 5633-2 Fall 2016Group 4 Project Report A Web Database Application on Loan Service System Devi Sai Geetha Alapati #7 Mohan Krishna Bhimanadam #24 Rohit Yadav Nethi #8 Bhavana Ganne #11 Prathyusha Mandala
More informationBEAWebLogic. Portal. Overview
BEAWebLogic Portal Overview Version 10.2 Revised: February 2008 Contents About the BEA WebLogic Portal Documentation Introduction to WebLogic Portal Portal Concepts.........................................................2-2
More informationResources required by the Bidders & Department Officials to access the e-tendering System
Resources required by the Bidders & Department Officials to access the e-tendering System Browsers supported This site generates XHTML 1.0 code and can be used by any browser supporting this standard.
More informationBuilding Mashups Using the ArcGIS APIs for FLEX and JavaScript. Shannon Brown Lee Bock
Building Mashups Using the ArcGIS APIs for FLEX and JavaScript Shannon Brown Lee Bock Agenda Introduction Mashups State of the Web Client ArcGIS Javascript API ArcGIS API for FLEX What is a mashup? What
More informationSession 6. JavaScript Part 1. Reading
Session 6 JavaScript Part 1 Reading Reading Wikipedia en.wikipedia.org/wiki/javascript Web Developers Notes www.webdevelopersnotes.com/tutorials/javascript/ JavaScript Debugging www.w3schools.com/js/js_debugging.asp
More informationLoad testing with WAPT: Quick Start Guide
Load testing with WAPT: Quick Start Guide This document describes step by step how to create a simple typical test for a web application, execute it and interpret the results. A brief insight is provided
More informationSkyway Builder Web Control Guide
Skyway Builder Web Control Guide 6.3.0.0-07/21/2009 Skyway Software Skyway Builder Web Control Guide: 6.3.0.0-07/21/2009 Skyway Software Published Copyright 2009 Skyway Software Abstract TBD Table of
More informationMythoLogic: problems and their solutions in the evolution of a project
6 th International Conference on Applied Informatics Eger, Hungary, January 27 31, 2004. MythoLogic: problems and their solutions in the evolution of a project István Székelya, Róbert Kincsesb a Department
More informationWHITE PAPER. Good Mobile Intranet Technical Overview
WHITE PAPER Good Mobile Intranet CONTENTS 1 Introduction 4 Security Infrastructure 6 Push 7 Transformations 8 Differential Data 8 Good Mobile Intranet Server Management Introduction Good Mobile Intranet
More informationApplication Security through a Hacker s Eyes James Walden Northern Kentucky University
Application Security through a Hacker s Eyes James Walden Northern Kentucky University waldenj@nku.edu Why Do Hackers Target Web Apps? Attack Surface A system s attack surface consists of all of the ways
More informationLesson 12: JavaScript and AJAX
Lesson 12: JavaScript and AJAX Objectives Define fundamental AJAX elements and procedures Diagram common interactions among JavaScript, XML and XHTML Identify key XML structures and restrictions in relation
More informationPerformance Evaluation of a Regular Expression Crawler and Indexer
Performance Evaluation of a Regular Expression Crawler and Sadi Evren SEKER Department of Computer Engineering, Istanbul University, Istanbul, Turkey academic@sadievrenseker.com Abstract. This study aims
More informationARCHITECTURE AND IMPLEMENTATION OF A NEW USER INTERFACE FOR INTERNET SEARCH ENGINES
ARCHITECTURE AND IMPLEMENTATION OF A NEW USER INTERFACE FOR INTERNET SEARCH ENGINES Fidel Cacheda, Alberto Pan, Lucía Ardao, Angel Viña Department of Tecnoloxías da Información e as Comunicacións, Facultad
More informationUNIT 3 SECTION 1 Answer the following questions Q.1: What is an editor? editor editor Q.2: What do you understand by a web browser?
UNIT 3 SECTION 1 Answer the following questions Q.1: What is an editor? A 1: A text editor is a program that helps you write plain text (without any formatting) and save it to a file. A good example is
More informationLecture 2 Advanced Scripting of DesignModeler
Lecture 2 Advanced Scripting of DesignModeler 1 Contents Supported API s of DesignModeler Attaching Debugger to DesignModeler Advanced scripting API s of DesignModeler Handlers Tree, Graphics, File, Event
More informationBIG-IP Access Policy Manager : Portal Access. Version 12.1
BIG-IP Access Policy Manager : Portal Access Version 12.1 Table of Contents Table of Contents Overview of Portal Access...7 Overview: What is portal access?...7 About portal access configuration elements...7
More informationDetects Potential Problems. Customizable Data Columns. Support for International Characters
Home Buy Download Support Company Blog Features Home Features HttpWatch Home Overview Features Compare Editions New in Version 9.x Awards and Reviews Download Pricing Our Customers Who is using it? What
More informationCHAPTER 7 WEB SERVERS AND WEB BROWSERS
CHAPTER 7 WEB SERVERS AND WEB BROWSERS Browser INTRODUCTION A web browser is a software application for retrieving, presenting, and traversing information resources on the World Wide Web. An information
More informationTabular Presentation of the Application Software Extended Package for Web Browsers
Tabular Presentation of the Application Software Extended Package for Web Browsers Version: 2.0 2015-06-16 National Information Assurance Partnership Revision History Version Date Comment v 2.0 2015-06-16
More informationSearching the Web for Information
Search Xin Liu Searching the Web for Information How a Search Engine Works Basic parts: 1. Crawler: Visits sites on the Internet, discovering Web pages 2. Indexer: building an index to the Web's content
More informationSEO According to Google
SEO According to Google An On-Page Optimization Presentation By Rachel Halfhill Lead Copywriter at CDI Agenda Overview Keywords Page Titles URLs Descriptions Heading Tags Anchor Text Alt Text Resources
More informationManipulating Database Objects
Manipulating Database Objects Purpose This tutorial shows you how to manipulate database objects using Oracle Application Express. Time to Complete Approximately 30 minutes. Topics This tutorial covers
More informationChapter 9. Web Applications The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill
Chapter 9 Web Applications McGraw-Hill 2010 The McGraw-Hill Companies, Inc. All rights reserved. Chapter Objectives - 1 Explain the functions of the server and the client in Web programming Create a Web
More informationIntroduction to emanagement MGMT 230 WEEK 5: FEBRUARY 5
Introduction to emanagement MGMT 230 WEEK 5: FEBRUARY 5 Digital design and usability search engine optimization. Measurement and evaluation. Web analytics and data mining Today s Class Search Engine Optimization
More informationA crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.
A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. The major search engines on the Web all have such a program,
More informationDeposit Wizard TellerScan Installation Guide
Guide Table of Contents System Requirements... 2 WebScan Overview... 2 Hardware Requirements... 2 Supported Browsers... 2 Driver Installation... 2 Step 1 - Determining Windows Edition & Bit Count... 3
More informationSelf Adjusting Refresh Time Based Architecture for Incremental Web Crawler
IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.12, December 2008 349 Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler A.K. Sharma 1, Ashutosh
More informationDeveloping Ajax Applications using EWD and Python. Tutorial: Part 2
Developing Ajax Applications using EWD and Python Tutorial: Part 2 Chapter 1: A Logon Form Introduction This second part of our tutorial on developing Ajax applications using EWD and Python will carry
More informationLesson 5: Introduction to Events
JavaScript 101 5-1 Lesson 5: Introduction to Events OBJECTIVES: In this lesson you will learn about Event driven programming Events and event handlers The onclick event handler for hyperlinks The onclick
More informationInformation Retrieval. Lecture 10 - Web crawling
Information Retrieval Lecture 10 - Web crawling Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Crawling: gathering pages from the
More informationWebsite Report for facebook.com
Website Report for facebook.com Fife Website Design 85 Urquhart Crescent 07821731179 hello@fifewebsitedesign.co.uk www.fifewebsitedesign.co.uk This report grades your website on the strength of a range
More informationCreate and Apply Clientless SSL VPN Policies for Accessing. Connection Profile Attributes for Clientless SSL VPN
Create and Apply Clientless SSL VPN Policies for Accessing Resources, page 1 Connection Profile Attributes for Clientless SSL VPN, page 1 Group Policy and User Attributes for Clientless SSL VPN, page 3
More informationComprehensive AngularJS Programming (5 Days)
www.peaklearningllc.com S103 Comprehensive AngularJS Programming (5 Days) The AngularJS framework augments applications with the "model-view-controller" pattern which makes applications easier to develop
More informationThe Evaluation of Just-In-Time Hypermedia Engine
The Evaluation of Just-In-Time Hypermedia Engine Zong Chen 1, Li Zhang 2 1 (School of Computer Sciences and Engineering, Fairleigh Dickinson University, USA) 2 (Computer Science Department, New Jersey
More information