Title: Artificial Intelligence: an illustration of one approach.

Size: px

Start display at page:

Download "Title: Artificial Intelligence: an illustration of one approach."

Silvia Maria James
6 years ago
Views:

1 Name : Salleh Ahshim Student ID: Title: Artificial Intelligence: an illustration of one approach. Introduction This essay will examine how different Web Crawling algorithms and heuristics that are being used by web spiders to retrieve relevant information from the Web. Web Crawler is a type program that automatically traverses the Web's hypertext structure by recursively retrieving all documents that are referenced. For example, a crawler starts with some page and downloads all the pages that page have links to. Then, for each of those pages, it downloads all the pages they are linked to, and so on, ad infinitum. The following are example of web application that uses crawler: - 1. Personal spiders are program used to search for Web pages of interest. For example, businesses use spider to improve their online experience, optimizing how they buy things, how they gather facts, how they are notified when things change, and how to enforce business rules when making online purchases. 2. Indexing functions that are needed to create the underlying index of search engine. 3. Naviguidance. This is a special web browser that assists the user with suggestions that this browser has learnt based on knowledge learnt about the user and the existing web page that is being browsed. Typical design of a Web Crawler Figure 1[1] below shows the components of a web crawler and how each of these components interacts with each other to process to user s request and the Internet as well as the associated database.

2 Figure1 To further aid our understanding of the operation of a crawler, the diagram below (figure 2)[1] is a flow chart that detail the working of a web crawler based on the web pages and URL involved. Also, this diagram shows how the various guiding classifier interact with the crawler.

3 Figure 2 Example of Classifier and Algorithm used in Web Crawlers Neural Networks The figure 3[1] below depicts a three layer feed-forward neural network; with output layer nodes represent either a positive or negative case.

4 Figure 3 Every attributes of an example is being represented by each node (shown as circle in the figure above) in the input layer. Every arrow that connects each of the nodes has a weight assigned to them. These arrows are called Directed edges. An output is obtained by passing through a sigmoid function the sum of all of the weighted inputs from all of the edges that is connected to that particular destination node. Weighted inputs is obtained by multiplying the input value at the source node and the weight that is that assigned to the directed edge. An output could be used as an input in a multi layered neural network. The sigmoid function is of the form: f (x) = 1/ {1 + e x } where x is the sum of the weighted inputs from the source nodes in the previous layer. The threshold value typically seen in a sigmoid function is modeled as an additional weight connected to a source node with a constant output of 1. A trained classifier can then be used in the crawler to assigned scores for unvisited URLs based on their respective parent pages.

5 Variation of Best First Crawler Backward link [2] is defined as the URL that is pointing to a particular web page from other pages on the Internet: - Basically, a Backlink based crawler will start on a given page. Then, the crawler starts to generate a list of URL that has been seen but not yet visited. Once a page has been visited, these pages will be stored in another list. The last data structure that the crawler maintains is a list that contains the list of URL seen on a particular page. There are several variations of this crawler and the differences are based on the importance and ordering metrics and how these crawlers use these 2 metrics. The importance metrics can be defined as the way a page is being evaluated. There are 4 different criteria that are used to evaluate these pages. The first criterion is called the Similarity to a Driving Query Q. With this method, the number of times that the word that is used in the query or search appears in the document and document collection is taken into consideration. The latter figure is usually an estimate. Another method is called the Backlink Count, which takes into account the number of links that is pointing to a particular page. The method PageRank is similar to the Backlink Count method described above. However, the PageRank method recursively calculate the weighted sum of the backlinks of a page. The last method cited by the researcher is called the Location Metric. In this method, the importance of the page is determined by its location not of its content. There are 3 types of Ordering metrics that were discussed by the researcher. These are Breath First, Backlink Count and PageRank. When used with the Backward link based crawling algorithm, the Breath First ordering metrics is just a null function because this crawling algorithm is indeed crawling breath first. While the Backward link based and PageRank ordering metrics uses the formula highlighted in the Important Metrics above, to sort the URLs that is to be visited by the crawler. ID3 Classifying Algorithm used in Construction of Decision Tree The assumptions made for this experiment are that the web crawler is crawling in a limited URL domain [3] and also, there is a start page for every URL domain such as a home page. Anchor text is used to predict the relevancy of the target pages. The decision to determine the priority of unvisited URL is based on the output of the decision tree. This decision tree is constructed by identifying the relevant pages using Support Vector Machine classifier. The user would train the classifier with pages that are considered relevant and some that is not consider relevant. The decision tree s positive example is defined as hyperlink that will lead to the shortest path between the source and target page. While a negative example is a hyperlink on the source that does not lead to the shortest path. The researchers then applied the ID3 algorithm on these positives and negatives examples. In the event that a term set cannot

6 be classified as either of these examples, then all the terms in this set is further classified as either positive or negative using probability. If there are more likely positive terms in this set, then it will be classified as positive case. Conclusion Based on the research that was carried out during the course of writing this essay, web crawler s algorithm was chosen as the topic of discussion because of the personal interest and possibly a future project. The implementation of web crawler is more wide spread than previously though. In addition, by studying the results obtained by these researches, the strength and limitation of each algorithm is understood better. References 1. Gautam Pant, Padmini Srinivasan; Learning to crawl: comparing classification schemes. ACM Transactions on Information Systems, Vol. 23, No. 4, October 2005, Pages Junghoo Cho, Hector Garcia-Molina, Lawrence Page; Efficient crawling through URL ordering. Department of Computer Science, Stanford University, CA 94305, USA. 3. Jun Li, Kazutaka Furuse, Kazunori Yamaguchi; Focused Crawling by Exploiting Anchor Text Using Decision Tree

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA An Implementation Amit Chawla 11/M.Tech/01, CSE Department Sat Priya Group of Institutions, Rohtak (Haryana), INDIA anshmahi@gmail.com