Chapter 2: Literature Review

Size: px

Start display at page:

Download "Chapter 2: Literature Review"

Hope Stevens
6 years ago
Views:

1 Chapter 2: Literature Review 2.1 Introduction Literature review provides knowledge, understanding and familiarity of the research field undertaken. It is a critical study of related reviews from various sources. In line with the research objectives discussed in chapter one, few critical areas related to the research area were identified and reviewed. These reviews will be seen in the following sections. 2.2 World Wide Web The World Wide Web is a global repository of information. It has grown from a few thousand pages in 1993 to more than 2 billion pages at present. It has become one of the primary means of publishing and locating information (Shkapenyuk & Suel, 2002). The World Wide Web is the combination of resource identifiers, hyperlinks, client server computing model and markup language. These elements will be discussed next. Resource identifiers are unique identifiers that locate a particular resource such as text files, images and other items of information. Uniform Resource Locater (URL) is a resource identifier used to locate resources. A URL takes the form of a string that describes how to find a resource on the Internet. URLs have two main components, the protocol needed to access the resource and the location of the resource (Sun MicroSystems, 2008). Hyperlink is a format of information which links parts of the documents to other documents. HTML requires an anchor element to create a hyperlink. The href attribute of the anchor should be set to a valid URL. Client server model computing is made up of client software requesting resource or service from server software while the server software provides client with the requested resources and services. On the World Wide Web, the browser is the client program which requests web pages from web servers using URLs. 7

2 The mark-up language (e.g. HTML or Hyper Text Markup Language) consists of character sequences called tags to indicate various formatting sequences such as headings, bulleted lists, hyperlinks and more.(shelly, Cashman & Mick, 2000) Due to the huge size of the Web, manually browsing documents using the hypertext structure is no longer effective for resource discovery. This is when search engines appeared to help users to find their required information (Mudassar, Yang & Adeel, 2004). 2.3 Search Engine The traditional search engine is made of a crawler, indexer and query processor. The web is seen as a large graph with pages as its nodes and hyperlinks as its edges (Gaautam et al, 2004). The tag tree structure of HTML, which contains embedded text and HREF link, helps surfers locate information by clicking on links. To manipulate this structure, people are eager to seek an automatic program to emulate the human behavior of clicking links and discovering resource. (Chang et al, 2005). That program is none other than a crawler. A crawler is also known as a bot or a robot or a spider. (Thom, Doug & Jim, 1998) The first crawler was the World Wide Web wander developed in 1993 Crawler visits web sites and reads pages in order to create entries for a search engine index. It can either retrieve a particular document or use some specified searching algorithm to recursively retrieve all documents that are referenced from some beginning base document. (Jansen, Spink & Pederson, 2003.) The crawler starts at a given URL also called seed url (node), downloads the page, parses it and follows the hyperlinks (edges) within the page to reach other pages. By this way, it creates a local collection of the pages. (Bullot, Gupta & Mahonia, 2003) Once the crawler retrieves the document of the URL it may decide to parse it and index it by inserting it into the database (Thom et al, 1998). Indexing content differs among crawlers. It can be document s html titles, the first few paragraphs, meta tags or even the whole document. The web page is then used as a source for new URLs to visit and index. General purpose crawlers insert the URLs into a queue and visit them in a breadth-first manner. A 8

3 traditional search engine typically indexes most web pages in the World Wide Web in a centralized database for all query processing (Ke & Wing, 2003). Though, the expectation of fetching all pages is not realistic, given the web s growth and refresh rates (Altingovde & Ulusoy, 2004). In brief, the search engines work this way, crawlers start off with an initial set of URLs. The URLs are placed in a queue and retrieved in a specific order. The crawler downloads the page of each URL. The URLs within each page are extracted and added back into the queue. The downloaded pages are passed to the indexer program to be indexed and the indexed information is stored in a database. This information in the database is then accessed by the query processor (Ricky & April, 2000). 2.4 Search Algorithms The goal of crawlers is to provide search capabilities over the web as a whole. Such a goal lends them to search strategies like breadth-first-search (Chang et al., 2005). Due to limitless resources available on the World Wide Web, it is important for search engines to download only the best pages. These best pages are located using different strategies. Some of the popular algorithms known are breadth first, best first, backlink, pagerank and random. Breadth First It s the simplest algorithm for crawling. Pages are visited as they are discovered. All pages in the current level will be visited in the order they are discovered before pages in the next level are visited. Breadth first begins at the root node and explores all neighboring nodes, then for each neighboring node, it explores their unexplored neighbor nodes. Breadth first order search is known to build domain specific collections with reasonable quality (Najorck & Wiener, 2001). Best First Pages are not simply visited in the order they are discovered. It explores by expanding most promising node chosen according to some rule. Some heuristics are used to rank the pages. The rule is to predict the path to the most relevant pages. 9

4 Pages are considered to be relevant are visited first. Non- relevant ones are pushed back in queue. Though it probes in the direction of relevant pages, sometimes there is the danger of missing out many relevant pages. (Begmark, Lagoze & Shityakov, 2002). Backlink Backlinks are incoming links to a website. It is used to rate how popular or important a website is. The more backlinks to a website, the more popular it is. This search algorithm, crawls pages with the highest number of known links first. (Björneborn and Ingwersen, 2004). PageRank PageRank is a link analysis algorithm. It assigns a numerical weighting to each page. The numerical weighting not only depends on the number of incoming links to a page, but also the importance of the pages providing the links. The pages with higher incoming links from more important pages are crawled first.. (Björneborn and Ingwersen, 2004). Random: This algorithm selects next page to crawl randomly from a set of uncrawled pages. (Cho, Garcia & Page, 1998) Cho, Garcia and Page (Cho, Garcia & Page, 1998) used connectivity-based document quality metrics to direct a crawler towards high-quality pages. They used different ordering metrics like breadth-first, backlink count and PageRank and random for individual crawls. The goal of the experiment was to identify which ordering metric found fast the most hot pages. Hot pages referred to pages with high number of links or a page with a high Page Rank. They found both PageRank and Breadth First metrics worked equally well in finding hot pages faster compared to the other ordering metrics as shown by symbol in Table 2.1. Najork and Wiener (Najork and Wiener, 2001), in their paper, concluded, that a crawler that downloads pages in a breadth first search order discovers the highest quality pages during the early stages of the crawl. Discovering high quality pages early on in a crawl is desirable for search engines as crawlers are only able to crawl a fraction of the web. 10

5 Table 2.1: Search Algorithm Comparison Search Algorithm Fastest In Finding Hot Pages Breadth First Search. Best First Search BackLink PageRank. Random Hot Pages found fastest. 2.5 Intranet Intranet is similar to the internet but is restricted to users within a company (Intranets, Enterprise Strategies and Solutions, 1998). It strengthens internal communication and is a central hub to access important forms, project lists, employee manuals, organizational policies, agreements and more.intranet provides quick access to information. Organizations save cost by putting documents into the intranet as its saves printing cost and distribution cost. Information is centralized and up-to-date with everyone having the same version of information and it improves information availability. Other than that, it allows for knowledge retention when employees leave the organization as information is documented. This is why more and more organizations are putting their mission-critical business practices into intranets. (Preston, 1996) Search Engines in Intranet Environments Intranet search engines are much the same as search engines such as AltaVista or Google. The search engine locates documents, extracts the text and stores it in an index file, making an entry for each word. When an end-user types a word and clicks the search button, the search engine receives the search query, looks for matching word in the index file, gathers related document information and sends back the information to the user (Intranets, Enterprise Strategies and Solutions, 1998). 11

6 Google s Intranet Search Engine MOMA Google uses its own search appliance to index more than 100 million internal documents. It gives its employees access to contacts, shared bookmarks and refinements. MOMA displays popular search terms, latency times and traffic statistics. All information about Google can be found on the intranet whether it be product status or the number of employees working for Google (Google Inc, 2007) Search Engine Studio Search Engine Studio indexes intranet documents. It saves output files on the server. The indexing is done based on four available methods, directory scan, FTP scan, link crawler or xml file. Search boxes are then added to existing html pages, through which searches can be made. Automatic database updates are also done to detect newly added documents. It also allows permissions to be set, this decides whether a particular user is able to view a particular search result. It is very flexible, where administrators are able to configure the search engine functionality as they need (Extreme.com, 2008). 2.6 Monitoring Monitoring is systematic and purposeful observation. It provides information that will be useful in analyzing situations, determining whether inputs are well utilized, identifying problems and finding solutions. Monitoring only produces data, which is then analyzed and utilized to manage. 12

7 2.6.1 Monitoring Study Google in its search appliance software and mini software applies monitoring in these ways: (Google Inc, 2008) Monitor crawling status Crawler status gives the summary of the crawl for the past 24 hours. Status messages give an overall picture whether links were successfully crawled. Some of the unsuccessful messages include connection timeout, host unreachable, page not found. Through these status messages administrators can trace why certain links were not able to be downloaded. Monitoring crawling crawls While search appliance is crawling, its history can be viewed through reports. The reports show each link in the current domain, that has been fetched and timestamps for the last 10 fetches. If the fetch is not successful the error message is also listed. User can also navigate to lower levels such as a particular host, directory or link, at each level the crawl status is given. Applications Manager URL Monitoring (I3Systems, 2008) applies monitoring through these ways: Monitor performance and availability of websites using HTTP and HTTPS requests, from end-user perspective. If the website is not accessible, then notifications are sent to administrator or some corrective action is triggered. Monitor just a single URL or a sequence of URLs. This is to ensure that single URL or sequence of URLs are always functioning, other than that, these URLs can be checked for certain attributes such as response time. Record a sequence of HTTP requests and configure it to be checked at regular intervals of time. This will ensure certain transactions are correctly carried out. 13

8 Get instant notifications when there are problems with application such as connectivity problems, slow page load time, or content errors. Generate reports to view performance of the website over a period of time. Validate the web page for specific error messages Monitoring Context Monitoring in the context of the proposed tool developed will be to collect information on resources that are frequently accessed by clients. Thus, the importance of resources placed on the web server can be assessed. Resources which are considered not very popular can be removed and more resources with content identical to popular resources can be added. The application also will be able to trace the frequency of client usage. This helps to trace users who most frequently access the application and the type of resource they request for. Resources can then be manipulated to accommodate these users more. Administrators can then judge how useful the intranet content is and will be able to manage it better. The application achieves the characteristics of monitoring by analyzing frequency of client access, types of resource content requested and problems within the application by recording errors as logs. The resulting data can be used by the administrators to identify problems and find solution. 2.7 Architectural Design Two architectural designs are explored for the prototype development. Both architectures are explained in the following sections Client Server Architecture Client server is a computational architecture that involves client processes requesting resources from server processes (Roger, 2004). Client server divides a computer application into three basic components, client, server and network. The client is usually the front end processor that is operated by the end user. The server provides processing capability or information to the client and the network transmits data between clients and 14

9 servers. This architecture is commonly used for file transfers, database applications, and web applications. (Aptech WorldWide, 2002) P2P Architecture (Peer to Peer) Each workstation has equivalent capabilities and responsibilities. It relies on the computing power and bandwidth of participating workstations. This architecture is used for ad-hoc connections. P2P technology is commonly used for file sharing whether it be audio, video or data. Workstations are equal peers as they serve both as server and client. All nodes provide resources including bandwidth, storage space and computing power thus capacity of the system increases just as demands increases. The robustness is also high as data is found in multiple peers, there is no dependency on single server thus there is no single point of failure Peer to peer is categorized into two, one is pure peer to peer where peers are equal and there is no central server. The other is hybrid peer to peer, it has a central server managing the network keeping information on peers and responds to request for information. (Aptech WorldWide, 2002) 2.8 Search Engine Approaches Search engines differ in their approach. Three approaches are discussed. Major search engines such as Google, Yahoo and AltaVista crawl and index a large portion of the web. Therefore they are able to provide longer search results. On the other hand, specialized content search engines only crawl and index specific content. For example, they may focus only on searches based on technology. This produces a shorter but more focused list of results. Individual web sites, especially larger corporate sites, may use a search engine to index and retrieve the content of just their own site. 15

10 2.8.1 Search Engine Study The following search engines were studied in order to understand better the mechanisms of search engines Google Google runs on a distributed network of thousands of low-cost computers and can therefore carry out fast parallel processing. Parallel processing is a method of computation in which many calculations can be performed simultaneously, significantly speeding up data processing. Google has 4 distinct parts, Googlebot web crawler, indexer, query processor and the page ranker. Googlebot web crawler finds and fetches web pages. It requests for web pages from web servers and downloads the page. The page is then passed to the indexer. Googlebot fetches a page and inserts all links into a queue for subsequent crawling. Googlebot gives the indexer the full text of the pages it finds. Indexer sorts every word on every page and stores resulting index of words in a database. These pages are stored in Google's index database. Each index entry stores a list of documents in which the term appears and the location. This data structure allows rapid access to documents that contain user query terms. Query processor compares search query to index and recommends pages where the query is found on. The query processor includes user interface (search box), the "engine" that evaluates queries and matches them to relevant documents, and the results formatter. Page Rank ranks Google s web pages. Pages are ranked based on popularity, position and size of search terms and the proximity of the search terms to one another (Google Inc, 2006) Become.com Become.com, is a search engine that helps people find product reviews and relevant buying information (Become.com, 2005). It uses Java technology based web crawler. The crawler is able to obtain information on over 3 billion web pages, writing well over 8 terabytes of data on 30 fully distributed servers in seven days. Java was the chosen platform due to 16

11 robust memory management, support for multi-threading, built-in network library and Remote Method Invocation. The search engine uses the Affinity Index Ranking that understands context of the page. It uses advanced concepts of physics and engineering dynamics. It evaluates pages based on what other pages say about it. Therefore it can give better search results to users. It consists of a crawler, controller and a fetcher. The crawler controller finds seed pages and identifies more links from the seed pages. The fetcher classifies information by running checks on pages by identifying page type, language and filters out duplicates. The information is sent to back to the crawler to guide the crawl. The crawler builds a web index based on the URLs which serves as a searchable database (Janice, 2005) Nutch Nutch is a complete open source web search engine package that aims to index the World Wide Web as effectively as commercial search sites.(brin & Page, 1998) It operates at one of three scales, local file system, intranet or whole World Wide Web. The three scales are different from one another. Crawling local file system is reliable compared to the other two scales as network errors don t occur and caching of page content is unnecessary. On the other end crawling the whole wide web creates a lot engineering problems such as how to partition work between a set of crawlers, how to cope with broken links and duplicate content. Nutch is divided into two, the crawler and the indexer. The crawler fetches pages and turns them to inverted index, the index is used to answer user s query. Web database is used by the crawler. The web database stores pages, links, number of links in the page, fetch information which specifies when the page has to be refetched, page s score which indicates the importance of the page. A page represents a page on the web and is indexed by its URL while a link represents a link from one page to another. The nodes are pages and the edges are links. A segment is a collection of pages fetched and indexed by the crawler in a single run. A fetch list for a segment for a segment is a list of URLs for the crawler to fetch and is 17

12 generated from the server. Fetcher output is data retrieved from pages in the fetchlist. The fetcher output is indexed and stored in the segment. Index is the inverted index of all the pages the system has retrieved and is created by merging all the individual segment indexes. Crawling is a cyclical process, the crawler generates a set of fetchlists from the web database, a set of fetchers downloads the pages and the web database is updated with new found links. The crawler then again generates a new fetchlist. This cycle is called generate, fetch, update cycle. This is how Nutch crawler works. A new web database is created. Root URLs are injected into the web database. A fetchlist from the web database is generated in a new segment. Content from URLs in the fetchlist are fetched. The web database is updated with links from fetched pages. These steps are repeated until the required depth is reached. The segments are updated from the web database. The fetched pages are indexed. Duplicate content is eliminated from the indexes. The indexes are merged into a single index for searching (Tom, 2006) Win Web Crawler It s a high speed, multi-threaded crawler which saves data to local disk. It has filters for URL, text, data, domain and date. It allows user-selectable recursion levels, retrieval threads, timeout, proxy support and many other options. (DownloadThat.com, 2006) Win Web Crawler queries all popular search engines, extract all matching URLs from search results, removing duplicate URLs and finally visiting websites and extracting data from there. Search engines that want to be searched can be selected and depth setting can be made. Deep setting specifies how deep the crawlers should crawl. On the overall this is a crawler program all in all. It only crawls and find pages. Indexing and query processing functionality has not been included (WinWebCrawler.com, 2006). 18

13 World Wide Web Crawler World Wide Web crawler, points the weaknesses of the client server architecture where the central server manages all status information regarding URLs visited and to visit. It requires a significant amount of network resources. Crawling and indexing takes a long time to complete thus it is not possible to provide up-to-date versions of frequently updated pages. In return the research work proposes a distributed nature of web data where the web is crawled using ordinary computers which are already distributed. This is how it works. URLs are partitioned by their hash values and these URLs are partitioned across participating crawlers. When a crawler finds a URL in a page, it calculates the hash value of the page and sends it to the node that assumes the value at that time (the home node). The home node checks if the URL has been visited, and if not, it schedules to visit the URL in future. When a new crawler joins a crawling session, a participating node splits its assuming hash ranges into two and gives one to the new node. If a crawler permanently leaves a session, it gives its hash range to another node. This solution is economical as work is distributed among nodes and every node plays an equal share. It also reduces network traffic (Takahashi et al, 2002) Existing Offline Browser Tools Study Available tools which support website downloading are reviewed below WebReaper (Meher, 2004) WebReaper is a web crawler. It downloads pages, pictures locally. This enables user to view websites without being connected to the internet. The locally saved files can be browsed, as if they were directly read from the internet. It can resume downloads and update websites by only downloading files which have been changed. It supports different types of filter options, users can select the type of files they would want to download by setting the filter option. The downloaded websites can then be viewed with any browser. 19

14 HTTrack (Meher, 2004) This tool is able to download a website from the internet to a local directory. Files and images are transferred from the server to the local computer. Though not connected to the computer, the website can be browsed from link to link as though the user is viewing it online. It can update existing mirrored site and resume interrupted downloads Download URLs Utility (Soft82.com, 2007) This tool allows websites to be downloaded to the hard disk. Downloaded can be done according to user specified level using the multi-threaded concept. The number of threads used to download files can be user selected. User also has the option of whether to download text or images Search Engine and Offline Browser Comparison A typical search engine consists of three major conceptual components, web crawler, indexer and a query processor (Arasu et al, 2001). Table 2.1 shows the comparison of these conceptual components among different search engines and offline browsers. These three are to be adopted in the prototype tool with the addition of features of automatic link download, monitoring and search and download from peer computers. Crawler: Visits web sites, reads and downloads links. Indexer: The information of pages read and downloaded by the crawler are indexed. Query Processor: When a search is made by the user, the query processor will compare search entry with the indexed entries, if a match is found, the page information is presented to the user. Automatic link download: Links are downloaded automatically from a page recursively according to user specified depth from the server to the local client user. This allows user to browse through a website without being connected to the server. Monitoring: Monitoring of user access frequency, link access and download frequency is implemented to track the both the popularity of the software and also the resources provided in the intranet. Other than that missing resources can also be traced by being able to view errors that occur during crawls. 20

15 Search and download from peer computers: User is able to search and retrieve resources from the server but all shared folders in connected computers on the network. This enables information sharing and downloading in an enterprise more efficient. Table 2.2: Search Engine and Offline Browser Comparison Search Engine Crawler Indexer Query Processor Automatic Link Monitor Peer Access Download Google. Become.com Nutch Win Web Crawler World Wide Web Crawler Search Engine Studio WebReaper HTTrack Download URLs Utility Functionality exists 2.9 Outcome The prototype tool, a desktop and intranet based searching, downloading and monitoring tool is proposed to adopt the traditional search engine technology, which is made of crawler, indexer and query processor. Additional features of automatic download, peer searching and downloading and monitoring features are also proposed to be incorporated. 21

16 Search engines don t expose their crawler programs to users, crawler programs function only to populate the database with content. The prototype application exposes users with the crawler functionality. Links are found within pages based on the breadthfirst-search algorithm. Users can download link pages within link pages simultaneously. Thus they can download relevant pages without the effort of clicking each link and downloading. It is automatically achieved. Monitoring is in the form of collection of data. Data on frequency of user access, content requested and application errors (e.g. resources not found) are captured to assist administrators in managing resources and the application more effectively. The prototype tool support both client server architecture and peer to peer architecture. Users are able to download resources from both server and peer machines. Thus this does not overburden traffic to the server. Monitoring on user access, content and web page request allows administrators to identify tool popularity and content popularity among users. Identifying content popularity allows efficient and economical server resource and space management. Content similar to popular content can be added while unpopular content can be removed. Other than that, application errors that arise due to application errors and missing resource links can also be traced. This helps administrators to debug application and maintain resources effectively. In conclusion, this prototype application is developed on the combination of existing components of a search engine, using the breadth-first-search algorithm, client server and peer to peer architecture and monitoring features for the desktop and intranet environment Summary This literature review has discussed on search engines, their components, search algorithms, architectures, intranet, monitoring and features and implementations of existing search engines. The information and outcome gathered from this review serves as the input in the following phase of the research, this will be discussed in chapter three. 22

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. The major search engines on the Web all have such a program,