Performance Evaluation of a Regular Expression Crawler and Indexer

Size: px

Start display at page:

Download "Performance Evaluation of a Regular Expression Crawler and Indexer"

Jordan Russell
5 years ago
Views:

1 Performance Evaluation of a Regular Expression Crawler and Sadi Evren SEKER Department of Computer Engineering, Istanbul University, Istanbul, Turkey academic@sadievrenseker.com Abstract. This study aims to find a solution for the optimization of indexer and crawler modules of a search engine if the possible varieties of the search phrases are previously known as a regular expression. A search engine can be considered as an expert in any area if the search domain is narrowed and the crawling and indexing modules are optimized according to this domain. A general expertise of the search engines can be modeled with regular expressions like searching only s or telephone numbers on the Internet. This paper mainly discusses several alternatives on an expert search engine and evaluates the performance of several varieties. Keywords: Regular Expression, Search Engine, Crawler, 1. Introduction Any achievement on the search engine technology can result a benefit for all of the Internet users since the search engines are the main gateways to the information on the Internet. A classical search engine tries to index the information on the Internet by crawling the web pages. During the crawling phase, the web spider downloads the web page and extracts the information, indexes the extracted information and then continues to the next web page. Any search engine tries to index the information extracted in a general form of notation for a great variety of search possibilities. From the Fig.1 a spider gets connects to the Internet and supplies information for indexer which is responsible to keep the information for queries. This information can be kept in a database or can stay in memory for faster results. Finally, a user gets connect to the search engine through a user interface and queries the data in the indexer. This study mainly concentrates on the question, What if the search engine previously knows the regular expression representation of the keywords searched?. In this case the search engine do not need to index and store the information unnecessary for the search result and also a reasonable performance increase would occur during the processing of the web pages. Internet Web Crawler User Interface Database Fig. 1. A sample view of a web spider and its components This approach can be useful if an expert search engine is designed for example to search only the personal information on the Internet. Let s take the case of searching personal information, such as an address or telephone of a given name and surname on the Internet. In this case all the search engine components should get expert on the personal information only. During this study, the representation of search query is accepted as previously known in a regular expression notation. Web crawler is free to crawl any web page by following the classical crawling algorithms. is specially optimized for the regular expressions built on b+

2 tree or data structure[1], which is also a find out from our previous research[2]. An overview of the developed system can be demonstrated as in Fig. 2: spider gets another link from the to search list. This operation keeps looping until the to search list gets empty. GUI Regular Expression Target Web Site Get Link from GUI Internet Check robots.txt Web Crawler Get link and follow Tree Spider Found another link? User Interface Already searched? Fig. 2 Deployment diagram of expert search engine on given regular expression Add searched From the above diagram it is obvious that the web crawler gets the regular expression [3] of its expertise from the user and crawls and indexes the Internet using this regular expression. Also the user can query any information obeying the regular expression provided initially from the indexer data structure. During this paper, the personal information will be provided as an example of the regular expressions on the Internet. Please note that the initial regular expression, so the expertise of the search engine, can be easily updated by an user interaction 2. Regular Expression spider A web spider should find out the links and follow them out while creating a list of traversed sites and the follow up queue for the next sites. Fig. 3 holds the flow chart of the web spider algorithm. The spider gets a URL from the GUI and starts by this initial page. Also an important check should be done before proceeding any URL from the robots.txt file provided in the web site. If the site permits the spider to go forward, than spider simply tries to find out all the links from the web page and add these links to a list for further traverses. Finally the End List is empty? Fig. 3 Flow char of the spider INDEXER While producing a to search list for implicit usage of the spider, the list of already searched sites should also be checked for double entry to the site.

GUI Also another job of indexer is keeping the keywords in an appropriate data structure. Also there should be a connection between the indexer data structure and the GUI modules.

This module is responsible of extracting the keywords from the URLs got from the spider.

The most important two modules of the indexer are listed below: Data Structure Tokenizer The deployment diagram of the indexer should look like in Fig 5 : INTERNET Spider Data Structure to keep user

3 GUI Also another job of indexer is keeping the keywords in an appropriate data structure. Also there should be a connection between the indexer data structure and the GUI modules. Spider Graphical User Interface Fig 4. IPO Diagram of the spider A simple input is fetched from the GUI and all the outputs are sent to the indexer. 3. This module is responsible of extracting the keywords from the URLs got from the spider. As already discussed in the analysis part, the HTML Tokenizer is a part of the indexer, which can parse the keywords from the sites. The most important two modules of the indexer are listed below: Data Structure Tokenizer The deployment diagram of the indexer should look like in Fig 5 : INTERNET Spider Data Structure to keep user inputs Data Structure Search Results (page by page) and a pointer s Tokenizer TOKENIZER Regular Expression Extractor Fig. 5 Connection between spider and the indexers and tokenizers in indexers Fig. 5 demonstrates the connection between spider and the indexers. Each indexer keeps a tokenizer to get the keywords from html pages. Fig. 6 Data structures between the GUI and indexer Fig. 6 demonstrates the connection between the indexer and the graphical user interface. 4. HTML Tokenizer This module is responsible of extracting the keywords from a given web page. Since all the information on the internet is transferred by the html format, the indexer should parse the html format. Because of the performance issues all the modules should run concurrently, so each of the html tokenizer should run in a concurrent thread. By using multi threaded implementation, the busy waiting of the spider and the rest of the indexing jobs will be avoided. A simple view of the html tokenizer should look like in the fig. 7:

4 Get the target URL of the page data structure or a well formatted string which the indexer obeys the same protocol. 5. Reverse Indexing No Page contains more token? Yes One of the major improvements on this study is implementing the indexer within a suitable manner of the regular expressions. The regular expressions can be consists of multiple tokens. For example in table 1, the sites are listed with the token varieties: TABLE 1 SAMPLE SITES AND KEYWORDS Return result string as keywords list Get next string from the page No Match to Regular Expression? Site URL Keywords Microsoft, product, support, help, research, training, Office, Windows, software, download research, offices, about, education, news, students, faculty, Add to result string Fig. 7 Flowchart of the HTML Tokenizer Fig. 7 demonstrates a simple flow chart of the html tokenizer. The initial step of the tokenizer is getting the target URL from the web spider. The URL information and the keywords extracted from this URL will be returned to the indexer for the future queries. The end condition of the HTML tokenizer is finishing all the keywords in the web page. This information can be gathered from the file pointer which is created on the target web page. Since all the information on the internet is downloaded to the local memory, the current web page viewed should also kept on the local computer with a file pointer. The file operations on this level are left to the java network library. Above the file operations which use the file pointer should keep track of the strings and html tokens. Fortunately all the html tags are kept within the < and > symbols. So a string tokenizer with the knowledge of html tokens can easily be converted to the html tokenizer. Also the extracted keywords should be kept on a result set with a separator. The result set can be a composite type Indexing of the above list can be done in two ways, either by indexing from web site to the keywords or from keywords to the web sites. The former method is called as reverse indexing [4] and increases the access time performance. In order to keep the regular expression in a tree with reverse indexing, an efficient way of regular expression modeling is required. The regular expressions are kept by tokens in the tree. For example a regular expression of an can be represented as below: [A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4} Above regular expression checks for the validity of e- mail addresses is given above. The regular expression above has 5 sub parts: TABLE 2 SUB PARTS OF THE REGULAR EXPRESSION 1 [A-Z0- name of mail account 9._%+-]+ sign on 3 [A-Z0-9.-]+ domain of 4 \.. sign on domain 5 [A-Z]{2,4} extension of domain (2 to 4 chars like com, org, edu) Let s consider the below examples and their separations according to table 2.

5 TABLE 3 PARSING OF SAMPLE S ali@baba.com baba. com john@hotmail.com hotmail. com bill@hotmail.com hotmail. com paul@yahoo.com yahoo. com dean@mit.edu mit. edu Table 3 results from the regular expressions can be indexed in either way below: com edu URL TABLE 5 PERFORMANCE TABLE OF SAMPLE DOMAINS (CONTINUED) u.tr m.com Indexing Reverse Indexing Search Reverse Search hotmail yahoo bab Fig 8. Tree representation of regular expression results 6. Performance Evaluation of Regular Expression Search Engine Table 4 holds the results of time performance of the search engine results from different regular expressions. TABLE 4 PERFORMANCE TABLE OF SAMPLE @ john bill paul ali dean URL # of Depth # of # of Keywords pages links Table 4 displays the performance of indexing and reverse indexing algorithms of the regular expression covered previously. The data structure for the tree implementations is the b+ tree. Previous research on the data structure of the indexer showed that the best possible data structure is b-tree variants. Depending on this research the name part of the regular expressions are kept on a b+ tree structure. The search results are measured from the addresses with 50% of existing and 50% of not existing search queries. 7. Conclusion It is obvious that some of the Internet queries can be represented with regular expressions. This study aims to find an optimized way for crawling and indexing parts of the search engines in the case of regular expression of the search queries are previously known. At the best of our knowledge the parsing result of the regular expression is not kept in a leveled tree until this research. This new indexing approach has increased the speed of indexing and queries also a possible data structure and reverse indexing algorithm are suggested for the case. Also a better version of indexing data structure can be applied to this study. ACKWLEDGEMENT This study was supported by Scientific Research Projects Coordination Unit of Istanbul University. Project number YADOP References [1] National Institute of Science and Technology, nist.gov, 2009 [2] Web Spider Performance and Data Structure Analysis, Şadi Evren ŞEKER, Banu Diri, 2009 [3] Gerard Berry and Ravi Sethi. From regular expressions to deterministic automata. Theoretical Computer Science, 48: , 1986.

6 [4] Sergey Brin and Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Computer Science Department, Stanford University, 1999 [5] TUSSE (Turkish Speaking Search Engine), [8] G.M. [6] Adelson-Velsky and E.M. Landis, An algorithm for the organization of information, Soviet Mathematics 3 (1962), pp

Information Retrieval. Lecture 10 - Web crawling

Information Retrieval. Lecture 10 - Web crawling Information Retrieval Lecture 10 - Web crawling Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Crawling: gathering pages from the