INFOQUEST- A META SEARCH ENGINE FOR USER FRIENDLY INTELLIGENT INFORMATION RETRIEVAL FROM THE WEB

INFOQUEST- A META SEARCH ENGINE FOR USER FRIENDLY INTELLIGENT INFORMATION RETRIEVAL FROM THE WEB Sachin Agarwal Pallavi Agarwal 4 th Year students Indian Institute of Information Technology(IIIT) Allahabad Deoghat, Jhalwa, Allahabad (U.P.) India Telephone Number: +91-9839692545 {sachinagarwal, pallaviagarwal }@ug.iiita.ac.in ABSTRACT The paper aims to describe a method used in the metasearch engine InfoQuest to assist the user in assessing the information on the web conveniently. InfoQuest classifies the web-documents into pre-defined classes and presents it to the user according to the user s fields of interest. It keeps a track on the user s web navigation pattern to get the information of the user s preferences without intervening him during browsing the web. To serve the special purpose of keeping an eye on user InfoQuest uses its own browser Carnival at client-end. Keywords: Information Retrieval, user-tracking, Semisupervised learning, Search-result personalization 1. INTRODUCTION World Wide Web is the most promising source of information. The traditional tool to retrieve the documents from such a big reservoir of information is the search engine. Search engines look the words in the query (excluding the stop words) as keywords and presents results on the basis of the presence of these keywords on the web pages. Meta search engines make use of the capabilities of existing search engines and manipulate these results to present them to the user in a better way. InfoQuest uses the search results from top search engines, classifies them into different groups, and presents them according to the user s preferences of the categories. It uses TSVM [1] classifier, constructed using Clustering based Classification (CBC) algorithm [2] to classify the documents in to these predefined groups. InfoQuest uses an absolutely novice rule based method to get the knowledge of the user s interests during browsing the web. Every activity of the user is tracked, while browsing the web to get the information of user s interest, so an intelligent browser Carnival is developed to record all the user activities. This record is finally used to present the search results according to user s preferences. 1.1 Motivation Search engines give the results according to the occurrence of the keywords in the web pages. The keywords can be ambiguous which have the consequence of having a collection of documents related to different subjects (classes). For example: On giving Apple as keyword, result comprises of the documents related to: Apple-company, Apple- fruit, Big Apple Circus, After getting such results, the user has to scan through all the documents available to get the documents of the category for which he has searched for. It is the additional overhead for the user and sometimes leads to user s frustration during browsing the web. There is the need to perform some additional processing on these results returned by search engines to help the user in searching the web. Available Meta Search Engines performs clustering on the results returned by the search engines before presenting to the user. InfoQuest classifies the search results and order these classes according to user s choice. 1.2 Organization of the paper The rest of the paper has been organized as follows. Related Work in this field has been mentioned in section 2. Section 3 describes InfoQuest architecture which includes a brief description of Carnival-The InfoQuest Intelligent Web Browser and the InfoQuest server. Section 4 gives the complete description of the working of InfoQuest. The user tracking has been explained in section 5. The results are presented and analyzed in section 6.The conclusive remarks and future work followed by references is mentioned in section 7, 8 and 9 respectively. 2. RELATED WORK MetaCrawler[3], a meta search engine which searches top search engines in parallel and performs sophisticated pruning on the responses returned. It removes irrelevant, outdated, or unavailable documents. KartOO[4] is a meta-search engine with visual display interfaces. KartOO launches the query to a set of search engines, gathers the results, compiles them and represents them in a series of interactive maps through a proprietary algorithm.

Grouper[5], it uses the results of the HuskySearch (a meta search engine) and partitions the results into clusters(by using Suffix Tree Algorithm), or groupings of URLs which contain similar content. By generating high quality clusters with simple descriptions for novice users Grouper provided an effective way of organizing search results into collections for ease of browsing. Google Personalized web search [6], it gives the personalized search results to the user. To have the personalized results, user has to first make a profile and then the search engine sorts the documents by first giving the results that are more relevant for the user and then presenting the less relevant results. Google has developed new algorithms that dynamically reorder results by weighting the interests the user enters in his profile. Newt[7] uses a keyword based filtering algorithm and the system learns user preferences through relevance feedback and genetic algorithms. InfoQuest is different from these as no information retrieval system described above performs the classification of the documents and tracking of the user to get the information of the change in its interests with time. 3. INFOQUEST ARCHITECTURE InfoQuest has its specialized browser, Carnival dedicated for the purpose of tracking the user web navigation pattern. A brief overview of the browser and server is given below. 3.1 Carnival- the InfoQuest browser A normal web browser provides no assistance in personalization of the search results. An efficient personalization of the user s query results can be achieved by tracking the user s psychology while browsing the web pages. InfoQuest incorporates an intelligent web browser, Carnival, dedicated for this very purpose. This browser is to be installed on the user s machine to use the intelligent InfoQuest meta-search facility. Carnival keep a track on the user activities like which kind of the web pages a user is browsing and giving more preference. On the basis of the user s web browsing pattern it makes a profile for the user which contains the user preference information. Carnival sends the user s preference information to the server (which provides the search results in the prescribed format) along with the query to get the results ordered as per the user s choice. The user s preference information includes classes of the web pages and page characteristics like font color, page background etc, user is interested.the server then gives the results in the preference order of the user. Carnival also takes the information of the class and page characteristics from the server of a web page whenever user browses it. 3.2 InfoQuest Server The heart of InfoQuest intelligent meta-search system is the InfoQuest server. The InfoQuest browser sends queries and the preference order of the user to the server which fetches the search results by meta searching some of the top rated search engines. The InfoQuest server then performs the merging of results obtained from the search engines and removes repeating URLs in the search results of different search engines. The server then performs the classification of the webpages pointed to by the URLs in the search results into the predefined categories and rank the results within these pre-defined categories according to the page characteristics that the user prefers and sends the ranked (ordered according to user preference order) results to Carnival running at client-end. In order to obviate the need of classifying the same documents repeatedly, server maintains a database that contains the information about the class of the web page and the page characteristics like font color, font size and page background color, number of images/text size (the measure for graphics on the page). 4. HOW INFOQUEST WORKS 4.1 Interaction with Search Engines and fetching web pages InfoQuest server, on receiving a query along with user s preference information from the client, through Carnival, directs the query to the search engines. The result sets obtained from the search engines contains many common URLs; the intersection of the search results set is done to remove the repeating results. The information about the class of the documents can be obtained by using the content of the web pages hosted at the URLs in the search results. Though search engines provides a title, a snippet (optional), and a URL in the results, the classification on the basis of this information is not done because many times snippet part may contain incomplete and ambiguous information that may result in inaccurate classification. For example, as in the web result shown below: Technorati:Tag:apple... This page shows goodies from the web about apple. To contribute, just make a post to your blog about apple and include the link below. More Info»... www.technorati.com/tag/apple. Just by using above information about the web page hosted at URL www.technorati.com/tag/apple it is not easy to detect the class of the document posted at URL.Moreover, matter becomes worse when the information about the page content is not provided as snippet. To overcome such situations web pages hosted at URL are used to get the class information of the result. As sometimes information in snippet and title may also prove out to be useful. The text in the snippet and title is also used along with the text at the web page to make the class distinction. To obviate the need of fetching the same web

page repeatedly, server maintains a database to keep the record of the information of the category and look and feel of the web page. 4.2 Parsing and Document representation The parsing of the web pages is done to get the frequency of occurrence of every word the document, after ignoring the stop words and performing stemming on the non-stop words using Porter s algorithm [8]. The set of these keywords forms a feature set of the document and the features of all the documents forms the feature space where the documents are represented using vector space model. Consequently a document matrix is prepared that contains Term Frequency Inverse Document Frequency TFIDF score. The TFIDF score has the significance of giving more weightage to the features that occur too often in a document and giving less weightage to the features occurring too often in many documents as they have less discriminating power. TFIDF score is calculated by using the relation: TFIDF = TF * IDF Where TF =term frequency (frequency of occurrence of a feature in a document) IDF=Inverse Document Frequency which is define as IDF = log(d/d) Where D = the number of documents and d = the number of documents in which word is occurring at least once. The document matrix is used further to classify the documents. 4.3 Classification of web documents Classification of the document is done to assign the predefined categories to the different web pages. It is a supervised learning problem. So a classifier is to be trained using a set of the documents for which the classes are already assigned (labeled training set). As it is a very time consuming and mechanical job to give classes to the documents to have training and testing dataset. InfoQuest uses CBC (Clustering Based Classification) [2], semi-supervised learning method of classification. The open directory project [9] is used to get the classes of the web documents. 4.4 Result Representation The classified results in user s preference order are then sent to Carnival at client end. The web page of the ordered results and a pop window showing ranked user s preference classes are sent to the user as the output of the query to InfoQuest. 5. USER TRACKING The user of a search engine would prefer to get the results of his interest. Normally the search engines displays the query results based on the usual criteria such as the frequency with which the web pages are accessed. More is the accessing frequency higher the web page will appear in the query results. In order to have the user s interests fields InfoQuest request the user to make a profile in the beginning. Every class is specified with a certain preference value in the profile. The user has the choice of specifying the preference value along with the class during the inception of profile. Henceforth, Carnival at user-end, keeps a watch on the user s web browsing patterns to track his interest changes. InfoQuest follows a set of rules that are fabricated based on the user psychology during browsing the web to get the information about the user s choice. The rules are based on: 1. The time user spends on a web page. 2. Frequency with which a user visits the web pages of a particular class and other page characteristics. 3. The page characteristics such as background color, font size, font color, images that appeal the user. In the following sections, we give a comprehensive discussion of the rules stated above. 5.1 The time user spends on a web page To track the user, a plug-in in Carnival keeps track on the user s activities. The user may select any URL on currently opened web page or may opt to open a URL of his choice. As soon as the user s selected web page pops up on the browser window, the browser starts keeping track of the time the user gives to a particular page. Care is taken to ensure that the browser does not count the time for which the user sits idle without doing effective work on the page. This could be traced out by keeping track of the upper and lower limit of the reading speed of a user while reading textual contents on the page and also the upper and lower limit of time taken by the user in viewing the graphical contents. Carnival requests for all the information regarding the web page like text size, number of images and web page class from server. This information is used to find an estimate of the time the user would spend on the page to perceive all its contents indepth. The upper and lower limits of user speed are initialized with appropriate values that normally users take. These limits are used to calculate the estimated maximum and minimum time the user can take to read the page. Now while reading the page if a user gives much more time than the estimated maximum time, then it can be safely assumed that he is not reading the page and the web page has just been ignored by the user after opening it and the user is not working on the currently opened page. If the user gives much less time than the estimated lower limit of time assigned for the page then it can be assumed that user has closed the web page without thoroughly reading it. In cases when the time taken by the user lies within the range of lower and upper limit of the time estimated

then that time will be used to update the limits of the reading speed of the user. In this way the reading speed limits of the user keeps on getting updated and gradually converges towards the actual reading speed limits of the user. If the user takes appropriate time (i.e. within the minimum and maximum limits of the expected time), then the characteristics (class and look and feel) of that web page can be assume to attract the user and as per his interest. The preference for such characteristics is increased for the user and stored in the Carnival for further ranking the results of the user queries. 5.2 Frequency with which a user visits a web page of a particular class and other characteristics When a user visits a web page and reads it effectively giving the average amount of time to that page then the page is considered to be relevant to the user. The browser checks whether the class is in profile.in case it is not in the profile, the profile is updated by giving a small preference value to the class. A small value for the preference update is chosen because it might be possible that the user is not much interested in a particular class of web documents but visits them seldom. Preference value for a class increases by substantial amount if the user reads more pages of that class than a certain threshold value. In this way the frequency with which the user visits the web pages of a particular class and other characteristics helps in keeping track of the user interest in certain class of web pages. 5.3 The page characteristics such as background color, font size, font color, images that appeal the user Page characteristic is an important criterion to figure out the look and feel of the web pages that may appeal to the user. These characteristics are specific for a particular page. Page characteristics preference are not initially taken from the user while having the profile for his preferences but as the time proceeds, the system learns the user s taste of page characteristics and adds them to the profile. These characteristics can further be used to filter the results in different classes and rank them in order of the user s choice of the page layout and design. 6. RESULTS To get a rough estimate of the performance of InfoQuest, we conducted a very simple experiment with the help of a number of users of different age groups The results obtained by the experiments on a school going child has been included in this paper. The school going child is chosen to test the system as they normally have varied interests and they generally get fascinated by the presentation of web pages, which is a good measure to test the system. The user was allowed to use the system for 4 days and on completion of the duration, the query Apple was given as a test query to the system. The results could be depicted from the following snapshots. Figure 1 shows the popup window that appears with the search result. It shows the categories along with the rank. Figure 2 shows the searched results under different categories in user preference order. Fig. 1 Classes of the web results for query Apple ranked according to user s preference order. The above snapshot depicts that the results for the query Apple are categorized in classes Arts/Performing Arts, Computers/ Software and Home/Cooking. On studying the results it was discovered that the results related to Big Apple Circus were in the Arts/Performing Arts category. The results related to Apple company were in Computers/Software and the results having Apple as Fruit were in Home/Cooking category. The categories are arranged in the order of the user preferences, which are clearly depicted from the number of stars in front of each category. The user can also directly click on the category to get the results in that category. On studying the weblog of the user maintained in browser Carnival, it was studied that user gave the following queries to the system. Mah Jongg, Pinball in Computers/Software category and Learning magic, Lance Burton, magic tricks, magic shops, latest movies + review, Filmfare awards,dance + disco in Arts/Performing Arts category. Apart from the above queries to InfoQuest the user visited a large number of web pages which were presented as search results to user s queries. From the log we figured out that though the user did not give Arts/Performing Arts as the preferred category to the system, but he gave the queries relating to this category and gave a considerable amount of time to those web pages. By this web navigation pattern the system learned that the user has interest in this category, which it has reflected in the results shown in figure 2.

9. REFERENCES Fig. 2 A part of the results for query Apple showing the results of class Arts/Performing arts on the top as this class is on the top of user s preferences. After looking at the pages from the web log we discovered that a majority of the time was spent on the pages with fascinating colorful images and text. First and second result in the above snapshot contains only an image. The third result has some text along with the images thus reducing the image/text ratio and fourth result has much more text compare to number of images thus having a very small image/text ratio. 7. CONCLUSION In this paper we have described InfoQuest- A Meta Search Engine for user friendly intelligent information retrieval from web. It personalizes the search results by tracking the user while navigating the web through an intelligent browser Carnival at user at end. Carnival studies the user s psychology for web pages preference and uses a novice rule set to capture his interest. The system architecture is modular and the changes and improvements could be easily incorporated in any module. The system has been tested on a number of users. The results for a school going boy presented in the paper reveals the efficiency of the system in personalizing the results by using the method of user tracking given in the paper. [1] Joachims, T. Transductive inference for text classification using support vector machines. In Proceedings of 16th International Conference on Machine Learning, 1999, San Francisco:Morgan Kaufmann., (pp. 200-209) [2] Hua-Jun Zeng, Xuan-Hui Wang, Zheng Chen, Hongjun Lu and Wei-Ying Ma, CBC: clustering based text classification requiring minimal labeled data, ICDM 2003, 19-22 Nov. 2003, page(s): 443-450. [3] Selberg, E. and Etzioni, O., "The MetaCrawler Architecture for Resource Aggregation on the Web.IEEEExpert, January/February 1997, Volume 12, number 1, pages 8-14. [4] http://www.kartoo.com/ : Kartoo meta search engine [5] Zamir, O. and Etzioni, O. "Grouper: a Dynamic Clustering Interface to Web Search Results".WWW-8, 1999 [6] http://labs.google.com/personalized: Google personalized search engine. [7] Beerud Sheth, 1994. "A Learning Approach to Personalized Information Filtering," Learning and Common Sense Section T. R. 94-01, MIT Media Laboratory, http://agents.www.media.mit.edu/groups/agents/papers.ht ml [8] M. Porter. An Algorithm for Suffix Stripping. Program Automated Library and Information Systems, 14(3). 1980. 130-137. [9] http://dmoz.org/: Open directory project developed by Netscape. 8. FUTURE WORK Every system has a scope for further improvements that could enhance the system s efficiency and capability. The same is true for InfoQuest as well. The rule set for user tracking can further be enhanced to provide better personalization of search results.