AN ADAPTIVE WEB SEARCH SYSTEM BASED ON WEB USAGES MINNG

Size: px

Start display at page:

Download "AN ADAPTIVE WEB SEARCH SYSTEM BASED ON WEB USAGES MINNG"

Maximillian Fleming
5 years ago
Views:

International Journal of Computer Engineering and Applications, Volume X, Issue I, Jan. 16 www.ijcea.

1 International Journal of Computer Engineering and Applications, Volume X, Issue I, Jan ISSN AN ADAPTIVE WEB SEARCH SYSTEM BASED ON WEB USAGES MINNG Sethi Shilpa 1,Dixit Ashutosh 2 1, 2 Department of Computer Engineering YMCA University of Science & Technology Faridabad, India ABSTRACT: Search engines are information retrieval tool that act as mediator between the web and user. When the user submits a query at search engine interface, it retrieves the pages based on query terms from its database which is in advance populated from web. Retrieved pages are then ranked and presented back to the user. Unfortunately user seldom gets the satisfied results in first go and need to modify its query. This problem arises because the search engine retrieves the results based on query keywords only and no attention is paid in incorporating the user interest during the ranking process. The paper presents an adaptive search mechanism based on web mining to extract useful patterns related to user so that relevant information can be served at the end. The result analysis shows that there are considerable improvements in quality of result set as compared to existing search engines. Keywords: Search engine, page rank, web usages mining, user profile, information retrieval. [1] INTRODUCTION WWW is a large repository of interconnected web documents that contain text, images, multimedia and many other items of information referred to as information resources [8] Statistics of authoritative web sites show that there are at least 4.78 billion web pages in indexed web as recorded on 27 July, 2015 and many more are lying in hidden web. People use information retrieval tool such as search engine to get information from such a huge collection of documents. A basic search engine has five main components namely: User interface, crawler also known as spider, indexing module, query processing module and ranking module [12]. When the user submits its information need in the form of set of keywords referred to as query at user interface, search engine takes few seconds to retrieve the web pages and present back the result list to the user. The less retrieval time is possible because it is retrieving the documents from its own database which has been maintained locally much before the actual requirement arises by crawling and indexing module. The crawler is the program that traverses the web at specified interval and downloads the web documents from different web servers [13]. Further these documents are parsed to extract text, hyperlinks and stored separately in different files. The hyperlinks are again used by crawler to download the web pages and text is stored in repository. The indexing module takes the text from repository and constructs the inverted index of terms belonging to a Shilpa Sethi and Ashutosh Dixit 9

2 AN ADAPTIVE SEARCH SYSTEM BASED ON WEB USAGES MINING document. The index is basically the list of terms where each term is linked with multiple postings [16]. The no. of postings is equal to the no. of documents containing the term. The document posting stores doc ID, the no. of incoming links, number of outgoing links from the document, depth and frequency of term in the document. Further this list is attached to a third list containing the exact information about the position of every occurrence of term in the document. The query processor executes the user query on this inverted index and retrieves the matched documents. These set of documents are then sorted by ranking module based upon content and link mining mechanism. The sorted list is at last present back to the user in response to its query. In short, the information retrieval is purely based on keyword matching. But users of these search engines may have varying internet skills for retrieving information from a novice user to computer specialist. So, the keywords entered by user are sometimes not enough to clearly reflect its information need or ambiguous to infer distinct need. Moreover, the different users use the same word to get different information. For example, for the query JAVA, some users may be interested in documents related to programming language Java whereas other may be looking for Java coffee. But the traditional search engines provide the same ranked list to the entire user whether they are interested in programming language or coffee. Hence, it becomes difficult for a novice user to get relevant information. In order to predict such information needs web usages mining can be consider as a solution. It can be defined as the collection of techniques that analyse the user access pattern in order to infer its searching need. Many algorithms based on user explicit feedback form, Collaborative filtering [2,14,15],click history [4,9], session usages [7] etc. had been proposed in the past. In order to mine the user interest, all the above mentioned approaches requires the involvement of user to some extent. This paper proposed a novel hassle free user interest learning mechanism which dynamically evaluates the user interest factor in different domains which can be further used in ranking process to sort the results as per user expectations. The rest of the paper is structured as follows: section 2 discusses the basic preliminaries and related work done in this area. Sections 3 describe the proposed user interest mining system in detail with example illustration. In section 4, analysis of sample query set is conducted to verify that user profile information can be utilized for the retrieval of relevant pages from search engine database. Section 5 compares the results of proposed system with popular search engines. Section 6 concludes the paper. [2] STATE OF THE ART The field of web information retrieval focuses on providing the information relevant to users need. Many researchers have made great contribution on this knot. The popular search engine, Goggle sorts the retrieved documents based on link structure of page within the web [5]. It first retrieves a set of relevant pages based on factors such as title tags and keywords and then applies the pagerank algorithm so that relevant pages can be placed at the top of ranked list. It equally divides the rank score of a web page among its outgoing links. [9] proposed the extension of basic page rank algorithm known as weighted page rank algorithm. It is pointed that all the outgoing links of a page can t have equal importance so, they assigned the page rank to a page based on link popularity of its incoming and outgoing links 10

3 International Journal of Computer Engineering and Applications, Volume X, Issue I, Jan ISSN The improved weighted page rank is proposed in [7] where higher rank values are assigned to outgoing links which is more visited by the user and receives higher popularity from its in links. Ekstrand et.al [2] describes the collaborative filtering based recommendation system where a user interest is inferred by asking the user to rate the items in a given domain. The Pearson correlation coefficient is used to choose a set of user who has the similar interest as that of active user and weighted aggregate of their rating is used to generate predictions for active user. But the main drawback of this approach is, users are very reluctant in providing any type of feedback. Zhongbao [11] suggested that user intension of search may be deduced by extracting the terms present in the documents previously clicked by the user and mapping these terms to set of categories from the ODP taxonomy.when the user submits a query, the top three matched categories are offered to the user for a selection. Only a small set of category hierarchy is used which cannot infer the user interest correctly. Moreover, the user had to select from these categories explicitly in which he is not always interested. A multi agent ontology profile construction method is proposed in [3] where the user short term interest based on sliding time window and long term interest based on forgetting factor are automatically evaluated for every user. The major shortcomings of this approach are the complex computation and storage requirement for maintaining each user profile. Although the existing search engine are using the sophisticated mining techniques to infer user search intention, but they are still not up to the mark of user satisfaction because of the following reasons: User is not interested in giving the explicit feedback on search results and implicitly learning the user interest is not so easy To serve the needs of individual, bulky profile of every user is maintained which require lots of memory space. So, an efficient automatic user interest learning mechanism is required to built that can maintain each user profile in an optimized way. The proposed ranking mechanism discussed in the next section solves the above mentioned problems by considering user visits in page categories and recursively learning the user interest. [3] PROPOSED SYSTEM The major components of proposed system shown in fig 1 are: query interface, user profile module, query processer, page classifier, crawler and database. The detail working of each component is given in following subsections. Shilpa Sethi and Ashutosh Dixit 11

4 AN ADAPTIVE SEARCH SYSTEM BASED ON WEB USAGES MINING Figure 1: Proposed Architecture igure 2: Proposed Architecture 3.1 Query interface It is an interface where a new user is registered and existing user is authenticated.when a user login to the system,it passes a signal something to authenticate to user profile module After registration/authentication, the user can submit its search need in the form of query here. Query interface passes the query words to query processor to find the relevant results related to query.after getting the sorted list of URLs from the query processor; it presents the results back to the user. 3.2 User Profile module After receiving the signal from search engine interface, it registers the new user or authenticates the existing user with user id. It creates the profile for every user based on degree of interest in a particular category of web pages. The proposed system classify each web page in one of the five categories viz : entertainment, sports, education, fashion and Shopping and food & beverages (as consideration is to evaluate the performance of system on small set of data by using the proposed technique which can be further extendible). The various tasks performed by this module are listed as follow: 1) Creates and maintains the profile for every user and store the information in profile database. 2) Receives the user click information on a particular URL from the query interface. 3) Extracts the category information related to click URL from search engine database. 4) Compute/update the user interest weight in each category represented by Weight interest (u, C) by using eqn (1) given below. Weight interest (u, C) measures the extent to which user u is interested in category C with respect to all the categories in search engine database. 12

5 International Journal of Computer Engineering and Applications, Volume X, Issue I, Jan ISSN ) Where: NP (u, C) counts the no. of pages accessed by user u in page category C.NP (u, Ci) counts the total no. of pages accessed by user u in all the categories (C1, C2...Cn). 6) Finally the interest weight of each user in different categories is passed to query processor. Example illustrating the working of user profile module:to explain the working of profile generator, let us consider a small set of users, U= {user1, user2, user3, user4} and page categories, C= {C1, C2, C3, C4}. Initially the interest weight of each user in all categories is set to zero. Let the user1 fires a query blackberry, which is found in page Pm and Pn belonging to two different categories namely shopping and fashion and food and beverages respectively. The query processor prepares the sorted list based on keyword weight and link weight as interest weight is 0 initially. The user clicks the page Pn say, belonging to category food and beverages ). So, according to eqn (1) the Weight interest of user1 will be updated to 1/1=1 under the category food and beverages (other remains still 0). In this way the degree of user interest under particular category will keep on updating as user access more and more pages of that category with respect to overall access. Table 1(b) shows an example of interest weight of different users in different categories at any time t. Table 1(a) Initial user interest weight in each category Classes C1 C2 C3 C4 User User User User Table 1(b) User interest weight in each category at time t Classes C1 C2 C3 C4 User User2 o User ,1 User From the table 1(b), it is may be observed that each user has different degree of interest in different page categories. So, the mechanism is successful in mapping the interest of different users and serving the ranked list as per user perspective. 3.3 Query processor The query processor receives the query terms from query interface and prepares the sorted list of wed documents for the user. The query processor performs the following activities: 1) Remove the non functional keywords (like in, what, that etc.) using Porters s algorithm [8] from the query. Shilpa Sethi and Ashutosh Dixit 13

6 AN ADAPTIVE SEARCH SYSTEM BASED ON WEB USAGES MINING 2) Find the synonyms of functional keywords using wordnet ) Find the pages which contain the functional terms and/ or synonyms. 4) Find the no. of occurrences and position of each of above mentioned terms in matched documents. 5) Calculate document weight, Weight doc of all the matched documents by using eqn (2) Where: Wpos denotes the position weight discussed in next section. Wkw denotes the keyword weight discussed in next section 6) Add the link weight as calculated by [7], interest weight and document weight to obtain the overall rank of a web page. 7) Prepare the sorted list of documents and pass it to query interface. Calculation of Wpos :The position of query term plays an important role while computing the weight of web document as the document containing query term in title tag is more important than the document having in body text. The weight corresponding to different positions are listed in table 2. Table 2: Keyword position weight Keyword position Weight <Title> 1 <H1><H2><H3> 0.75 <B><I><U> 0.5 <Body> 0.25 Rules for assigning the position weight are as follow: Rule1: If the query contains a single term and it is occurring at different positions in the web document, then the higher position weight is considered among all occurrences. Rule2: If the query contains more than one functional term than the sum of highest position weight of all the terms are assigned to Wpos. Calculation of Wkw:The frequency of keyword in the document also reflects the relevancy of document w.r.t query term. As the different documents have different lengths so, frequency need to be normalized. Where: ni denotes the no. of occurrences of each query term of Q. nk denotes the no. of occurrences of each keyword in the document, Doc 3.4 Crawler It traverses the web automatically by following the hyperlinks and depending upon the host protocol downloads the web documents from the web server. It starts the process of crawling by placing a set of seed URL (in the proposed system the seed set contains the URLs from five different domains) in a queue called URL frontier. From this queue it picks the URL, downloads the page, segregate the link information from the page and update the URL frontier. The page information such as no. of incoming links, outgoing links is placed 14

7 International Journal of Computer Engineering and Applications, Volume X, Issue I, Jan ISSN in page repository. This process is repeated and the collected documents are further indexed by page classifier in appropriate class of search engine database. 3.5 Page Classifier In order to full fill user need quickly, Search engine maintains the search engine database with the help of a special module called indexer. Here, the working of indexer is slightly modified so as to index the pages as well as classify them in different classes. The different task performed by the page classifier module is listed below: 1) Construct the initial set of page categories starting with the seed keywords in each category 2) Extracts the functional words (is, what, then etc. are ignored) along with their position information within the document. 3) Determine the page category to which the page belong by comparing the set of functional words of page with set of keywords of each page category. The page will be placed in a category whose intersection with page keywords is maximum. (Here, intersection between two sets must be above minimum threshold value (taken as 0.20). 4) The set of keywords in each category will keep on updating by taking the union of keyword set with that page whose intersection with the category keyword is above ) Store the page info in qualified page category 6) If the intersection of page keywords of any page P with all the categories is below 0.20 create a new page category, Cm with seed keyword set initialized to keywords of page P. The working of page classifier module is depicted in fig 2. Fig 2: Flow chart showing working of page classifier module. [4] RESULT ANALYSIS A dataset of 10,000 pages are classified in five different classes namely entertainment, education, sports, fashion & shopping and food and beverages. The seed set of keywords Shilpa Sethi and Ashutosh Dixit 15

8 AN ADAPTIVE SEARCH SYSTEM BASED ON WEB USAGES MINING defining each class is built. The analysis of pages browsed by the different users in various categories has been conducted to identify their degree of interest which is further used in rank calculation mechanism. User study with a group of graduate students was conducted. Users were expected to select relevant URL satisfying their information need. The experiment tracks the pages, user has visited from 05 April, 2015 to 19 April, 2015 and analyzes the batch of page after every 5 days. The no. of pages visited by a volunteer group of 5 students in first batch is shown in table 3(a). Table3 (a): no. of pages visited by each user in different page category in I ST batch Page Categories Users U1 U2 U3 U4 U5 101(entertainment) (education) (sports) (fashion &shopping) (food & beverages) The no. of pages visited by each user in second batch (April 10, April14, 2015) is shown in table 3(b). Table 3(b): no. of pages visited by each user in different category in 2ND batch Page Categories Users U1 U2 U3 U4 U5 101(entertainment) (education) (sports) (fashion shopping) (food & beverages) The no. of pages browsed by each user in third batch (April 15, April19, 2015) is shown in table 3(c). Table3(c): no. of pages visited by each user in different category in 3 RD batch Page Categories Users U1 U2 U3 U4 U5 101(entertainment) (education) (sports) (fashion &shopping) (food & beverages) The weight of interest of U1 after the three batches of analysis is shown in table 4 Table 4: Interest weight in different classes for UID1 Batch1 Batch2 Batch3 101(entertainment) (education) (sports) , (fashion &shopping) (food & beverages)

9 International Journal of Computer Engineering and Applications, Volume X, Issue I, Jan ISSN By analyzing the browsing history of U1,It has been observed that during the second batch of experiment, the interest weight in two new classes (104 & 105) are added, But the interest weight in class 102 and 103 has been dropped. Similarly, the interest weights of rest of the users in different page categories are also maintained by profile gem ration module. Calculating the document weight: The document weight is obtained by adding keyword weight, link weight and user interest weight. The sorted result list will be different for different users. When the user 1 (U1) submitted the query Sony Ericson, the no. pages matched by query processor are 192(very less no. of pages as compared to Google, but as the concern is on evaluating the technique to check whether it is producing satisfactory results or not ). 10 top results are shown to U1 on the first page.user satisfaction level on the scale of 10 for the proposed system and popular existing search engine is compared and shown in fig 3. [6] CONCLUSION Fig3: Comparison of proposed system with existing search system An efficient page ranking mechanism based on user interest mining is proposed in this paper to retrieve quality data. The user interest is mined by tracking the no. of pages visited by user in past without any efforts at the part of user. The technique maintains the user profile by considering the few attributes about browsing history of user thereby providing the optimized solution to personalize the results. Short term and long term interest of user is easily adjusted as the no. of paged visited by user in different classes vary from time to time.. The experiment with volunteer groups verifies the proposed mechanism is effective as compared to existing search system. REFERENCES [1] N.Duhan,A,Sharma Optimization of search results with duplicate page elimination using usage data. ACEEE Int. J. on Network Security, Vol. 02, No. 02, Pg (2011) Shilpa Sethi and Ashutosh Dixit 17

10 AN ADAPTIVE SEARCH SYSTEM BASED ON WEB USAGES MINING [2] Ekstrand, M.D.,Riedi, J.T.,Konstan,J.A. Collaborative filtering recommender systems. Foundation and trends in human computer interaction vol. 4 no. 2.Pg (2010) [3] Q.Gao, Y.Cho A multi agent personalized ontology profile based user preference profile construction method IEEE 44th international symposium on robotics Inspec Accession Number ,Pg 1-4 (2013) [4] K.W.T. Leung., W.Lee, Dl Ng Personalised concept based clustering of search engine queries IEEE transactions on Knowledge and data engineering ISSN ,Pg (2008) [5] L.Page, S.Brin, R.Motwani, T. Winograd The pagerank citation ranking bringing order to the web Technical report, Stanford Digital Libraries SIDL-WP , 1999(1999). [6] Z.Sha,, D.Xiaotie., C.Kang., Z.Weimin Using Online Relevance Feedback to Build Effective Personalized Metasearch Engine, In Web Information Systems Engineering,. In Proceedings of the Second International Conference Vol1, Pg: (2001) [7] A. Sharma., N. Duhan, G. Kumar A novel page ranking method based on link visits of web pages Int. J. of Recent Trends in Engineering and Technology, Vol. 4, No. 1,Pg (2010) [8] S.Sethi,,A.Dixit. Design of personalized search system based on user interest and query structuring 2nd International conference on computing on sustainable global development INDIACOM-2015.ISBN: Pg(s) (2015) [9] W.Xing, A.Ghorbani Weighted PageRank Algorithm, Proc. of the 2nd Annual Conference on Communication Networks & Services Research, /04 (2004) [10] Hu Liang, Song Guohang, Xie Zhenzhen, and Zhao Kuo Personalized Recommendation Algorithm Based on Preference Features. Tsinghua science and technology issnll ll08/11llpp volume 19, Number 3 (2014) [11] L. Zhongbao Research on personalized search engine based on user interest mining International conference on intelligent computing and integrated system ISBN: Pg(s) (2010) [12] P. Mudgil, A.Sharma,P.Gupta An improved indexing mechanism to index web document Proc of International conference on computational intelligence and communication networks ISBN: (2013) [13] S.Sethi A.Dixit A crawling mechanism to maintain freshness of downloaded collection based on user on user perspective and page updation frequency Journal of Network Communications and Emerging Technologies (JNCET) Volume 5, Special Issue 2, December (2015) [14] Zhao Zhi-Dan, Shang Ming-Sheng, User-based Collaborative-Filtering Recommendation Algorithms on Hadoop. Third International Conference on Knowledge Discovery and Data Mining(2010) [15] Mu Xiangwei, Yan Chen and Li Taoying User-Based Collaborative Filtering Based on Improved Similarity Algorithm. (2010) [16] S. Mitra, M. Winslett., W.Windsor,K. Chen-Chuan. Trustworthy keyword searchfor compliance storage VLDB, J.17(2), pp (2008) 18

A GEOGRAPHICAL LOCATION INFLUENCED PAGE RANKING TECHNIQUE FOR INFORMATION RETRIEVAL IN SEARCH ENGINE

A GEOGRAPHICAL LOCATION INFLUENCED PAGE RANKING TECHNIQUE FOR INFORMATION RETRIEVAL IN SEARCH ENGINE Sanjib Kumar Sahu 1, Vinod Kumar J. 2, D. P. Mahapatra 3 and R. C. Balabantaray 4 1 Department of Computer