Finding Hubs and authorities using Information scent to improve the Information Retrieval precision

Size: px
Start display at page:

Download "Finding Hubs and authorities using Information scent to improve the Information Retrieval precision"

Transcription

1 Finding Hubs and authorities using Information scent to improve the Information Retrieval precision Suruchi Chawla 1, Dr Punam Bedi 2 1 Department of Computer Science, University of Delhi, Delhi, INDIA 2 Department of Computer Science, University of Delhi, Delhi, INDIA Abstract - Improvement in the effectiveness of Information Retrieval on the Web in one such research area in which efforts are going on to improve the precision by satisfying the user need effectively and efficiently. To improve the precision of Information Retrieval on the Web there is a need to understand the information need of the query issued by user and retrieve the good authorities and hub web pages for a particular information need implied by the query. Good authorities are those pages which are rich source of information for a specific information need. Good Hub pages are those pages which contain link to good authorities for a given information need. This paper identifies the good authorities and hub pages for a particular information need using Information Scent in HITS with query sessions mining. Information scent is used to identify the relevance of the clicked URLs in the query sessions of the user with respect to the information need of the query sessions. The approach proposed in this paper uses information need of query sessions modeled using information scent and content vector of clicked URLs in the sessions to cluster query sessions with similar information need.each derived cluster is used in HITS algorithm to generate good hubs and authorities pages for each information need represented by cluster using information scent. The input query is then used to select the cluster that contains the high scent authorities and hub page for the information need associated with the input query. Performance of the proposed approach is evaluated with an experimental study of query sessions mining of the "Google" Search engine web history and the experimental results shows the improvement of the Information Retrieval precision. Keywords : Information scent, Information retrieval, clustering, search engine, authorities, hub. 1. Introduction Current search tools retrieve too many documents for a given input query out of which only few are relevant to the query [10]. Furthermore some of the relevant documents which do not contain the keyword used in query could not be retrieved through keyword based search engines. These relevant documents are called as authorities and hub pages for a given query. The good authorities are those pages which are best source of information on a given topic. Good Hubs are those pages which provide collection of links to good authorities. One possible justification for the fact that these relevant documents could not be retrieved using keywords in input query is that authorities pages for a given topic is seldom self descriptive. It is hardly found that topic being represented by authorities pages itself appear in a page. The result of this is that most keyword based search engine could not retrieve these documents since it does not contain the keyword that matches the keyword given in query. In order to overcome this problem and to improve the precision of information retrieval, HITS Algorithm is introduced in [4] to generate the hubs and authorities pages for a given input query. The HITS algorithm uses the initial set of retrieved documents for an input query using standard IR system as root set to generate the hub and authorities for a given input query. The performance of HITS algorithm depend on the initial set of retrieved documents which are considered to be relevant but the actual relevancy depend on the user perception of relevance of retrieved document with respect to his information need. Hubs and authorities scores were computed online which had the impact on the efficiency of search engine in Information Retrieval. Keeping in view the above bottlenecks that has the impact on the precision. The approach proposed in this paper uses the Information Scent in HITS with query sessions mining. Information Scent is used to quantitatively assess the relevance of the clicked page with respect to the information need of the query session. Information Scent is used in [7] to improve the Information retrieval precision by improving the rank

2 of those retrieved pages in the result set which were relevant to the Information need associated to input query. History of interaction of users with search engine is preprocessed to get the set of query sessions in the query log of search engine. A particular query session contain those retrieved documents that user clicked for his information need associated with the input query. Information need associated with the query sessions is modeled using information scent and content of clicked pages in the sessions. Query sessions with similar Information need are clustered and each cluster contains set of pages that satisfy similar information need. In offline processing HITS algorithm uses the derived clusters as roots sets to discover the authorities and hub web pages for unique information need associated with each root set. Information Scent is used to determine the relevancy of the given page as authorities and hub for a specific information need associated with root set. In online processing the input query is used to select the cluster for a specific information need. The selected cluster will return the high Information Scent authorities and hub web pages as good authorities and hub web pages for a given input query. The proposed approach is used to improve the information retrieval precision by returning high quality web documents for a given query using Information Scent in HITS with query sessions mining. This paper is organized as follows: section 2 describes the Information scent, section 3 Modeling information need of query sessions using information scent, section 4 explains the use of Information Scent in HITS algorithm with Query sessions mining, section 5 gives the proposed approach for generating hubs and authorities for boosting the information retrieval precision, section 6 present the experimental study and section 7 concludes the paper. 2. Information Scent On the web, users search for information by navigating from page to page along the web links. Their actions are guided by their information need. Information scent is the measure of sense of value and cost of accessing a page based on perceptual cues with respect to the information need of user. More the page is satisfying the information need of user, more will be the information scent associated to it. The interaction between user needs, user action and content of web can be used to infer information need from a pattern of surfing [1][2][5][6]. Information scent is used in the proposed approach to derive the quantitative measure of the sense of value of the clicked page in query session with respect to the information need of the user associated with the query session. High Information Scent URLs are those clicked URLs in the query sessions that are close to the information need associated with the query sessions. High Scent pages are uniquely clicked for a given information need. For a given sequence of clicked documents in particular query session more unique is the frequently clicked page to the session relative to the entire set of query sessions present in the data set, more likely it is close to the information need of the current query session and thus more is the information scent associated to it in determining the information need of the session. Another parameter that is taken in accessing the information scent of the clicked pages is the time spent on the clicked pages. The reason for considering the time factor is that the clicked page which consumes more user attention is more likely to satisfy his information need than the page which takes less time of user. Thus both the parameters decide the relevancy of the pages in determining the information need associated to query sessions using Information Scent. The concept of Information Scent takes into consideration the fact that every document clicked by the user in a particular query session is not equally relevant with respect to the information need of the user. 3. Modeling Information Need of Query sessions using Information Scent The Inferring User Need by Information Scent (IUNIS) algorithm provides various combinations of parameters to quantify the Information Scent [2] [3].The factors that are taken and adapted according to the proposed approach are page access PF.IPF weight and TIME that are used to quantify the information scent associated with the clicked page in a query session. In page access PF.IPF the PF is the access frequency of the clicked page in the given query session and the IPF is the ratio of total query sessions in the data set to the number of query sessions in which this page is clicked. This factor gives high weightage to those pages that are uniquely and frequently accessed in the query session and are relevant to the information need associated with the current query session. The second factor that is taken is Time spent on a page in a given query session. By including the time more weightage is given to those pages that consume more user attention. The information scent s id is calculated for each page P id in a given session Q i as follows. s id = PF.IPF(Pid)*Time(Pid) d 1..n (1) PF.IPF(Pid) = fp /max(fpid)*log(m/m d 1..n id P id ) (2)

3 where n is the number of distinct clicked pages in the query session Q i. PF.IPF(P id ) and Time(P id ) are defined as follows. PF.IPF(Pid ) : PF corresponds to page P id normalized frequency f Pid in a given query session Q i and IPF correspond to the ratio of total number of query sessions M in the whole data set to the number of query sessions m Pid that contain the given page P id. Time(Pid ) : It is the ratio of time spent on the page P id in a given session Q i to the total duration of session Q i. In this paper weighted content vector of the page P id in the query session Q i is used. P id = Content id d 1..n where n is the number of distinct clicked pages in the query session Q i. Content id : The content vector of a page P id is a keyword vector (w 1,id,w 2id,w 3,id,,w v,id ) where v is the number of terms in the vocabulary set V.Vocabulary V is a set of distinct terms found in all distinct clicked pages in whole dataset relevant to a content feature. Vector Model in [8] is used for representing content of each page P id in all query sessions. TF.IDF (term frequency * inverse document frequency) term weight is used to represent the content vector for a given page P id. The importance of each term of V in a given page P id is calculated using TF.IDF weight. The TF.IDF weight is calculated as number of times a term appears in the given page weighted by the ratio of the number of all pages to the number of the pages that contain the given term. The information scent associated with the given clicked page P id is calculated by using two factors i.e. PF.IPF page access and TIME. Each query session is constructed as linear combination of vector of each page P id scaled by the weight s id which is the information scent associated with the page P id in session Q i. That is n Q i = s id P id (3) d=1 In above formula n is the number of distinct clicked pages in the session Q i and s id (information scent) is calculated for each page P id in a given session Q i using (1) and (2). Each query session Q i is obtained as weighted vector using formula (3). This vector models the information need associated with the query session Q i Clustering Queries Clustering is the process of grouping the data into classes or clusters so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. Dissimilarities are accessed based on attribute values describing the objects. Query sessions vector are clustered using k-means algorithm because of its good performance for document clustering [9][11]. Query sessions in our approach are similar to the vectors of web pages. Thus clustering queries in our approach are similar to those for clustering pages. A score or criterion function measures the quality of resulting clusters. This is used by common vector space implementation of k-means algorithm [12]. The function measures the average similarity between vectors and the centroid of clusters that are assigned to. Let C p be a cluster found in a k-way clustering process (p 1..k) and let c p be the centroid of p th cluster. The criterion function I is defined as follows: k I=1/M Σ Σ sim (v i, c p ) (4) p=1 v i C p where M is the total number of query sessions in all clusters and v i is the vector representing some query session belonging to the cluster C p and centroid c p of the cluster C p is defined as given below. c p = ( Σ v i ) / C p (5) v i C p where C p denotes the number of query sessions in cluster C p. sim (v i, c p ) is calculated using cosine measure. 4. Information Scent in HITS: Computing Hubs and Authorities with query sessions mining The HITS(Hyperlink-Induced Topic Search) algorithm in [4] compute the list of hubs and authorities for a particular web search topic. Search topic is specified by one or more query terms. HITS algorithm applies two main steps:

4 A sampling component which constructs a focused collection of several pages likely to be rich in relevant authorities. A weight propagation component which determines numerical estimates of hub and authority weights by an iterative procedure. HITS returns highest weight hub and authorities web pages for a given search topic. In the proposed approach HITS algorithm is modified to use the clusters obtained using Information Scent in query sessions mining as the root sets for each information need associated with the clusters. A particular root set is the set of pages which are satisfying the similar specific information need.root set is expanded to get the Base set by including all the pages that root set pages link to and all the pages that link to a page in the root set. Web is viewed as directed graph consists of set of nodes with directed edges between certain node pairs. Given any subset S of nodes, the nodes induce a subgraph containing all edges that connect two nodes in S. Good hubs and authorities web pages are discovered from subgraph G induced by base set for each unique information need using Information Scent. Information Scent is used to identify the relevancy of the pages in the base set with respect to the information need associated with the base set. High Information scent pages are more relevant than low Scent pages. HITS algorithm is modified to use Information scent associated with each page as their initial authority and hub score which was otherwise taken as constant for all pages in the base set. The use of Information Scent is favored because it measures the relevancy of the clicked page for a specific information need and using the page relevancy for a specific information need as their initial hub and authority score will give its more accurate approximation as hub or authority. It is based on the fact that all the pages in the base set are not equally relevant for the specific search topic. The procedure for computing hub and authorities pages using Information Scent in HITS for each base set obtained as a result of query sessions mining is given below. Modified HITS Offline Processing 1) Use each cluster obtained as result of query sessions mining as root set for each unique information need associated with root set. 2) Expand each root set to the corresponding base set. 3) Use the information scent of each page p in the base set to initialize its authority weight x p and hub weight y p in the subgraph induced by base set. 4) Update the authority weight x p and hub weight y p for each page p as follows till scores of each page p reach some fixed point. x p = y q (6) q such that q p y p = x q (7) q such that p q 5) Page p with a high information scent associated with weight x p will be viewed as good authority and page p with high information scent associated with weight y p will be viewed as good hub. 6) HITS output a short list consists of the pages with the high scent authority score and hub score for the specific information need associated with each base set using the threshold value of the Information Scent x. 5. Proposed approach for improving the information retrieval precision using high scent hub and authority pages. The proposed approach is based on improving the information retrieval precision using high scent authorities and hubs for a unique information need of input query using Information Scent in HITS with query session mining. It is based on using the clusters of query sessions as root sets for modified HITS to generate hubs and authorities for unique information needs using Information Scent. Each query session consist of a query along with the clicked URLs in its answer. Query session=(query,(clicked URLs) + ) where clicked URLs are those URLs which user clicked before submitting another query. Any change in the content of input query marks the beginning of the new query session.

5 5.1. ProposedApproach 1. Offline preprocessing phase at regular and periodical intervals. 1.1 Extract the queries and associated clicked URLs from the data set. 1.2 Preprocess the extracted Queries to find the query sessions. 1.3 Model the information need associated to each query session using information scent and weighted vector of content of pages in the session using equations (1)(2)(3). 1.4 Cluster the query session using information need associated to each query session using k-means clustering. 1.5 Use each cluster C j as root set for modified HITS algorithm to discover high scent authorities and hub web pages.create a list HA j of hub and authorities having high scent for hub and authority score using threshold value x for Information Scent. 2. Online processing phase: 2.1 Find the C j cluster to which input query q belongs. 2.2 if no cluster found then Find the C j cluster which is most similar to the term weight vector of the input query q as per the threshold value set for the similarity measure. 2.3 Use the URLs in set HA j associated with selected cluster C j to be returned as high scent hub and authorities pages in the top result page of retrieved page sets for the input query q. 6. Experimental Study Experiment was performed on the data set collected from the Web History of "Google" search engine. Web history of Google search engine stores the history of interaction of users for all query topics issued on search engine. The Web history of "Google" search engine contains the following fields for each entry in it. 1. Time of the Day 2. Query terms 3. Clicked URLs On submission of the input query, "Google" search engine returns a result page consists of URLs with information about URLs. The URLs are ranked in order of relevance to the input query as determined by the "Google" internal relevance function. In the experiment only those query sessions were selected which had at least one click in their answers. Query sessions considered consist of query terms along with the clicked URLs. The clicked URLs were those URLs which user clicked before he submits another query. The number of distinct URLs in the data set was found to be The data set was preprocessed to get 9088 query sessions. The Information need of query sessions is modeled using Information Scent and content of clicked URLs and was clustered using k-means algorithm. The k-means algorithm was executed several times for different values of k and criterion function was computed for each value of k. The criterion function was found to have maximum value at k=161.the similarity of vectors was measured using cosine formula for weighted term vector. The similarity threshold value was set to 0.5. The threshold value x for information scent was set to 0.25.The experiment was performed on queries selected from three domain mainly entertainment, academics and sports. The data set was generated by the users expertise in selected domains who were asked to issue the queries in these domains. The sample of queries taken in each of these domains from the data set is given in Table 1. Table 1. Sample of Queries in selected Domains Domain Entertainment Sports Academics Queries Free pics, online audio stores Free download mp3, skies of arcadia pictures,vcd files, mpeg movies. Grand American road racing series Arena football, South dakota wrestling Major league baseball tryouts, kit car Arena football. Cgi perl tutorial, sql tutorial, tutorial oracle,windows 2000 tutorial, macros, templates Weblogs, The experiment was performed on Pentium IV PC with 512 MB RAM under Windows XP using Java and Oracle database. WebSphinx Crawler was used to fetch the clicked documents of query sessions in the data set. Each query session was transformed into the vector representation using Information Scent and content of clicked URLs. Query sessions were clustered and each cluster of query sessions was represented by mean value of vector of terms. Some of the good authorities and hub pages discovered using relevant pages in cluster to which Search engine input query belongs using modified HITS is given in Table 2.

6 Table 2. Sample of authorities and hub pages discovered for the query Search engine Query Authorities and hub pages Search engine Searchenginecolossus.c om Google.com,yahoo.com, 123khoj.com Searchturtle.com,allthe web.com,lycos.com,msn.com,indiaspider.com,al tavista.com,excite.com,s earchenginewatch.com, askjeeves.com,search.ne tscape.com 6.1. Performance of proposed approach The performance of the proposed approach to recommend hub and authorities based on using Information Scent in HITS with query sessions mining is evaluated by anonymous users having knowledge in domain from which queries were selected. The performance was evaluated on trained and untrained set of queries belonging to each of the domains considered. The trained set of queries were those queries which had sessions associated with them in the data set and untrained set of queries were those queries which did not have sessions associated with them in the data set.the experiment was performed on both trained and untrained set of queries separately. The performance was evaluated by using average precision of queries for each domain. The experiment was performed on 25 trained queries and 35 untrained queries. The average precision was calculated for first top 10 retrieved URLs and users mark the relevant documents within the list of top 10 URLs retrieved for a given query using both proposed and without using proposed approach. avgprecision untrained queries enter academics sport domain w ithoutpropose dapproach proposedapproa ch avgprecision trainedqueries enter academics sport domain w ithoutproposedapp roach proposed approach Fig 2. Average precision of without proposed approach and proposed approach on trained queries. The Fig 1 and Fig 2 shows the average precision of search results of Google search engine without proposed approach and with proposed approach presented in this paper for both trained and untrained set of queries. The above experiment shows that information retrieval precision is improved for both trained set of queries and untrained set of queries using proposed approach. The improvement in the average precision confirmed the effectiveness of the proposed approach in satisfying the information need of the user through the recommendation of high scent authorities and hubs efficiently for a given input query using Information Scent in HITS with query sessions mining. 7. Conclusion In this paper efforts have been made for improving the information retrieval precision through the recommendation of the good hub and authorities pages using the information need of the input query. Information Scent is used in HITS with query sessions mining to compute the hub and authorities for each unique information need identified in query sessions mining. The clusters of similar information need query sessions are used to compute the high scent hubs and authorities for information need associated with the clusters in HITS. Information retrieval Effectiveness and efficiency is improved with the introduction of Information Scent in HITS with query sessions mining to generate the high information scent hub and authorities for each specific information need represented by the clusters of query sessions. Experimental results confirm the improvement of the precision of information retrieval using proposed approach.. Fig 1. Average precision of without proposed and proposed approach on untrained queries.

7 8. References [1] E. Agichtein, E. Brill, S. Dumais. Improving Web Search Ranking by Incorporating User behaviour, In Proceedings of the ACM Conference on Research and Development on Information Retrieval (SIGIR), [2] E H. Chi, P. Pirolli, K. Chen and J. Pitkow. Using Information Scent to model User Information Needs and Actions on the Web, In Proc. ACM CHI 2001 Conference on Human Factors in Computing Systems, pp ,2001. [3] J. Heer and E.H. Chi. Identification of Web User Traffic Composition using Multi-Modal clustering and Information Scent, In Proc of Workshop on Web Mining. SIAM Conference on Data Mining, pp ,2001. [4] Jon M. Kleinberg. Authoritative Sources in a hyperlinked environment ; J. ACM, 46(5), pp ,1999. [5] P.Pirolli. Computational models of information scent-following in a very large browsable text collection, In Proc. ACM CHI 97 Conference on Human Factors in Computing Systems, pp. 3-10, [6] P. Pirolli. The use of proximal information scent to forage for distal content on the world wide web, In Working with Technology in Mind: Brunswikian. Resources for Cognitive Science and Engineering, Oxford University Press, [7] Punam Bedi and Suruchi Chawla. Improving Information Retrieval Precision using Query log mining and Information Scent ; Information Technology Journal 6(4) Asian Network for Scientific Information, pp , [8] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, [9] R J. Wen,,Y J. Nie, J H Zhang. Query Clustering Using User Logs ; ACM Transactions on Information Systems, vol 20,No 1, pp ,2002. [10] V N. Gudivada, V V. Raghavan, W. Grosky, and R. KasanaGottu. Information Retrieval on World Wide Web, IEEE expert, pp ,1997. [11] Y. Zhao and G. Karypis. Comparison of agglomerative and partitional document clustering algorithms, In SIAM Workshop on Clustering High-dimensional Data and its Applications, [12] Y, Zhao and Y, Karypis. Criterion functions for document clustering, Technical report, University of Minnesota, Minneapolis, MN, 2002.

Addressing Low Precision in Web Log Mining for Personalized Information Retrieval

Addressing Low Precision in Web Log Mining for Personalized Information Retrieval Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,

More information

A New Technique to Optimize User s Browsing Session using Data Mining

A New Technique to Optimize User s Browsing Session using Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach P.T.Shijili 1 P.G Student, Department of CSE, Dr.Nallini Institute of Engineering & Technology, Dharapuram, Tamilnadu, India

More information

Optimization of Clusters of Web Query Sessions using Genetic Algorithm for Effective Personalized Web Search

Optimization of Clusters of Web Query Sessions using Genetic Algorithm for Effective Personalized Web Search Optimization of Clusters of Web Query Sessions using Genetic Algorithm for Effective Personalized Web Search Suruchi Chawla, PhD Assistant Professor Shaheed Rajguru College of Applied Science University

More information

Inferring User Search for Feedback Sessions

Inferring User Search for Feedback Sessions Inferring User Search for Feedback Sessions Sharayu Kakade 1, Prof. Ranjana Barde 2 PG Student, Department of Computer Science, MIT Academy of Engineering, Pune, MH, India 1 Assistant Professor, Department

More information

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN: IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T

More information

Enhancing Cluster Quality by Using User Browsing Time

Enhancing Cluster Quality by Using User Browsing Time Enhancing Cluster Quality by Using User Browsing Time Rehab M. Duwairi* and Khaleifah Al.jada'** * Department of Computer Information Systems, Jordan University of Science and Technology, Irbid 22110,

More information

Abstract. 1. Introduction

Abstract. 1. Introduction A Visualization System using Data Mining Techniques for Identifying Information Sources on the Web Richard H. Fowler, Tarkan Karadayi, Zhixiang Chen, Xiaodong Meng, Wendy A. L. Fowler Department of Computer

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Enhancing Cluster Quality by Using User Browsing Time

Enhancing Cluster Quality by Using User Browsing Time Enhancing Cluster Quality by Using User Browsing Time Rehab Duwairi Dept. of Computer Information Systems Jordan Univ. of Sc. and Technology Irbid, Jordan rehab@just.edu.jo Khaleifah Al.jada' Dept. of

More information

DATA MINING II - 1DL460. Spring 2014"

DATA MINING II - 1DL460. Spring 2014 DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Deep Web Crawling and Mining for Building Advanced Search Application

Deep Web Crawling and Mining for Building Advanced Search Application Deep Web Crawling and Mining for Building Advanced Search Application Zhigang Hua, Dan Hou, Yu Liu, Xin Sun, Yanbing Yu {hua, houdan, yuliu, xinsun, yyu}@cc.gatech.edu College of computing, Georgia Tech

More information

DATA MINING - 1DL105, 1DL111

DATA MINING - 1DL105, 1DL111 1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database

More information

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer Data Mining George Karypis Department of Computer Science Digital Technology Center University of Minnesota, Minneapolis, USA. http://www.cs.umn.edu/~karypis karypis@cs.umn.edu Overview Data-mining What

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

Dynamic Visualization of Hubs and Authorities during Web Search

Dynamic Visualization of Hubs and Authorities during Web Search Dynamic Visualization of Hubs and Authorities during Web Search Richard H. Fowler 1, David Navarro, Wendy A. Lawrence-Fowler, Xusheng Wang Department of Computer Science University of Texas Pan American

More information

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING Chapter 3 EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING 3.1 INTRODUCTION Generally web pages are retrieved with the help of search engines which deploy crawlers for downloading purpose. Given a query,

More information

LET:Towards More Precise Clustering of Search Results

LET:Towards More Precise Clustering of Search Results LET:Towards More Precise Clustering of Search Results Yi Zhang, Lidong Bing,Yexin Wang, Yan Zhang State Key Laboratory on Machine Perception Peking University,100871 Beijing, China {zhangyi, bingld,wangyx,zhy}@cis.pku.edu.cn

More information

Theme Identification in RDF Graphs

Theme Identification in RDF Graphs Theme Identification in RDF Graphs Hanane Ouksili PRiSM, Univ. Versailles St Quentin, UMR CNRS 8144, Versailles France hanane.ouksili@prism.uvsq.fr Abstract. An increasing number of RDF datasets is published

More information

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.

More information

Web Structure Mining using Link Analysis Algorithms

Web Structure Mining using Link Analysis Algorithms Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.

More information

Authoritative Sources in a Hyperlinked Environment

Authoritative Sources in a Hyperlinked Environment Authoritative Sources in a Hyperlinked Environment Journal of the ACM 46(1999) Jon Kleinberg, Dept. of Computer Science, Cornell University Introduction Searching on the web is defined as the process of

More information

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _ COURSE DELIVERY PLAN - THEORY Page 1 of 6 Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _ LP: CS6007 Rev. No: 01 Date: 27/06/2017 Sub.

More information

COMP5331: Knowledge Discovery and Data Mining

COMP5331: Knowledge Discovery and Data Mining COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank

More information

Web. The Discovery Method of Multiple Web Communities with Markov Cluster Algorithm

Web. The Discovery Method of Multiple Web Communities with Markov Cluster Algorithm Markov Cluster Algorithm Web Web Web Kleinberg HITS Web Web HITS Web Markov Cluster Algorithm ( ) Web The Discovery Method of Multiple Web Communities with Markov Cluster Algorithm Kazutami KATO and Hiroshi

More information

Learn from Web Search Logs to Organize Search Results

Learn from Web Search Logs to Organize Search Results Learn from Web Search Logs to Organize Search Results Xuanhui Wang Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 xwang20@cs.uiuc.edu ChengXiang Zhai Department

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Retrieval of Web Documents Using a Fuzzy Hierarchical Clustering

Retrieval of Web Documents Using a Fuzzy Hierarchical Clustering International Journal of Computer Applications (97 8887) Volume No., August 2 Retrieval of Documents Using a Fuzzy Hierarchical Clustering Deepti Gupta Lecturer School of Computer Science and Information

More information

Path Analysis References: Ch.10, Data Mining Techniques By M.Berry, andg.linoff Dr Ahmed Rafea

Path Analysis References: Ch.10, Data Mining Techniques By M.Berry, andg.linoff  Dr Ahmed Rafea Path Analysis References: Ch.10, Data Mining Techniques By M.Berry, andg.linoff http://www9.org/w9cdrom/68/68.html Dr Ahmed Rafea Outline Introduction Link Analysis Path Analysis Using Markov Chains Applications

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Selecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach +

Selecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach + Selecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach + Abdullah Al-Hamdani, Gultekin Ozsoyoglu Electrical Engineering and Computer Science Dept, Case Western Reserve University,

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

User Intent Discovery using Analysis of Browsing History

User Intent Discovery using Analysis of Browsing History User Intent Discovery using Analysis of Browsing History Wael K. Abdallah Information Systems Dept Computers & Information Faculty Mansoura University Mansoura, Egypt Dr. / Aziza S. Asem Information Systems

More information

Similarity Matrix Based Session Clustering by Sequence Alignment Using Dynamic Programming

Similarity Matrix Based Session Clustering by Sequence Alignment Using Dynamic Programming Similarity Matrix Based Session Clustering by Sequence Alignment Using Dynamic Programming Dr.K.Duraiswamy Dean, Academic K.S.Rangasamy College of Technology Tiruchengode, India V. Valli Mayil (Corresponding

More information

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts Kwangcheol Shin 1, Sang-Yong Han 1, and Alexander Gelbukh 1,2 1 Computer Science and Engineering Department, Chung-Ang University,

More information

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page International Journal of Soft Computing and Engineering (IJSCE) ISSN: 31-307, Volume-, Issue-3, July 01 Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page Neelam Tyagi, Simple

More information

A NEW CLUSTER MERGING ALGORITHM OF SUFFIX TREE CLUSTERING

A NEW CLUSTER MERGING ALGORITHM OF SUFFIX TREE CLUSTERING A NEW CLUSTER MERGING ALGORITHM OF SUFFIX TREE CLUSTERING Jianhua Wang, Ruixu Li Computer Science Department, Yantai University, Yantai, Shandong, China Abstract: Key words: Document clustering methods

More information

A Metric for Inferring User Search Goals in Search Engines

A Metric for Inferring User Search Goals in Search Engines International Journal of Engineering and Technical Research (IJETR) A Metric for Inferring User Search Goals in Search Engines M. Monika, N. Rajesh, K.Rameshbabu Abstract For a broad topic, different users

More information

Searching the Web What is this Page Known for? Luis De Alba

Searching the Web What is this Page Known for? Luis De Alba Searching the Web What is this Page Known for? Luis De Alba ldealbar@cc.hut.fi Searching the Web Arasu, Cho, Garcia-Molina, Paepcke, Raghavan August, 2001. Stanford University Introduction People browse

More information

A Novel Approach for Weighted Clustering

A Novel Approach for Weighted Clustering A Novel Approach for Weighted Clustering CHANDRA B. Indian Institute of Technology, Delhi Hauz Khas, New Delhi, India 110 016. Email: bchandra104@yahoo.co.in Abstract: - In majority of the real life datasets,

More information

IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK

IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK 1 Mount Steffi Varish.C, 2 Guru Rama SenthilVel Abstract - Image Mining is a recent trended approach enveloped in

More information

Information Retrieval. Lecture 11 - Link analysis

Information Retrieval. Lecture 11 - Link analysis Information Retrieval Lecture 11 - Link analysis Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 35 Introduction Link analysis: using hyperlinks

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

Social Information Retrieval

Social Information Retrieval Social Information Retrieval Sebastian Marius Kirsch kirschs@informatik.uni-bonn.de th November 00 Format of this talk about my diploma thesis advised by Prof. Dr. Armin B. Cremers inspired by research

More information

Web page recommendation using a stochastic process model

Web page recommendation using a stochastic process model Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,

More information

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem I J C T A, 9(41) 2016, pp. 1235-1239 International Science Press Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem Hema Dubey *, Nilay Khare *, Alind Khare **

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

ihits: Extending HITS for Personal Interests Profiling

ihits: Extending HITS for Personal Interests Profiling ihits: Extending HITS for Personal Interests Profiling Ziming Zhuang School of Information Sciences and Technology The Pennsylvania State University zzhuang@ist.psu.edu Abstract Ever since the boom of

More information

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014. A B S T R A C T International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Information Retrieval Models and Searching Methodologies: Survey Balwinder Saini*,Vikram Singh,Satish

More information

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LVII, Number 4, 2012 CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL IOAN BADARINZA AND ADRIAN STERCA Abstract. In this paper

More information

Link Analysis in Web Information Retrieval

Link Analysis in Web Information Retrieval Link Analysis in Web Information Retrieval Monika Henzinger Google Incorporated Mountain View, California monika@google.com Abstract The analysis of the hyperlink structure of the web has led to significant

More information

Enabling Users to Visually Evaluate the Effectiveness of Different Search Queries or Engines

Enabling Users to Visually Evaluate the Effectiveness of Different Search Queries or Engines Appears in WWW 04 Workshop: Measuring Web Effectiveness: The User Perspective, New York, NY, May 18, 2004 Enabling Users to Visually Evaluate the Effectiveness of Different Search Queries or Engines Anselm

More information

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW ISSN: 9 694 (ONLINE) ICTACT JOURNAL ON COMMUNICATION TECHNOLOGY, MARCH, VOL:, ISSUE: WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW V Lakshmi Praba and T Vasantha Department of Computer

More information

Template Extraction from Heterogeneous Web Pages

Template Extraction from Heterogeneous Web Pages Template Extraction from Heterogeneous Web Pages 1 Mrs. Harshal H. Kulkarni, 2 Mrs. Manasi k. Kulkarni Asst. Professor, Pune University, (PESMCOE, Pune), Pune, India Abstract: Templates are used by many

More information

7. Mining Text and Web Data

7. Mining Text and Web Data 7. Mining Text and Web Data Contents of this Chapter 7.1 Introduction 7.2 Data Preprocessing 7.3 Text and Web Clustering 7.4 Text and Web Classification 7.5 References [Han & Kamber 2006, Sections 10.4

More information

Semantic Clickstream Mining

Semantic Clickstream Mining Semantic Clickstream Mining Mehrdad Jalali 1, and Norwati Mustapha 2 1 Department of Software Engineering, Mashhad Branch, Islamic Azad University, Mashhad, Iran 2 Department of Computer Science, Universiti

More information

Research Article QOS Based Web Service Ranking Using Fuzzy C-means Clusters

Research Article QOS Based Web Service Ranking Using Fuzzy C-means Clusters Research Journal of Applied Sciences, Engineering and Technology 10(9): 1045-1050, 2015 DOI: 10.19026/rjaset.10.1873 ISSN: 2040-7459; e-issn: 2040-7467 2015 Maxwell Scientific Publication Corp. Submitted:

More information

Link Based Clustering of Web Search Results

Link Based Clustering of Web Search Results Link Based Clustering of Web Search Results Yitong Wang and Masaru Kitsuregawa stitute of dustrial Science, The University of Tokyo {ytwang, kitsure}@tkl.iis.u-tokyo.ac.jp Abstract. With information proliferation

More information

QUERY RECOMMENDATION SYSTEM USING USERS QUERYING BEHAVIOR

QUERY RECOMMENDATION SYSTEM USING USERS QUERYING BEHAVIOR International Journal of Emerging Technology and Innovative Engineering QUERY RECOMMENDATION SYSTEM USING USERS QUERYING BEHAVIOR V.Megha Dept of Computer science and Engineering College Of Engineering

More information

On the Effectiveness of Web Usage Mining for Page Recommendation and Restructuring

On the Effectiveness of Web Usage Mining for Page Recommendation and Restructuring On the Effectiveness of Web Usage Mining for Recommendation and Restructuring Hiroshi Ishikawa, Manabu Ohta, Shohei Yokoyama, Junya Nakayama, and Kaoru Katayama Tokyo Metropolitan University Abstract.

More information

IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM

IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM Myomyo Thannaing 1, Ayenandar Hlaing 2 1,2 University of Technology (Yadanarpon Cyber City), near Pyin Oo Lwin, Myanmar ABSTRACT

More information

Finding Neighbor Communities in the Web using Inter-Site Graph

Finding Neighbor Communities in the Web using Inter-Site Graph Finding Neighbor Communities in the Web using Inter-Site Graph Yasuhito Asano 1, Hiroshi Imai 2, Masashi Toyoda 3, and Masaru Kitsuregawa 3 1 Graduate School of Information Sciences, Tohoku University

More information

Lecture 17 November 7

Lecture 17 November 7 CS 559: Algorithmic Aspects of Computer Networks Fall 2007 Lecture 17 November 7 Lecturer: John Byers BOSTON UNIVERSITY Scribe: Flavio Esposito In this lecture, the last part of the PageRank paper has

More information

Analytical survey of Web Page Rank Algorithm

Analytical survey of Web Page Rank Algorithm Analytical survey of Web Page Rank Algorithm Mrs.M.Usha 1, Dr.N.Nagadeepa 2 Research Scholar, Bharathiyar University,Coimbatore 1 Associate Professor, Jairams Arts and Science College, Karur 2 ABSTRACT

More information

A Novel PAT-Tree Approach to Chinese Document Clustering

A Novel PAT-Tree Approach to Chinese Document Clustering A Novel PAT-Tree Approach to Chinese Document Clustering Kenny Kwok, Michael R. Lyu, Irwin King Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, N.T., Hong Kong

More information

A Model for Interactive Web Information Retrieval

A Model for Interactive Web Information Retrieval A Model for Interactive Web Information Retrieval Orland Hoeber and Xue Dong Yang University of Regina, Regina, SK S4S 0A2, Canada {hoeber, yang}@uregina.ca Abstract. The interaction model supported by

More information

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Proximity Prestige using Incremental Iteration in Page Rank Algorithm Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration

More information

ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, Comparative Study of Classification Algorithms Using Data Mining

ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, Comparative Study of Classification Algorithms Using Data Mining ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, 2014 ISSN 2278 5485 EISSN 2278 5477 discovery Science Comparative Study of Classification Algorithms Using Data Mining Akhila

More information

AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM

AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM Masahito Yamamoto, Hidenori Kawamura and Azuma Ohuchi Graduate School of Information Science and Technology, Hokkaido University, Japan

More information

Document Clustering: Comparison of Similarity Measures

Document Clustering: Comparison of Similarity Measures Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation

More information

Analysis of Dendrogram Tree for Identifying and Visualizing Trends in Multi-attribute Transactional Data

Analysis of Dendrogram Tree for Identifying and Visualizing Trends in Multi-attribute Transactional Data Analysis of Dendrogram Tree for Identifying and Visualizing Trends in Multi-attribute Transactional Data D.Radha Rani 1, A.Vini Bharati 2, P.Lakshmi Durga Madhuri 3, M.Phaneendra Babu 4, A.Sravani 5 Department

More information

Personalized Information Retrieval

Personalized Information Retrieval Personalized Information Retrieval Shihn Yuarn Chen Traditional Information Retrieval Content based approaches Statistical and natural language techniques Results that contain a specific set of words or

More information

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Abstract - The primary goal of the web site is to provide the

More information

The application of Randomized HITS algorithm in the fund trading network

The application of Randomized HITS algorithm in the fund trading network The application of Randomized HITS algorithm in the fund trading network Xingyu Xu 1, Zhen Wang 1,Chunhe Tao 1,Haifeng He 1 1 The Third Research Institute of Ministry of Public Security,China Abstract.

More information

Keywords: clustering algorithms, unsupervised learning, cluster validity

Keywords: clustering algorithms, unsupervised learning, cluster validity Volume 6, Issue 1, January 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Clustering Based

More information

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News Selecting Model in Automatic Text Categorization of Chinese Industrial 1) HUEY-MING LEE 1 ), PIN-JEN CHEN 1 ), TSUNG-YEN LEE 2) Department of Information Management, Chinese Culture University 55, Hwa-Kung

More information

International Journal of Advance Engineering and Research Development. Survey of Web Usage Mining Techniques for Web-based Recommendations

International Journal of Advance Engineering and Research Development. Survey of Web Usage Mining Techniques for Web-based Recommendations Scientific Journal of Impact Factor (SJIF): 5.71 International Journal of Advance Engineering and Research Development Volume 5, Issue 02, February -2018 e-issn (O): 2348-4470 p-issn (P): 2348-6406 Survey

More information

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.4, April 2009 349 WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY Mohammed M. Sakre Mohammed M. Kouta Ali M. N. Allam Al Shorouk

More information

A Hybrid Recommender System for Dynamic Web Users

A Hybrid Recommender System for Dynamic Web Users A Hybrid Recommender System for Dynamic Web Users Shiva Nadi Department of Computer Engineering, Islamic Azad University of Najafabad Isfahan, Iran Mohammad Hossein Saraee Department of Electrical and

More information

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)

More information

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan Today s topic CS347 Clustering documents Lecture 8 May 7, 2001 Prabhakar Raghavan Why cluster documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics

More information

Research Article Apriori Association Rule Algorithms using VMware Environment

Research Article Apriori Association Rule Algorithms using VMware Environment Research Journal of Applied Sciences, Engineering and Technology 8(2): 16-166, 214 DOI:1.1926/rjaset.8.955 ISSN: 24-7459; e-issn: 24-7467 214 Maxwell Scientific Publication Corp. Submitted: January 2,

More information

Query Sugges*ons. Debapriyo Majumdar Information Retrieval Spring 2015 Indian Statistical Institute Kolkata

Query Sugges*ons. Debapriyo Majumdar Information Retrieval Spring 2015 Indian Statistical Institute Kolkata Query Sugges*ons Debapriyo Majumdar Information Retrieval Spring 2015 Indian Statistical Institute Kolkata Search engines User needs some information search engine tries to bridge this gap ssumption: the

More information

Kohei Arai 1 Graduate School of Science and Engineering Saga University Saga City, Japan

Kohei Arai 1 Graduate School of Science and Engineering Saga University Saga City, Japan Numerical Representation of Web Sites of Remote Sensing Satellite Data Providers and Its Application to Knowledge Based Information Retrievals with Natural Language Kohei Arai 1 Graduate School of Science

More information

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science Lecture 9: I: Web Retrieval II: Webology Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen April 10, 2003 Page 1 WWW retrieval Two approaches

More information

INCORPORATING SYNONYMS INTO SNIPPET BASED QUERY RECOMMENDATION SYSTEM

INCORPORATING SYNONYMS INTO SNIPPET BASED QUERY RECOMMENDATION SYSTEM INCORPORATING SYNONYMS INTO SNIPPET BASED QUERY RECOMMENDATION SYSTEM Megha R. Sisode and Ujwala M. Patil Department of Computer Engineering, R. C. Patel Institute of Technology, Shirpur, Maharashtra,

More information

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure Neelam Singh neelamjain.jain@gmail.com Neha Garg nehagarg.february@gmail.com Janmejay Pant geujay2010@gmail.com

More information

MAINTAIN TOP-K RESULTS USING SIMILARITY CLUSTERING IN RELATIONAL DATABASE

MAINTAIN TOP-K RESULTS USING SIMILARITY CLUSTERING IN RELATIONAL DATABASE INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 MAINTAIN TOP-K RESULTS USING SIMILARITY CLUSTERING IN RELATIONAL DATABASE Syamily K.R 1, Belfin R.V 2 1 PG student,

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

Making Retrieval Faster Through Document Clustering

Making Retrieval Faster Through Document Clustering R E S E A R C H R E P O R T I D I A P Making Retrieval Faster Through Document Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-02 January 23, 2004 D a l l e M o l l e I n s t i t u t e

More information

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains

More information

Segmentation of User Task Behaviour by Using Neural Network

Segmentation of User Task Behaviour by Using Neural Network Segmentation of User Task Behaviour by Using Neural Network Arti Dwivedi(Mtech scholar), Asst. Prof. Umesh Lilhore Abstract This paper, introduces Segmentation of User s Task to understand user search

More information

Fuzzy Cognitive Maps application for Webmining

Fuzzy Cognitive Maps application for Webmining Fuzzy Cognitive Maps application for Webmining Andreas Kakolyris Dept. Computer Science, University of Ioannina Greece, csst9942@otenet.gr George Stylios Dept. of Communications, Informatics and Management,

More information

Popularity Weighted Ranking for Academic Digital Libraries

Popularity Weighted Ranking for Academic Digital Libraries Popularity Weighted Ranking for Academic Digital Libraries Yang Sun and C. Lee Giles Information Sciences and Technology The Pennsylvania State University University Park, PA, 16801, USA Abstract. We propose

More information

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms Engineering, Technology & Applied Science Research Vol. 8, No. 1, 2018, 2562-2567 2562 Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms Mrunal S. Bewoor Department

More information

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL 85 CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL 5.1 INTRODUCTION Document clustering can be applied to improve the retrieval process. Fast and high quality document clustering algorithms play an important

More information

Association-Rules-Based Recommender System for Personalization in Adaptive Web-Based Applications

Association-Rules-Based Recommender System for Personalization in Adaptive Web-Based Applications Association-Rules-Based Recommender System for Personalization in Adaptive Web-Based Applications Daniel Mican, Nicolae Tomai Babes-Bolyai University, Dept. of Business Information Systems, Str. Theodor

More information

CS290N Summary Tao Yang

CS290N Summary Tao Yang CS290N Summary 2015 Tao Yang Text books [CMS] Bruce Croft, Donald Metzler, Trevor Strohman, Search Engines: Information Retrieval in Practice, Publisher: Addison-Wesley, 2010. Book website. [MRS] Christopher

More information