Evaluatation of Integration algorithms for Meta-Search Engine

Similar documents
Evaluation of Meta-Search Engine Merge Algorithms

ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies

Journal of Chemical and Pharmaceutical Research, 2014, 6(5): Research Article

INTRODUCTION. Chapter GENERAL

Obtaining Language Models of Web Collections Using Query-Based Sampling Techniques

Linking Entities in Chinese Queries to Knowledge Graph

Selection of Best Web Site by Applying COPRAS-G method Bindu Madhuri.Ch #1, Anand Chandulal.J #2, Padmaja.M #3

Oleksandr Kuzomin, Bohdan Tkachenko

Research and Design of Key Technology of Vertical Search Engine for Educational Resources

Template Extraction from Heterogeneous Web Pages

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

Eric J. Glover, Steve Lawrence, Michael D. Gordon, William P. Birmingham, and C. Lee Giles

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Tag Based Image Search by Social Re-ranking

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach

DETERMINE COHESION AND COUPLING FOR CLASS DIAGRAM THROUGH SLICING TECHNIQUES

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating

Examining the Authority and Ranking Effects as the result list depth used in data fusion is varied

An Efficient Approach for Color Pattern Matching Using Image Mining

Inferring User Search for Feedback Sessions

Implementing a customised meta-search interface for user query personalisation

Design and Implementation of Agricultural Information Resources Vertical Search Engine Based on Nutch

Fuzzy Double-Threshold Track Association Algorithm Using Adaptive Threshold in Distributed Multisensor-Multitarget Tracking Systems

Face recognition based on improved BP neural network

REDUNDANCY REMOVAL IN WEB SEARCH RESULTS USING RECURSIVE DUPLICATION CHECK ALGORITHM. Pudukkottai, Tamil Nadu, India

A Framework for adaptive focused web crawling and information retrieval using genetic algorithms

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

Privacy-Preserving of Check-in Services in MSNS Based on a Bit Matrix

Keywords APSE: Advanced Preferred Search Engine, Google Android Platform, Search Engine, Click-through data, Location and Content Concepts.

Accumulative Privacy Preserving Data Mining Using Gaussian Noise Data Perturbation at Multi Level Trust

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Part I: Data Mining Foundations

A Metric for Inferring User Search Goals in Search Engines

Analyzing Working of FP-Growth Algorithm for Frequent Pattern Mining

AUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS

CS47300: Web Information Search and Management

Image retrieval based on bag of images

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

EXTRACT THE TARGET LIST WITH HIGH ACCURACY FROM TOP-K WEB PAGES

A New Technique to Optimize User s Browsing Session using Data Mining

A MINING TECHNIQUE FOR WEB DATA USING CLUSTERING

Efficiently Mining Positive Correlation Rules

Competitive Intelligence and Web Mining:

QoS Based Ranking for Composite Web Services

Microsoft Research Asia at the Web Track of TREC 2009

QUERY RECOMMENDATION SYSTEM USING USERS QUERYING BEHAVIOR

Data Preprocessing Method of Web Usage Mining for Data Cleaning and Identifying User navigational Pattern

A Personal Information Retrieval System in a Web Environment

International Journal of Advanced Research in. Computer Science and Engineering

IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK

Web Data mining-a Research area in Web usage mining

INFOQUEST- A META SEARCH ENGINE FOR USER FRIENDLY INTELLIGENT INFORMATION RETRIEVAL FROM THE WEB

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

Experimental study of Web Page Ranking Algorithms

NUSIS at TREC 2011 Microblog Track: Refining Query Results with Hashtags

Ranking Assessment of Event Tweets for Credibility

MATRIX BASED INDEXING TECHNIQUE FOR VIDEO DATA

Search Quality. Jan Pedersen 10 September 2007

Open Access Research on the Prediction Model of Material Cost Based on Data Mining

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

A Supervised Method for Multi-keyword Web Crawling on Web Forums

Deep Web Crawling and Mining for Building Advanced Search Application

A GEOGRAPHICAL LOCATION INFLUENCED PAGE RANKING TECHNIQUE FOR INFORMATION RETRIEVAL IN SEARCH ENGINE

Collaborative Filtering using a Spreading Activation Approach

NTU Approaches to Subtopic Mining and Document Ranking at NTCIR-9 Intent Task

HUKB at NTCIR-12 IMine-2 task: Utilization of Query Analysis Results and Wikipedia Data for Subtopic Mining

Comparative Study of Web Structure Mining Techniques for Links and Image Search

IRCE at the NTCIR-12 IMine-2 Task

A Constrained Spreading Activation Approach to Collaborative Filtering

ResPubliQA 2010

Volume 2, Issue 6, June 2014 International Journal of Advance Research in Computer Science and Management Studies

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16

Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering Recommendation Algorithms

A Web Page Segmentation Method by using Headlines to Web Contents as Separators and its Evaluations

Searching the Deep Web

Keywords Data alignment, Data annotation, Web database, Search Result Record

A Formal Approach to Score Normalization for Meta-search

Personalized Information Retrieval by Using Adaptive User Profiling and Collaborative Filtering

Query Modifications Patterns During Web Searching

Estimating Credibility of User Clicks with Mouse Movement and Eye-tracking Information

Implementation of Enhanced Web Crawler for Deep-Web Interfaces

International Journal of Advancements in Research & Technology, Volume 2, Issue 6, June ISSN

Research on Design and Application of Computer Database Quality Evaluation Model

A Novel deep learning models for Cold Start Product Recommendation using Micro blogging Information

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

CS54701: Information Retrieval

ABSTRACT. The purpose of this project was to improve the Hypertext-Induced Topic

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

Deep Web Content Mining

Searching SNT in XML Documents Using Reduction Factor

THE WEB SEARCH ENGINE

A Dynamic TDMA Protocol Utilizing Channel Sense

Finding Topic-centric Identified Experts based on Full Text Analysis

PERSONALIZED MOBILE SEARCH ENGINE BASED ON MULTIPLE PREFERENCE, USER PROFILE AND ANDROID PLATFORM

Multi-Stage Rocchio Classification for Large-scale Multilabeled

Design and Realization of Agricultural Information Intelligent Processing and Application Platform

On-Lib: An Application and Analysis of Fuzzy-Fast Query Searching and Clustering on Library Database

User Intent Discovery using Analysis of Browsing History

Transcription:

Evaluatation of Integration algorithms for Meta-Search Engine Bhavsingh Maloth Department of Information Technology, Guru Nanak Engineering College Approved By A.I.C.T.E,New Delhi Affiliated to Jawaharlal Nehru Technological University, Kukatpally, Hyderabad-500 085, Andhrapradesh, India K bhavsingh.mtech@gmail.com ABSTRACT :-Many people use search engines every day to retrieve documents from the Web. Although the social influence of search engine rankings has become significant, ranking algorithms are not disclosed. In this paper, we have investigated three major search engine rankings by analyzing two kinds of data. One is the weekly ranking snapshots of top 250 Web pages we collected for almost one year by submitting,000 pre-selected queries; the other comprises back linked Web pages gathered by our own Web crawling. As a result, we have confirmed that () several top 0 rankings are mutually similar, however the following ranked Web pages are almost different, (2) ranking transitions have their own characteristics, and (3) each search engine s ranking has its own correlation with the number of back-linked Web pages. I. INTRODUCTION In recent years, since the number of Web pages has been increasing rapidly, it has become almost impossible to obtain information from the Web without using search engines. Since most people refer only to high-ranked Web pages, such as top 0 results, the information that we obtain from the Web might be biased according to search engine rankings. For example, if the Web pages that support one side s opinion about a topic were ranked high by search engines and the others were not, our information obtained from the Web might be biased to the former opinion. If this happens frequently, then it can be said that our knowledge is biased according to search engine rankings. Since search engine rankings have a great influence, it is important to understand how search engines rank Web pages and to check whether their rankings are biased. To investigate the bias of search engines, we focused on the features of search engine rankings. In this study, we investigated the features of two major search engine rankings, MSN and Yahoo by comparing ranking bias and correlation between rankings and back-links. When retrieved items are returned by the underlined search engines further processes these items and presents relevant items to the user based on a result integration algorithm.jp callam proposed 4 typical synthesis algorithms to process different situations.kirsch provided another typical method which required the participant search engine to return some additional information. The meta-search engine would use this information to re calculate the relevance of document on the client side.meta crawler imported the concept credibility to determine the relevance between documents and queries.profusions algorithm was one of the normalized score methodinquires adopted a merge strategy which also recalculated the relevance on the client side.i am evaluating and then compare in experiments 4 merge algorithms through this project. Simple merge algorithm 2. Abstract merge algorithm 3. Position merge algorithm 4. Abstract / position merge algorithm A..Simple marge algorithm This approach is actually similar to the single search engine, and it just to range the results from participant search engines in turn by using a multi-way merge method. In practical, users are interested in the search results of the first three pages in high probability. The results after three pages could only enhance the completeness of the search results, but could not improve user experience any more. We firstly give the formal description of simple merge algorithm.assuming, there are n participant search engines In meta-search engine, where { i} S = S, i =,2,Ln which is the set of participant search engine, i S is the ith participant search engine; q is one query submitted to the meta-search engine;{ } i ij R = r i =,2,Ln, j =,2,Lm, which is the result of query q returned from the ith participant search engine;{ } final k R = a, k =,2,L,n m, which is the final merge result of simple merge algorithm; For any k, k =,2,L, n m, we have the following result,a k = r x, y where x= k [k/m] X m, y= [k/m] In order to show the effect of cross-merger, the simple merge algorithm does not remove the duplicates.n contrast, following several algorithms will remove the ates.

and R R2, if Abstract_rank (Query, R.abstract) > Abstract_rank (Query, R2.abstract), then R2 will be deleted from R base, otherwise R will be deleted C. Position merge algorithm Fig : Simle merge algorithm B. Abstract merge algorithm The main idea of abstract merge algorithm is to rank search results with the relevance between query and the abstract information of search results. Firstly we need extract the terms from query, and calculate the relevance between terms and abstract. Secondly, we calculate the relevance between query and each page. Finally, the results are returned to users according to their relevance. We use the following method to calculate the relevance between query and abstract of one page. Rank (termj,abstract) = Σ ln(lenght( abstract) / Location(termj,i, abstract)) Occurence(termj,abstract)>0 (),Occurence(termj, abstract)=0 Where Length (abstract) is the length of abstract; Occurrence (termj, abstract) is the function of termj and abstract, which indicates the frequency of termj in abstract; Location (termj, i, abstract) is the function of termj, i and abstract, which indicates the ith position of termj in abstract. Thus, the relevance between query and abstract can be calculated by the following formula: abtract_rank(query, abstract) = Σ Rank(termi,abstract) (2) By calculating the relevance between query and each result record through formula (2) we can organize the results in descending order of the relevance, then return the ranked results to users. For Any two records R, R2 in R n Ri base, where R base = i=, if R.hyperlink equals R2.hyperlink, we call that record R equals the record R2, R R2. For any records R and R2, R,R2 R base, The basic idea of position merge algorithm is to make a full use of the original position information from each single search engine. For the same query, there are some pages which will occur in several result lists of different participant search engines. But their position in different result lists may not be the same. To reconcile this contradiction, we should take the position in different participants into account. We assume that the number of the participant search engine is T, the priority of search engine Si is Xi. For measuring the record whether is present or not inthe Si.results of the search engine Si, this method produces coefficient xis(record, i),xis(recordi) = /place(record,si.results) record Si.results (3) 0, record Si.results Where place(record, Si.results) means the position of record in the Si.results. Thus we can estimate the position relevance place_rank(record) in the following formula, T xis(record, i) place_rank(record)= i= T-i+ By calculating the relevance between query and each of the result records through Formula (4), the results will be arranged in descending order, and then the system will return the results without overlap to users in the HTML formdetermining the priority: In the above position algorithm, the priorities of participant search engines need to be determined in advance. In our experiment, we use two main search engines Google and Yahoo as the participants, so we must estimate two parameters Xg and Xy. In reference [8], the author did not give the method to estimate the priority of participant search engine. So in this paper we design an experiment method to estimate these parameters. In this experiment, we use JNTU University as a query keyword, and then take one page as a unit. We predefine the parameters Xg and Xy into nine groups: G: Xg = 0.9, Xy = 0., G2:Xg = 0.8, Xy = 0.2, G3:Xg = 0.7, Xy = 0.3, G4:Xg = 0.6, Xy = 0.4, G5:Xg = 0.5, Xy = 0.5, G6:Xg = 0.4, Xy = 0.6, G7:Xg = 0.3, Xy = 0.7, G8:Xg = 0.2, Xy = 0.8, G9:Xg = 0., Xy = 0.9, Where keeping the Xg + Xy =. For each group, we calculate the relevance and the location of the first three pages, which includes about 53 records (first page includes 4 records, second page includes 9 records, and the third page includes 20 records). The experiment result was shown in table. The first column indicates the sequence number of the extracted results; the second column indicates whether each record is related to the query or not, in which denotes relevance, 0 denotes irrelevance. The rest of columns show the location of each result in the corresponding page when we assume Xg = 0.9 Xy = 0., Xg=0.8 Xy = 0.2, etc. The symbol - means this record doesn t occur in this result

page.table the change of the position in different Xg and Xy groups position of the record in different groups D. Abstract/position merge algorithm The abstract and the position are very important information. Abstract/position merge algorithm considered these two factors synthetically to make the integrated results meet the users needs. Following is the formula to calculate the relevance between the Result and query,allrank(record)= A*Max{abstract_rank(rec),rec record,rec base R }+B*place_rank(record) (5)Where n T A=/(n In /), B= / i= i X T-i+ The results i= returned from the participant search engines will be reranked by meta-search engine based on formula (5). Finally, meta-search engine returns the results to the users after processing the duplicates. ) The evaluation of the results from different merge algorithm In this section, we will compare the results of abstract merge algorithm, position merge algorithm and abstract/position merge algorithm, then estimates which result is better and more relevant to the user s query intent. We use two main measures: accuracy, namely the ratio of the related results in the whole results, and the weight, namely the location of the related results in the whole results. We designed the following evaluation plan: Firstly, setting up 0 keywords which are more broadly, more representative and more accordant with the users queries, such as Indo Games, JNTU University 2007, Dell computer, Kakatiya University, Chirajeevi movie, the IPL Cricket, Matrix introduction, TheDa Vinci Code, Slumdog, Thinking in Java. Secondly, for each query, we take the first top 30 results for three algorithms as the basis data set for analyzing. Every result will be judged whether the results are related to the keywords or not based on the contents of each result for each query by human. If one result is relevant, we will mark the location of the result which occur in the 30 results, and then calculate the weight of each relevant result with the weight function Y =/ x.5 [9, 0]. Y is the accessing probability of the web page at the position x by human. Because in the 30 search results, the relevant results are not only one, we cumulate all the weights and get the average value. The final step is to calculate Accuracy, namely "the number of related results / 30." For each algorithm, all the weights and accuracy of 0 keywords are calculated, and then we compute the average weight and average accuracy respectively. Table 2 shows the final weight and the average for the Above three merge algorithms. From the table, we can see that the accuracy of the Three merge algorithms is very close. Abstract/ position algorithm is better, because it combines the excellence of abstract ranking and position ranking, which gets more related results. The other algorithms focus on the different point of view, the abstract algorithm ranks the Results based on the frequency of the query keywords in abstract and the location coefficient. And the position merge algorithm calculates the location relevance by using the location of the related results. Fig 2: Architecture of meta search engine E. Compare Meta-Search Engine with Google and Yahoo We established a meta-search engine which takes Google and Yahoo as the participant search engines. In this meta-search engine, we used the above four merge algorithms to fusion the results from the participants. In order to evaluate whether the meta-search engine could improve the experience of the users, we did a testing. The testing process is similar to the above experiment. We choose the following five queries as the testing set: Olypmic Games, JNTU University 2007, Dell computer Cricket, firstly, we need fetch the top 30 results from Google and Yahoo respectively for every query. Secondly, we use the same way to estimate the relevance of these results respectively and mark the positions, then calculate the weight for every result and accumulate the weights, and get the mean value. Then we can get the accuracy, (the number of related results)/30 And the experiment result about three merge algorithms of metasearch engine, According to the analysis of the data, we can make such a conclusion: the accuracy and weight of both Google and the Yahoo are all lower than the three algorithms of meta-search engine. This just depicts that meta-search engine is able to synthesize the advantages from each member search engine, and the merge algorithms could improve the quality of searching results.

Google and Yahoo s weight and accuracy was shown in Table 3.According to the analysis of the data, we can make such a conclusion: the accuracy and weight of both Google and the Yahoo are all lower than the three algorithms of meta-search engine. This just depicts that meta-search engine is able to synthesize the advantages from each member search engine, and the merge algorithms could improve the quality of searching III. CONCLUSION In this paper, we design an experiment to evaluate abstract merge algorithm, position merge algorithm and abstract/position algorithm of meta-search engine,and ompare the meta-search engine with the general search engines, such as Google and Yahoo. From the experiment result, we can conclude that the better merge algorithm could improve the quality of searching. Meta-search engine can get more high quality result than the general search engine on average, but this does not mean that meta-search engine would get better result than general search engine for any query. In future, we will adopt more effective search results ranking technology, such as the use of the classification algorithm based on neural network, deep Web content mining and so on. IV. REFERENCES [] Http://www.se-express.com/about/about.htm II. AND YAHOO Fig 3: comparision results COMPARE META-SEARCH ENGINE WITH GOOGLE We established a meta-search engine which takes Google and Yahoo as the participant search engines. In this meta-search engine, we used the above four merge algorithms to fusion the results from the participants. In order to evaluate whether the meta-search engine could improve the experience of the users, we did a testing. The testing process is similar to the above experiment. We choose the following five queries as the esting set: Bill Gates Beijing University 2007, Heilongjiang University, The Great Wall, Beijing Olympic Games and Dell computer.firstly, we need fetch the top 30 results from Google and Yahoo respectively for every query. Secondly, we use the same way to estimate the relevance of theseresults respectively and mark the positions, thenalculate the weight for every result and accumulate the weights, and get the mean value. Then we can get the accuracy, (the number of related results)/30. The evaluating result was shown in Table 3. And the experiment result about three merge algorithms of meta-search engine, [2] Http://www.sowang.com/internet/2002430.htm [3] Callan, J.P., Lu, Z., Croft, W.B. Searching distributed collections with inference networks[r]. In: Fox, E.A., ngwersen, P., Fidel, R.,eds. Proceedings of the 8th International Conference on Research and Development in Information Retrieval. ACM Press, 995:2~28. [4] Kirsch, S.T. Document retrieval over networks wherein ranking and relevance scores are computed at the client for multiple database documents[r]. United States Patent #5, 997: 659,732. [5] Selberg, E.W. Towards comprehensive web search [Ph.D. Thesis]. University of Washington,999. [6] Gauch, S., Wang, G., Gomez, M. Profusion:intelligent fusion from multiple, distributed search engines[j]. Journal of Universal Computer Science, 996, 2(9):637~649. [7] Howe, A.E., Dreilinger, D. SavvySearch: a metasearch engine that learns which search engine to query[r]. ACM Transactions on Information Systems, 997, 3(5):95~222. [8] Zhang Wei-feng, Xu Bao-wen, Zhou Xiao-yu,Xu Lei, Li Dong, Results Generating of Meta Search Engines, Journal of Mini Micro Systems, Vol.24. No., 34-37, Jan 20003. [9] T. Joachims. Optimizing Search Engines singclickthrough Data [R]. In Proceedings of the Eighth ACM, SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002:33-42 [0] R. Lempel and S. Moran. Predictive Caching and Prefetching of Query Results in Search Engines

[R]. In Proceedings of the Twelfth International World Wide Web Conference, 2003:9-28