EFFECT OF QUERY FORMATION ON WEB SEARCH ENGINE RESULTS

Similar documents
3D Model Retrieval Method Based on Sample Prediction

Python Programming: An Introduction to Computer Science

Empirical Validate C&K Suite for Predict Fault-Proneness of Object-Oriented Classes Developed Using Fuzzy Logic.

Sectio 4, a prototype project of settig field weight with AHP method is developed ad the experimetal results are aalyzed. Fially, we coclude our work

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

Descriptive Statistics Summary Lists

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method

Pruning and Summarizing the Discovered Time Series Association Rules from Mechanical Sensor Data Qing YANG1,a,*, Shao-Yu WANG1,b, Ting-Ting ZHANG2,c

Ontology-based Decision Support System with Analytic Hierarchy Process for Tour Package Selection

BASED ON ITERATIVE ERROR-CORRECTION

Package popkorn. R topics documented: February 20, Type Package

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

Improving Information Retrieval System Security via an Optimal Maximal Coding Scheme

One advantage that SONAR has over any other music-sequencing product I ve worked

An Improved Shuffled Frog-Leaping Algorithm for Knapsack Problem

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Guide to Applying Online

Searching a Russian Document Collection Using English, Chinese and Japanese Queries

Analysis of Server Resource Consumption of Meteorological Satellite Application System Based on Contour Curve

The VSS CCD photometry spreadsheet

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON

Improving Template Based Spike Detection

Elementary Educational Computer

GPUMP: a Multiple-Precision Integer Library for GPUs

Random Graphs and Complex Networks T

DATA MINING II - 1DL460

Octahedral Graph Scaling

UNIT 4 Section 8 Estimating Population Parameters using Confidence Intervals

Ones Assignment Method for Solving Traveling Salesman Problem

Shadow Document Methods of Results Merging

Computers and Scientific Thinking

Data Warehousing. Paper

GE FUNDAMENTALS OF COMPUTING AND PROGRAMMING UNIT III

Evaluation of the Software Industry Competitiveness in Jilin Province Based on Factor Analysis

Harris Corner Detection Algorithm at Sub-pixel Level and Its Application Yuanfeng Han a, Peijiang Chen b * and Tian Meng c

VISUALSLX AN OPEN USER SHELL FOR HIGH-PERFORMANCE MODELING AND SIMULATION. Thomas Wiedemann

1&1 Next Level Hosting

FREQUENCY ESTIMATION OF INTERNET PACKET STREAMS WITH LIMITED SPACE: UPPER AND LOWER BOUNDS

Extending The Sleuth Kit and its Underlying Model for Pooled Storage File System Forensic Analysis

IMP: Superposer Integrated Morphometrics Package Superposition Tool

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

Goals of the Lecture UML Implementation Diagrams

A MODIFIED APPROACH FOR ESTIMATING PROCESS CAPABILITY INDICES USING IMPROVED ESTIMATORS

performance to the performance they can experience when they use the services from a xed location.

Accuracy Improvement in Camera Calibration

CHAPTER IV: GRAPH THEORY. Section 1: Introduction to Graphs

Identification of the Swiss Z24 Highway Bridge by Frequency Domain Decomposition Brincker, Rune; Andersen, P.

Task scenarios Outline. Scenarios in Knowledge Extraction. Proposed Framework for Scenario to Design Diagram Transformation

Weston Anniversary Fund

What are we going to learn? CSC Data Structures Analysis of Algorithms. Overview. Algorithm, and Inputs

A Comparative Study of Positive and Negative Factorials

Text Feature Selection based on Feature Dispersion Degree and Feature Concentration Degree

HADOOP: A NEW APPROACH FOR DOCUMENT CLUSTERING

Avid Interplay Bundle

Python Programming: An Introduction to Computer Science

The Closest Line to a Data Set in the Plane. David Gurney Southeastern Louisiana University Hammond, Louisiana

Evaluation scheme for Tracking in AMI

Normal Distributions

ECE4050 Data Structures and Algorithms. Lecture 6: Searching

BOOLEAN MATHEMATICS: GENERAL THEORY

. Written in factored form it is easy to see that the roots are 2, 2, i,

Algorithms for Disk Covering Problems with the Most Points

Chapter 4. Procedural Abstraction and Functions That Return a Value. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Evaluation of Distributed and Replicated HLR for Location Management in PCS Network

SOFTWARE usually does not work alone. It must have

Software development of components for complex signal analysis on the example of adaptive recursive estimation methods.

Optimization for framework design of new product introduction management system Ma Ying, Wu Hongcui

Relationship between augmented eccentric connectivity index and some other graph invariants

New HSL Distance Based Colour Clustering Algorithm

Perhaps the method will give that for every e > U f() > p - 3/+e There is o o-trivial upper boud for f() ad ot eve f() < Z - e. seems to be kow, where

CSC165H1 Worksheet: Tutorial 8 Algorithm analysis (SOLUTIONS)

Cubic Polynomial Curves with a Shape Parameter

A Study on the Performance of Cholesky-Factorization using MPI

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

CAAP. Critical Thinking Test Spring Hutchinson Community College Institution Code: 1420

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation

Lecture 28: Data Link Layer

Chapter 10. Defining Classes. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Prime Cordial Labeling on Graphs

Lecture 7 7 Refraction and Snell s Law Reading Assignment: Read Kipnis Chapter 4 Refraction of Light, Section III, IV

PyIO: Input-Output Analysis with Python. Suahasil Nazara, Dong Guo, Geoffrey J.D. Hewings and Chokri Dridi. REAL 03-T-23 October 2003

Web OS Switch Software

1.2 Binomial Coefficients and Subsets

27 Refraction, Dispersion, Internal Reflection

HAFOD MAKING A COMPLAINT [NEW]_Layout 1 21/03/ :06 Page 1 MAKING A COMPLAINT

A Boolean Query Processing with a Result Cache in Mediator Systems

Information Metrics for Low-rate DDoS Attack Detection : A Comparative Evaluation

INTERSECTION CORDIAL LABELING OF GRAPHS

Solving Fuzzy Assignment Problem Using Fourier Elimination Method

ENGI 4421 Probability and Statistics Faculty of Engineering and Applied Science Problem Set 1 Descriptive Statistics

A Fast Social-user Reaction Analysis using Hadoop and SPARK Platform

State-space feedback 6 challenges of pole placement

Analysis of Documents Clustering Using Sampled Agglomerative Technique

Lecture 9: Exam I Review

ISSN (Print) Research Article. *Corresponding author Nengfa Hu

CS 111 Green: Program Design I Lecture 27: Speed (cont.); parting thoughts

Isn t It Time You Got Faster, Quicker?

A Semi- Non-Negative Matrix Factorization and Principal Component Analysis Unified Framework for Data Clustering

Web Text Feature Extraction with Particle Swarm Optimization

Transcription:

Iteratioal Joural o Natural Laguage Computig (IJNLC) Vol. 2, No., February 203 EFFECT OF QUERY FORMATION ON WEB SEARCH ENGINE RESULTS Raj Kishor Bisht ad Ila Pat Bisht 2 Departmet of Computer Sciece & Applicatios, Amrapali Istitute, Haldwai (Uttarakhad), Idia bishtrk@gmail.com 2 Dept. of Ecoomics & Statistics, Govt. of Uttarakhad, Divisioal Office, Haldwai, (Uttarakhad) Idia Pat_ila@rediffmail.com ABSTRACT Query i a search egie is geerally based o atural laguage. A query ca be expressed i more tha oe way without chagig its meaig as it depeds o thikig of huma beig at a particular momet. Aim of the searcher is to get most relevat results immaterial of how the query has bee expressed. I the preset paper, we have examied the results of search egie for chage i coverage ad similarity of first few results whe a query is etered i two sematically same but i differet formats. Searchig has bee made through Google search egie. Fiftee pairs of queries have bee chose for the study. The t-test has bee used for the purpose ad the results have bee checked o the basis of total documets foud, similarity of first five ad first te documets foud i the results of a query etered i two differet formats. It has bee foud that the total coverage is same but first few results are sigificatly differet. KEYWORDS Search egie, Google, query, rak, t-test.. INTRODUCTION A web query is a set of words or a sigle word that a searcher eters ito the web search egie to get some iformatio as per his or her requiremet. Web search queries etered by web searcher are ustructured ad vary from stadard query laguages. A commo searcher eters a query ito web search egie accordig to his or her ow way of commuicatio. For example, to kow about ecoomy of Idia, two queries Ecoomy of Idia ad Idia Ecoomy ca be put. Though both the queries are sematically same but sytax of both are differet a little bit. As far as key words are take ito cosideratio, after removig stop words ad stemmig, both the queries have same cotet words Idia ad Ecoomy. The searcher expects same results i both of the cases as both the queries are sematically same ad also cotai same cotet words. But i geeral, it is observed that the search egie does ot provide same results for a query etered i two differet forms, however some documets are commo i two results. I this paper, we have studied the effect of query formatio o web search egie results i terms of coverage of documets ad similarity of first five ad first te documets. We select Google search egie for our experimet due to its popularity. So far may researchers have ivestigated the behavior of web search results ad effect of query formatio o them. Some iterestig DOI : 0.52/ijlc.203.204 3

Iteratioal Joural o Natural Laguage Computig (IJNLC) Vol. 2, No., February 203 characteristics of web search have bee showed [7] by aalyzig the queries from the Excite search egie like, the average legth of a search query was 2.4 terms, about half of the users etered a sigle query while a little less tha a third of users etered three or more uique queries, close to half of the users examied oly the first oe or two pages of results (0 results per page), less tha 5% of users used advaced search features (e.g., Boolea operators like AND, OR, ad NOT) etc. Study shows that librarias may ot routiely be teachig queries as a strategy for selectig ad usig search tools o the Web []. Karlgre, Sahlgre ad Cöster [5] ivestigated topical depedecies betwee query terms by aalyzig the distributioal character of query terms. Topi ad Lucas [8] examied the effects of the search iterface ad Boolea logic traiig o user search performace ad satisfactio. Topi ad Lucas [9] preseted a detailed aalysis of the structure ad compoets of queries writte by experimetal participats i a study that maipulated two factors foud to affect ed-user iformatio retrieval performace: traiig i Boolea logic ad the type of search iterface. Vechtomova ad Karamuftuoglu [0] demostrated effective ew methods of documet rakig based o lexical cohesive relatioships betwee query terms. Eastma ad Jase [2] aalyzed the impact of query operators o web search egie results. Oe ca fid the detail of iformatio retrieval techology i the book of Maig, Raghava, ad Schutze [6]. The structure of the paper is as follows: Sectio 2 describes the research desig ad methodology. I Sectio 3, experimetal results are give ad fially sectio 4 describes coclusios of the study. 2. RESEARCH METHODOLOGY This sectio describes the specific research questios ad the methodology used for study. 2.. Research Questio The preset study ivestigates the followig research questios: ) Is there ay chage i coverage (total o. of documets foud) of result s retrieved by Google search egie i respose to sematically same but two differet forms of a query? Here the objective is to check the differece i umber of documets retrieved i respose to two forms of a query. Google search egie provides the total o. of results foud agaist a query. Sice a searcher may search the iformatio i ay of the documets, thus it is importat to kow whether the coverage of two results is same or ot. The ull ad alterative hypotheses are as follows: Null Hypothesis: There is o differece i the coverage. Alterative hypothesis: The coverage of two results is sigificatly differet. 2) Whether the first few documets (5 or 0) are same i the two results retrieved by Google search egie i respose to sematically same but two differet forms of a query? Study shows that approximately 80% of web searchers ever view more tha the first 0 documets i the result list [3,4]. Based o this overwhelmig evidece of web searcher behaviour, we utilized oly the first 5 ad 0 documets i the result of each query. We have checked the umber of documets commo i sample queries. Assumig that the first five ad first te documets are same i two results, populatio mea ca be take as five ad te respectively. The ull ad alterative hypotheses are as follows: 32

Iteratioal Joural o Natural Laguage Computig (IJNLC) Vol. 2, No., February 203 Null Hypothesis: First 5 ad first 0 documets are same i two results, that is, sample mea is equal to populatio mea. Alterative hypothesis: First 5 ad first 0 documets are sigificatly differet i two results, that is, the sample mea is sigificatly differet from populatio mea. We choose 5% level of sigificace for iferece. 2.2. Methodology For first problem, we shall use paired t-test as it ca be assumed that the differece of umber of observatios distributed ormally. Let deotes the differece of two th observatios of i pair. Uder the ull hypothesis H that there is o sigificat 0 differece betwee the two observatios, the paired t-test with - degree of freedom is the test statistics D t = () S / Di where D = D i, S = 2 2 ( D i D ) ad be the umber of observatios take. For first problem, Google search egie shows the umber of documets retrieved i respose to a query. Let x ad y be the umber of documets retrieved i two forms of th i query. I this case D is the differece of i i i xi ad For secod problem, let x be the mea of the sample of size, be the populatio 2 2 mea, S be the ubiased estimate of populatio variace, the to test the ull hypothesis that the sample is from the populatio havig mea, the studet s t- test with degree of freedom, is defied by the statistics y i. x t = S (2) Where x = x i ad S = 2 ( x i x). 2. EXPERIMENTAL RESULTS Fiftee pairs of queries have bee farmed o geeral basis (see appedix A). The queries have bee submitted to the search egie from 0 th May 202 to 9 th May 202. Results of every pair of query have bee oted dow. For each query, it has bee observed that all retrieved documets were ot same i two forms ad also the order of commo retrieved documets were differet i two results. Table depicts the coverage of documets i two forms of a query. Table 2 shows umber of commo documets i first five ad first te results respectively. 33

Iteratioal Joural o Natural Laguage Computig (IJNLC) Vol. 2, No., February 203 For the data give i table, paired t-test have bee applied, the calculated value of t statistics is 0.385 which is less tha tabulated value.76 for 4 degree of freedom. Thus the ull hypothesis is accepted at 5% sigificace level, that is, there is o sigificat differece betwee the coverage of two results. Table. Number of documets retrieved i two forms of a query Query pair o. x i y i Q 83,000,000 20,000,000 Q2 67,00,000 372,000,000 Q3 34,000,000 42,400,000 Q4,080,000,000 2,450,000,000 Q5 7,00,000 224,000,000 Q6 36,800,000 37,000,000 Q7 575,000,000 405,000,000 Q8 22,400,000 20,500,000 Q9 227,000 74,000 Q0 5,000,000 4,600,000 Q 75,600,000 75,700,000 Q2 9,700,000,200,000 Q3 5,00,000 9,600,000 Q4,400,000 8,680,000 Q5,400,000,000 758,000,000 Table 2. Number of commo documets i first five (D 5 ) ad first te (D 0 ) retrieved documets Query pair o. D 5 D 0 Q 3 3 Q2 2 4 Q3 4 5 Q4 2 2 Q5 3 7 Q6 3 6 Q7 4 5 Q8 4 6 Q9 3 8 Q0 2 3 Q 3 4 Q2 2 5 Q3 4 7 Q4 4 8 Q5 4 5 34

Iteratioal Joural o Natural Laguage Computig (IJNLC) Vol. 2, No., February 203 For the data give i colum 2 of table 2, we applied t-test for sample mea; the calculated value of t statistics is 8.37 which is greater tha tabulated value.76 for 4 degree of freedom. Thus the ull hypothesis is rejected at 5 % sigificace level, that is, there is sigificat differece betwee the sample mea ad the populatio mea. Thus, first five documets i two results are sigificatly differet. For the data give i colum 3 of table 2, we agai applied t-test for sample mea; the calculated value of t statistics is 9.86 which is greater tha tabulated value.76 for 4 degree of freedom. Thus the ull hypothesis is rejected at 5 % sigificace level, that is, there is sigificat differece betwee the sample mea ad the populatio mea. Thus, first te documets i two results are sigificatly differet. 3. CONCLUSIONS The experimet o Google search results has bee performed to check the ability of search egie for respodig over a pair of sematically same but differet structural queries. I this work, we have tried to check whether commo user is gettig same results for a query asked i two differet ways or ot. Accordig to our experimet, there is o sigificat differece betwee the coverage of two results, this shows that the search egie provides almost same umber of results for a query asked i ay form but first five ad first te results of two queries are sigificatly differet. As from the previous researchers, it has bee observed that most of the user check the first page, hece it ca be cocluded that a commo user does ot get same results for a query whe asked i differet ways. To get optimum results oe should modify oe s query i every possible way because every modificatio provides a chace to get ew results. It also sigifies the iability of the search egie for providig results based o sematic structure of a setece which ca ope a ew dimesio for researchers i this field. REFERENCES [] Cohe, L. B., (2005) A query-based approach i web search istructio: A assessmet of curret practice, Research Strategies, Vol. 20, pp 442-457. [2] Eastm, C. M. ad Jase, B. J., (2003) Coverage, Relevace, ad Rakig: The impact of query operators o web search egie results, ACM Trasactios o Iformatio Systems, Vol. 2(4), pp 383-4. [3] Hölscher, C. ad Strube, G., (2000) Web search behavior of Iteret experts ad ewbies, Iteratioal Joural of Computer ad Telecommuicatios Networkig, Vol. 33( 6), pp 337 346. [4] Jase, B. J., Spik A. ad Saracevic, T., (2000) Real life, real users, ad real eeds: A study ad aalysis of user queries o the Web, Iformatio Processig ad Maagemet, Vol. 36( 2), pp 207 227. [5] Karlgre, J., Sahlgre, M. ad Cöster, R., (2006) Weightig Query Terms Based o Distributioal Statistics Lecture Notes i Computer Sciece, Vol.4022, pp 208-2. [6] Maig, C. D., Raghava, P. ad Schutze, H. (2008) Itroductio to Iformatio Retrieval. Cambridge Uiversity Press, Cambridge, New York. [7] Spik, A., Wolfram, D., Jase, M. B. J. ad Saracevic, T., (200) Searchig the web: The public ad their queries Joural of the America Society for Iformatio Sciece ad Techology, Vol. 52 (3), 226 234. [8] Topi, H. ad Lucas, W., (2005a) Searchig the Web: operator assistace required, Iformatio Processig ad Maagemet, Vol. 4(2), pp 383-403. [9] Topi, H. ad Lucas, W., (2005b), Mix ad match: combiig terms ad operators for successful Web searches. Iformatio Processig ad Maagemet, Vol. 4(4), pp 80-87. 35

Iteratioal Joural o Natural Laguage Computig (IJNLC) Vol. 2, No., February 203 [0] Vechtomova, O. ad Karamuftuoglu, M. (2008), Lexical cohesio ad term proximity i documet rakig Iformatio Processig ad Maagemet, Vol. 44(4), pp 485-502. Appedix A. List of pairs of Queries Q. Idia Ecoomy / Ecoomy of Idia Q.2 Car Accidet / Accidet of car Q.3 Diabetes Diet / Diet for Diabetes Q.4 Office Maagemet / Maagemet i Office Q.5 Fiace Project Report / Project Report o fiace Q.6 Kids fu games / Fu games for kids Q.7 Statistics Books / Books o Statistics Q.8 Icome tax retur filig procedure / Procedure for icome tax retur filig Q.9 Kumao Himalayas / Himalayas of Kumao Q.0 Huma behaviour Aalysis / Aalysis of huma behaviour Q. Wildlife survey / Survey o wildlife Q.2 Aciet Idia History / History of Aciet Idia Q.3 Moral Values stories / Stories o moral values Q.4 Fiacial sector reforms i Idia / Reforms i fiacial sector i Idia Q.5 Health care policy issues / Policy issues i health care Authors Raj Kishor Bisht did his M. Sc. ad Ph.D. i Mathematics from Kumau Uiversity Naiital (Uttarakhad) Idia. He also qualified Natioal Eligibility test coducted by Coucil of Scietific Idustrial Research Idia. Curretly he is workig as Associate Professor i the departmet of Computer Sciece & Applicatios, Amrapali Istitute of Maagemet ad Computer Applicatios, Haldwai (Uttarakhad) Idia. His research iterest icludes mathematical models i NLP, Formal laguage & Automata theory. Ila Pat Bisht did her M. Sc. ad Ph.D. i Statistics from Kumau Uiversity Naiital (Uttarakhad) Idia. Curretly he is workig as Statistical Officer i the departmet of Ecoomics & Statistics, Govt. of Uttarakhad, Divisioal Office, Haldwai, (Uttarakhad) Idia. Her research iterest icludes Samplig, Operatio research ad applied Statistics. 36