Iteratioal Joural o Natural Laguage Computig (IJNLC) Vol. 2, No., February 203 EFFECT OF QUERY FORMATION ON WEB SEARCH ENGINE RESULTS Raj Kishor Bisht ad Ila Pat Bisht 2 Departmet of Computer Sciece & Applicatios, Amrapali Istitute, Haldwai (Uttarakhad), Idia bishtrk@gmail.com 2 Dept. of Ecoomics & Statistics, Govt. of Uttarakhad, Divisioal Office, Haldwai, (Uttarakhad) Idia Pat_ila@rediffmail.com ABSTRACT Query i a search egie is geerally based o atural laguage. A query ca be expressed i more tha oe way without chagig its meaig as it depeds o thikig of huma beig at a particular momet. Aim of the searcher is to get most relevat results immaterial of how the query has bee expressed. I the preset paper, we have examied the results of search egie for chage i coverage ad similarity of first few results whe a query is etered i two sematically same but i differet formats. Searchig has bee made through Google search egie. Fiftee pairs of queries have bee chose for the study. The t-test has bee used for the purpose ad the results have bee checked o the basis of total documets foud, similarity of first five ad first te documets foud i the results of a query etered i two differet formats. It has bee foud that the total coverage is same but first few results are sigificatly differet. KEYWORDS Search egie, Google, query, rak, t-test.. INTRODUCTION A web query is a set of words or a sigle word that a searcher eters ito the web search egie to get some iformatio as per his or her requiremet. Web search queries etered by web searcher are ustructured ad vary from stadard query laguages. A commo searcher eters a query ito web search egie accordig to his or her ow way of commuicatio. For example, to kow about ecoomy of Idia, two queries Ecoomy of Idia ad Idia Ecoomy ca be put. Though both the queries are sematically same but sytax of both are differet a little bit. As far as key words are take ito cosideratio, after removig stop words ad stemmig, both the queries have same cotet words Idia ad Ecoomy. The searcher expects same results i both of the cases as both the queries are sematically same ad also cotai same cotet words. But i geeral, it is observed that the search egie does ot provide same results for a query etered i two differet forms, however some documets are commo i two results. I this paper, we have studied the effect of query formatio o web search egie results i terms of coverage of documets ad similarity of first five ad first te documets. We select Google search egie for our experimet due to its popularity. So far may researchers have ivestigated the behavior of web search results ad effect of query formatio o them. Some iterestig DOI : 0.52/ijlc.203.204 3
Iteratioal Joural o Natural Laguage Computig (IJNLC) Vol. 2, No., February 203 characteristics of web search have bee showed [7] by aalyzig the queries from the Excite search egie like, the average legth of a search query was 2.4 terms, about half of the users etered a sigle query while a little less tha a third of users etered three or more uique queries, close to half of the users examied oly the first oe or two pages of results (0 results per page), less tha 5% of users used advaced search features (e.g., Boolea operators like AND, OR, ad NOT) etc. Study shows that librarias may ot routiely be teachig queries as a strategy for selectig ad usig search tools o the Web []. Karlgre, Sahlgre ad Cöster [5] ivestigated topical depedecies betwee query terms by aalyzig the distributioal character of query terms. Topi ad Lucas [8] examied the effects of the search iterface ad Boolea logic traiig o user search performace ad satisfactio. Topi ad Lucas [9] preseted a detailed aalysis of the structure ad compoets of queries writte by experimetal participats i a study that maipulated two factors foud to affect ed-user iformatio retrieval performace: traiig i Boolea logic ad the type of search iterface. Vechtomova ad Karamuftuoglu [0] demostrated effective ew methods of documet rakig based o lexical cohesive relatioships betwee query terms. Eastma ad Jase [2] aalyzed the impact of query operators o web search egie results. Oe ca fid the detail of iformatio retrieval techology i the book of Maig, Raghava, ad Schutze [6]. The structure of the paper is as follows: Sectio 2 describes the research desig ad methodology. I Sectio 3, experimetal results are give ad fially sectio 4 describes coclusios of the study. 2. RESEARCH METHODOLOGY This sectio describes the specific research questios ad the methodology used for study. 2.. Research Questio The preset study ivestigates the followig research questios: ) Is there ay chage i coverage (total o. of documets foud) of result s retrieved by Google search egie i respose to sematically same but two differet forms of a query? Here the objective is to check the differece i umber of documets retrieved i respose to two forms of a query. Google search egie provides the total o. of results foud agaist a query. Sice a searcher may search the iformatio i ay of the documets, thus it is importat to kow whether the coverage of two results is same or ot. The ull ad alterative hypotheses are as follows: Null Hypothesis: There is o differece i the coverage. Alterative hypothesis: The coverage of two results is sigificatly differet. 2) Whether the first few documets (5 or 0) are same i the two results retrieved by Google search egie i respose to sematically same but two differet forms of a query? Study shows that approximately 80% of web searchers ever view more tha the first 0 documets i the result list [3,4]. Based o this overwhelmig evidece of web searcher behaviour, we utilized oly the first 5 ad 0 documets i the result of each query. We have checked the umber of documets commo i sample queries. Assumig that the first five ad first te documets are same i two results, populatio mea ca be take as five ad te respectively. The ull ad alterative hypotheses are as follows: 32
Iteratioal Joural o Natural Laguage Computig (IJNLC) Vol. 2, No., February 203 Null Hypothesis: First 5 ad first 0 documets are same i two results, that is, sample mea is equal to populatio mea. Alterative hypothesis: First 5 ad first 0 documets are sigificatly differet i two results, that is, the sample mea is sigificatly differet from populatio mea. We choose 5% level of sigificace for iferece. 2.2. Methodology For first problem, we shall use paired t-test as it ca be assumed that the differece of umber of observatios distributed ormally. Let deotes the differece of two th observatios of i pair. Uder the ull hypothesis H that there is o sigificat 0 differece betwee the two observatios, the paired t-test with - degree of freedom is the test statistics D t = () S / Di where D = D i, S = 2 2 ( D i D ) ad be the umber of observatios take. For first problem, Google search egie shows the umber of documets retrieved i respose to a query. Let x ad y be the umber of documets retrieved i two forms of th i query. I this case D is the differece of i i i xi ad For secod problem, let x be the mea of the sample of size, be the populatio 2 2 mea, S be the ubiased estimate of populatio variace, the to test the ull hypothesis that the sample is from the populatio havig mea, the studet s t- test with degree of freedom, is defied by the statistics y i. x t = S (2) Where x = x i ad S = 2 ( x i x). 2. EXPERIMENTAL RESULTS Fiftee pairs of queries have bee farmed o geeral basis (see appedix A). The queries have bee submitted to the search egie from 0 th May 202 to 9 th May 202. Results of every pair of query have bee oted dow. For each query, it has bee observed that all retrieved documets were ot same i two forms ad also the order of commo retrieved documets were differet i two results. Table depicts the coverage of documets i two forms of a query. Table 2 shows umber of commo documets i first five ad first te results respectively. 33
Iteratioal Joural o Natural Laguage Computig (IJNLC) Vol. 2, No., February 203 For the data give i table, paired t-test have bee applied, the calculated value of t statistics is 0.385 which is less tha tabulated value.76 for 4 degree of freedom. Thus the ull hypothesis is accepted at 5% sigificace level, that is, there is o sigificat differece betwee the coverage of two results. Table. Number of documets retrieved i two forms of a query Query pair o. x i y i Q 83,000,000 20,000,000 Q2 67,00,000 372,000,000 Q3 34,000,000 42,400,000 Q4,080,000,000 2,450,000,000 Q5 7,00,000 224,000,000 Q6 36,800,000 37,000,000 Q7 575,000,000 405,000,000 Q8 22,400,000 20,500,000 Q9 227,000 74,000 Q0 5,000,000 4,600,000 Q 75,600,000 75,700,000 Q2 9,700,000,200,000 Q3 5,00,000 9,600,000 Q4,400,000 8,680,000 Q5,400,000,000 758,000,000 Table 2. Number of commo documets i first five (D 5 ) ad first te (D 0 ) retrieved documets Query pair o. D 5 D 0 Q 3 3 Q2 2 4 Q3 4 5 Q4 2 2 Q5 3 7 Q6 3 6 Q7 4 5 Q8 4 6 Q9 3 8 Q0 2 3 Q 3 4 Q2 2 5 Q3 4 7 Q4 4 8 Q5 4 5 34
Iteratioal Joural o Natural Laguage Computig (IJNLC) Vol. 2, No., February 203 For the data give i colum 2 of table 2, we applied t-test for sample mea; the calculated value of t statistics is 8.37 which is greater tha tabulated value.76 for 4 degree of freedom. Thus the ull hypothesis is rejected at 5 % sigificace level, that is, there is sigificat differece betwee the sample mea ad the populatio mea. Thus, first five documets i two results are sigificatly differet. For the data give i colum 3 of table 2, we agai applied t-test for sample mea; the calculated value of t statistics is 9.86 which is greater tha tabulated value.76 for 4 degree of freedom. Thus the ull hypothesis is rejected at 5 % sigificace level, that is, there is sigificat differece betwee the sample mea ad the populatio mea. Thus, first te documets i two results are sigificatly differet. 3. CONCLUSIONS The experimet o Google search results has bee performed to check the ability of search egie for respodig over a pair of sematically same but differet structural queries. I this work, we have tried to check whether commo user is gettig same results for a query asked i two differet ways or ot. Accordig to our experimet, there is o sigificat differece betwee the coverage of two results, this shows that the search egie provides almost same umber of results for a query asked i ay form but first five ad first te results of two queries are sigificatly differet. As from the previous researchers, it has bee observed that most of the user check the first page, hece it ca be cocluded that a commo user does ot get same results for a query whe asked i differet ways. To get optimum results oe should modify oe s query i every possible way because every modificatio provides a chace to get ew results. It also sigifies the iability of the search egie for providig results based o sematic structure of a setece which ca ope a ew dimesio for researchers i this field. REFERENCES [] Cohe, L. B., (2005) A query-based approach i web search istructio: A assessmet of curret practice, Research Strategies, Vol. 20, pp 442-457. [2] Eastm, C. M. ad Jase, B. J., (2003) Coverage, Relevace, ad Rakig: The impact of query operators o web search egie results, ACM Trasactios o Iformatio Systems, Vol. 2(4), pp 383-4. [3] Hölscher, C. ad Strube, G., (2000) Web search behavior of Iteret experts ad ewbies, Iteratioal Joural of Computer ad Telecommuicatios Networkig, Vol. 33( 6), pp 337 346. [4] Jase, B. J., Spik A. ad Saracevic, T., (2000) Real life, real users, ad real eeds: A study ad aalysis of user queries o the Web, Iformatio Processig ad Maagemet, Vol. 36( 2), pp 207 227. [5] Karlgre, J., Sahlgre, M. ad Cöster, R., (2006) Weightig Query Terms Based o Distributioal Statistics Lecture Notes i Computer Sciece, Vol.4022, pp 208-2. [6] Maig, C. D., Raghava, P. ad Schutze, H. (2008) Itroductio to Iformatio Retrieval. Cambridge Uiversity Press, Cambridge, New York. [7] Spik, A., Wolfram, D., Jase, M. B. J. ad Saracevic, T., (200) Searchig the web: The public ad their queries Joural of the America Society for Iformatio Sciece ad Techology, Vol. 52 (3), 226 234. [8] Topi, H. ad Lucas, W., (2005a) Searchig the Web: operator assistace required, Iformatio Processig ad Maagemet, Vol. 4(2), pp 383-403. [9] Topi, H. ad Lucas, W., (2005b), Mix ad match: combiig terms ad operators for successful Web searches. Iformatio Processig ad Maagemet, Vol. 4(4), pp 80-87. 35
Iteratioal Joural o Natural Laguage Computig (IJNLC) Vol. 2, No., February 203 [0] Vechtomova, O. ad Karamuftuoglu, M. (2008), Lexical cohesio ad term proximity i documet rakig Iformatio Processig ad Maagemet, Vol. 44(4), pp 485-502. Appedix A. List of pairs of Queries Q. Idia Ecoomy / Ecoomy of Idia Q.2 Car Accidet / Accidet of car Q.3 Diabetes Diet / Diet for Diabetes Q.4 Office Maagemet / Maagemet i Office Q.5 Fiace Project Report / Project Report o fiace Q.6 Kids fu games / Fu games for kids Q.7 Statistics Books / Books o Statistics Q.8 Icome tax retur filig procedure / Procedure for icome tax retur filig Q.9 Kumao Himalayas / Himalayas of Kumao Q.0 Huma behaviour Aalysis / Aalysis of huma behaviour Q. Wildlife survey / Survey o wildlife Q.2 Aciet Idia History / History of Aciet Idia Q.3 Moral Values stories / Stories o moral values Q.4 Fiacial sector reforms i Idia / Reforms i fiacial sector i Idia Q.5 Health care policy issues / Policy issues i health care Authors Raj Kishor Bisht did his M. Sc. ad Ph.D. i Mathematics from Kumau Uiversity Naiital (Uttarakhad) Idia. He also qualified Natioal Eligibility test coducted by Coucil of Scietific Idustrial Research Idia. Curretly he is workig as Associate Professor i the departmet of Computer Sciece & Applicatios, Amrapali Istitute of Maagemet ad Computer Applicatios, Haldwai (Uttarakhad) Idia. His research iterest icludes mathematical models i NLP, Formal laguage & Automata theory. Ila Pat Bisht did her M. Sc. ad Ph.D. i Statistics from Kumau Uiversity Naiital (Uttarakhad) Idia. Curretly he is workig as Statistical Officer i the departmet of Ecoomics & Statistics, Govt. of Uttarakhad, Divisioal Office, Haldwai, (Uttarakhad) Idia. Her research iterest icludes Samplig, Operatio research ad applied Statistics. 36