Recent Researches in Ranking Bias of Modern Search Engines

Size: px

Start display at page:

Download "Recent Researches in Ranking Bias of Modern Search Engines"

Darlene Taylor
5 years ago
Views:

1 DEWS2007 C7-7 Web,, {hirate,yoshida,yamana}@yama.info.waseda.ac.jp Web Web Web Recent Researches in Ranking Bias of Modern Search Engines Yu HIRATE,, Yasuaki YOSHIDA, and Hayato YAMANA, Media Network Center, Waseda University Totsuka-cho, Shinjuku-ku, Tokyo, , Japan Science and Engineering, Waseda University Okubo, Shinjuku-ku, Tokyo, , Japan National Institute of Informatics Hitotsubashi, Chiyoda-ku, Tokyo Japan {hirate,yoshida,yamana}@yama.info.waseda.ac.jp Abstract Recently, modern web search engines are recognized as fundamental tools for information retrieval from the web. It means the information that we are able to obtain may be depend on rankings of search engines. However, if search engines return unfair rankings, then following the suggestion comes up. Is the information that we obtain from the web biased according to the rankings? From this point of view, many researchers and politicians investigated about search engine rankings. We define such unfairness as ranking bias, and surveyed several current researches about ranking bias. In this paper, we report these researches by dividing two categories, comparing index coverages, analysis of ranking algorithms. Also, we report researches about comparing set of results or ranking of search engine. Key words Search Engine, ranking, bias, index, ranking algorithm [1] Web [2] Web Web [3] [4] Web

2 [5] CNET Olsen 2002 Google [6] 2005 Dogpile.com [7] Google [8], Yahoo! [9], MSN [10], Ask Jeeves [11] % [12] 2 CS CS 2 rich-get-richer Googlearchy , rich-get-richer [13] rich-get-richer [13] Top10 Top10 Top10 Top10 Top10 Top10 [14] 2. 2 rich-get-richer UCLA Cho [13] Web rich-get-richer Web Open Directory [15] PageRank [16] PageRank 4 2 PageRank

3. 1 ( [13] ) 2 PageRank ( [13] ) PageRank 5 1 2 PageRank PageRank rich-get-richer 1 PageRank 2 1 2 1 2 1 PageRank 2 rich-get-richer 2.

3 3. 1 ( [13] ) 2 PageRank ( [13] ) PageRank PageRank PageRank rich-get-richer 1 PageRank PageRank 2 rich-get-richer 2. 3 Googlearchy [17] [18] rich-and-richer Web [17] [19] [20] Google Googlearchy [17] [18] [13] Googlocracy [18] PageRank [16] PageRank PageRank 2.1 rich-and-richer Web PageRank PageRank 3. 3 Web Web [21] [22] [23] [24]

3.2 [13] [14] [31] [32] [35] [36] [37] 3

1 98 Bharat [21] 99 Henzinger [22] 2004 Vaughan [23] 2006 Yossef [24] DEC Bharat 4 97 [21] AltaVista [25], Excite [26], HotBot [27], Infoseek 5 [28] [21] Compaq Henzinger 6 98 [22] Web AltaVista

4 3.2 [13] [14] [31] [32] [35] [36] [37] Bharat [21] 99 Henzinger [22] 2004 Vaughan [23] 2006 Yossef [24] DEC Bharat 4 97 [21] AltaVista [25], Excite [26], HotBot [27], Infoseek 5 [28] [21] Compaq Henzinger 6 98 [22] Web AltaVista [25], HotBot [27], Excite [26], Infoseek [28], Google [8], Lycos [29] Western Ontario Vaughan [23] Google [8] AltaVista [25] AlltheWeb [30] Web Google Google 5 Go.com Yahoo! 6 Google 7 Google site: AltaVista host: AlltheWeb 3 4 ( [23] ) 1 Google, MSN, Yahoo! [24] Page from Google MSN Yahoo! Indexed by Google 46% 45% MSN 55% 51% Yahoo! 44% 22% 4 Google, MSN,Yahoo! [24] AltaVista Technion Bar-Yossef 2006 [24] Google [8] Yahoo! [9] MSN [10] [24] Query-Pool Query-Pool URL Yahoo! [24] PageRank [13] [14] [31] [32] V R P

5 [13] [31] [14] [32] 2.1 rich-get-richer rich-get-richer [13] [31] UCLA Cho [13] 60 [31] t p V (p, t) t p P (p, t) [13] (1) t V (p, t) P (p, t) (1) P (p, t) t P (p, t) 3 (0 < = t < 13) (13 < = t < 25) 8 [14] Q(p)(0 < = Q(p) < = 1) 1 5 ( [14] ) 6 ( [14] ) (t > = 25) Q(p) V (p, t) P (p, t) 2.25 (2) [4] V (p, t) R(p, t) V (p, t) R(p, t) 1.5 (3) Stanford Web- Base [33] [34] R(p, t) P (p, t) R(p, t) P (p, t) 1.5 (4) P (p, t) [13] [31] Indiana Fortunato 2006 [31] 2 2 [13] V (p, t) P (p, t) 2:25

6 7. 1 [35] [36] Mowshowitz 9 [35] [36] [35] [36] [35] [36] 7 ( [31] ) [31] V (p, t) P (p, t) 1.8 V (p, t) P (p, t) 7 7 [31] [13] [31] V (p, t) P (p, t) Googlearchy Googlocracy 6. 2 [14] UCLA Cho [14] PageRank [16] 3.2 PageRank [14] Q(p) p Q(p) P (p, t) 6. 3 [32] CMU Pandey [32] Randomized Rank Promotion Randomized Rank Promotion 5 P (p, t) [32] 7. [35] [36] [37] [35] [36] [37] E B(E) E C Sim(E) B(E) = 1 Sim(E) (5) E 1, E 2, E 3 q 1 E 1 URL (u 1, u 2, u 3) E 2 URL (u 4, u 3, u 5) E 3 URL (u 3, u 1, u 4) S S 1 = (1, 1, 1, 0, 0), S 2 = (0, 0, 1, 1, 1), S 3 = (1, 0, 1, 1, 0) S S = (2, 1, 3, 2, 1) (6) Sim(E i) S S i E 1, E 2, E 3 B(E 1), B(E 2), B(E 3) B(E 1) = 1 = {3 ( )} 0.5 (7) B(E 2) = 1 = {3 ( )} 0.5 (8) B(E 3) = 1 = {3 ( )} 0.5 (9) [36] About, Ah-ha 10, AltaVista [25], AOLSearch, FastSearch, Google [8], LookSmart, Lycos [29], MSN [10], Netscape, Overture, Sprinks, Teoma, TureSearch, WiseNut, Yahoo! [9] 16 [38] [39] 5 25 [38] [35] [36] 10 Enhance Interactive

8 [38] [36] 10 DNA evidence Google 10 [37] 11 Bondi beach Yahoo! 10 [37] 9 [39] [36] 8 [39] 16 25 9 8 9 AOL Search, Google, Netscape, Yahoo!

Image, Picsearch [40] 2 13 Spearman s footrule [41] [42] Fagin s measure [43] M measure [37] 4 11 Teoma Ask.

7 8 [38] [36] 10 DNA evidence Google 10 [37] 11 Bondi beach Yahoo! 10 [37] 9 [39] [36] 8 [39] AOL Search, Google, Netscape, Yahoo!Ah-ha MSN, Sprinks, Teoma, TrueSearch [36] Bar-llan Bar-llan [37] [37] Google, Yahoo!, Teoma Google Image, Yahoo! Image, Picsearch [40] 2 13 Spearman s footrule [41] [42] Fagin s measure [43] M measure [37] 4 11 Teoma Ask.com [11] 12 US elections 2004, organic food, DNA evidence 13 Twin towers, Bondi beach Google DNA evidence 11 Yahoo! Bondi beach 10 Google DNA evidence [37] 11 Yahoo! Bondi beach 10 Google Google Yahoo! [37] Picsearch [40]

8 API [1], 2006, R D, [2] Y. Hirate, S. Kato, and H. Yamana, Web Structure in Proc. of WAW2006, [3] T. Josvhims, Optimizing Search Engine using Clickthrough Data, Proc. of ACM SIGKDD2002, pp , [4] R. Lempel and S. Moran, Predictive Caching and Prefetching of Query Results in Search Engines, Proc. of WWW2003, pp , [5],, [6] S. Olsen, Does search engine s power threaten Web s independence?, [7] Dogpile Web Search Home Page, [8] Google, [9] Yahoo!, [10] MSN, [11] Ask.com Search Engine, [12] Dogpile.com, Different Engines, Different Results, [13] J. Cho and S. Roy, Impact of Search Engines on Page Popularity, Proc. of WWW2004, pp , [14] J. Cho, S. Roy, and R. E. Adams, Page Quality: In Search of an Unbiased Web Ranking, Proc. of ACM SIG- MOD2005, pp , [15] Open Directory, [16] L. Page, S. Brin, R. Motwani, and T. Winograd, The Pagerank Citation Ranking: Bringing Order to the Web, Technical Report, Stanford University, [17] M. Hindman, K. Tsioutsiouliklis, and J. A. Johnson, Googlearchy: How a Few Heavily-Linked Sites Dominate Politics on the Web, Annual Meeting of the Midwest Political Science Association, [18] F. Menczer, S. Fortunato, A. Flammini, and A. Vespignani, Googlearchy or Googlocracy?, IEEE Spectrum, [19] R. Albert, A.-L. Barabasi, and H. Jeong, Diameter of the World Wide Web, Nature, Vol. 401(6749), pp , [20] A. Broder, R.Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener, Graph Structure in the Web, The International Journal of Computer and Telecommunication Networking, Vol. 33, pp , [21] K. Bharat and A. Broder, A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines, Computer Networks and ISDN Systems, Vol. 30, Issue 1 7, pp , [22] M. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork, Measuring Index Quality using Random Walks on the Web, Proc. of WWW1999, pp , [23] L. Vaughan and M. Thelwall, Search Engine Coverage Bias: Evidence and Possible Causes, Information Processing and Management, Vol. 40, No. 4, pp , [24] Z. Bar-Yossef and M. Gurevich, Random Sampling from a Search Engine s Index, Proc. of WWW2006, pp , [25] AltaVista, [26] Excite, [27] HotBot, [28] Go.com, [29] Lycos, [30] AlltheWeb, [31] S. Fortunato, A. Flammini, F. Menczer, and A. Vespignani, The Egalitarian Effect of Search Engine, eprint arxive:cs.cy/ , [32] S. Pandey, S. Roy, C. Olston, J. Cho, and S. Chakrabarti, Shuffling a Stacked Deck: The Case for Partially Randomized Ranking of Search Engine Results, Proc. of VLDB2005, pp , [33] H. Hirai, S. Raghavan, H. G-Molina, and A. Paepcke, Webbase: A repository of the Web, Proc. of WWW2000, pp , [34] The Stanford WebBase Project, testbed/doc2/webbase/ [35] A. Mowshowitz and A. Kawaguchi, Bias on the Web, Communications of the ACM, Vol. 45, No. 9, pp , [36] A. Mowshowiz and A. Kawaguchi, Measuring Search Engine Bias, Information Processing and Management, Vol. 41, pp , [37] J. Bar-llan, M. Mat-Hassan and M. Levene, Methods for Comparing Rankings of Search Engine Results, eprint arxive:cs.ir/050539, [38] West Publishing, West s analysis of American low, Eagan, MN: Thomson-West, [39] Library of Congress Classification System, [40] Picsearch - Image Search for Pictures and Images, [41] P. Diaconis, R. L. Graham, Spearman s Footrule as a Measure of Disarray, Journal of the Royal Statistical Society, Series B(Methodological), Vol. 39, pp , [42] C. Dwork, R. Kumar, M. Naor and D. Sivakumar, Rank Aggregation Methods for the Web, Proc. of WWW2001, pp , [43] R. Fagin, R. Kumar and D. Sivakumar, Comparing Top k Lists, SIAM Journal on Discrete Mathematics, Vol. 17, No. 1, pp , 2003.

CS Search Engine Technology

CS Search Engine Technology CS236620 - Search Engine Technology Ronny Lempel Winter 2008/9 The course consists of 14 2-hour meetings, divided into 4 main parts. It aims to cover both engineering and theoretical aspects of search