Search Engines Considered Harmful In Search of an Unbiased Web Ranking

Similar documents
Search Engines Considered Harmful In Search of an Unbiased Web Ranking

Impact of Search Engines on Page Popularity

Impact of Search Engines on Page Popularity

Promotional Ranking of Search Engine Results: Giving New Web Pages a Chance to Prove Their Values

Effective Page Refresh Policies for Web Crawlers

Recent Researches in Ranking Bias of Modern Search Engines

Time-aware Authority Ranking

Nearest Neighbor Classification

RankMass Crawler: A Crawler with High Personalized PageRank Coverage Guarantee

How to Crawl the Web. Hector Garcia-Molina Stanford University. Joint work with Junghoo Cho

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Automatic Identification of User Interest For Personalized Search

Personalized Information Retrieval

A STUDY ON THE EVOLUTION OF THE WEB

arxiv:cs/ v1 [cs.ir] 4 Mar 2005

Hyperplane Ranking in. Simple Genetic Algorithms. D. Whitley, K. Mathias, and L. Pyeatt. Department of Computer Science. Colorado State University

Information Search in Web Archives

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

Automatic Identification of User Goals in Web Search [WWW 05]

Empirical risk minimization (ERM) A first model of learning. The excess risk. Getting a uniform guarantee

Delay Injection for. Service Dependency Detection

Downloading Hidden Web Content

Part 1: Link Analysis & Page Rank

by Jimmy's Value World Ashish H Thakkar

Lecture 8: Linkage algorithms and web search

World Wide Web has specific challenges and opportunities

Brief (non-technical) history

From Correlation to Causation: Active Delay Injection for Service Dependency Detection

Myths about Links, Links and More Links:

Lecture 1: Perfect Security


Document version Cohorts

Breadth-First Search Crawling Yields High-Quality Pages

Downloading Textual Hidden Web Content Through Keyword Queries

Learn SEO Copywriting

Distributed Data Store

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Mathematical Analysis of Google PageRank

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

Let be a function. We say, is a plane curve given by the. Let a curve be given by function where is differentiable with continuous.

Introduction to Information Retrieval

Lecture on Modeling Tools for Clustering & Regression

Are people biased in their use of search engines? Keane, Mark T.; O'Brien, Maeve; Smyth, Barry. Communications of the ACM, 51 (2): 49-52

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

CPSC 340: Machine Learning and Data Mining

A Probabilistic Approach to Metasearching with Adaptive Probing

Introduction to Data Mining

Model Selection and Assessment

FUNCTIONS, ALGEBRA, & DATA ANALYSIS CURRICULUM GUIDE Overview and Scope & Sequence

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

Parallel Crawlers. Junghoo Cho University of California, Los Angeles. Hector Garcia-Molina Stanford University.

1 GOOGLE NOW ALLOWS EDITING BUSINESS LISTINGS

A Survey of Google's PageRank

Multiple Pivot Sort Algorithm is Faster than Quick Sort Algorithms: An Empirical Study

8 th Grade Pre Algebra Pacing Guide 1 st Nine Weeks

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

Big picture. Definitions. Internal sorting. Exchange sorts. Insertion sort Bubble sort Selection sort Comparison. Comp Sci 1575 Data Structures

The Evolution of Web Content and Search Engines

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

Media Pack. Home About User Benefits Advertiser Benefits Advertising Packages

Contractors Guide to Search Engine Optimization

Genealogical Trees on the Web: A Search Engine User Perspective

Parallel Crawlers. 1 Introduction. Junghoo Cho, Hector Garcia-Molina Stanford University {cho,

The PageRank Citation Ranking: Bringing Order to the Web

Link Analysis and Web Search

CS 526 Advanced Topics in Compiler Construction. 1 of 12

Predict Topic Trend in Blogosphere

CS 6604: Data Mining Large Networks and Time-Series

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks

Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee

Cost-Effective Conceptual Design. over Taxonomies. Yodsawalai Chodpathumwan. University of Illinois at Urbana-Champaign.

Author(s): Rahul Sami, 2009

Computational Methods. Randomness and Monte Carlo Methods

Below, we will walk through the three main elements of the algorithm, which include Domain Attributes, On-Page and Off-Page factors.

How to organize the Web?

Knowing something about how to create this optimization to harness the best benefits will definitely be advantageous.

Search Engine Optimization

Exploring both Content and Link Quality for Anti-Spamming

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

A P2P-based Incremental Web Ranking Algorithm

CS144: Sessions. Cookie : CS144: Web Applications

ELEMENTARY NUMBER THEORY AND METHODS OF PROOF

The Evolution of the Web and Implications for an Incremental Crawler

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

ELEMENTARY NUMBER THEORY AND METHODS OF PROOF

Information Retrieval

Big Data Analytics CSCI 4030

RCA Business & Technical Conference

Unit VIII. Chapter 9. Link Analysis

Why is the Internet so important? Reaching clients on the Internet

Graph and Link Mining

Optimizing Search Engines using Click-through Data

Big Data Analytics CSCI 4030

BUBBLE RAP: Social-Based Forwarding in Delay-Tolerant Networks

Modern Retrieval Evaluations. Hongning Wang

Empirical Characterization of P2P Systems

Transcription:

Search Engines Considered Harmful In Search of an Unbiased Web Ranking Junghoo John Cho cho@cs.ucla.edu UCLA Search Engines Considered Harmful Junghoo John Cho 1/38

Motivation If you are not indexed by Google, you do not exist on the Web News.com article, 10/23/2002 People discover pages through search engines Top results: many users Bottom results: no new users Are we biased by search engines? Search Engines Considered Harmful Junghoo John Cho 2/38

Outline Are the rich getting richer? Web popularity-evolution experiment How much bias do search engines introduce? Impact of search engines Can we avoid search-engine bias? New ranking metric Search Engines Considered Harmful Junghoo John Cho 3/38

Web Evolution Experiment Collect Web history data Is rich-get-richer happening? From Oct. 2002 until Oct. 2003 154 sites monitored Top sites from each category of Open Directory Pages downloaded every week All pages in each site A total of average 4M pages every week (65GB) Search Engines Considered Harmful Junghoo John Cho 4/38

Rich-Get-Richer Problem Construct weekly Web-link graph From the downloaded data Partition pages into 10 groups Based on initial link popularity Top 10% group, 10%-20% group, etc. How many new links to each group after a month? Rich-get-richer More new links to top groups Search Engines Considered Harmful Junghoo John Cho 5/38

Result: Simple Link Count Increase in number of in links 4 10 6 3 10 6 2 10 6 1 10 6 20 40 60 80 100 Popularity After 7 months 70% of new links to top 20% pages No new links to bottom 60% pages Search Engines Considered Harmful Junghoo John Cho 6/38

Result: PageRank Increase in PageRank 0.0020 0.0015 0.0010 0.0005 0.0005 0.0010 20 40 60 80 100 Popularity After 7 months Decrease in PageRank for bottom 50% pages Due to normalization of PageRank Search Engines Considered Harmful Junghoo John Cho 7/38

Outline Web popularity-evolution experiment Rich-get-richer is indeed happening Unpopular pages get no attention Impact of search engines How much bias do search engines introduce? New ranking metric Page quality Search Engines Considered Harmful Junghoo John Cho 8/38

Outline Web popularity-evolution experiment Rich-get-richer is indeed happening Unpopular pages get no attention Impact of search engines How much bias do search engines introduce? New ranking metric Page quality Search Engines Considered Harmful Junghoo John Cho 8/38

Search Engine Bias What we mean by bias? Search Engines Considered Harmful Junghoo John Cho 9/38

Search Engine Bias What we mean by bias? What is the ideal ranking? How do search engines rank pages? Search Engines Considered Harmful Junghoo John Cho 9/38

What is the Ideal Ranking? Rank by intrinsic quality of a page? Search Engines Considered Harmful Junghoo John Cho 10/38

What is the Ideal Ranking? Rank by intrinsic quality of a page? Very subjective notion Different quality judgment on the same page Can there be an objective definition? Search Engines Considered Harmful Junghoo John Cho 10/38

Page Quality Q(p) Definition The probability that an average Web user will like page p enough to create a link to it if he looks at it Search Engines Considered Harmful Junghoo John Cho 11/38

Page Quality Q(p) Definition The probability that an average Web user will like page p enough to create a link to it if he looks at it In principle, we can measure Q(p) by 1. showing p to all Web users and 2. counting how many people like it p 1 : 10,000 people, 8,000 liked it, Q(p 1 ) = 0.8 p 2 : 10,000 people, 2,000 liked it, Q(p 2 ) = 0.2 Search Engines Considered Harmful Junghoo John Cho 11/38

Page Quality Q(p) Definition The probability that an average Web user will like page p enough to create a link to it if he looks at it In principle, we can measure Q(p) by 1. showing p to all Web users and 2. counting how many people like it p 1 : 10,000 people, 8,000 liked it, Q(p 1 ) = 0.8 p 2 : 10,000 people, 2,000 liked it, Q(p 2 ) = 0.2 Democratic measure of quality When consensus is hard to reach, pick the one that more people like Search Engines Considered Harmful Junghoo John Cho 11/38

PageRank: Intuition A page is important if many pages link to it Not every link is equal A link from an important page matters more than others e.g. Link from Yahoo vs Link from a random home page P R(p i ) = (1 d) + d [P R(p 1 )/c 1 + + P R(p m )/c m ] Random-Surfer Model When users follow links randomly, P R(p i ) is the probability to reach p i Search Engines Considered Harmful Junghoo John Cho 12/38

Page Quality vs PageRank PageRank Page quality if everyone is given equal chance High PageRank high quality To obtain high PageRank, many people should look at the page and like it. Low PageRank low quality? PageRank is biased against new pages How much bias for low PageRank pages? Search Engines Considered Harmful Junghoo John Cho 13/38

Measuring Search-Engine Bias Ideal experiment: Divide the world into two groups The users who do not use search engines The users who use search engines very heavily Compare popularity evolution Search Engines Considered Harmful Junghoo John Cho 14/38

Measuring Search-Engine Bias Ideal experiment: Divide the world into two groups The users who do not use search engines The users who use search engines very heavily Compare popularity evolution Problem: Difficult to conduct in practice Search Engines Considered Harmful Junghoo John Cho 14/38

Theoretical Web-User Models Let us do theoretical experiments! Random-surfer model Users follow links randomly Never use search engines Search-dominant model Users always start with a search engine Only visit pages returned by the search engine Compare popularity evolution Search Engines Considered Harmful Junghoo John Cho 15/38

Basic Definitions for the Models (Simple) Popularity P(p, t) Fraction of Web users that like p at time t E.g, 100,000 users, 10,000 like p, P(p, t) = 0.1 Visit Popularity V(p, t) Number of users that visit p in a unit time Awareness A(p, t) Fraction of Web users who are aware of p E.g., 100,000 users, 30,000 aware of p, A(p, t) = 0.3 Search Engines Considered Harmful Junghoo John Cho 16/38

Basic Definitions for the Models (Simple) Popularity P(p, t) Fraction of Web users that like p at time t E.g, 100,000 users, 10,000 like p, P(p, t) = 0.1 Visit Popularity V(p, t) Number of users that visit p in a unit time Awareness A(p, t) Fraction of Web users who are aware of p E.g., 100,000 users, 30,000 aware of p, A(p, t) = 0.3 P(p, t) = Q(p) A(p, t) Search Engines Considered Harmful Junghoo John Cho 16/38

Random-Surfer Model Popularity-Equivalence Hypothesis V(p, t) = r P(p, t) (or V(p, t) P(p, t)) PageRank is visit probability under random-surfer model Higher popularity More visitors Random-Visit Hypothesis A visit is done by any user with equal probability Search Engines Considered Harmful Junghoo John Cho 17/38

Random-Surfer Model: Analysis Current popularity P(p, t) Number of visitors from V(p, t) = r P(p, t) Awareness increase A(p, t) Popularity increase P(p, t) New popularity P(p, t + 1) Search Engines Considered Harmful Junghoo John Cho 18/38

Random-Surfer Model: Analysis Current popularity P(p, t) Number of visitors from V(p, t) = r P(p, t) Awareness increase A(p, t) Popularity increase P(p, t) New popularity P(p, t + 1) Formal Analysis: Differential Equation [ P(p, t) = 1 e r t ] n 0 P(p,t)dt Q(p) Search Engines Considered Harmful Junghoo John Cho 18/38

Random-Surfer Model: Result Theorem The popularity of page p evolves over time through the following formula: P(p, t) = Q(p) 1 + [ Q(p) P(p,0) 1] e [ r n Q(p) ]t Q(p): quality of p P(p, 0): initial popularity of p at time zero n: total number of Web users. r: normalization constant in V(p, t) = r P(p, t) Search Engines Considered Harmful Junghoo John Cho 19/38

Random-Surfer Model: Popularity Graph Popularity 1.0 0.8 0.6 Infant Expansion Maturity 0.4 0.2 5 10 15 20 25 30 35 Q(p) = 1, P(p, 0) = 10 8 r, n = 1 Time Search Engines Considered Harmful Junghoo John Cho 20/38

Search-Dominant Model V(p, t) P(p, t)? Search Engines Considered Harmful Junghoo John Cho 21/38

Search-Dominant Model V(p, t) P(p, t)? For ith result, how many clicks? For PageRank P(p, t), what ranking? Search Engines Considered Harmful Junghoo John Cho 21/38

Search-Dominant Model V(p, t) P(p, t)? For ith result, how many clicks? For PageRank P(p, t), what ranking? Empirical measurement by Lempel et al. and us Search Engines Considered Harmful Junghoo John Cho 21/38

Search-Dominant Model V(p, t) P(p, t)? For ith result, how many clicks? For PageRank P(p, t), what ranking? Empirical measurement by Lempel et al. and us New Visit-Popularity Hypothesis V(p, t) = r P(p, t) 9 4 Search Engines Considered Harmful Junghoo John Cho 21/38

Search-Dominant Model V(p, t) P(p, t)? For ith result, how many clicks? For PageRank P(p, t), what ranking? Empirical measurement by Lempel et al. and us New Visit-Popularity Hypothesis V(p, t) = r P(p, t) 9 4 Random-Visit Hypothesis A visit is done by any user with equal probability Search Engines Considered Harmful Junghoo John Cho 21/38

Search-Dominant Model: Result Popularity 1.0 0.8 0.6 0.4 0.2 1500 1550 1600 1650 Time [P(p, t)] (i 9 4 ) [P(p, 0)] (i 9 4 ( ) ) i 9 = r t (same parameters as before) 4 Q(p) i n i=1 Search Engines Considered Harmful Junghoo John Cho 22/38

Comparison of Two Models Time to final popularity Random surfer: 25 time units Search dominant: 1650 time units 66 times increases! Expansion stage Random surfer: 12 time units Search dominant: non existent Popularity 1.0 0.8 0.6 0.4 0.2 Infant Expansion Maturity Popularity 1.0 0.8 0.6 0.4 0.2 5 10 15 20 25 30 35 Time 1500 1550 1600 1650 Time Search Engines Considered Harmful Junghoo John Cho 23/38

Outline Web popularity-evolution experiment Is rich-get-richer happening? Impact of search engines Random-surfer model Search-dominant model New ranking metric How to measure page quality? Search Engines Considered Harmful Junghoo John Cho 24/38

Measuring Quality: Basic Idea Quality: probability of link creation by a new visitor Search Engines Considered Harmful Junghoo John Cho 25/38

Measuring Quality: Basic Idea Quality: probability of link creation by a new visitor Assuming the same number of visitors Q(p) Number of new links (or popularity increase) Search Engines Considered Harmful Junghoo John Cho 25/38

Measuring Quality: Basic Idea Quality: probability of link creation by a new visitor Assuming the same number of visitors Q(p) Number of new links (or popularity increase) Quality Estimator ˆQ(p) = P(p) Search Engines Considered Harmful Junghoo John Cho 25/38

Measuring Quality: Problem 1 Different number of visitors to each page More visitors to more popular page How to account for number of visitors? Quality Estimator ˆQ(p) = P(p) Search Engines Considered Harmful Junghoo John Cho 26/38

Measuring Quality: Problem 1 Different number of visitors to each page More visitors to more popular page How to account for number of visitors? Idea: PageRank = visit probability Quality Estimator ˆQ(p) = P(p) Search Engines Considered Harmful Junghoo John Cho 26/38

Measuring Quality: Problem 1 Different number of visitors to each page More visitors to more popular page How to account for number of visitors? Idea: PageRank = visit probability Quality Estimator ˆQ(p) = P(p)/P(p) Search Engines Considered Harmful Junghoo John Cho 26/38

Measuring Quality: Problem 2 No more new links to very popular pages Everyone already knows them P(p)/P(p) 0 for well-known pages How to account for well-known pages? Quality Estimator ˆQ(p) = P(p)/P(p) Search Engines Considered Harmful Junghoo John Cho 27/38

Measuring Quality: Problem 2 No more new links to very popular pages Everyone already knows them P(p)/P(p) 0 for well-known pages How to account for well-known pages? Idea: P(p) = Q(p) when everyone knows p Use P(p) to measure Q(p) for well-known pages Quality Estimator ˆQ(p) = P(p)/P(p) Search Engines Considered Harmful Junghoo John Cho 27/38

Measuring Quality: Problem 2 No more new links to very popular pages Everyone already knows them P(p)/P(p) 0 for well-known pages How to account for well-known pages? Idea: P(p) = Q(p) when everyone knows p Use P(p) to measure Q(p) for well-known pages Quality C: relative Estimator weight given to popularity increase ˆQ(p) = C P(p)/P(p) + P(p) C: weight given to popularity increase Search Engines Considered Harmful Junghoo John Cho 27/38

Measuring Quality: Theoretical Proof Theorem Under the random-surfer model, the quality of page p, Q(p), always satisfies the following equation: ( n Q(p) = r ) ( dp(p, t)/dt P(p, t) ) + P(p, t) Compare it with ˆQ(p) = C P(p) P(p) + P(p) Search Engines Considered Harmful Junghoo John Cho 28/38

Is Page Quality Effective? How to measure its effectiveness? Implement it to a major search engine? Any other alternatives? Search Engines Considered Harmful Junghoo John Cho 29/38

Is Page Quality Effective? How to measure its effectiveness? Implement it to a major search engine? Any other alternatives? Idea: Pages eventually obtain deserved popularity (however long it may take...) Future PageRank Q(p) Search Engines Considered Harmful Junghoo John Cho 29/38

Page Quality: Evaluation (1) ˆQ(p) as a predictor of future PageRank Compare the correlations of current ˆQ(p) with future PageRank current PageRank with future PageRank ˆQ(p) predicts future PageRank better? Search Engines Considered Harmful Junghoo John Cho 30/38

Page Quality: Evaluation (1) ˆQ(p) as a predictor of future PageRank Compare the correlations of current ˆQ(p) with future PageRank current PageRank with future PageRank ˆQ(p) predicts future PageRank better? Download the Web multiple times with long intervals 1 month 4 months t 1 t 2 t 3 t 4 Search Engines Considered Harmful Junghoo John Cho 30/38

Page Quality: Evaluation (1) ˆQ(p) as a predictor of future PageRank Compare the correlations of current ˆQ(p) with future PageRank current PageRank with future PageRank ˆQ(p) predicts future PageRank better? Download the Web multiple times with long intervals 1 month 4 months t 1 t 2 t 3 t 4 P R(p, t 3 ) P R(p, t 4 )? Search Engines Considered Harmful Junghoo John Cho 30/38

Page Quality: Evaluation (1) ˆQ(p) as a predictor of future PageRank Compare the correlations of current ˆQ(p) with future PageRank current PageRank with future PageRank ˆQ(p) predicts future PageRank better? Download the Web multiple times with long intervals 1 month 4 months t 1 t 2 t 3 t 4 ˆQ(p, t 3 ) P R(p, t 4 )? Search Engines Considered Harmful Junghoo John Cho 30/38

Page Quality: Evaluation (2) Compare the average relative error P R(p,t 4 ) err(p) = P R(p,t 4) ˆQ(p,t 3 ) P R(p,t 4) P R(p,t 3 ) P R(p,t 4 ) For the pages whose PageRank consistently increased/decreased from t 1 through t 3. Search Engines Considered Harmful Junghoo John Cho 31/38

Page Quality: Evaluation (2) Compare the average relative error P R(p,t 4 ) err(p) = Result P R(p,t 4) ˆQ(p,t 3 ) P R(p,t 4) P R(p,t 3 ) P R(p,t 4 ) For ˆQ(p, t 3 ): average err = 0.32 For P R(p, t 3 ): average err = 0.78 ˆQ(p, t 3 ) twice as accurate. For the pages whose PageRank consistently increased/decreased from t 1 through t 3. Search Engines Considered Harmful Junghoo John Cho 31/38

Quality Evaluation: More Detail Fraction of pages 0.6 0.5 0.4 Q(p) PR(p) 0.3 0.2 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. err(p) Search Engines Considered Harmful Junghoo John Cho 32/38

Summary Web popularity-evolution experiment Rich-get-richer is indeed happening Impact of search engines Random-surfer model Search-dominant model Search engines have worrisome impact New ranking metric Page quality Search Engines Considered Harmful Junghoo John Cho 33/38

Page Quality Improved version of PageRank Accounts for different levels of exposure Third-generation ranking principle 1 st : Term frequency 2 nd : Link structure 3 rd : Link evolution Potential to reduce Web inequality Identify high-quality pages early on Search Engines Considered Harmful Junghoo John Cho 34/38

Thank You For more details, see A. Ntoulas, J. Cho and C. Olston. What s New on the Web? In WWW Conference, 2004. J. Cho and S. Roy Impact of Web Search Engines on Page Popularity In WWW Conference, 2004. J. Cho and R. Adams. Page Quality: In Search of an Unbiased Web Ranking UCLA CS Department, Nov. 2003. Any Questions? Search Engines Considered Harmful Junghoo John Cho 35/38

Popularity Increase: Relative Link Count Relative increase in number of in inks 1 0.8 0.6 0.4 0.2 20 40 60 80 100 Popularity Search Engines Considered Harmful Junghoo John Cho 36/38

Popularity Increase: Relative PageRank Relative increase in PageRank 0.005 0.005 0.01 0.015 20 40 60 80 100 Popularity Search Engines Considered Harmful Junghoo John Cho 37/38

Search-Dominant Model: Result Popularity 1 0.8 0.6 0.4 0.2 1650.24336 1650.24337 1650.24338 Time Search Engines Considered Harmful Junghoo John Cho 38/38