Brief (non-technical) history

Similar documents
Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google

Lecture 8: Linkage algorithms and web search

Link Analysis SEEM5680. Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press.

CS60092: Informa0on Retrieval

Multimedia Content Management: Link Analysis. Ralf Moeller Hamburg Univ. of Technology

Introduction to Information Retrieval

TODAY S LECTURE HYPERTEXT AND

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Link analysis. Query-independent ordering. Query processing. Spamming simple popularity

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Information Retrieval and Web Search

Information retrieval

The changing face of web search. Prabhakar Raghavan Yahoo! Research

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

CS6200 Information Retreival. The WebGraph. July 13, 2015

Link analysis in web IR CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval. Lecture 11 - Link analysis

This lecture. CS276B Text Retrieval and Mining Winter Pagerank: Issues and Variants. Influencing PageRank ( Personalization )

Link analysis in web IR CE-324: Modern Information Retrieval Sharif University of Technology

CS6322: Information Retrieval Sanda Harabagiu. Lecture 10: Link analysis

Information Retrieval

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Lecture 8: Linkage algorithms and web search

Lec 8: Adaptive Information Retrieval 2

Text Technologies for Data Science INFR11145 Web Search Walid Magdy Lecture Objectives

Link Analysis and Web Search

COMP Page Rank

PV211: Introduction to Information Retrieval

Link Structure Analysis

Part 1: Link Analysis & Page Rank

Big Data Analytics CSCI 4030

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Mining The Web. Anwar Alhenshiri (PhD)

Big Data Analytics CSCI 4030

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Introduction to Data Mining

How to organize the Web?

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

PageRank for Product Image Search. Research Paper By: Shumeet Baluja, Yushi Jing

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

Mining Web Data. Lijun Zhang

Web Search Basics. Berlin Chen Department t of Computer Science & Information Engineering National Taiwan Normal University

Web Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University

Mining Web Data. Lijun Zhang

Informa(on Retrieval

Today s lecture hypertext and links

Information Networks: PageRank

Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

Information retrieval. Lecture 9

Introduction In to Info Inf rmation o Ret Re r t ie i v e a v l a LINK ANALYSIS 1

Slides based on those in:

A Survey of Google's PageRank

World Wide Web has specific challenges and opportunities

A Survey on Web Information Retrieval Technologies

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Information Retrieval and Web Search Engines

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture November, 2002

Unit VIII. Chapter 9. Link Analysis

CS425: Algorithms for Web Scale Data

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Informa(on Retrieval

Link Analysis. Hongning Wang

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

Information Retrieval. Lecture 9 - Web search basics

NBA 600: Day 15 Online Search 116 March Daniel Huttenlocher

Social Networks 2015 Lecture 10: The structure of the web and link analysis

CENTRALITIES. Carlo PICCARDI. DEIB - Department of Electronics, Information and Bioengineering Politecnico di Milano, Italy

Lecture 27: Learning from relational data

Bruno Martins. 1 st Semester 2012/2013

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Recent Researches on Web Page Ranking

Introduction to Data Mining

COMP 4601 Hubs and Authorities

Link Analysis in Web Mining

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

~ Ian Hunneybell: WWWT Revision Notes (15/06/2006) ~

CSI 445/660 Part 10 (Link Analysis and Web Search)

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

Web-Mining Agents Community Analysis. Tanya Braun Universität zu Lübeck Institut für Informationssysteme

A New Technique for Ranking Web Pages and Adwords

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

PageRank Algorithm Abstract: Keywords: I. Introduction II. Text Ranking Vs. Page Ranking

Learning to Rank Networked Entities

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

1.6 Case Study: Random Surfer

SEARCH ENGINE INSIDE OUT

Transcription:

Web Data Management Part 2 Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI, University of Pittsburgh Brief (non-technical) history Early keyword-based engines Altavista, Excite, Infoseek, Inktomi, ca. 1995-1997 Sponsored search ranking: Goto.com (morphed into Overture.com Yahoo!) Your search ranking depended on how much you paid Auction for keywords: casino was expensive! 1

The trouble with sponsored search It costs money. What s the alternative? Search Engine Optimization: Tuning your web page to rank highly in the algorithmic search results for select keywords Alternative to paying for placement Thus, intrinsically a marketing function Performed by companies, webmasters and consultants ( Search engine optimizers ) for their clients Some perfectly legitimate, some very shady Simplest forms First generation engines relied heavily on tf/idf The top-ranked pages for the query maui resort were the ones containing the most maui s and resort s SEOs responded with dense repetitions of chosen terms e.g., maui resort maui resort maui resort Often, the repetitions would be in the same color as the background of the web page 4 Repeated terms got indexed by crawlers 4 But not visible to humans on browsers Pure word density cannot be trusted as an IR signal 2

Link-based Ranking 1998+: Link-based ranking pioneered by Google Blew away early engines Great user experience in search of a business model Meanwhile Goto/Overture s annual revenues were nearing $1 billion Result: Google added paid-placement ads to the side, independent of search results Yahoo followed suit, acquiring Overture (for paid placement) and Inktomi (for search) Ads Algorithmic results. 3

The Web as a Directed Graph Page A Anchor hyperlink Page B Assumption 1: A hyperlink between pages denotes author perceived relevance (quality signal) Assumption 2: The anchor of the hyperlink describes the target page (textual context) Anchor Text For ibm how to distinguish between: IBM s home page (mostly graphical) IBM s copyright page (high term freq. for ibm ) Rival s spam page (arbitrarily high term freq.) ibm ibm.com IBM home page A million pieces of anchor text with ibm send a strong signal www.ibm.com 4

Indexing anchor text When indexing a document D, include anchor text from links pointing to D. Armonk, NY-based computer giant IBM announced today www.ibm.com Joe s computer hardware links Sun HP IBM Big Blue today announced record profits for the quarter Simple Link Popularity First generation: using link counts as simple measures of popularity. Two basic suggestions: Undirected popularity: 4 Each page gets a score = the number of in-links plus the number of out-links (3+2=5). Directed popularity: 4 Score of a page = number of its in-links (3). 5

Query processing First retrieve all pages meeting the text query (say venture capital). Order these by their link popularity (either variant on the previous page). Can be easily spammed! What is PageRank Reference: http://infolab.stanford.edu/pub/papers/google.pdf PageRank is a vote, by all the other pages on the Web, about how important a page is. A link to a page: a vote of support. No link: no support (abstention not a vote against). PageRank of those who vote maters (e.g, 1 vote from a highly ranked page vs 100 votes from low rank pages). PageRank of a page should be calculated without knowing the final value of the PageRank of the other pages. 6

What is PageRank (cont) More formally: Page A has pages T1 Tn which point to it. C(A) is the number of links going out of page A. Then PageRank (PR) of a page A: PR(A) = (1-d) + d (PR(T1)/C(T1) + + PR(Tn)/C(Tn)) d is a damping factor which can be set between 0 and 1 (usually 0.85). PRs form a probability distribution over web pages, so (normalized) sum of all web pages PRs will be 1. PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web (we will talk about it later). Simple Example of PR Calculation Network: two pages, each pointing to the other: Each page has one outgoing link (C(A) = 1 and C(B) = 1). PR(A)= 0.15 + 0.85 * 0 = 0.15 PR(B)= 0.15 + 0.85 * 0.15 = 0.2775 PR(A)= 0.15 + 0.85 * 0.2775 = 0.385875 PR(B)= 0.15 + 0.85 * 0.385875 = 0.47799375 PR(A)= 0.15 + 0.85 * 0.47799375 = 0.5562946875 PR(B)= 0.15 + 0.85 * 0.5562946875 = 0.622850484375. 7

Simple Hierarchy Looping 8

Full Mesh Impact of Non-Voters 9

Dynamic View of PageRank Scoring Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably In the steady state each page has a long-term visit rate - use this as the page s score. 1/3 1/3 1/3 Idea: pages visited more often are more important. Not quite enough The web is full of dead-ends. Random walk can get stuck in dead-ends. Makes no sense to talk about long-term visit rates.?? 10

Teleporting At a dead end, jump to a random web page. At any non-dead end, with probability 10%, jump to a random web page (teleport). With remaining probability (90%), go out on a random link. 10% - a parameter. Now cannot get stuck locally. There is a long-term rate at which any page is visited. How do we compute this visit rate? Markov chains A Markov chain consists of n states, plus an n n transition probability matrix P. At each step, we are in exactly one of the states. For 1 i,j n, the matrix entry P ij tells us the probability of j being the next state, given we are currently in state i. i P ij j For all i: Markov chains are abstractions of random walks. 11

Ergodic Markov chains A Markov chain is ergodic if you have a path from any state to any other For any start state, after a finite transient time T 0, the probability of being in any state at a fixed time T>T 0 is nonzero. For any ergodic Markov chain, there is a unique long-term visit rate for each state. Steady-state probability distribution. Over a long time-period, we visit each state in proportion to this rate. It doesn t matter where we start. Not ergodic Probability vectors A probability (row) vector x = (x 1, x n ) tells us where the walk is at any point. E.g., (000 1 000) means we re in state i. 1 i n More generally, the vector x = (x 1, x n ) means the walk is in state i with probability x i. 12

Change in probability vector If the probability vector is x = (x 1, x n ) at this step, what is it at the next step? Recall that row i of the transition prob. matrix P tells us where we go next from state i. So from x, our next state is distributed as xp. Steady state example The steady state looks like a vector of probabilities a = (a 1, a n ): a i is the probability that we are in state i. 1/4 3/4 1 2 1/4 3/4 For this example, a 1 =1/4 and a 2 =3/4. 13

How do we compute this vector? Let a = (a 1, a n ) denote the row vector of steady-state probabilities. If our current position is described by a, then the next step is distributed as ap. But a is the steady state, so a=ap. Solving this matrix equation gives us a. One way of computing a Recall, regardless of where we start, we eventually reach the steady state a. Start with any distribution (say x=(10 0)). After one step, we re at xp; after two steps at xp 2, then xp 3 and so on. Eventually means for large k, xp k = a. Algorithm: multiply x by increasing powers of P until the product looks stable. 14

PageRank Summary Preprocessing: Given graph of links, build matrix P. From it compute a. The entry a i is a number between 0 and 1: the pagerank of page i. Query processing: Retrieve pages meeting query. Rank them by their pagerank. Order is query-independent. PageRank: Issues and Variants How realistic is the random surfer model? What if we modeled the back button? Surfer behavior sharply skewed towards short paths Search engines, bookmarks & directories make jumps nonrandom. Biased Surfer Models Weight edge traversal probabilities based on match with topic/query (non-uniform edge selection) Bias jumps to pages on topic (e.g., based on personal bookmarks & categories of interest) 15

Topic Specific Pagerank Conceptually, we use a random surfer who teleports, with say 10% probability, using the following rule: 4 Selects a category (say, one of the 16 top level ODP categories) based on a query & user -specific distribution over the categories 4 Teleport to a page uniformly at random within the chosen category Sounds hard to implement: can t compute PageRank at query time! Topic Specific PageRank Offline:Compute pagerank for individual categories Query independent as before Each page has multiple pagerank scores one for each category, with teleportation only to that category Online: Distribution of weights over categories computed by query context classification Generate a dynamic pagerank score for each page - weighted sum of categoryspecific pageranks Can be generalized in terms of Data Linkage 16