Web Data Management Part 2 Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI, University of Pittsburgh Brief (non-technical) history Early keyword-based engines Altavista, Excite, Infoseek, Inktomi, ca. 1995-1997 Sponsored search ranking: Goto.com (morphed into Overture.com Yahoo!) Your search ranking depended on how much you paid Auction for keywords: casino was expensive! 1
The trouble with sponsored search It costs money. What s the alternative? Search Engine Optimization: Tuning your web page to rank highly in the algorithmic search results for select keywords Alternative to paying for placement Thus, intrinsically a marketing function Performed by companies, webmasters and consultants ( Search engine optimizers ) for their clients Some perfectly legitimate, some very shady Simplest forms First generation engines relied heavily on tf/idf The top-ranked pages for the query maui resort were the ones containing the most maui s and resort s SEOs responded with dense repetitions of chosen terms e.g., maui resort maui resort maui resort Often, the repetitions would be in the same color as the background of the web page 4 Repeated terms got indexed by crawlers 4 But not visible to humans on browsers Pure word density cannot be trusted as an IR signal 2
Link-based Ranking 1998+: Link-based ranking pioneered by Google Blew away early engines Great user experience in search of a business model Meanwhile Goto/Overture s annual revenues were nearing $1 billion Result: Google added paid-placement ads to the side, independent of search results Yahoo followed suit, acquiring Overture (for paid placement) and Inktomi (for search) Ads Algorithmic results. 3
The Web as a Directed Graph Page A Anchor hyperlink Page B Assumption 1: A hyperlink between pages denotes author perceived relevance (quality signal) Assumption 2: The anchor of the hyperlink describes the target page (textual context) Anchor Text For ibm how to distinguish between: IBM s home page (mostly graphical) IBM s copyright page (high term freq. for ibm ) Rival s spam page (arbitrarily high term freq.) ibm ibm.com IBM home page A million pieces of anchor text with ibm send a strong signal www.ibm.com 4
Indexing anchor text When indexing a document D, include anchor text from links pointing to D. Armonk, NY-based computer giant IBM announced today www.ibm.com Joe s computer hardware links Sun HP IBM Big Blue today announced record profits for the quarter Simple Link Popularity First generation: using link counts as simple measures of popularity. Two basic suggestions: Undirected popularity: 4 Each page gets a score = the number of in-links plus the number of out-links (3+2=5). Directed popularity: 4 Score of a page = number of its in-links (3). 5
Query processing First retrieve all pages meeting the text query (say venture capital). Order these by their link popularity (either variant on the previous page). Can be easily spammed! What is PageRank Reference: http://infolab.stanford.edu/pub/papers/google.pdf PageRank is a vote, by all the other pages on the Web, about how important a page is. A link to a page: a vote of support. No link: no support (abstention not a vote against). PageRank of those who vote maters (e.g, 1 vote from a highly ranked page vs 100 votes from low rank pages). PageRank of a page should be calculated without knowing the final value of the PageRank of the other pages. 6
What is PageRank (cont) More formally: Page A has pages T1 Tn which point to it. C(A) is the number of links going out of page A. Then PageRank (PR) of a page A: PR(A) = (1-d) + d (PR(T1)/C(T1) + + PR(Tn)/C(Tn)) d is a damping factor which can be set between 0 and 1 (usually 0.85). PRs form a probability distribution over web pages, so (normalized) sum of all web pages PRs will be 1. PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web (we will talk about it later). Simple Example of PR Calculation Network: two pages, each pointing to the other: Each page has one outgoing link (C(A) = 1 and C(B) = 1). PR(A)= 0.15 + 0.85 * 0 = 0.15 PR(B)= 0.15 + 0.85 * 0.15 = 0.2775 PR(A)= 0.15 + 0.85 * 0.2775 = 0.385875 PR(B)= 0.15 + 0.85 * 0.385875 = 0.47799375 PR(A)= 0.15 + 0.85 * 0.47799375 = 0.5562946875 PR(B)= 0.15 + 0.85 * 0.5562946875 = 0.622850484375. 7
Simple Hierarchy Looping 8
Full Mesh Impact of Non-Voters 9
Dynamic View of PageRank Scoring Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably In the steady state each page has a long-term visit rate - use this as the page s score. 1/3 1/3 1/3 Idea: pages visited more often are more important. Not quite enough The web is full of dead-ends. Random walk can get stuck in dead-ends. Makes no sense to talk about long-term visit rates.?? 10
Teleporting At a dead end, jump to a random web page. At any non-dead end, with probability 10%, jump to a random web page (teleport). With remaining probability (90%), go out on a random link. 10% - a parameter. Now cannot get stuck locally. There is a long-term rate at which any page is visited. How do we compute this visit rate? Markov chains A Markov chain consists of n states, plus an n n transition probability matrix P. At each step, we are in exactly one of the states. For 1 i,j n, the matrix entry P ij tells us the probability of j being the next state, given we are currently in state i. i P ij j For all i: Markov chains are abstractions of random walks. 11
Ergodic Markov chains A Markov chain is ergodic if you have a path from any state to any other For any start state, after a finite transient time T 0, the probability of being in any state at a fixed time T>T 0 is nonzero. For any ergodic Markov chain, there is a unique long-term visit rate for each state. Steady-state probability distribution. Over a long time-period, we visit each state in proportion to this rate. It doesn t matter where we start. Not ergodic Probability vectors A probability (row) vector x = (x 1, x n ) tells us where the walk is at any point. E.g., (000 1 000) means we re in state i. 1 i n More generally, the vector x = (x 1, x n ) means the walk is in state i with probability x i. 12
Change in probability vector If the probability vector is x = (x 1, x n ) at this step, what is it at the next step? Recall that row i of the transition prob. matrix P tells us where we go next from state i. So from x, our next state is distributed as xp. Steady state example The steady state looks like a vector of probabilities a = (a 1, a n ): a i is the probability that we are in state i. 1/4 3/4 1 2 1/4 3/4 For this example, a 1 =1/4 and a 2 =3/4. 13
How do we compute this vector? Let a = (a 1, a n ) denote the row vector of steady-state probabilities. If our current position is described by a, then the next step is distributed as ap. But a is the steady state, so a=ap. Solving this matrix equation gives us a. One way of computing a Recall, regardless of where we start, we eventually reach the steady state a. Start with any distribution (say x=(10 0)). After one step, we re at xp; after two steps at xp 2, then xp 3 and so on. Eventually means for large k, xp k = a. Algorithm: multiply x by increasing powers of P until the product looks stable. 14
PageRank Summary Preprocessing: Given graph of links, build matrix P. From it compute a. The entry a i is a number between 0 and 1: the pagerank of page i. Query processing: Retrieve pages meeting query. Rank them by their pagerank. Order is query-independent. PageRank: Issues and Variants How realistic is the random surfer model? What if we modeled the back button? Surfer behavior sharply skewed towards short paths Search engines, bookmarks & directories make jumps nonrandom. Biased Surfer Models Weight edge traversal probabilities based on match with topic/query (non-uniform edge selection) Bias jumps to pages on topic (e.g., based on personal bookmarks & categories of interest) 15
Topic Specific Pagerank Conceptually, we use a random surfer who teleports, with say 10% probability, using the following rule: 4 Selects a category (say, one of the 16 top level ODP categories) based on a query & user -specific distribution over the categories 4 Teleport to a page uniformly at random within the chosen category Sounds hard to implement: can t compute PageRank at query time! Topic Specific PageRank Offline:Compute pagerank for individual categories Query independent as before Each page has multiple pagerank scores one for each category, with teleportation only to that category Online: Distribution of weights over categories computed by query context classification Generate a dynamic pagerank score for each page - weighted sum of categoryspecific pageranks Can be generalized in terms of Data Linkage 16