the math behind Sat 25 March 2006
A brief history of Google 1995-7 The Stanford days (aka Backrub(!?)) 1998 Yahoo! wouldn't buy (but they might invest...) 1999 Finally out of beta! Sergey Brin Larry Page
The beta days
The big picture Q: What happens when you type in a search query? A: Thousands of trained monkeys type the results very very quickly (?) Query: How to multiply matrices that's easy
The big picture Well, it's a little more complicated... Your words word ID's (lexicon) looked up in reverse index intersection sort by relevance report back to you!
The big picture Each word in query converted to a corresponding word ID by the lexicon Each word ID is mapped to a list of docid's = a unique # associated with each web page Take the intersection of these lists Often many many results, but user only cares about one or two! So put the best ones at the top! (but how?)
The PageRank Problem But how does a computer program (and/or a trained monkey) know which page out of thousands is the best? Algorithm needs to be: as objective as possible, and hard to hack by advertisers! Before Google, most search engines did a Bad Job at answering this question
The web as a graph To build an algorithm, we need a mathematical way to think of the internet We'll use the idea of a graph, a set of vertices connected by edges Some examples: undirected graph directed graph just a bunch of points and lines (not a graph)
The web as a graph So how do we turn the web into a graph? Which objects become the vertices? Which objects become the edges? Directed or undirected?
The web as a graph A page is a vertex, and each hyperlink is a directed edge! pages with hyperlinks the graph representing those pages
Idea One: Links from good pages lead to other good pages
Idea One: Links from good pages lead to other good pages How can we turn this into an equation to solve? Let R i = the rank or number of coolness points for page i, then we want:
Idea One: Links from good pages lead to other good pages We can write this in summation notation:
Idea One: Links from good pages lead to other good pages Difficulty: all the ranks R i depend on each other How to solve for all of them at once??
Idea Two: The drunken web surfer +
Compare the ideas The drunken web surfer is an easy algorithm, but how good is the answer?... yet idea one (good links - good pages) seems to give a better answer, although maybe harder to write an algorithm? Best of both worlds? Key of PageRank: they give the same answer!
Idea of a (weighted) incidence matrix How to write a drunken surfer algorithm? We'll define a matrix based on our graph Define each term a ij in the matrix: So a ij represents the entry in row i and column j
Idea of a (weighted) incidence matrix Define each term a ij in the matrix where (we'll see why this makes sense in a moment)
Idea of a (weighted) incidence matrix An example: A B A B C C graph version the internet (circa ~1975)
Idea of a (weighted) incidence matrix An example: from A B to A A B C B C C our graph corresponding matrix Notice that each column adds up to exactly 1 here
Simulating drunken surfers Suppose there are 2 drunken surfers at each of these three webpages They click on a link at random How many surfers (on average) do we now expect at each webpage? A(2) B(2) A(?) B(?) C(2) C(?)
Simulating drunken surfers Everyone from B goes to C (so C gets 2) Everyone from C goes to A (so A gets 2) Half from A go to B (B gets 1), other half go to C (C gets 1) A(2) B(2) A(2) B(1) C(2) C(3)
Simulating drunken surfers What happens at the next step? A(2) B(1) A(3) B(1) C(3) C(2) A(3) B(1) A(2) B(1.5) C(2) C(2.5)
Simulating drunken surfers Can we write this process as an equation? Let x = vector with avg #surfers at each page at time 1, and y = vector with avg #surfers/page at time 2 Then: where A is the incidence matrix
Oh, yea, matrix multiplication Review of what the equation means: (Let's take a look at a helpful webpage) oh yea, I think I learned that once...
Simulating drunken surfers Why does this equation work? x = vector with avg #surfers at each page at time 1 y = vector with avg #surfers/page at time 2
Walking around with matrices Compare with our previous example A(2) B(2) A(2) B(1) x y graph equation C(2) C(3) matrix equation y x
Walking around with matrices Compare with our previous example matrix equation y x
Walking around with matrices Q: So when do we stop? A: When each step becomes almost the same as it was before. The vector x becomes stable Let's test that out! (using Matlab)
Walking around with matrices What is the mathematical meaning of this convergence? has converged when Let's rename the distribution vector (used to be called x or y) as R for rank
The meaning of convergence Intuitively: the number of drunken surfers at each page, on average, stays the same Mathematically: becomes which is the same as
The meaning of convergence In other words: Vector R, the convergence point (the limit) of this random drunken walk on the graph (calculated with matrices), is the same answer for both Idea One (good pages link to good pages) and Idea Two (random clicks eventually lead to good pages)
A note on the Page, Brin, Motwani, Winograd paper They think of R as a probability distribution (percentages of total # of surfers) They also deal with a problem called a rank sink An example from the PBMW paper: the ranks add up exactly to 1, since it is thought of as a probability distribution
The Credits Several graphics and the main ideas are in these two papers: The PageRank Citation Ranking: Bringing Order to the Web by Larry Page, Sergey Brin, R. Motwani, and T. Winograd (1998) available by scholar.google search for PageRank The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1998), available by scholar.google search for Web Search Engine This talk also has a webpage! cims.nyu.edu/~neylon/googlemath/
Thank you!...any questions / ideas? (there's a little more, if we have extra time...)
A potential problem! An inescapable cycle of hyperlinks is called a rank sink Artificially increases these page's rank
Addressing rank sinks Intuitive idea: at any point, the drunken surfer may jump to a completely arbitrary other webpage, even without a hyperlink to it Mathematically: we basically replace all zeros in the incidence matrix by a small value But adjust columns to keep them adding up to 1!
Addressing rank sinks Mathematically: we basically replace all zeros in the incidence matrix by a small value Define and where d is a small number, such as d=0.1
this is the last slide!
this is the slide after the last slide! you've gone too far!!!