A brief history of Google

Similar documents
Announcements. Assignment 6 due right now. Assignment 7 (FacePamphlet) out, due next Friday, March 16 at 3:15PM

Motivation. Motivation

Synonyms. Hostile. Chilly. Direct. Sharp

Lecture #3: PageRank Algorithm The Mathematics of Google Search

PAGE RANK ON MAP- REDUCE PARADIGM

COMP5331: Knowledge Discovery and Data Mining

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science

Lecture 27: Learning from relational data

COMP 4601 Hubs and Authorities

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

COMP Page Rank

Synonyms. Hostile. Chilly. Direct. Sharp

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Big Data Analytics CSCI 4030

The PageRank Citation Ranking: Bringing Order to the Web

The Anatomy of a Large-Scale Hypertextual Web Search Engine

.. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

PageRank and related algorithms

The application of Randomized HITS algorithm in the fund trading network

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

How to organize the Web?

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW

A Survey of Google's PageRank

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

CS 137 Part 4. Structures and Page Rank Algorithm

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

An Adaptive Approach in Web Search Algorithm

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

MITOCW ocw f99-lec07_300k

Fast Iterative Solvers for Markov Chains, with Application to Google's PageRank. Hans De Sterck

MITOCW ocw f99-lec12_300k

Big Data Analytics CSCI 4030

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

Note: Please use the actual date you accessed this material in your citation.

CPSC 532L Project Development and Axiomatization of a Ranking System

Lecture 8: Linkage algorithms and web search

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Introduction To Graphs and Networks. Fall 2013 Carola Wenk

Introduction to Information Retrieval

Lecture 17 November 7

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

Web Structure Mining using Link Analysis Algorithms

PageRank. CS16: Introduction to Data Structures & Algorithms Spring 2018

CLOUD COMPUTING PROJECT. By: - Manish Motwani - Devendra Singh Parmar - Ashish Sharma

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Searching the Web for Information

Searching the Web [Arasu 01]

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

CS103 - Pagerank. Figure 1: Example of a webgraph

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Advanced Computer Architecture: A Google Search Engine

Single link clustering: 11/7: Lecture 18. Clustering Heuristics 1

Mathematical Methods and Computational Algorithms for Complex Networks. Benard Abola

An Improved Computation of the PageRank Algorithm 1

Google Scale Data Management

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

Link Analysis in the Cloud

PageRank Algorithm Abstract: Keywords: I. Introduction II. Text Ranking Vs. Page Ranking

Learning to Rank Networked Entities

PROFESSOR: Last time, we took a look at an explicit control evaluator for Lisp, and that bridged the gap between

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Search Engine Architecture. Hongning Wang

Internet search engines. COMP 250 Winter 2018 Lecture 30

A New Technique for Ranking Web Pages and Adwords

Collaborative filtering based on a random walk model on a graph

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

Part 1: Link Analysis & Page Rank

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute. Week 02 Module 06 Lecture - 14 Merge Sort: Analysis

1.6 Case Study: Random Surfer

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

E-Business s Page Ranking with Ant Colony Algorithm

Introduction to Data Mining

Link Analysis and Web Search

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

Weighted Page Rank Algorithm based on In-Out Weight of Webpages

The PageRank Computation in Google, Randomized Algorithms and Consensus of Multi-Agent Systems

Reading Time: A Method for Improving the Ranking Scores of Web Pages

Page Rank Algorithm. May 12, Abstract

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Calculating Web Page Authority Using the PageRank Algorithm. Math 45, Fall 2005 Levi Gill and Jacob Miles Prystowsky

Brief (non-technical) history

Unit VIII. Chapter 9. Link Analysis

A Reordering for the PageRank problem

Link analysis. Query-independent ordering. Query processing. Spamming simple popularity

Lec 8: Adaptive Information Retrieval 2

The PageRank Computation in Google, Randomized Algorithms and Consensus of Multi-Agent Systems

Information Retrieval. Lecture 11 - Link analysis

University of Maryland. Tuesday, March 2, 2010

2.3 Algorithms Using Map-Reduce

MITOCW watch?v=sdw8_0rdzuw

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Social Network Analysis

Transcription:

the math behind Sat 25 March 2006

A brief history of Google 1995-7 The Stanford days (aka Backrub(!?)) 1998 Yahoo! wouldn't buy (but they might invest...) 1999 Finally out of beta! Sergey Brin Larry Page

The beta days

The big picture Q: What happens when you type in a search query? A: Thousands of trained monkeys type the results very very quickly (?) Query: How to multiply matrices that's easy

The big picture Well, it's a little more complicated... Your words word ID's (lexicon) looked up in reverse index intersection sort by relevance report back to you!

The big picture Each word in query converted to a corresponding word ID by the lexicon Each word ID is mapped to a list of docid's = a unique # associated with each web page Take the intersection of these lists Often many many results, but user only cares about one or two! So put the best ones at the top! (but how?)

The PageRank Problem But how does a computer program (and/or a trained monkey) know which page out of thousands is the best? Algorithm needs to be: as objective as possible, and hard to hack by advertisers! Before Google, most search engines did a Bad Job at answering this question

The web as a graph To build an algorithm, we need a mathematical way to think of the internet We'll use the idea of a graph, a set of vertices connected by edges Some examples: undirected graph directed graph just a bunch of points and lines (not a graph)

The web as a graph So how do we turn the web into a graph? Which objects become the vertices? Which objects become the edges? Directed or undirected?

The web as a graph A page is a vertex, and each hyperlink is a directed edge! pages with hyperlinks the graph representing those pages

Idea One: Links from good pages lead to other good pages

Idea One: Links from good pages lead to other good pages How can we turn this into an equation to solve? Let R i = the rank or number of coolness points for page i, then we want:

Idea One: Links from good pages lead to other good pages We can write this in summation notation:

Idea One: Links from good pages lead to other good pages Difficulty: all the ranks R i depend on each other How to solve for all of them at once??

Idea Two: The drunken web surfer +

Compare the ideas The drunken web surfer is an easy algorithm, but how good is the answer?... yet idea one (good links - good pages) seems to give a better answer, although maybe harder to write an algorithm? Best of both worlds? Key of PageRank: they give the same answer!

Idea of a (weighted) incidence matrix How to write a drunken surfer algorithm? We'll define a matrix based on our graph Define each term a ij in the matrix: So a ij represents the entry in row i and column j

Idea of a (weighted) incidence matrix Define each term a ij in the matrix where (we'll see why this makes sense in a moment)

Idea of a (weighted) incidence matrix An example: A B A B C C graph version the internet (circa ~1975)

Idea of a (weighted) incidence matrix An example: from A B to A A B C B C C our graph corresponding matrix Notice that each column adds up to exactly 1 here

Simulating drunken surfers Suppose there are 2 drunken surfers at each of these three webpages They click on a link at random How many surfers (on average) do we now expect at each webpage? A(2) B(2) A(?) B(?) C(2) C(?)

Simulating drunken surfers Everyone from B goes to C (so C gets 2) Everyone from C goes to A (so A gets 2) Half from A go to B (B gets 1), other half go to C (C gets 1) A(2) B(2) A(2) B(1) C(2) C(3)

Simulating drunken surfers What happens at the next step? A(2) B(1) A(3) B(1) C(3) C(2) A(3) B(1) A(2) B(1.5) C(2) C(2.5)

Simulating drunken surfers Can we write this process as an equation? Let x = vector with avg #surfers at each page at time 1, and y = vector with avg #surfers/page at time 2 Then: where A is the incidence matrix

Oh, yea, matrix multiplication Review of what the equation means: (Let's take a look at a helpful webpage) oh yea, I think I learned that once...

Simulating drunken surfers Why does this equation work? x = vector with avg #surfers at each page at time 1 y = vector with avg #surfers/page at time 2

Walking around with matrices Compare with our previous example A(2) B(2) A(2) B(1) x y graph equation C(2) C(3) matrix equation y x

Walking around with matrices Compare with our previous example matrix equation y x

Walking around with matrices Q: So when do we stop? A: When each step becomes almost the same as it was before. The vector x becomes stable Let's test that out! (using Matlab)

Walking around with matrices What is the mathematical meaning of this convergence? has converged when Let's rename the distribution vector (used to be called x or y) as R for rank

The meaning of convergence Intuitively: the number of drunken surfers at each page, on average, stays the same Mathematically: becomes which is the same as

The meaning of convergence In other words: Vector R, the convergence point (the limit) of this random drunken walk on the graph (calculated with matrices), is the same answer for both Idea One (good pages link to good pages) and Idea Two (random clicks eventually lead to good pages)

A note on the Page, Brin, Motwani, Winograd paper They think of R as a probability distribution (percentages of total # of surfers) They also deal with a problem called a rank sink An example from the PBMW paper: the ranks add up exactly to 1, since it is thought of as a probability distribution

The Credits Several graphics and the main ideas are in these two papers: The PageRank Citation Ranking: Bringing Order to the Web by Larry Page, Sergey Brin, R. Motwani, and T. Winograd (1998) available by scholar.google search for PageRank The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1998), available by scholar.google search for Web Search Engine This talk also has a webpage! cims.nyu.edu/~neylon/googlemath/

Thank you!...any questions / ideas? (there's a little more, if we have extra time...)

A potential problem! An inescapable cycle of hyperlinks is called a rank sink Artificially increases these page's rank

Addressing rank sinks Intuitive idea: at any point, the drunken surfer may jump to a completely arbitrary other webpage, even without a hyperlink to it Mathematically: we basically replace all zeros in the incidence matrix by a small value But adjust columns to keep them adding up to 1!

Addressing rank sinks Mathematically: we basically replace all zeros in the incidence matrix by a small value Define and where d is a small number, such as d=0.1

this is the last slide!

this is the slide after the last slide! you've gone too far!!!