PAGE RANK ON MAP- REDUCE PARADIGM Group 24 Nagaraju Y Thulasi Ram Naidu P Dhanush Chalasani
Agenda Page Rank - introduction An example Page Rank in Map-reduce framework Dataset Description Work flow Modules. Experiments. References
Page Rank Need an algorithm to rank web pages based on importance efficiently. Patented to Stanford university. Page rank as per Google: PageRank is a link analysis algorithm that assigns a numerical weighting to each element of a hyperlinked set of documents, with the purpose of measuring its relative importance within the set. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important". Page Rank redefined: PageRank is a probability distribution used to represent the likelihood that a person who is just randomly clicking on links will arrive at any particular page
Contd., Consider: B(u) denotes the set of all the pages linking to u. L(v) denotes the size of set of all the pages from v. Page Rank of a page u is Damping factor: The PageRank theory holds that even an imaginary surfer who is randomly clicking on links will eventually stop clicking. The probability, at any step, that the person will continue is a damping factor d. Various research studies show that damping factor is 0.85. New page rank of the page u is
An example: PR(A)=PR(B)/1 + PR(C)/2 Page A Page B PR(B)=PR(A)/2+PR(C)/2 Initial Condition: PR(A)=1 PR(B)=1 PR(C)=1 Page C PR(C)=PR(A)/2
Iteration 1: PR(A)=PR(B)/1 + PR(C)/2 1.5 Page A 1 Page B 1 PR(B)=PR(A)/2+PR(C)/2 1 Page C 1 Iteration 1: PR(A)=1.5 PR(B)=1 PR(C)=0.5 PR(C)=PR(A)/2 0.5
Iteration 2: PR(A)=PR(B)/1 + PR(C)/2 1.25 Page A 1.5 Page B 1 PR(B)=PR(A)/2+PR(C)/2 1 Page C 0.5 Iteration 1: PR(A)=1.25 PR(B)=1 PR(C)=0.75 PR(C)=PR(A)/2 0.75
Problems: Internet is huge: Google has found over 1 trillion unique urls Assume each url takes 0.5k, then we need over 400TB just to store the links. Calculating page rank for all pages takes long time.
PR in map-reduce paradigm: Need a framework that allows the implementation of page rank in a distributed and highly scalable way. Independent steps. Page rank of a page depends only on previous page rank of its out-links.
Dataset: Datasets: Movie dataset, Genetic web pages from http://www.cs.toronto.edu/~tsap/experiments/datasets/ind ex.html Data set: <link>: <outlinks> 22: 0 991 992 993 994 995 996 997 889-1 29: 1169 1172 1183 1186 1202-1 34: 1355 1358-1
Preprocessing: Dangling pages (pages with no outlinks) will be removed. Assign initial page rank as 1. Data Set: <id> <intialpr><outlinks> 8 1 534 535 536 537 538 539 540 541 542 543-1 9 1 572 576 578 579 581 582 584 585 586 590-1 10 1 597 598 602 603-1
High level Work flow: Module 1: Calculate page rank Module 2: Calculate outlinks Iter <15 Yes No Module 3: Add dangling links. Sort results.
Module 1: Map: - Input: - key:1 Start with the initial pagerank and outlinks of a document. - value: <pagerank> 2 3 Output : key: 2 Value: 1 <pagerank> <2> Value: 3 <pagerank> <5> For each outlink, output is the docid of the inlinks, its PageRank, and its total number of outlinks. Reduce Now the reducer has a document id, all the inlinks to Input: that document and their corresponding PageRanks and Key: 2 number of outlinks. Value: 1 pagerank 2 Value: 3 pagerank 5 Value:... Output: Key: 2 Value: <new pagerank> 2 1 3... Computed the new PageRank. Key is url id and value its rank and set of inlinks
Module 2: Map: - Input: - key: 2 - value: <pagerank> 2 1 3... Output : key: 2 Start with the initial pagerank and inlinks of a document. Value: 5 <pagerank> Value: 2 <pagerank> Value: 4 <pagerank> For each inlink, output is the docid of its outlink and its pagerank. Reduce Input: Key: 2 Now the reducer has a document id, all the outlinks from that document. Value: 5 <pagerank> Value: 2 <pagerank> Value: 4 <pagerank>" Output: Key: 2 Value: <pagerank> 4 5... Output is the outlinks of a page. Key is url id and value its rank and set of outlinks
Module 3: After converging, add dangling pages do an iteration and sort the Urls based on their PageRank. Map: input key :URL value: <rank> outlinks Output key:rank value :URL.
Experiments Fig: Runtimes (in secs) Vs Number of iterations
References: The anatomy of a large-scale hypertextual Web search engine by Sergey Brinand Lawrence Page http://www.cs.toronto.edu/~tsap/experiments/datasets/index.html The PageRank Citation Ranking: Bringing Order to the Web by Lawrence Page, Sergey Brin, Rajeev Motwani http://www.webworkshop.net/pagerank.html
Thank you.