Distributed Pagerank for P2P Systems

Similar documents
Distributed Pagerank for P2P Systems

Pagerank Computation and Keyword Search on Distributed Systems and P2P Networks

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

CS 640 Introduction to Computer Networks. Today s lecture. What is P2P? Lecture30. Peer to peer applications

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

Making Gnutella-like P2P Systems Scalable

Mining Web Data. Lijun Zhang

Overview Computer Networking Lecture 16: Delivering Content: Peer to Peer and CDNs Peter Steenkiste

CPSC 426/526. P2P Lookup Service. Ennan Zhai. Computer Science Department Yale University

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

Scalable overlay Networks

A Survey of Peer-to-Peer Content Distribution Technologies

Flooded Queries (Gnutella) Centralized Lookup (Napster) Routed Queries (Freenet, Chord, etc.) Overview N 2 N 1 N 3 N 4 N 8 N 9 N N 7 N 6 N 9

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Content Overlays. Nick Feamster CS 7260 March 12, 2007

University of Maryland. Tuesday, March 2, 2010

Advanced Computer Networks

CPSC 426/526. P2P Lookup Service. Ennan Zhai. Computer Science Department Yale University

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Mining Web Data. Lijun Zhang

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Advanced Computer Architecture: A Google Search Engine

Distributed Knowledge Organization and Peer-to-Peer Networks

Calculating Web Page Authority Using the PageRank Algorithm. Math 45, Fall 2005 Levi Gill and Jacob Miles Prystowsky

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

Scalable P2P architectures

March 10, Distributed Hash-based Lookup. for Peer-to-Peer Systems. Sandeep Shelke Shrirang Shirodkar MTech I CSE

Distributed Web Crawling over DHTs. Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy CS294-4

Scale Free Network Growth By Ranking. Santo Fortunato, Alessandro Flammini, and Filippo Menczer

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

Web Structure Mining using Link Analysis Algorithms

Data-Intensive Computing with MapReduce

Scaling Problem Computer Networking. Lecture 23: Peer-Peer Systems. Fall P2P System. Why p2p?

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Peer-to-Peer Internet Applications: A Review

Administrative. Web crawlers. Web Crawlers and Link Analysis!

COMP Page Rank

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

COMP5331: Knowledge Discovery and Data Mining

Peer-peer and Application-level Networking. CS 218 Fall 2003

Fast Iterative Solvers for Markov Chains, with Application to Google's PageRank. Hans De Sterck

PageRank Algorithm Abstract: Keywords: I. Introduction II. Text Ranking Vs. Page Ranking

Information Leak in the Chord Lookup Protocol

An Algorithm to Reduce the Communication Traffic for Multi-Word Searches in a Distributed Hash Table

EECS 395/495 Lecture 5: Web Crawlers. Doug Downey

Query Processing and Alternative Search Structures. Indexing common words

On the Feasibility of Peer-to-Peer Web Indexing and Search

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER

Lecture 21 P2P. Napster. Centralized Index. Napster. Gnutella. Peer-to-Peer Model March 16, Overview:

CS6200 Information Retreival. The WebGraph. July 13, 2015

Distributed computing: index building and use

Searching the Web [Arasu 01]

PAGE RANK ON MAP- REDUCE PARADIGM

Early Measurements of a Cluster-based Architecture for P2P Systems

DATA MINING - 1DL105, 1DL111

Architectures for Distributed Systems

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Telematics Chapter 9: Peer-to-Peer Networks

Graph Algorithms. Revised based on the slides by Ruoming Kent State

Adaptive methods for the computation of PageRank

Information Retrieval Spring Web retrieval

Peer-to-Peer (P2P) Distributed Storage. Dennis Kafura CS5204 Operating Systems

Information Networks: PageRank

Overlay and P2P Networks. Unstructured networks. Prof. Sasu Tarkoma

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

A Reordering for the PageRank problem

Peer-to-Peer Networks

Google Scale Data Management

Reputation Management in P2P Systems

How to organize the Web?

Overview of this week

INF5070 media storage and distribution systems. to-peer Systems 10/

Kademlia: A P2P Informa2on System Based on the XOR Metric

Simulations of Chord and Freenet Peer-to-Peer Networking Protocols Mid-Term Report

Searching for Shared Resources: DHT in General

Personalizing PageRank Based on Domain Profiles

Opportunistic Application Flows in Sensor-based Pervasive Environments

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Searching for Shared Resources: DHT in General

Telecommunication Services Engineering Lab. Roch H. Glitho

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

CLOUD COMPUTING PROJECT. By: - Manish Motwani - Devendra Singh Parmar - Ashish Sharma

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

CS6220: DATA MINING TECHNIQUES

Peer-to-Peer Protocols and Systems. TA: David Murray Spring /19/2006

A Scalable Parallel HITS Algorithm for Page Ranking

Reading Time: A Method for Improving the Ranking Scores of Web Pages

OnCrawl Metrics. What SEO indicators do we analyze for you? Dig into our board of metrics to find the one you are looking for.

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

The Topic Specific Search Engine

Diploma Thesis Collaborative Data Processing on Mobile Handsets Jan Kettner. Examiner: Prof. Dr. Mesut Günes Tutor: Georg Wittenburg, M. Sc.

Spyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems

Web Analysis in 4 Easy Steps. Rosaria Silipo, Bernd Wiswedel and Tobias Kötter

Chapter 10: Peer-to-Peer Systems

Transcription:

Distributed Pagerank for P2P Systems Karthikeyan Sankarlingam, Simha Sethumadhavan, and James C. Browne The University of Texas at Austin Department of Computer Sciences 9/1/2005 1

Contributions Distributed computation of Pageranks based on asynchronous iteration Application in P2P systems Application on Internet scale Practical keyword search for P2P systems Very large scale asynchronous iteration computation 9/1/2005 2

Overview Motivation: Keyword search for P2P systems P2P system overview State of art in keyword search Approach and Solution Pageranks for P2P systems Distributed computation of pageranks Incremental retrieval of documents Performance results Distributed computation of pageranks on the Internet 9/1/2005 3

Peer to Peer (P2P) Systems P2P systems can be effective distributed storage systems Efficient retrieval Efficient search Retrieval Distributed Hash Tables: Chord, CAN, Pastry, Freenet Unstructured P2P systems: Gnutella, Morpheus, Kazaa Characteristics Distributed storage, no centralized server Peer-to-peer communication Dynamic effects peers enter and leave frequently 9/1/2005 4

P2P Systems Hash 0x8134 Key space (16 bit keys) 0x0000 0x1FFF 0x2000-0x3FFF 0x4000-0x5FFF 0x6000-0x7FFF 0x8000-0x9FFF 0xA000-0xBFFF 0xC000-0xDFFF 0xE000-0xFFFF Peers Distributed hash tables Routing 9/1/2005 5

P2P systems: Retrieval Key space 0x0000 0x1FFF 0x2000-0x3FFF 0x4000-0x5FFF 0x6000-0x7FFF 0x8000-0x9FFF 0xA000-0xBFFF 0xC000-0xDFFF 0xE000-0xFFFF Peers Fetch 0x8134 9/1/2005 6

P2P systems: Search State of the art Index based keyword search [Reynolds and Vahdat, Gnawali] Document vectors [Kronofol] Combinations based on these Problem Retrieval too many responses No easy way to estimate relevance 9/1/2005 7

Index based keyword search Centralized Index D0,D1 Keyword List of Doc Ids (keys) D2,D3 D4 tree oak D0,D1,D2,D3 D0,D1,D9 D5,D6,D7 spider D12,D11 D9 D8 D12 linux D12,D11 D10,D11 9/1/2005 8

Index based keyword search D0,D1 K8 Hashed Keyword List of Doc Ids (keys) D5,D6,D7 D2,D3 K0 D4 K0(tree) K1(oak) K2(spider) D0,D1,D2,D3 D0,D1,D9 D12,D11 D8 K2 D9 K1 D12 K8(linux) D12,D11 D10,D11 Hash, distribute and embed the index in P2P system! 9/1/2005 9

P2P Systems: Search State of art Index based keyword search [Reynolds and Vahdat, Gnawali] Document vectors [Kronofol] Combinations based on these Problem Retrieval too many responses No easy way to estimate relevance 9/1/2005 10

Solution Google s Pagerank! Apply Pagerank in a P2P environment Give every document in the P2P system a rank Use link structure Incremental retrieval based on pageranks 9/1/2005 11

Google s Computation of Pagerank Centralized solution Uses a centralized crawler updated every 4 weeks Computation farm solving a 3 billion order matrix problem Computation time of 6 to 7 days Methods to accelerate this have been proposed [Kamvar et. al] Challenge for a P2P implementation Files are distributed No crawler on P2P systems No centralized computation possible Peers keep entering and leaving 9/1/2005 12

Pagerank Assign a numeric rank to every page D Document link structure is the key A C E B Inlink F Outlink (with respect to C) 9/1/2005 13

Pagerank contd. 1.0 1.0 1.0 1.0 0.67 0.67 2.0 0.67 0.67 0.67 Every page contributes equally to all its outlinks Pagerank of a page = sum of inlink weights Web graph has backedges Pagerank is computed iteratively 0.67 Mathematical formulation: R = AR i+1 i 9/1/2005 14

Distributed Pagerank Compute pagerank locally at each peer node Send pagerank updates to linked documents (on other peers) A C D E Stop when each local pagerank converges B F 3 peers Documents (nodes) 9/1/2005 15

Why does this process work? Asynchronous Iterations Pagerank is an eigenvalue computation problem [Page et. al, Haveliwala] Link matrix is sparse and diagonally dominant Asychronous Iterations [Chazan & Miranker, and others] Peers act as simple state machines exchanging messages 9/1/2005 16

Integration with P2P systems Storage: Augment P2P system to store a rank for every document Computation: Peers must execute the distributed pagerank computation algorithm Communication: Pagerank update messages are routed based on linked document s key Caching: Optimization to save routed traffic Route first message using P2P layer Cache IP address for that key at sender Deliver subsequent messages point to point 9/1/2005 17

Dynamic systems Peer joins and leaves Use transport layer to detect if peer unavailable Buffer update messages if peer unavailable Periodically retry until peer comes back Document insertion and deletion New documents are initialized with a pagerank Deleted documents send pagerank update messages with negative pagerank Incremental and continuously updated pageranks 9/1/2005 18

Integration with P2P search DHT systems Augment index with a pagerank field Return results sorted by pagerank Nodes update index with pagerank when they converge Hashed Keyword K0(tree) K1(oak) K2(spider) List of Doc Ids (keys) D0{R0},D1{R1},D2{R2},D3{R3} D0{R0},D1{R1},D9{R9} D12{R12},D11{R11} FASD/Freenet systems Forward based on document closeness and pagerank K8(linux) {Rxx} - Pageranks D12{R12},D11{R11} 9/1/2005 19

Multi-word search tree 0 1 2 1000 1000 keys oak 0 1 2 1000 User pin 0 1 2 1000 500 keys Example Query: tree & oak & pin Many keys transferred, leading to high network traffic No ranking scheme 250 keys 9/1/2005 20

Incremental search tree 0 1 2 1000 200 keys oak 0 1 2 1000 User pin 0 1 2 1000 40 keys Example Query: tree & oak & pin Traffic reduction: Incremental forwarding Quality of hits: Relevance sorting 20 keys 9/1/2005 21

Results Modeling 10K, 100K, 500K, 5M document sets 500 peer network Simple network transfer model Power law distribution for link structure: # nodes with degree i α 1 / i k [Broder et. al] Evaluation parameters Convergence: How many passes? Quality of pagerank: Error relative to a centralized scheme Message traffic: Number of pagerank update messages Execution time and Scalability 9/1/2005 22

Results Convergence 1. Fast convergence: ~ 100 iterations 2. 99% of documents converge to within 1% in 10 iterations Quality of Pagerank Very high, Over 99% have very small errors, max error typically < 0.1% Message traffic 1. 30 msgs/doc for a 0.2 error threshold 2. 100 msgs/doc for a 10-6 error threshold 3. Msgs/doc. independent of # docs 4. Traffic grows logarithmically with error threshold 9/1/2005 23

Results Execution Time Dominated by network speed Error threshold Slow n/w(32 Kb/sec) Fast n/w (200 Kb/sec) 0.2 10-3 10-6 33.7 hrs 87.9 hrs 117 hrs 5.4 hrs 14.1 hrs 18.7 hrs Scalability 1. Convergence, quality and messages/doc independent of size 2. Execution times grows logarithmically with size 9/1/2005 24

Results: Incremental Search We built our own document set 2-word and 3-word queries synthesized using frequent terms 10X reduction in network traffic for 2- word queries 6X reduction in network traffic for 3- word queries 9/1/2005 25

Conclusions Distributed computation of Google Pagerank First document ranking scheme for P2P networks Major benefits for keyword search Performance and Scalability demonstrated for P2P systems 9/1/2005 26

P2P Internet search engine? P2P computation of Pagerank of Internet documents Web servers acts as peers, exchange messages and compute pagerank Pagerank becomes a free public commoditiy Will this work? With a T3 link between web space providers, 3 billion node graph can be computed in 35 days. No re-crawls required! Document inserts and deletes are automatically handled How to build a distributed Internet scale keyword index? Web server implementation? 9/1/2005 27

Future Work Implement Pagerank on a P2P system Use link structure to map documents Peer-to-peer chaotic iterations solutions should work in other domains Explore Internet scale application 9/1/2005 28

Questions 9/1/2005 29

Distributed Pagerank A C D E Time = 0 Set A,B = 1.0 Set C,D,E = 1.0 Set F = 1.0 B F Send updates: From A, B to C From C to F 9/1/2005 30

Distributed Pagerank A C D E Time = 5 C,F receive updates Recompute page ranks Send updates: From C to F B F Time = 17 F receives updates Recompute page ranks No more updates. STOP. 9/1/2005 31

Integration with P2P search DHT systems like Chord, CAN, Pastry Augment index with a pagerank field Return results sorted by pagerank FASD/Freenet systems Forward based on document closeness and pagerank Keyword0 Keyword1 Keyword2 Keys Pagerank 9/1/2005 32

Convergence 1. Fast convergence: ~ 100 iterations 2. 99% of documents converge to within 1% in 10 iterations Quality of Pagerank Very high, Over 99% have very small errors, max error typically < 0.1% Message traffic 1. Low: 30 (0.2 error) to 100 (10-6 error) msgs/doc 2. Msgs/doc. independent of # docs 3. Traffic grows logarithmically with error threshold Execution Time 1. Low: 14 to 90 hrs based on n/w speed 2. Dominated by network speed Scalability 1. Convergence, quality and messages/doc independent of size 2. Execution times grows logarithmically with size 9/1/2005 33