MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

Similar documents
ECS 289 / MAE 298, Lecture 9 April 29, Web search and decentralized search on small-worlds

ECS 253 / MAE 253, Lecture 8 April 21, Web search and decentralized search on small-world networks

ECS 253 / MAE 253, Lecture 10 May 3, Web search and decentralized search on small-world networks

CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena

A brief history of Google

Information Retrieval. Lecture 11 - Link analysis

PageRank and related algorithms

beyond social networks

How to explore big networks? Question: Perform a random walk on G. What is the average node degree among visited nodes, if avg degree in G is 200?

The Structure of Information Networks. Jon Kleinberg. Cornell University

M.E.J. Newman: Models of the Small World

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Overlay (and P2P) Networks

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Degree Distribution: The case of Citation Networks

Link Analysis and Web Search

arxiv:cs/ v1 [cs.ir] 26 Apr 2002

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

COMP5331: Knowledge Discovery and Data Mining

Lecture 8: Linkage algorithms and web search

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

The Role of Homophily and Popularity in Informed Decentralized Search

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

Motivation. Motivation

Data mining --- mining graphs

Lecture 17 November 7

COMP Page Rank

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

TELCOM2125: Network Science and Analysis

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

How to organize the Web?

CPSC 532L Project Development and Axiomatization of a Ranking System

CSE 158 Lecture 11. Web Mining and Recommender Systems. Triadic closure; strong & weak ties

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Volume 2, Issue 11, November 2014 International Journal of Advance Research in Computer Science and Management Studies

γ : constant Goett 2 P(k) = k γ k : degree

Small-World Models and Network Growth Models. Anastassia Semjonova Roman Tekhov

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

Network Thinking. Complexity: A Guided Tour, Chapters 15-16

Big Data Analytics CSCI 4030

Examples of Complex Networks

World Wide Web has specific challenges and opportunities

Extracting Information from Complex Networks

Big Data Analytics CSCI 4030

Link Analysis. Hongning Wang

On Finding Power Method in Spreading Activation Search

COMP 4601 Hubs and Authorities

Part 1: Link Analysis & Page Rank

CSE 255 Lecture 13. Data Mining and Predictive Analytics. Triadic closure; strong & weak ties

Information Retrieval and Web Search Engines

CSE 158 Lecture 13. Web Mining and Recommender Systems. Triadic closure; strong & weak ties

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Summary: What We Have Learned So Far

Social, Information, and Routing Networks: Models, Algorithms, and Strategic Behavior

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Math 443/543 Graph Theory Notes 10: Small world phenomenon and decentralized search

Big Data Analytics CSCI 4030

Introduction To Graphs and Networks. Fall 2013 Carola Wenk

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Recent Researches on Web Page Ranking

Fast Iterative Solvers for Markov Chains, with Application to Google's PageRank. Hans De Sterck

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization

Complex Networks. Structure and Dynamics

CS 322: (Social and Information) Network Analysis Jure Leskovec Stanford University

Structural Analysis of Paper Citation and Co-Authorship Networks using Network Analysis Techniques

The application of Randomized HITS algorithm in the fund trading network

Searching the Web [Arasu 01]

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

Social Network Analysis

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Lec 8: Adaptive Information Retrieval 2

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Introduction to Data Mining

Information Networks: PageRank

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

A Generating Function Approach to Analyze Random Graphs

Slides based on those in:

Structure of Social Networks

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Small-world networks

Topic mash II: assortativity, resilience, link prediction CS224W

An Improved Computation of the PageRank Algorithm 1

CSCI5070 Advanced Topics in Social Computing

Learning to Rank Networked Entities

Social and Technological Network Data Analytics. Lecture 5: Structure of the Web, Search and Power Laws. Prof Cecilia Mascolo

Nick Hamilton Institute for Molecular Bioscience. Essential Graph Theory for Biologists. Image: Matt Moores, The Visible Cell

Introduction to Data Mining

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science

The Complex Network Phenomena. and Their Origin

CS6200 Information Retreival. The WebGraph. July 13, 2015

Switching for a Small World. Vilhelm Verendel. Master s Thesis in Complex Adaptive Systems

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

Navigation in Networks. Networked Life NETS 112 Fall 2017 Prof. Michael Kearns

PageRank. CS16: Introduction to Data Structures & Algorithms Spring 2018

Transcription:

MAE 298, Lecture 9 April 30, 2007 Web search and decentralized search on small-worlds

Search for information Assume some resource of interest is stored at the vertices of a network: Web pages Files in a file-sharing network Would like to determine rapidly where in the network a particular item of interest can be found.

To warehouse data or search on demand? Centralized : Catalogue data in one central place. Makes most sense when high cost to search network in real time. Requires resources for learning the data and storing it. Decentralized : Data is spread out in a distributed data base. Can be a very slow process to search. But dependent on network topology may be able to devise quick algorithms.

Web search Centralized warehousing of information (need results as quickly as possible) Key: Use information contained in the edges as well as the vertices! (Assumes edges contain information about relevance). Process: Query arrives, select subset of pages which match, order that subset by ranking based on link structure.

Typical web domain M. E. J. Newman

Ranking pages in a connected component Each site starts with unit rank (i.e., weight). Transfers fraction of this rank equally to each connected site. So the rank of vertex i, r i : r i j A ij r j, where A is the adjacency matrix. Using the random walk analogue, the occupancy probabilities in steady-state (i.e., the vector corresponding the λ = 1): r = M r, where M is state transition matrix.

The Web as a whole (This is an old picture, circa 2000, currently Google 8 billion pages served..., Yahoo claims 20 billion... Is there really that much valid content out there?)

Disconnected components We understand how to deal with each components. How do we deal with getting a consistent rank across the whole web?

The Random Surfer model [Brin and Page, The anatomy of a large-scale hypertextual Web search engine, Computer Networks, 30 (1998)] [L. Page, S. Brin, R. Motwani, and T. Winograd, The Pagerank Citation Ranking: Bringing Order to the web, technical report, Stanford University, Stanford, CA, 1998. ] With probability ɛ follow an out-link of current page. With probability [1 ɛ] jump at random to some other web page. (Usually assume jump is random, so land at any site with prob 1/N).

The Random Surfer model The weight of a page j is the sum over all the in-links pointing to it, including those gained by the random jump: r j = i j {r i ɛ(i) + [1 ɛ(i)]j ij }. (Note, this also makes the matrix positive, recall Perron-Frobenius). Rules of thumb: ɛ(i) = ɛ J ij = 1/N ɛ = 0.8.

Real-world complications: Page Rank Newer pages have less rank, even though they may be extremely relevant. Calculate eigenvalues of 10 9 by 10 9 matrix!!! spam, spam, spam Search engine optimizers (reverse engineer search engines) Link farms (both invisible and visible) Selling highly ranked domain names delisting web sites

Real-world complications in general Stop words typically removed: and, of, the to, be, or. So how to handle query to be or not to be? Dealing with complications of multiple languages.

Alternate approaches topology based [Kleinberg and Lawrence, The structure of the Web, Science, 294 (2001)] [Kleinberg, Authoritative sources in a hyperlinked environment, J. ACM, 46 (1999)] Slightly more sophisticated. Kleinberg proposes to use in-links and out-links. Google assumes a page is important if other important pages point to it. Kleinberg identifies two kinds of importance: authorities hubs and

Hubs and authorities A page pointed to by highly ranked pages in an authority A page that points to highly ranked pages is a hub. (May not contain the information, but will tell you where to find it). In use: Teoma search engine (http://search.ask.com/) Citeseer literature search engine.

Usage/click based. Alternate approaches Negative edge weights: (Penalize spam linking to you.) Genetic algorithms (Multiple agents searching simultaneously. The least-fit killed-off and the most fit duplicated. Relies on assumption that pages on a particular topic clustered together). [ Menczer and Belew, Adaptive retrieval agents: Internalizing local context and scaling up to the Web, Machine Learning, 39 (2000)] Clustering results by subject, e.g., http://clusty.com/...

Distributed search Some resource of interest is stored at the vertices of a network: i.e., Files in a file-sharing network Depth-first Breadth-first Search on arbitrary networks: O(N) Search on power-law random graphs Breadth-first passing always to highest-degree node possible. [Adamic, R. M. Lukose, A. R. Puniyani, and B. A. Huberman, Search in power-law networks, Phys. Rev. E, 64 2001], find between O(N 2/3 ) and O(N 1/2 ).

Decentralized search on small-world networks Consider a network with small diameter. Will this help with decentralized search? (i.e. Find a local algorithm with performance less than O(N)?)

What is a small-world [Watts and Strogatz, Collective dynamics of small-world networks, Nature, 393 (1998)] Start with regular 1D lattice, add links uniformly at random:

Watts-Strogatz small-world model Together with Barabasi-Albert launched the flurry of activity on networks. Watts and Strogatz showed that networks from both the natural and manmade world, such as the neural network of C. elegans and power grids, exhibit the small-world property. Originally they wanted to understand the synchronization of cricket chirps. Introduced the mathematical formalism, which interpolates between lattices and networks.

A new paradigm I think I ve been contacted by someone from just about every field outside of English literature. I ve had letters from mathematicians, physicists, biochemists, neurophysiologists, epidemiologists, economists, sociologists; from people in marketing, information systems, civil engineering, and from a business enterprise that uses the concept of the small world for networking purposes on the Internet. Duncan Watts

Navigation Clearly if central coordination, can use short paths to deliver info quickly. But, can someone living in a small world actually make use of this info and do efficient decentralized routing? Instead of designing search algorithms, given a local greedy algorithm, are there any topologies that enable O(log N) delivery times?

Precise topologies required [J. M. Kleinberg, Navigation in a small world, Nature, 406 (2000)] Start with a regular 2D square lattice (consider vertices and edges). Add random long links, with bias proportional to distance between two nodes: p(e ij ) 1/d α ij

Find mean delivery time t N β, with β > 1 unless α = 2. Only for α = 2 will decentralized routing work, and packet can go from source to destination in O(log N) steps. For d-dimensional lattice need α = d. But we know greedy decentralized routing works for human networks (c.f. Milgram s experiments six-degrees of separation [ S. Milgram, The small world problem, Psych. Today, 2, 1967.] So how do we get beyond a lattice model?

Navigating social networks [Watts, P. S. Dodds, and M. E. J. Newman, Identity and search in social networks, Science, 296 (2002)] [Kleinberg, Small world phenomena and the dynamics of information, in Proceedings of NIPS 2001]. Premise: people navigate social networks by looking for common features between there acquaintances and the targets (occupation, city inhabited, age,...) Brings in DATA!

Hierarchical social distance tree Individuals are grouped into categories along many attributes. One tree for each attribute. Trees are not the network, but complementary mental constructs believed to be at work. Assume likelyhood of acquaintance falls of exponentially with social distance.

Summary Web search Decentralized search Next time: Epidemiology on networks. (Flow of diseases).