Social and Technological Network Analysis. Lecture 5: Web Search and Random Walks. Dr. Cecilia Mascolo

Similar documents
Social and Technological Network Data Analytics. Lecture 5: Structure of the Web, Search and Power Laws. Prof Cecilia Mascolo

CS 6604: Data Mining Large Networks and Time-Series

Social Networks 2015 Lecture 10: The structure of the web and link analysis

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

CSI 445/660 Part 10 (Link Analysis and Web Search)

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

How to organize the Web?

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

Information Networks: PageRank

Algorithms, Games, and Networks February 21, Lecture 12

COMP 4601 Hubs and Authorities

Link Structure Analysis

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Link Analysis: Web Structure and Search

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Part 1: Link Analysis & Page Rank

Introduction to Data Mining

Information Networks: Hubs and Authorities

Lecture 27: Learning from relational data

Lecture 17 November 7

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Lecture 8: Linkage algorithms and web search

Introduction to Information Retrieval

The PageRank Citation Ranking

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

Social Network Analysis

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Lecture 5: Graphs. Rajat Mittal. IIT Kanpur

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Information Retrieval. Lecture 11 - Link analysis

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Link Analysis and Web Search

Big Data Analytics CSCI 4030

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science

Lecture #3: PageRank Algorithm The Mathematics of Google Search

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

CENTRALITIES. Carlo PICCARDI. DEIB - Department of Electronics, Information and Bioengineering Politecnico di Milano, Italy

Big Data Analytics CSCI 4030

CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

PPI Network Alignment Advanced Topics in Computa8onal Genomics

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

CS425: Algorithms for Web Scale Data

Lecture 8: Linkage algorithms and web search

Analysis of Large Graphs: TrustRank and WebSpam

CS224W: Analysis of Networks Jure Leskovec, Stanford University

On Page Rank. 1 Introduction

Slides based on those in:

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

PV211: Introduction to Information Retrieval

Lecture 3: Descriptive Statistics

Lec 8: Adaptive Information Retrieval 2

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

Bruno Martins. 1 st Semester 2012/2013

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

PageRank for Product Image Search. Research Paper By: Shumeet Baluja, Yushi Jing

Mining Web Data. Lijun Zhang

Dynamically Motivated Models for Multiplex Networks 1

CS6200 Information Retreival. The WebGraph. July 13, 2015

Recent Researches on Web Page Ranking

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

Brief (non-technical) history

Information Retrieval and Web Search

SGN (4 cr) Chapter 11

TODAY S LECTURE HYPERTEXT AND

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis

Mining Social Network Graphs

Modeling web-crawlers on the Internet with random walksdecember on graphs11, / 15

REGULAR GRAPHS OF GIVEN GIRTH. Contents

Mathematical Analysis of Google PageRank

Searching the Web [Arasu 01]

CS60092: Informa0on Retrieval

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

Mining Web Data. Lijun Zhang

COMP5331: Knowledge Discovery and Data Mining

2.3 Algorithms Using Map-Reduce

Graph Algorithms. Revised based on the slides by Ruoming Kent State

Collaborative filtering based on a random walk model on a graph

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Information retrieval

Algebraic Graph Theory- Adjacency Matrix and Spectrum

Social Network Analysis

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Single link clustering: 11/7: Lecture 18. Clustering Heuristics 1

Generalized Social Networks. Social Networks and Ranking. How use links to improve information search? Hypertext

Link Analysis. Hongning Wang

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Information Retrieval. hussein suleman uct cs

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Calculating Web Page Authority Using the PageRank Algorithm. Math 45, Fall 2005 Levi Gill and Jacob Miles Prystowsky

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

CSE 158 Lecture 11. Web Mining and Recommender Systems. Triadic closure; strong & weak ties

Transcription:

Social and Technological Network Analysis Lecture 5: Web Search and Random Walks Dr. Cecilia Mascolo

In This Lecture We describe the concept of search in a network. We describe powerful techniques to enhance these searches.

Search When searching Computer Laboratory on Google the first link is for the department s page. How does Google know this is the best answer? InformaMon retrieval problem: synonyms (jump/leap), polysemy (Leopard), etc Now with the web: diversity in authoring introduces issues of common criteria for ranking documents The web offers abundance of informamon: whom do we trust as source? SMll one issue: stamc content versus real Mme World trade center query on 11/9/01 TwiXer helps solving these issues these days

Automate the Search We could collect a large sample of pages relevant to computer laboratory and collect their votes through their links. The pages receiving more in- links are ranked first. But if we use the network structure more deeply we can improve results.

Example: Query newspaper Links are seen as votes. Authori/es are established: the highly endorsed pages AuthoriMes

Numbers are reported back on the source page and aggregate. Hubs are high value lists A Refinement: Hubs

And we are now reweighmng the authorimes When do we stop? Principle of Repeated Improvement

RepeaMng and Normalizing The process can be repeated NormalizaMon: Each authority score is divided by the sum of all authority scores Each hub score is divided by the sum of all hub scores

More Formally: does the process converge? Each page has an authority a i and a hub h i score IniMally a i = h i = 1 At each step Normalize a i = h j = # h j j ">i " =1 a i # a i j ">i " h j =1

The process converges

Let s prove it Graph seen as matrix M ij [where m ij =1 if i- >j] Vectors a=(a1,a2..an), h=(h1,h2,..hn) h i = # a j $ h i = # M ij a j i"> j j So h=ma and a=m T h

Algorithm IniMally a=h=1 n Step h=ma, a=m T h Normalize Then a=m T (Ma)=(M T M)a And h=m(m T h)=(mm T )h In 2k steps a=(m T M) k a and h=(mm T ) k h

Eigenvalues and eigenvectors Defini/on: Let Ax=λx for scalar λ, vector x and matrix A Then x is an eigenvector and λ is its eigenvalue If A is symmetric (A T =A) then A has n orthogonal unit eigenvectors w 1, w n that form a basis (ie linear independent) with eigenvalues λ 1, λ n ( λ i λ i+1 ) Tip: M T M and MM T are symmetric

So Given the linear independence of the w i we can write x=σp i w i And x has coordinates [p1.. pn] If λ1, λn ( λi λi+1 ) then A k x=(λ 1k p 1 w 1 + λ 2k p 2 w 2,.+ λ nk p n w n )=Σ. λ ik p i w i Let s divide by λ 1 k A k x/λ 1k =λ 1k p 1 w 1/ λ 1 k +λ 2k p 2 w 2/ λ 1 k +. λ nk p n w n/ λ 1 k When k- >, A k x/λ 1k - >p 1 w 1

Convergence (M T M) k a and (MM T ) k h So for k- > these sequences converge

PageRank We have seen hubs and authorimes Hubs can collect links to important authorimes who do not point to each others There are other models: bexer for the web, where one prominent can endorse another. The PageRank model is based on transferrable importance.

PageRank Concepts Pages pass endorsements on outgoing links as fracmons which depend on out- degree IniMal PageRank value of each node in a network of n nodes: 1/n. Choose a number of steps k. [Basic] Update rule: each page divides its pagerank equally over the outgoing links and passes an equal share to the pointed pages. Each page s new rank is the sum of received pageranks.

Example All pages start with PageRank= 1/8 A becomes important and B,C benefit too at step 2

Convergence Except for some special cases, PageRank values of all nodes converge to limimng values when the number of steps goes to infinity. The convergence case is one where the PageRank of each page does not change anymore, i.e., they regenerate themselves.

Example of Equilibrium

Problems with the basic PageRank Dead ends F,G converge to ½ and all the other nodes to 0

SoluMon: The REAL PageRank [Scaled] Update Rule: Apply basic update rule. Then, scale down all values by scaling factor s [chosen between 0 and 1]. [Total network PageRank value changes from 1 to s] Divide 1- s residual units of PageRank equally over all nodes: (1- s)/n each. It can be proven that values converge again. Scaling factor usually chosen between 0.8 and 0.9

Search Ranking is very important to business A change in results in the search pages might mean loss of business I.e., not appearing on first page. Ranking algorithms are kept very secret and changed conmnuously.

Examples of Google Bombs

StarMng from a node, follow one outgoing link with random probability Random Walks

PageRank as Random Walk The probability of being at a page X aqer k steps of a random walk is precisely the PageRank of X aqer k applicamons of the Basic PageRank Update Rule Scaled Update Rule equivalent: follow a random outgoing link with probability s while with probability 1- s jump to a random node in the network.

Chapter 14 References