Social Networks 2015 Lecture 10: The structure of the web and link analysis

Similar documents
Social and Technological Network Data Analytics. Lecture 5: Structure of the Web, Search and Power Laws. Prof Cecilia Mascolo

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

CSI 445/660 Part 10 (Link Analysis and Web Search)

Social Network Analysis

Link Analysis: Web Structure and Search

Information Networks: Hubs and Authorities

Lecture #3: PageRank Algorithm The Mathematics of Google Search

CS 6604: Data Mining Large Networks and Time-Series

Graph and Link Mining

Social and Technological Network Analysis. Lecture 5: Web Search and Random Walks. Dr. Cecilia Mascolo

Algorithms, Games, and Networks February 21, Lecture 12

Link Analysis and Web Search

Information Retrieval and Web Search

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

Big Data Analytics CSCI 4030

COMP 4601 Hubs and Authorities

Web Structure Mining using Link Analysis Algorithms

Introduction To Graphs and Networks. Fall 2013 Carola Wenk

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Lecture 17 November 7

Bruno Martins. 1 st Semester 2012/2013

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

CS6200 Information Retreival. The WebGraph. July 13, 2015

Lecture 8: Linkage algorithms and web search

Searching the Web What is this Page Known for? Luis De Alba

Brief (non-technical) history

Lecture 11: Graph algorithms! Claudia Hauff (Web Information Systems)!

Mathematical Analysis of Google PageRank

COMP5331: Knowledge Discovery and Data Mining

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

World Wide Web has specific challenges and opportunities

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

Searching the Web [Arasu 01]

Big Data Analytics CSCI 4030

Introduction to Data Mining

Collaborative filtering based on a random walk model on a graph

COMP Page Rank

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

CENTRALITIES. Carlo PICCARDI. DEIB - Department of Electronics, Information and Bioengineering Politecnico di Milano, Italy

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Scalable Data-driven PageRank: Algorithms, System Issues, and Lessons Learned

CLOUD COMPUTING PROJECT. By: - Manish Motwani - Devendra Singh Parmar - Ashish Sharma

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Reading Time: A Method for Improving the Ranking Scores of Web Pages

On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

DSCI 575: Advanced Machine Learning. PageRank Winter 2018

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

Lecture 27: Learning from relational data

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

Mining Web Data. Lijun Zhang

Seek and Ye shall Find

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Mining Web Data. Lijun Zhang

Information Retrieval. Lecture 11 - Link analysis

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

MIDTERM EXAMINATION Networked Life (NETS 112) November 21, 2013 Prof. Michael Kearns

Here(is(the(XML(Schema(that(describes(the(format:

Part 1: Link Analysis & Page Rank

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

6 TIPS FOR IMPROVING YOUR WEB PRESENCE

Connected Components, and Pagerank

A web directory lists web sites by category and subcategory. Web directory entries are usually found and categorized by humans.

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

KNOWLEDGE GRAPHS. Lecture 1: Introduction and Motivation. TU Dresden, 16th Oct Markus Krötzsch Knowledge-Based Systems

Lecture 1: Introduction and Motivation Markus Kr otzsch Knowledge-Based Systems

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Getting Started 1. Open the Hootsuite app directory and install the Google Drive for Hootsuite app

Pregel. Ali Shah

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Seek and Ye shall Find

= a hypertext system which is accessible via internet

Seek and Ye shall Find

Introduction to Information Retrieval

How to organize the Web?

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

Information Retrieval. Lecture 9 - Web search basics

Introducing the. new & improved. Think. Formerly. Intelligent Business Automation.

SOFIA: Social Filtering for Niche Markets

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

: Semantic Web (2013 Fall)

Link Structure Analysis

modern database systems lecture 10 : large-scale graph processing

TI2736-B Big Data Processing. Claudia Hauff

Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

Experimental study of Web Page Ranking Algorithms

CPSC 340: Machine Learning and Data Mining. Ranking Fall 2016

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 12: Conclusion. Aidan Hogan

Recap! CMSC 498J: Social Media Computing. Department of Computer Science University of Maryland Spring Hadi Amiri

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

Transcription:

04198250 Social Networks 2015 Lecture 10: The structure of the web and link analysis

The structure of the web

Information networks Nodes: pieces of information Links: different relations between information Key example: World Wide Web

Other information networks Citation networks Difference to the web: implicit time line

Information networks and short paths

World Wide Web Key features of the early web Distributed information system: different pages stored on different computers Protocols for accessing this information using a browser Information is represented using hypertext This makes the web into a network, where nodes are pages and (directed) links are hyperlinks

The modern web Two types of links: Navigational links: traditional hyperlinks. Clicking on one shows a new page in the browser Transactional links: have side-effects. Clicking on one triggers a program that can cause other effects than showing a new page in the browser We will focus on the information network spanned out by navigational links

Using graph theory to analyse the structure of the web Representing the web as a graph, the same techniques used to analyse social networks can be employed Important difference: links are now directed The network concepts (connectivity, components, etc.) we defined for undirected graphs can be generalised to directed graphs become slightly more complicated

Concepts for directed graphs: paths and connectivity A path from A to B in a directed graph is a sequence of nodes, beginning with A and ending with B, such that every two consecutive nodes is connected with a forward edge A directed graph is strongly connected if there is a path from every node to every other node

Concepts for directed graphs: strongly connected components We say that a strongly connected component (SCC) in a directed graph is a subset of the nodes such that: (i) every node in the subset has a path to every other; and (ii) the subset is not part of some larger set with the property that every node can reach every other.

Example

Example

Example

Example

Example

Example

Example

The bow-tie structure of the web What does the web look like? What are its strongly connected components? Early influential study: Broder et al. (1999) findings later confirmed by others

The bow-tie structure of the web

The bow-tie structure of the web One giant strongly connected component

The bow-tie structure of the web INnodes: can reach the giant SCC but cannot be reached from it One giant strongly connected component

The bow-tie structure of the web INnodes: can reach the giant SCC but cannot be reached from it One giant strongly connected component OUT-nodes: can be reached from the giant SCC, but cannot reach it

The bow-tie structure of the web INnodes: can reach the giant SCC but cannot be reached from it One giant strongly connected component OUT-nodes: can be reached from the giant SCC, but cannot reach it The bow-tie structure: three very large components

Web 2.0 Three main principles: towards shared content and collective creation personal data in the cloud mechanisms for maintaining social connections between people Web 2.0 applications: Wikipedia, Facebook, Twitter, Gmail,... Many of the general concepts in this course are extremely relevant for analysing Web 2.0!

Link analysis and web search

Page ranking Web search is keyword-based Key web search problem: almost always too many results! How can the results be ranked, so that we get the most important ones first? Difficult, because: Queries have low expressivity Synonomy: several words mean the same thing Polysemy: the same word have several meanings It is difficult to say anything about the importance of a web page based only on the keywords present there Different people expect different results from the same query

Link analysis Key idea: rank pages not (only) according to their local content, but (also) use look at their links Idea 1: if a page is linked to by many of the relevant pages (pages with the keywords) then that page is important

Example

List pages Problem: ranking by in-links only can give results that have many in-links in general but are not relevant Idea 2: not all links are equally important. Links from pages that link to many of the pages with many votes are likely to be more important. We call such pages list pages. Two steps now: (1) find how good a page is as a list (2) let links from good list pages count higher

Example: (1) a page s value as a list

Example: (1) a page s value as a list 8

Example: (1) a page s value as a list 8 11

Example: (1) a page s value as a list 8 7 11

Example: (1) a page s value as a list 8 11 7

Example: (1) a page s value as a list 8 11 6 7

Example: (1) a page s value as a list 8 11 6 7

Example: (1) a page s value as a list 8 11 6 7

Example: (1) a page s value as a list 8 11 6 7 5

Example: (1) a page s value as a list 8 11 6 7 5 6

Example: (2) let links from good list pages count higher 8 11 6 7 5 6

Example: (2) let links from good list pages count higher 8 new score: 19 11 6 7 5 6

Example: (2) let links from good list pages count higher 8 new score: 19 11 new score: 19 6 7 5 6

Example: (2) let links from good list pages count higher 8 new score: 19 11 new score: 19 6 7 5 6 new score: 1

Example: (2) let links from good list pages count higher 8 new score: 19 11 new score: 19 7 new score: 1 6 5 6 new score: 24

Example: (2) let links from good list pages count higher 8 new score: 19 11 new score: 19 7 new score: 1 6 5 6 new score: 24 new score: 5

Example: (2) let links from good list pages count higher 8 new score: 19 11 new score: 19 7 new score: 1 6 5 6 new score: 24 new score: 15 new score: 5

Example: (2) let links from good list pages count higher 8 new score: 19 11 new score: 19 7 new score: 1 6 5 6 new score: 24 new score: 12 new score: 15 new score: 5

Why stop here? We can repeat this process: change the list-values using the new scores, compute new new scores Again and again and again...

Ranking algorithm: hubs and authorities 1. For the query, find all the hubs: these are pages with the keywords. These pages will be used as potential lists. The hub score for each hub is initially 1. authorities: the pages linked to by the hubs. These will be the pages we rank 2. For each authority, update its authority score to be the sum of the hub scores of all the hubs pointing to it. For each hub, update its hub score to be the sum of the authority scores of all the authorities it points to 4. Repeat 2 and a given number of times

Ranking algorithm: hubs and authorities Normalisation: divide each hub score by the sum of all hub scores, and each authority score by the sum of all authority scores The normalised values converge: for each iteration, the change is smaller and smaller And they converge to the same values, no matter which initial authority and hub scores we used! (advanced material)

PageRank In the hubs and authorities algorithm, pages have different roles. A page can cast many votes without itself being a relevant result PageRank is an alternative algorithm, where a page is considered to be important if it is linked to by other pages that also are important

Ranking algorithm: PageRank Let n be the number of nodes in the network. Let the initial PageRank value for each node be 1/n Update the PageRank value for each node as follows: each page divides its current PageRank value equally across its outgoing links. Each page updates its PageRank value to be the sum of the incoming values. Repeat the update step k times

PageRank: example (whiteboard)

PageRank: analysis The PageRank values converge towards final limiting values as the number of iterations increases We say that an assignment of PageRank values are equilibrium values if they are not changed by the update rule Limiting values are equilibrium values In strongly connected networks, equilibrium values are limiting values

PageRank: a problem Limiting values?

PageRank: a problem Limiting values? All PageRank will end up here

PageRank: a problem

PageRank: a problem In all real large networks, this is a real problem Solution: the update rule is modified by using a scaling factor This is the version of PageRank that is used in practice

PageRank with scaling factor Let n be the number of nodes in the network. Let the initial PageRank value for each node be 1/n Let s be a scaling factor between 0 and 1 (typically: 0.8-0.9) Update: 1.first apply the basic/standard PageRank update rule. 2.scale down all PageRank values by a factor of s.add the value (1-s)/n to the PageRank of each node Repeat the update step k times

PageRank with scaling factor: analysis The PageRank values still converge towards final limiting values as the number of iterations increases Limiting values depend on s In any network, limiting values are unique equilibrium values

PageRank as random walks

PageRank as random walks Consider the following definition of a random walk of a web graph: Pick a web page at random (uniform probability) Follow a random link from that web page (again, uniform probability) Repeat the previous step k times

PageRank as random walks Consider the following definition of a random walk of a web graph: Pick a web page at random (uniform probability) Follow a random link from that web page (again, uniform probability) Repeat the previous step k times Theorem: the probability of being at page X after k steps is equal to the PageRank value after k steps (unscaled)

Link analysis and modern web search PageRank was developed by Google, and was used to rank search results for many years Today, both Google and others use ranking methods that are extremely complex extremely secret always changing

Examples and analysis Blackboard