Graph and Link Mining

Similar documents
Social Networks 2015 Lecture 10: The structure of the web and link analysis

Structure of Social Networks

Introduction To Graphs and Networks. Fall 2013 Carola Wenk

Lecture #3: PageRank Algorithm The Mathematics of Google Search

CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena

Degree Distribution: The case of Citation Networks

HW 4: PageRank & MapReduce. 1 Warmup with PageRank and stationary distributions [10 points], collaboration

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Information Networks: PageRank

Algorithms and Applications in Social Networks. 2017/2018, Semester B Slava Novgorodov

How to explore big networks? Question: Perform a random walk on G. What is the average node degree among visited nodes, if avg degree in G is 200?

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030

Copyright 2000, Kevin Wayne 1

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Link Analysis and Web Search

Information Retrieval and Web Search

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Lecture 27: Learning from relational data

How to organize the Web?

Absorbing Random walks Coverage

THE KNOWLEDGE MANAGEMENT STRATEGY IN ORGANIZATIONS. Summer semester, 2016/2017

Absorbing Random walks Coverage

Introduction Types of Social Network Analysis Social Networks in the Online Age Data Mining for Social Network Analysis Applications Conclusion

World Wide Web has specific challenges and opportunities

Using! to Teach Graph Theory

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

CS6200 Information Retreival. The WebGraph. July 13, 2015

CS 6604: Data Mining Large Networks and Time-Series

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

Algorithms, Games, and Networks February 21, Lecture 12

Brief (non-technical) history

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Part 1: Link Analysis & Page Rank

MODULE 5 BLOG PROMOTION AND MARKETING STRATEGIES

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

CSI 445/660 Part 10 (Link Analysis and Web Search)

Undirected Graphs. V = { 1, 2, 3, 4, 5, 6, 7, 8 } E = { 1-2, 1-3, 2-3, 2-4, 2-5, 3-5, 3-7, 3-8, 4-5, 5-6 } n = 8 m = 11

Promoting Your Small Business with and Social Media

CENTRALITIES. Carlo PICCARDI. DEIB - Department of Electronics, Information and Bioengineering Politecnico di Milano, Italy

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

Social Network Analysis

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

CSE 158 Lecture 11. Web Mining and Recommender Systems. Triadic closure; strong & weak ties

Network Mathematics - Why is it a Small World? Oskar Sandberg

Lesson Three: False Claims Act and Health Insurance Portability and Accountability Act (HIPAA)

PARTICIPANT CENTER GUIDE TEAMRAISER 2016 GUIDE

A Guide to using Social Media (Facebook and Twitter)

PARTICIPANT CENTER GUIDE 1 TEAMRAISER 2016 GUIDE

Social-Network Graphs

The main things to note here are that:

Graph Theory. Network Science: Graph theory. Graph theory Terminology and notation. Graph theory Graph visualization

A Survey of Google's PageRank

Big Data Analytics CSCI 4030


An Improved Computation of the PageRank Algorithm 1

Week 5 Video 5. Relationship Mining Network Analysis

Filtering Unwanted Messages from (OSN) User Wall s Using MLT

Using Non-Linear Dynamical Systems for Web Searching and Ranking

Strongly connected: A directed graph is strongly connected if every pair of vertices are reachable from each other.

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Chapter 1. Social Media and Social Computing. October 2012 Youn-Hee Han

AAG Mobile App User Manual

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

A P2P-based Incremental Web Ranking Algorithm

Raising Money with Facebook

WALK MS Fundraise with. Guide To Fundraising with Facebook Created by the Georgia Chapter

PEOPLE PEOPLE. Dynamic profiles of all your people, with info captured from anywhere. Includes followups & targeting.

Graph Theory Review. January 30, Network Science Analytics Graph Theory Review 1

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

Chapter 3. Graphs. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

WE RE STRONGER TOGETHER.

Jeffrey D. Ullman Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

COMP5331: Knowledge Discovery and Data Mining

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 7

.. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Social Networks. Slides by : I. Koutsopoulos (AUEB), Source:L. Adamic, SN Analysis, Coursera course

Graph Data Management

How To Create Backlinks

So, why not start making some recommendations that will earn you some cash?

Information Networks: Hubs and Authorities

The Internet and World Wide Web. Chapter4

Link analysis. Query-independent ordering. Query processing. Spamming simple popularity

Bruno Martins. 1 st Semester 2012/2013

Reading Time: A Method for Improving the Ranking Scores of Web Pages

3.1 Basic Definitions and Applications. Chapter 3. Graphs. Undirected Graphs. Some Graph Applications

Algorithm Design and Analysis

TABLE OF CONTENT A) INTRODUCTION TO TIMELINE FACEBOOK TIMELINE ANATOMY OF FACEBOOK TIMELINE B) FACEBOOK TIMELINE ELEMENTS 1. COVER 2.

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Apache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis

Efficient and Scalable Friend Recommendations

Online Communication. Chat Rooms Instant Messaging Blogging Social Media

What is this Page Known for? Computing Web Page Reputations. Davood Rafiei, Alberto Mendelzon University of Toronto

HW1. Due: September 13, 2018

The Structure of Information Networks. Jon Kleinberg. Cornell University

Transcription:

Graph and Link Mining

Graphs - Basics A graph is a powerful abstraction for modeling entities and their pairwise relationships. G = (V,E) Set of nodes V = v,, v 5 Set of edges E = { v, v 2, v 4, v 5 } Examples: Social network Twitter Followers Web Collaboration graphs v 5 v v 2 v 4 v 3 2

Undirected Graphs Undirected Graph The edges are undirected pairs they can be traversed in any direction. Degree of node: Number of edges incident on the node Path: A sequence of edges from one node to another Connected Component: A set of nodes such that there is a path between any two nodes in the set v v 5 A v 4 v 3 v 2 3

Directed Graphs Directed Graph: Edges are ordered pairs they can be traversed in the direction from first to second. In-degree and Out-degree of a node. Path: A sequence of directed edges from one node to another Strongly Connected Component: A set of nodes such that there is a directed path between any two nodes in the set v A v 5 v 2 v 4 v 3 4

Examples of Graphs we Might Mine Airline Route Maps are useful Info can tell you about both history and politics Call Detail Records Tell us about relationships between people Who got in trouble about a decade ago for using this info? Web is based on (hyper)links between docs Social Networks form Graphs Link Analysis is the data mining technique that addresses relationships and connections 5

6 Degrees of Separation Claim: there are at most 6 degrees of separation between any two people This is important in social networks LinkedIn tell you how you connect to others and it expands with each link. Stanley Milgram wasn t first to note small world effect But popularized it with famous experiment: How close are two random people? Picked people in Omaha Nebraska or Wichita Kansas, and someone in Boston Asked source person to send it to other person and if did not know the person send it to someone more likely to know them Average path length was 5.5 or 6 But only 64 of 296 arrived (this is often not highlighted) 6

Examples of Applications Identifying authoritative sources of information on the WWW by analyzing page links Google and PageRank we will come back to this Understanding physician referral patterns Analyzing telephone call patterns MCI Friends and Family You call Mary Smith, also on MCI, so ask her to join MCI But your wife does not know Mary Smith! Oops! Far-fetched? Facebook does it all of the time!!!! Identify fraud: in past one would purchaser several stolen calling cards and use them to call same person. That is a clue. 7

Mining the graph structure A graph is a combinatorial object, with a certain structure. Mining the structure of the graph reveals information about the entities in the graph E.g., if in the Facebook graph I find that there are people that are all linked to each other, then these people are likely to be a community The community discovery problem By measuring the number of friends in Facebook graph I can find the most important nodes The node importance problem 8

Importance problem What are the most important nodes in the graph? What are the most authoritative pages on the web? Who are the important users in Facebook? What are the most influential Twitter accounts? 9

Link Analysis First generation search engines view documents as flat text files could not cope with size, spamming, user needs Second generation search engines Ranking becomes critical shift from relevance to authoritativeness authoritativeness: the static importance of the page a success story for the network analysis + a huge commercial success it all started with two graduate students at Stanford. Everyone knows the company, right?

Link Analysis: Intuition A link from page p to page q denotes endorsement page p considers page q an authority on a subject use the graph of recommendations assign an authority value to every page The same idea applies to other graphs as well Twitter graph, where user p follows user q

Constructing the graph w w w w w Goal: output an authority weight for each node Also known as centrality or importance 2

Rank by Popularity Rank pages according to the number of incoming edges (in-degree, degree centrality) w=2 w=3 w=2. Red Page 2. Yellow Page 3. Blue Page 4. Purple Page 5. Green Page w= w= 3

Popularity It is not important only how many link to you, but how important they are Good authorities are pointed by good authorities Recursive definition of importance 4

PageRank w Good authorities are pointed to by good authorities The value of a page is the value of the people that link to you How do we implement that? Each node distributes its authority value equally to its neighbors The authority value of each node is the sum of the authority fractions it collects from its neighbors. Solving the system of equations we get authority values for the nodes w = ½, w = ¼, w = ¼ w w + w + w = w = w + w w = ½ w w = ½ w w 5

A More Complex Example v v 2 w = /3 w 4 + /2 w 5 v 3 w 2 = /2 w + w 3 + /3 w 4 w 3 = /2 w + /3 w 4 w 4 = /2 w 5 w 5 = w 2 v 5 v 4 6

Random Walks on Graphs What we described is equivalent to a random walk on the graph Random walk: Start from a node uniformly at random Pick one of the outgoing edges uniformly at random Repeat Some nodes will be visited more often than others. Those are more important. Based not only on number of incoming links, but how often the predecessor nodes are visited A value like Google s Pagerank indicates how often a node would be visited 7

Random walks on graphs Question: what is the probability of being at a specific node? p i : probability of being at node i at this step p i : probability of being at node i in the next step p = /3 p 4 + /2 p 5 v v 2 p 2 = /2 p + p 3 + /3 p 4 v 3 p 3 = /2 p + /3 p 4 p 4 = /2 p 5 p 5 = p 2 v 5 v 4 After many steps the probabilities converge to the stationary distribution of the random walk. 8

How Does Pagerank Work? Arbitrarily initialize all pages to Pagerank of Repeatedly perform calculations for each page Eventually the values will converge Pagerank is what caused Google to succeed Prior to that only content mattered, not link structure 9

Benefits of PageRank It is not trivial to fool Pagerank You can create dummy pages to point to your page, but since no one is pointing to those pages, it will have low PageRank and not help much You can create dummy pages to also point to one another, but without being pointed to by an outside authority, the impact will be limited But it is clear that Google must have many tweaks to catch cases like this link spam or link farms 2

Social Network Analysis Social Network Analysis Overview https://www.youtube.com/watch?v=fgr_gq2ika 5 Minutes What is Social Network Analysis https://www.youtube.com/watch?v=xt3epf2esbq 4 minutes 2