Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Similar documents
Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Graphs / Networks CSE 6242/ CX Interactive applications. Duen Horng (Polo) Chau Georgia Tech

Graphs III. CSE 6242 A / CS 4803 DVA Feb 26, Duen Horng (Polo) Chau Georgia Tech. (Interactive) Applications

Analytics Building Blocks

Text Analytics (Text Mining)

Text Analytics (Text Mining)

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Lecture 27: Learning from relational data

Data Integration. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Graphs / Networks CSE 6242/ CX Basics, how to build & store graphs, laws, etc. Centrality, and algorithms you should know

Data Collection, Simple Storage (SQLite) & Cleaning

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Social Network Analysis

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Social-Network Graphs

Basics, how to build & store graphs, laws, etc. Centrality, and algorithms you should know

COMP Page Rank

Basics, how to build & store graphs, laws, etc. Centrality, and algorithms you should know

Classification Key Concepts

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

Graph and Link Mining

COMP5331: Knowledge Discovery and Data Mining

Data Mining Concepts & Tasks

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

Data Mining Concepts & Tasks

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Degree Distribution: The case of Citation Networks

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Graph Theory and Network Measurment

Detecting and Analyzing Communities in Social Network Graphs for Targeted Marketing

Social Network Analysis

Scaling Up Pig. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Data Mining Concepts. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech

modern database systems lecture 10 : large-scale graph processing

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

Link Analysis and Web Search

Social Networks 2015 Lecture 10: The structure of the web and link analysis

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Scaling Up Pig. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Complex-Network Modelling and Inference

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Motivation. Motivation

Graphs (Part II) Shannon Quinn

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena

Information Networks: PageRank

Data mining --- mining graphs

Big Data Analytics CSCI 4030

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

Graph Theory. Network Science: Graph theory. Graph theory Terminology and notation. Graph theory Graph visualization

A Survey of Google's PageRank

Algorithmic and Economic Aspects of Networks. Nicole Immorlica

Apache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis

CS6220: DATA MINING TECHNIQUES

Extracting Information from Complex Networks

Big Data Analytics CSCI 4030

Computing Yi Fang, PhD

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Analysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths

The PageRank Computation in Google, Randomized Algorithms and Consensus of Multi-Agent Systems

Part 1: Link Analysis & Page Rank

The PageRank Computation in Google, Randomized Algorithms and Consensus of Multi-Agent Systems

Introduction to Data Mining

Lecture 8: Linkage algorithms and web search

Overview of this week

CENTRALITIES. Carlo PICCARDI. DEIB - Department of Electronics, Information and Bioengineering Politecnico di Milano, Italy

Using! to Teach Graph Theory

CS/ENGRD 2110 Object-Oriented Programming and Data Structures Spring 2012 Thorsten Joachims

Graph-Parallel Problems. ML in the Context of Parallel Architectures

A brief history of Google

How to organize the Web?

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Reading Time: A Method for Improving the Ranking Scores of Web Pages

Social Networks, Social Interaction, and Outcomes

Sergei Silvestrov, Christopher Engström. January 29, 2013

Speaker Pages For CoMeT System

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Structure of Social Networks

DSCI 575: Advanced Machine Learning. PageRank Winter 2018

Introduction to Machine Learning. Duen Horng (Polo) Chau Associate Director, MS Analytics Associate Professor, CSE, College of Computing Georgia Tech

Week 5 Video 5. Relationship Mining Network Analysis

Recap! CMSC 498J: Social Media Computing. Department of Computer Science University of Maryland Spring Hadi Amiri

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

Network Centrality. Saptarshi Ghosh Department of CSE, IIT Kharagpur Social Computing course, CS60017

Announcements. Assignment 6 due right now. Assignment 7 (FacePamphlet) out, due next Friday, March 16 at 3:15PM

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

Mining Social Network Graphs

Search Engine Architecture. Hongning Wang

CS-E5740. Complex Networks. Network analysis: key measures and characteristics

Demystifying movie ratings 224W Project Report. Amritha Raghunath Vignesh Ganapathi Subramanian

Advanced Computer Architecture: A Google Search Engine

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

V2: Measures and Metrics (II)

Transcription:

CSE 6242/ CX 4242 Feb 18, 2014 Graphs / Networks Centrality measures, algorithms, interactive applications Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Le Song

Joint IC/CSE Seminar Announcement Thursday (2/20), 10-11am, TSRB Banquet Hall Design Techniques for Crowdsourcing Complex Tasks" Edith Law Harvard University (Graduated from CMU) Edith is a faculty candidate; she ll talk about some of her best work 0.5% bonus point for attending. Talk may be recorded 2

Recap Last time: Basics, how to build graph, store graph, laws, etc.# Today: Centrality measures, algorithms, interactive applications for visualization and recommendation 3

Centrality = Importance

Why Node Centrality? What can we do if we can rank all the nodes in a graph (e.g., Facebook, LinkedIn, Twitter)?# Find celebrities or influential people in a social network (Twitter)# Find gatekeepers who connect communities (headhunters love to find them on LinkedIn)# What else?# 5

More generally Helps graph analysis, visualization, understanding, e.g.,# let us rank nodes, group or study them by centrality# only show subgraph formed by the top 100 nodes, out of the millions in the full graph# similar to google search results (ranked, and they only show you 10 per page)# Most graph analysis packages already have centrality algorithms implemented. Use them!# Can also compute edge centrality. Here we focus on node centrality. 6

Degree Centrality (easiest) Degree = number of neighbors" For directed graphs# in degree = # incoming edges# out degree = # outgoing edges# Algorithms?# Sequential scan through edge list# What about for a graph stored in SQLite? 7

Computing degrees using SQL Recall simplest way to store a graph in SQLite:# edges(source_id, target_id)! 1. Create index for each column# 2. Use group by statement to find node degrees# select count(*) from edges group by source_id; 8

Betweenness Centrality High betweenness # = important gatekeeper or liaison# Betweenness of a node v# = # Number of shortest paths between s and t that involves v Number of shortest paths between s and t = how often a node serves as the bridge that connects two other nodes. 9

Clustering Coefficient A node s clustering coefficient is a measure of how close the node s neighbors are from forming a clique.# 1 = neighbors form a clique# 0 = No edges among neighbors# (Assuming undirected graph) 10

Computing Clustering Coefficient... Requires triangle counting# Real social networks have a lot of triangles# Friends of friends are friends # But: triangles are expensive to compute# # (3-way join; several approx. algos)# Can we do that quickly? 11

Super Fast Triangle Counting [Tsourakakis ICDM 2008] details But: triangles are expensive to compute (3-way join; several approx. algos) Q: Can we do that quickly? A: Yes! #triangles = 1/6 Sum ( (λi)^3 ) (and, because of skewness, we only need the top few eigenvalues! 12

1000x+ speed-up, >90% accuracy 13

PageRank (Google) Larry Page Sergey Brin Brin, Sergey and Lawrence Page (1998). Anatomy of a Large-Scale Hypertextual Web Search Engine. 7th Intl World Wide Web Conf.

PageRank: Problem Given a directed graph, find its most interesting/central node A node is important, if it is connected with important nodes (recursive, but OK!)

PageRank: Solution Given a directed graph, find its most interesting/central node# Proposed solution: use random walk; spot most popular node (-> steady state probability (ssp)) state = webpage A node has high ssp, if it is connected with high ssp nodes (recursive, but OK!)

(Simplified) PageRank Let B be the transition matrix: transposed, column-normalized To From B 1 2 3 = 4 5

(Simplified) PageRank B p = p B p = p 1 2 3 = 4 5

(Simplified) PageRank B p = 1 * p thus, p is the eigenvector that corresponds to the highest eigenvalue (=1, since the matrix is column-normalized)# Why does such a p exist? # p exists if B is nxn, nonnegative, irreducible [Perron Frobenius theorem]

(Simplified) PageRank In short: imagine a particle randomly moving along the edges# compute its steady-state probability (ssp)# # Full version of algorithm: with occasional random jumps# Why? To make the matrix irreducible

Full Algorithm With probability 1-c, fly-out to a random node# Then, we have# p = c B p + (1-c)/n 1 => p = (1-c)/n [I - c B] -1 1

Full Algorithm With probability 1-c, fly-out to a random node# Then, we have# p = c B p + (1-c)/n 1 => p = (1-c)/n [I - c B] -1 1

PageRank for graphs (generally) You can compute PageRank for any graphs# Should be in your algorithm toolbox # Better than simple centrality measure (e.g., degree) # Fast to compute for large graphs (O(E))# But can be misled (Google Bomb)# How? 23

Personalized PageRank Make one small variation of PageRank# Intuition: not all pages are equal, some more relevant to a person s specific needs# How? 24

Personalizing PageRank With probability 1-c, fly-out to a random node some preferred nodes# Then, we have# p = c B p + (1-c)/n 1 => p = (1-c)/n [I - c B] -1 1

Why learn Personalized PageRank? Can be used for recommendation, e.g.,# If I like this webpage, what would I also be interested?# If I like this product, what other products I also like? (in a user-product bipartite graph)# Also helps with visualizing large graphs# Instead of visualizing every single nodes, visualize the most important ones# Again, very flexible. Can be run on any graph. 26

Building an interactive application Will show you an example application (Apolo) that uses a diffusion-based algorithm to perform recommendation on a large graph# Personalized PageRank (= Random Walk with Restart)" Belief Propagation (powerful inference algorithm, for fraud detection, image segmentation, error-correcting codes, etc.)# Spreading activation or degree of interest in Human- Computer Interaction (HCI)# Guilt-by-association techniques 27

Building an interactive application Why diffusion-based algorithms are widely used? # Intuitive to interpret uses network effect, homophily, etc.# Easy to implement Math is relatively simple# Fast run time linear to #edges, or better# Probabilistic meaning 28

Human-In-The-Loop Graph Mining Apolo: Machine Learning + Visualization CHI 2011 Apolo: Making Sense of Large Network Data by Combining Rich User Interaction and Machine Learning 29

Finding More Relevant Nodes HCI# Paper Citation network Data Mining Paper 30

Finding More Relevant Nodes HCI# Paper Citation network Data Mining Paper 30

Finding More Relevant Nodes HCI# Paper Citation network Data Mining Paper Apolo uses guilt-by-association (Belief Propagation, similar to personalized PageRank) 30

Demo: Mapping the Sensemaking Literature Nodes: 80k papers from Google Scholar (node size: #citation) Edges: 150k citations# 31

Key Ideas (Recap) Specify exemplars# Find other relevant nodes (BP) 33

Apolo s Contributions 1 Human + Machine It was like having a partnership with the machine. Apolo User 2 Personalized Landscape 34

Apolo 2009 35

Apolo 2010 36

Apolo 2011 22,000 lines of code. Java 1.6. Swing. Uses SQLite3 to store graph on disk 37

User Study Used citation network# Task: Find related papers for 2 sections in a survey paper on user interface# Model-based generation of UI# Rapid prototyping tools 38

Between subjects design# Participants: grad student or research staff 39

39

39

Judges Scores 16 Apolo Scholar Score 8 0 Model-# based *Prototyping# *Average# Higher is better.# Apolo wins. * Statistically significant, by two-tailed t test, p <0.05 40

Apolo: Recap A mixed-initiative approach for exploring and creating personalized landscape for large network data# Apolo = ML + Visualization + Interaction 41

Practitioners guide to building (interactive) applications? Important that you pick a good problem!# Otherwise, you solve a non-problem, and nobody cares# Think about scalability early# e.g., picking a scalable algorithm early on# When building interactive applications, use iterative design approach (as in Apolo)# Why? It s hard to get it right the first time# Create prototype, evaluate, modify prototype, evaluate,...# Quick evaluation helps you identify important fixes early (can save you a lot of time) 42

Practitioners guide to building (interactive) applications? How to do iterative design?# What kinds of prototypes? # Paper prototype, lo-fi prototype, high-fi prototype# What kinds of evaluation? # Recruit your friends to try your tools# Lab study (controlled, as in Apolo) # Longitudinal study (usage over months)# Deploy it and see the world s reaction!# To learn more:# CS 6750 Human-Computer Interaction# CS 6455 User Interface Design and Evaluation 43