Basics, how to build & store graphs, laws, etc. Centrality, and algorithms you should know

Similar documents
Basics, how to build & store graphs, laws, etc. Centrality, and algorithms you should know

Graphs / Networks CSE 6242/ CX Basics, how to build & store graphs, laws, etc. Centrality, and algorithms you should know

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Scaling Up Pig. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Scaling Up Pig. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Data Collection, Simple Storage (SQLite) & Cleaning

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Analytics Building Blocks

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Text Analytics (Text Mining)

Classification Key Concepts

Text Analytics (Text Mining)

Faloutsos : Multimedia Databases and Data Mining. Must-read Material. Must-read Material. Lecture #26: Graph mining - patterns

Graphs III. CSE 6242 A / CS 4803 DVA Feb 26, Duen Horng (Polo) Chau Georgia Tech. (Interactive) Applications

Data Integration. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

The Shape of the Internet. Slides assembled by Jeff Chase Duke University (thanks to Vishal Misra and C. Faloutsos)

Data Mining Concepts & Tasks

Data Mining Concepts. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech

Data Mining Concepts & Tasks

CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Introduction to Machine Learning. Duen Horng (Polo) Chau Associate Director, MS Analytics Associate Professor, CSE, College of Computing Georgia Tech

Discovering Roles and Anomalies in Graphs: Theory and Applications

Graphs / Networks CSE 6242/ CX Interactive applications. Duen Horng (Polo) Chau Georgia Tech

Faloutsos : Multimedia Databases and Data Mining. Must-read Material. Must-read Material (cont d) Lecture #26: Graph mining - patterns

CSE 158 Lecture 11. Web Mining and Recommender Systems. Triadic closure; strong & weak ties

CSE 255 Lecture 13. Data Mining and Predictive Analytics. Triadic closure; strong & weak ties

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Distributed Graph Storage. Veronika Molnár, UZH

Overlay (and P2P) Networks

Text Analytics (Text Mining)

CSE 158 Lecture 13. Web Mining and Recommender Systems. Triadic closure; strong & weak ties

BDVA 2016 Workshop Visual Analytics for Relational Data Dr. Quang Vinh Nguyen

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

How to organize the Web?

DATABASE SYSTEMS. Introduction to MySQL. Database System Course, 2016

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

CSE 258 Lecture 12. Web Mining and Recommender Systems. Social networks

Graph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web

Extracting Information from Complex Networks

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Analysis of Networks Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Erdős-Rényi Model for network formation

G(B)enchmark GraphBench: Towards a Universal Graph Benchmark. Khaled Ammar M. Tamer Özsu

Structure of Social Networks

Real-time Fraud Detection with Innovative Big Graph Feature. Gaurav Deshpande, VP Marketing, TigerGraph; Mingxi Wu, VP Engineering, TigerGraph

Social Network Analysis

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

DATABASE SYSTEMS. Introduction to MySQL. Database System Course, 2016

CSE 158 Lecture 11. Web Mining and Recommender Systems. Social networks

CS 322: (Social and Information) Network Analysis Jure Leskovec Stanford University

Jure Leskovec Computer Science Department Cornell University / Stanford University

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization

CIB Session 12th NoSQL Databases Structures

Efficient and Scalable Friend Recommendations

Data Informatics. Seon Ho Kim, Ph.D.

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

CSE 6242 / CX 4242 Course Review. Duen Horng (Polo) Chau Associate Director, MS Analytics Assistant Professor, CSE, College of Computing Georgia Tech

Exercise set #2 (29 pts)

Information Retrieval CSCI

The Structure of Information Networks. Jon Kleinberg. Cornell University

DATABASE SYSTEMS. Introduction to MySQL. Database System Course, 2018

Lecture 2: From Structured Data to Graphs and Spectral Analysis

Chapter 3. Foundations of Business Intelligence: Databases and Information Management

Part 1: Collecting and visualizing The Movie DB (TMDb) data

Unifying Big Data Workloads in Apache Spark

Databases 2 (VU) ( / )

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

An Introduction to Big Data Formats

Intro to Neo4j and Graph Databases

Visual Analytics Sandbox: A big data platform for processing network traffic

Temporal Graphs KRISHNAN PANAMALAI MURALI

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Social Network Analysis With igraph & R. Ofrit Lesser December 11 th, 2014

Big Data Analytics CSCI 4030

Sublinear Models for Streaming and/or Distributed Data

CMPT 354: Database System I. Lecture 1. Course Introduction

Math 443/543 Graph Theory Notes 10: Small world phenomenon and decentralized search

Link Analysis in the Cloud

origin destination duration New York London 415 Shanghai Paris 760 Istanbul Tokyo 700 New York Paris 435 Moscow Paris 245 Lima New York 455

Graph theoretic concepts. Devika Subramanian Comp 140 Fall 2008

An Evolving Network Model With Local-World Structure

A geometric model for on-line social networks

L Modeling and Simulating Social Systems with MATLAB

10/18/2017. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414

Apache Kylin. OLAP on Hadoop

How to explore big networks? Question: Perform a random walk on G. What is the average node degree among visited nodes, if avg degree in G is 200?

Why do we need graph processing?

Chapter 1. Social Media and Social Computing. October 2012 Youn-Hee Han

Behavioral Data Mining. Lecture 9 Modeling People

Social, Information, and Routing Networks: Models, Algorithms, and Strategic Behavior

Webinar Series TMIP VISION

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

Transcription:

http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Graphs / Networks Basics, how to build & store graphs, laws, etc. Centrality, and algorithms you should know Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Parishit Ram (GT PhD alum; SkyTree), Alex Gray

Internet 50 Billion Web Pages www.worldwidewebsize.com www.opte.org 2

Facebook 1.2 Billion Users Modified from Marc_Smith, flickr 3

Citation Network 250 Million Articles www.scirus.com/press/html/feb_2006.html#2 Modified from well-formed.eigenfactor.org 4

Many More Twitter Who-follows-whom (288 million users) Who-buys-what (120 million users) cellphone network Who-calls-whom (100 million users) Protein-protein interactions 200 million possible interactions in human genome Sources: www.selectscience.net www.phonedog.com www.mediabistro.com www.practicalecommerce.com/ 5

Large Graphs I Analyzed Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File Graph 1 Billion 37 Billion Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million 6

How to represent a graph? Conceptually. Visually. Programmatically.

How to Represent a Graph? 1 4 Visually Adjacency matrix Adjacency list 2 3 Source node Target node 1 2 3 4 1 0 1 3 0 2 0 0 0 2 3 0 1 0 0 4 0 0 0 0 1: 2, 3 2: 4 3: 2 1, 2, 1 1, 3, 3 2, 4, 2 3, 2, 1 Edge list most common distribution format sometimes painful to parse when edges/nodes have many columns (some are text with double/single quotes, some are integers, some decimals,...) 8

How to Represent a Graph? Visually Adjacency matrix Adjacency list 1 4 2 3 Source node Target node 1 2 3 4 1 0 1 3 0 2 0 0 0 2 3 0 1 0 0 4 0 0 0 0 1: 2, 3 2: 4 3: 2 Edge list 1, 2, 1 1, 3, 3 2, 4, 2 3, 2, 1 Each node is often identified by a numeric ID. Why? 9

Assigning an ID to a node Use a map (Java) / dictionary (Python) / SQLite Same concept: given an entity/node (e.g., Tom ) not seen before, assign a number to it Example of using SQLite to map names to IDs Hidden column; SQLite automatically created for you rowid name 1 Tom 2 Sandy 3 Richard 4 Polo 10

How to use the node IDs? Create an index for name. Then write a join query. rowid name 1 Tom 2 Sandy 3 Richard 4 Polo source target Tom Sandy Polo Richard source target 1 2 4 3 11

How to store large graphs?

How large is large? What do you think? In what units? Thousands? Millions? How do you measure a graph s size? By... (Hint: highly subjective. And domain specific.) 13

Storing large graphs... On your laptop computer SQLite Neo4j (GPL license) http://neo4j.com/licensing/ On a server MySQL, PostgreSQL, etc. Neo4j (?) 14

GPL In purely private (or internal) use with no sales and no distribution the software code may be modified and parts reused without requiring the source code to be released. For sales or distribution, the entire source code need to be made available to end users, including any code changes and additions in that case, copyleft is applied to ensure that end users retain the freedoms defined above. https://en.wikipedia.org/wiki/gnu_general_public_license 15

Storing large graphs... With a cluster (more details a few lectures down) Titan (on top of HBase), S2Graph if you need real time read and write Hadoop (generic framework) if batch processing is fine Hama, Giraph, inspired by Google s Pregel FlockDB, by Twitter Turri (Apple) / Dato / GraphLab 16

Storing large graphs on your computer I like to use SQLite. Why? Good enough for my use. Easily handle up to gigabytes Roughly tens of millions of nodes/edges (perhaps up to billions?). Very good! For today s standard. Very easy to maintain: one cross-platform file Has programming wrappers in numerous languages C++, Java (Andriod), Python, Objective C (ios),... Queries are so easy! e.g., find all nodes degrees = 1 SQL statement Bonus: SQLite even supports full-text search Offline application support (ipad) 17

SQLite graph database schema Simplest schema: edges(source_id, target_id) More sophisticated (flexible; lets you store more things): CREATE TABLE nodes ( id INTEGER PRIMARY KEY, type INTEGER DEFAULT 0, name VARCHAR DEFAULT ''); CREATE TABLE edges ( source_id INTEGER, target_id INTEGER, type INTEGER DEFAULT 0, weight FLOAT DEFAULT 1, timestamp INTEGER DEFAULT 0, PRIMARY KEY(source_id, target_id, timestamp)); 18

[Side note; you already done this in HW1] Full-Text Search (FTS) on SQLite http://www.sqlite.org/fts3.html Very simple. Built-in. Only needs 3 lines of commands. Create FTS table (index) CREATE VIRTUAL TABLE critics_consensus USING fts4(consensus); Insert text into FTS table INSERT INTO critics_consensus SELECT critics_consensus FROM movies; Query using the match keyword SELECT * FROM critics_consensus WHERE consensus MATCH 'funny OR horror'; Originally developed by Google engineers 19

I have a graph dataset. Now what? Analyze it! Do data mining or graph mining. How does it look like? Visualize it if it s small. Does it follow any expected patterns? Or does it *not* follow some patterns (outliers)? Yuck. Why does this matter? If we know the patterns (models), we can do prediction, recommendation, etc. e.g., is Alice going to friend Bob on Facebook? People often buy beer and diapers together. Outliers often give us new insights e.g., telemarketer s friends don t know each other 20

Finding patterns & outliers in graphs Outlier/Anomaly detection To spot them, we need to patterns first Anomalies = things that do not fit the patterns To effectively do this, we need large datasets patterns and anomalies don t show up well in small datasets vs 21

Are real graphs random? Random graph (Erdos-Renyi) 100 nodes, avg degree = 2 http://en.wikipedia.org/wiki/erdős Rényi_model Before layout No obvious patterns After layout Graph and layout generated with pajek http://vlado.fmf.uni-lj.si/pub/networks/pajek/ 22

Laws and patterns 23

Laws and patterns Are real graphs random? 23

Laws and patterns Are real graphs random? 23

Laws and patterns Are real graphs random? A: NO!!! Diameter (longest shortest path) in- and out- degree distributions other (surprising) patterns So, let s look at the data 23

Power Law in Degree Distribution Faloutsos, Faloutsos, Faloutsos [SIGCOMM99] Seminal paper. Must read! att.com internet domains ibm.com log(degree) log(rank) Zipf s law: the frequency of any item is inversely proportional to the item s rank (when ranked by decreasing frequency) Christos was Polo s advisor http://en.wikipedia.org/wiki/zipf%27s_law 24

Power Law in Degree Distribution Faloutsos, Faloutsos, Faloutsos [SIGCOMM99] Seminal paper. Must read! att.com internet domains ibm.com log(degree) -0.82 log(rank) Zipf s law: the frequency of any item is inversely proportional to the item s rank (when ranked by decreasing frequency) Christos was Polo s advisor http://en.wikipedia.org/wiki/zipf%27s_law 24

Power Law in Eigenvalues of Adjacency Matrix Eigenvalue Eigen exponent = slope = -0.48 Rank of decreasing eigenvalue 25

How about graphs from other domains?

More Power Laws Web hit counts [Alan L. Montgomery and Christos Faloutsos] Web Site Traffic log(#website) ebay users sites log(#website visit) 27

epinions.com who-trusts-whom [Richardson + Domingos, KDD 2001] count trusts-2000-people user (out) degree 21

And numerous more # of sexual contacts Income [Pareto] 80-20 distribution Duration of downloads [Bestavros+] Duration of UNIX jobs File sizes 22

Any other laws? Yes! Small diameter (~ constant!) six degrees of separation / Kevin Bacon small worlds [Watts and Strogatz] 30

Problem: Time evolution Jure Leskovec (CMU -> Stanford) Jon Kleinberg (Cornell) Christos Faloutsos (CMU) 31

Evolution of the Diameter Prior work on Power Law graphs hints at slowly growing diameter: diameter ~ O(log N) diameter ~ O(log log N) What is happening in real data? 32

Evolution of the Diameter Prior work on Power Law graphs hints at slowly growing diameter: diameter ~ O(log N) diameter ~ O(log log N) What is happening in real data? Diameter shrinks over time 33

Diameter Patents Network Patent citation network 25 years of data @1999 2.9 M nodes 16.5 M edges Effective diameter Time (year) 34

Why Effective Diameter? The maximum diameter is susceptible to outliers So, we use effective diameter instead defined as the minimum number of hops in which 90% of connected node pairs can reach each other 35

Evolution of #Node and #Edge N(t) nodes at time t E(t) edges at time t Suppose that N(t+1) = 2 * N(t) Q: what is your guess for E(t+1) =? 2 * E(t) 36

Evolution of #Node and #Edge N(t) nodes at time t E(t) edges at time t Suppose that N(t+1) = 2 * N(t) Q: what is your guess for A: over-doubled! E(t+1) =? 2 * E(t) But obeying the Densification Power Law 37

Densification Patent Citations Citations among patents granted @1999 2.9 M nodes 16.5 M edges Each year is a 1.66 datapoint E(t) N(t) 38

So many laws! There will be more to come... To date, there are 11 (or more) laws RTG: A Recursive Realistic Graph Generator using Random Typing [Akoglu, Faloutsos] 39

So many laws! What should you do? Try as many distributions as possible and see if your graph fits them. If it doesn t, find out the reasons. Sometimes it s due to errors/problems in the data; sometimes, it signifies some new patterns! 40

What might be the reasons for the hills? Polonium: Tera-Scale Graph Mining and Inference for Malware Detection [Chau, et al] 41