Scalable Network Analysis

Similar documents
Scalable Clustering of Signed Networks Using Balance Normalized Cut

Social and Information Network Analysis Positive and Negative Relationships

Positive and Negative Relations

Link Prediction for Social Network

Link Sign Prediction and Ranking in Signed Directed Social Networks

Data Mining Techniques

Positive and Negative Links

CS224W Final Report Emergence of Global Status Hierarchy in Social Networks

ECS289: Scalable Machine Learning

Data Mining Techniques

Distributed stochastic gradient descent for link prediction in signed social networks

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Scalable Clustering of Signed Networks Using Balance Normalized Cut

Graph Mining: Overview of different graph models

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks

CS249: ADVANCED DATA MINING

IMP: A Message-Passing Algorithm for Matrix Completion

V2: Measures and Metrics (II)

Supervised Random Walks

Analyzing Stochastic Gradient Descent for Some Non- Convex Problems

AspEm: Embedding Learning by Aspects in Heterogeneous Information Networks

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Practice Questions for Midterm

QUINT: On Query-Specific Optimal Networks

Graph Data Management

Algorithms and Applications in Social Networks. 2017/2018, Semester B Slava Novgorodov

Databases and Information Retrieval Integration TIETS42. Kostas Stefanidis Autumn 2016

Graph Data Management Systems in New Applications Domains. Mikko Halin

Conflict Graphs for Parallel Stochastic Gradient Descent

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

CSE 258. Web Mining and Recommender Systems. Advanced Recommender Systems

Massive Data Analysis

WHITE PAPER NGINX An Open Source Platform of Choice for Enterprise Website Architectures

DeepWalk: Online Learning of Social Representations

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

Spectral Clustering and Community Detection in Labeled Graphs

CSE 158 Lecture 8. Web Mining and Recommender Systems. Extensions of latent-factor models, (and more on the Netflix prize)

Learning Low-rank Transformations: Algorithms and Applications. Qiang Qiu Guillermo Sapiro

Batch-Incremental vs. Instance-Incremental Learning in Dynamic and Evolving Data

Recommender System. What is it? How to build it? Challenges. R package: recommenderlab

Harp-DAAL for High Performance Big Data Computing

Graph-Parallel Problems. ML in the Context of Parallel Architectures

CS 5220: Parallel Graph Algorithms. David Bindel

Parallelism. CS6787 Lecture 8 Fall 2017

ECS289: Scalable Machine Learning

Performance and Scalability

Non-exhaustive, Overlapping k-means

DeepFace: Closing the Gap to Human-Level Performance in Face Verification

Machine Learning Basics. Sargur N. Srihari

CPSC 340: Machine Learning and Data Mining. Recommender Systems Fall 2017

modern database systems lecture 10 : large-scale graph processing

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation. Deepak Pathak, Philipp Krähenbühl and Trevor Darrell

Mining Web Data. Lijun Zhang

Modularity CMSC 858L

MIND: Machine Learning based Network Dynamics. Dr. Yanhui Geng Huawei Noah s Ark Lab, Hong Kong

Uncovering the Formation of Triadic Closure in Social Networks. Zhanpeng Fang and Jie Tang Tsinghua University

Structure of Social Networks

Challenges in Multiresolution Methods for Graph-based Learning

Decentralized and Distributed Machine Learning Model Training with Actors

Why do we need graph processing?

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI

TGNet: Learning to Rank Nodes in Temporal Graphs. Qi Song 1 Bo Zong 2 Yinghui Wu 1,3 Lu-An Tang 2 Hui Zhang 2 Guofei Jiang 2 Haifeng Chen 2

Complex-Network Modelling and Inference

Paradigm Shift of Database

Convex Optimization MLSS 2015

Balanced and partitionable signed graphs

COMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare

Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization

CS224W Project: Recommendation System Models in Product Rating Predictions

Extracting Communities from Networks

Metric Learning for Large-Scale Image Classification:

A Brief Look at Optimization

ActiveClean: Interactive Data Cleaning For Statistical Modeling. Safkat Islam Carolyn Zhang CS 590

FastText. Jon Koss, Abhishek Jindal

Apache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis

Networks in economics and finance. Lecture 1 - Measuring networks

Case Study 4: Collaborative Filtering. GraphLab

Local Community Detection in Dynamic Graphs Using Personalized Centrality

Epilog: Further Topics

Web Personalization & Recommender Systems

Introduction to Data Management. Lecture #1 (Course Trailer )

Latent Space Model for Road Networks to Predict Time-Varying Traffic. Presented by: Rob Fitzgerald Spring 2017

CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena

over Multi Label Images

Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea

Algorithmic and Economic Aspects of Networks. Nicole Immorlica

Dynamic and Historical Shortest-Path Distance Queries on Large Evolving Networks by Pruned Landmark Labeling

Mining Data that Changes. 17 July 2015

Graph based machine learning with applications to media analytics

JAVASCRIPT CHARTING. Scaling for the Enterprise with Metric Insights Copyright Metric insights, Inc.

Progress Report: Collaborative Filtering Using Bregman Co-clustering

LDA for Big Data - Outline

Know your neighbours: Machine Learning on Graphs

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

arxiv: v1 [cs.si] 8 Jun 2012

STREAMING RANKING BASED RECOMMENDER SYSTEMS

Survey on Process in Scalable Big Data Management Using Data Driven Model Frame Work

Databases 2 (VU) ( / )

Transcription:

Inderjit S. Dhillon University of Texas at Austin COMAD, Ahmedabad, India Dec 20, 2013

Outline Unstructured Data - Scale & Diversity Evolving Networks Machine Learning Problems arising in Networks Recommender Systems Link Prediction Sign Prediction Formulation as Missing Value Estimation Scalable Algorithms NOMAD: Distributed matrix completion algorithm Results on Applications Conclusions 2

Structured Data Data organized into fields: Relational databases, spreadsheets, XML Highly optimized for storage & retrieval (e.g. using SQL) 3

Structured Data Focus is on data format, efficient storage & search Less or no uncertainty in semantics: e.g. businesses know the fields of the data 4

Unstructured Data Modern data is unstructured and diverse Networks, Graphs Text Images, Videos 5

Unstructured Data Much greater growth rate 6

Unstructured Data Dynamic aspects of unstructured data: Constantly evolving Uncertainties abound: What should I ask of the data? Seek insights Heterogeneity renders traditional database models inadequate 7

Unstructured Data Buzzwords - Big Data & Data Science Machine Learning: Predictive models for data Engineering perspective: Scale matters 8

Network Graphs Social networks (Friendship) APQ6 AQP1 AQP5 Bipartite networks (Membership, Ratings, etc.) MIP AQP2 GK Gene networks (Functional interaction) 9

Graph Evolution Social networks are highly dynamic Constantly grow, change quickly over time Users arrive/leave, relationships form/dissolve Understanding graph evolution is important 10

Graphs meet Machine Learning Network analysis: Understanding structure & evolution of networks Formulate predictive problems on the adjacency matrix of the graph Confluence of graph theory & machine learning 11

Users Recommender Systems Movies Link Prediction Netflix problem: 100M ratings, 0.5M users, 20K movies A Toy Problem In Comparison! Facebook growth 12

Users Recommender Systems Movies 13

Link Prediction in Social Networks?? Network at time T Network at time T + 1 Problem: Infer missing relationships from a given snapshot of the network 14

Predicting gene-disease links? 15

Signed Social Networks Like Dislike Like / Dislike? Sign Prediction Problem: Given a snapshot of the signed social network, predict the signs of missing edges 16

Formulation as Missing Value Estimation

Users Low-rank Matrix Completion Movies 18

Users Low-rank Matrix Completion Movies? 19

Low-rank Matrix Completion 20

Low-rank Matrix Completion min W 2R m k,h2r n k X (i,j)2 (A ij W i H T j ) 2 + (kw k 2 F + khk 2 F ) 21

Low-rank Matrix Completion? min W 2R m k,h2r n k X (i,j)2 (A ij W i H T j ) 2 + (kw k 2 F + khk 2 F ) 22

Low-rank Matrix Completion 3.44 min W 2R m k,h2r n k X (i,j)2 (A ij W i H T j ) 2 + (kw k 2 F + khk 2 F ) 23

Link Prediction Can be posed as matrix completion problem A (t) = 2 3. 1 1.. 1. 1 1 1 61 1.. 1 7 4. 1... 5. 1 1.. 2 3 1 1 61 7 1 1 1 1 1 415 1 Issue: Only positive relationships are observed Test Link Score (4,5) 1 (1,4) 1 (3,4) 1 (1,5) 1 24

Link Prediction Formulate Biased Matrix Completion Problem: min W 2R n k,h2r n k X (i,j)2 (A ij W i H T j ) 2 + X (i,j) /2 (A ij W i H T j ) 2 + (kw k 2 F + khk 2 F ) A (t) = 2 3. 1 1.. 1. 1 1 1 61 1.. 1 7 4. 1... 5. 1 1.. 2 3 0.85 1.09 60.97 7 0.85 1.09 0.97 0.69 0.85 40.695 0.85 Test Link Score (4,5) 0.59 (1,4) 0.59 (3,4) 0.67 (1,5) 0.72 25

Signed Social Networks Balanced Not balanced Friend of a friend is a friend Enemy of an enemy is a friend Social Balance [Harary,1953]: In real-world signed networks, triangles tend to be balanced 26

Signed Social Networks Theorem: All triangles in a network are balanced if and only if there exist two antagonistic groups. 27

Signed Social Networks Weakly Balanced Not balanced Relaxation: Weak balance Allow triangles with all negative edges 28

Signed Social Networks Theorem: All triangles in a network are weakly balanced if and only if there exist k antagonistic groups. 29

Sign Prediction Theorem: A k-weakly balanced signed network has rank at most k. Sign inference can be posed as low-rank matrix completion Theorem: If there are no small groups, the underlying network can be exactly recovered, under certain conditions. K. Chiang et al. Prediction and Clustering in Signed Networks: A Local to Global Perspective. To appear in JMLR. 30

Scalable Algorithms

Stochastic Gradient Sample random index (i,j) and update corresponding factors: w i w i ((A ij w T i h j )h j + w i ) h j h j ((A ij w T i h j )w i + h j ) Time per update O(k) Effective for very large-scale problems 32

Distributed Stochastic Gradient Descent (DSGD) [Gemulla et al. KDD 2011] Decoupled updates Easy to parallelize But communication & computation are interleaved 33

DSGD 34

DSGD Curse of the last reducer 35

DSGD 36

NOMAD Goal: Keep CPU & network simultaneously busy. Non-locking stochastic Multi-machine algorithm for Asynchronous & Decentralized matrix factorization Asynchronous distributed solution. 37

NOMAD Nomadic variables queue h 1 h 5 h 4 h 3 Native variables w 1 w 2 w 3 Workers w 4 h 2 38

NOMAD Nomadic variables queue h 1 h 5 h 3 Native variables Push h 4 w 1 w 2 w 3 Workers w 4 h 2 h 4 39

NOMAD 40

NOMAD 41

NOMAD 42

NOMAD 43

NOMAD 44

NOMAD 45

NOMAD 46

NOMAD 47

NOMAD 48

NOMAD 49

NOMAD 50

NOMAD Algorithm 1. Initialize: Randomly assign h j columns to worker queues 2. Parallel Foreach q in {1,2,...,p} 3. If queue[q] not empty then 4. (j, h j ) queue[q].pop() 5. for ( i, j) in (q) do j 6. Do SGD updates 7. end for 8. Sample q uniformly from {1,2,...,p} 9. queue[q ].push( (j, h j ) 10. end if 11. Parallel End 51

NOMAD Algorithm 1. Initialize: Randomly assign h j columns to worker queues 2. Parallel Foreach q in {1,2,...,p} 3. If queue[q] not empty then 4. (j, h j ) queue[q].pop() 5. for ( i, j) in (q) do j 6. Do SGD updates Distributed setting: 7. end for Write over the network 8. Sample q uniformly from {1,2,...,p} 9. queue[q ].push( (j, h j ) 10. end if 11. Parallel End Concurrent object 52

Algorithm Complexity Average space required per worker: O(mk/p + nk/p + /p) Average time for one sweep: O( k/p) 53

Results on Applications

Recommender Systems Netflix dataset: 2,649,429 users, 17,770 movies, ~100M ratings Multicore Distributed 55

Recommender Systems Synthetic dataset: ~85M users, 17,770 items, ~8.5B observations 56

Link Prediction Flickr dataset: 1.9M users & 42M links. Test set: sampled 5K users. 57

Predicting Gene-Disease Links 0.16 0.14 CATAPULT(Blom et al.,2013) Katz Biased MF Test pair rank 0.08 0.07 CATAPULT(Blom et al.,2013) Katz Biased MF Singleton gene rank 0.12 0.06 0.1 0.05 Pr(Rank <= x) 0.08 0.06 Pr(Rank <= x) 0.04 0.03 0.04 0.02 0.02 0.01 0 0 10 20 30 40 50 60 70 80 90 100 Rank 0 0 10 20 30 40 50 60 70 80 90 100 Rank 58

Sign Prediction Epinions dataset (+ve & -ve reviews): 131K nodes, 840K edges, 15% edges negative. MF-ALS is faster and achieves higher accuracy. MF-ALS takes 455 secs on network with 1.1M nodes & 120M edges 59

Conclusions Rapid growth of unstructured data demands scalable machine learning solutions for analysis Machine learning problems arising in network analysis can be cast in the matrix completion framework Our proposed asynchronous distributed algorithm NOMAD outperforms state-of-the-art matrix completion solvers Beyond Matrix Completion: Asynchronous distributed framework for solving machine learning problems 60