Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Similar documents
Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Graphs III. CSE 6242 A / CS 4803 DVA Feb 26, Duen Horng (Polo) Chau Georgia Tech. (Interactive) Applications

Graphs / Networks CSE 6242/ CX Interactive applications. Duen Horng (Polo) Chau Georgia Tech

Analytics Building Blocks

Data Integration. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Text Analytics (Text Mining)

Text Analytics (Text Mining)

Graphs / Networks CSE 6242/ CX Basics, how to build & store graphs, laws, etc. Centrality, and algorithms you should know

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Basics, how to build & store graphs, laws, etc. Centrality, and algorithms you should know

Basics, how to build & store graphs, laws, etc. Centrality, and algorithms you should know

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Data Collection, Simple Storage (SQLite) & Cleaning

Link Analysis and Web Search

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Big Data Analytics CSCI 4030

Lecture 27: Learning from relational data

Social Network Analysis

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

How to organize the Web?

COMP5331: Knowledge Discovery and Data Mining

Classification Key Concepts

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Data Mining Concepts & Tasks

COMP Page Rank

modern database systems lecture 10 : large-scale graph processing

Big Data Analytics CSCI 4030

Social-Network Graphs

Part 1: Link Analysis & Page Rank

Information Networks: PageRank

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

Introduction to Data Mining

Scaling Up Pig. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Scaling Up Pig. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Social Network Analysis

Motivation. Motivation

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

A brief history of Google

Extracting Information from Complex Networks

Lecture 8: Linkage algorithms and web search

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

Data Mining Concepts & Tasks

Detecting and Analyzing Communities in Social Network Graphs for Targeted Marketing

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Pregel. Ali Shah

Data Mining Concepts. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Graphs (Part II) Shannon Quinn

Analysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths

CENTRALITIES. Carlo PICCARDI. DEIB - Department of Electronics, Information and Bioengineering Politecnico di Milano, Italy

Graph Theory and Network Measurment

Mining Web Data. Lijun Zhang

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Degree Distribution: The case of Citation Networks

Apache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis

Graph Data Management

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Mosaic: Processing a Trillion-Edge Graph on a Single Machine

CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena

CS6220: DATA MINING TECHNIQUES

Graph and Link Mining

Mining Social Network Graphs

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

The Anatomy of a Large-Scale Hypertextual Web Search Engine

PEGASUS: A peta-scale graph mining system Implementation. and observations. U. Kang, C. E. Tsourakakis, C. Faloutsos

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader

Algorithmic and Economic Aspects of Networks. Nicole Immorlica

Similarity Ranking in Large- Scale Bipartite Graphs

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

Advanced Computer Architecture: A Google Search Engine

Using! to Teach Graph Theory

DSCI 575: Advanced Machine Learning. PageRank Winter 2018

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

CS425: Algorithms for Web Scale Data

Data mining --- mining graphs

PageRank. CS16: Introduction to Data Structures & Algorithms Spring 2018

Overview of this week

Slides based on those in:

Graph Theory Problem Ideas

Sampling Large Graphs for Anticipatory Analysis

Structure of Social Networks

A Survey of Google's PageRank

Network Centrality. Saptarshi Ghosh Department of CSE, IIT Kharagpur Social Computing course, CS60017

Sergei Silvestrov, Christopher Engström. January 29, 2013

ScienceDirect A NOVEL APPROACH FOR ANALYZING THE SOCIAL NETWORK

Graph Analytics and Machine Learning A Great Combination Mark Hornick

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Introduction Types of Social Network Analysis Social Networks in the Online Age Data Mining for Social Network Analysis Applications Conclusion

Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System

HW 4: PageRank & MapReduce. 1 Warmup with PageRank and stationary distributions [10 points], collaboration

Diffusion and Clustering on Large Graphs

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

Transcription:

CSE 6242/ CX 4242 Graphs / Networks Centrality measures, algorithms, interactive applications Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Le Song

Centrality = Importance

Why Node Centrality? What can we do if we can rank all the nodes in a graph (e.g., Facebook, LinkedIn, Twitter)? Find celebrities or influential people in a social network (Twitter) Find gatekeepers who connect communities (headhunters love to find them on LinkedIn) What else? 3

More generally Helps graph analysis, visualization, understanding, e.g., Let us rank nodes, group or study them by centrality Only show subgraph formed by the top 100 nodes, out of the millions in the full graph Similar to google search results (ranked, and they only show you 10 per page) Most graph analysis packages already have centrality algorithms implemented. Use them! Can also compute edge centrality. Here we focus on node centrality. 4

Degree Centrality (easiest) Degree = number of neighbors For directed graphs In degree = No. of incoming edges 1 2 3 Out degree = No. of outgoing edges 4 For undirected graphs, only degree is defined. Algorithms? Sequential scan through edge list What about for a graph stored in SQLite? 5

Computing Degrees using SQL Recall simplest way to store a graph in SQLite: edges(source_id, target_id) 1. If slow, first create index for each column 2. Use group by statement to find in degrees select count(*) from edges group by source_id; 6

Betweenness Centrality High betweenness = gatekeeper Betweenness of a node v = Number of shortest paths between s and t that goes through v Number of shortest paths between s and t = how often a node serves as the bridge that connects two other nodes. Betweenness is very well studied. http://en.wikipedia.org/wiki/centrality#betweenness_centrality 7

(Local) Clustering Coefficient A node s clustering coefficient is a measure of how close the node s neighbors are from forming a clique. 1 = neighbors form a clique 0 = No edges among neighbors (Assuming undirected graph) Local means it s for a node; can also compute a graph s global coefficient Image source: http://en.wikipedia.org/wiki/clustering_coefficient 8

Computing Clustering Coefficients... Requires triangle counting Real social networks have a lot of triangles Friends of friends are friends Triangles are expensive to compute (neighborhood intersections; several approx. algos) Can we do that quickly? Algorithm details: Faster Clustering Coefficient Using Vertex Covers http://www.cc.gatech.edu/~ogreen3/_docs/2013vertexcoverclusteringcoefficients.pdf 9

Super Fast Triangle Counting [Tsourakakis ICDM 2008] details But: triangles are expensive to compute (3-way join; several approx. algos) Q: Can we do that quickly? A: Yes! #triangles = 1/6 Sum ( λi 3 ) (and, because of skewness, we only need the top few eigenvalues! 10

Power Law in Eigenvalues of Adjacency Matrix Eigenvalue Eigen exponent = slope = -0.48 Rank of decreasing eigenvalue 11

1000x+ speed-up, >90% accuracy 12

More Centrality Measures Degree Betweenness Closeness, by computing Shortest paths Proximity (usually via random walks) used successfully in a lot of applications Eigenvector 13

PageRank (Google) Larry Page Sergey Brin Brin, Sergey and Lawrence Page (1998). Anatomy of a Large-Scale Hypertextual Web Search Engine. 7th Intl World Wide Web Conf.

PageRank: Problem Given a directed graph, find its most interesting/central node A node is important, if it is connected with important nodes (recursive, but OK!)

PageRank: Solution Given a directed graph, find its most interesting/central node Proposed solution: use random walk; spot most popular node (-> steady state probability (ssp)) A node has high ssp, if it is connected with high ssp nodes state = webpage (recursive, but OK!)

(Simplified) PageRank Let B be the transition matrix: transposed, column-normalized To From B 1 2 3 = 4 5

(Simplified) PageRank B p = p B p = p 1 2 3 = 4 5

(Simplified) PageRank B p = 1 * p Thus, p is the eigenvector that corresponds to the highest eigenvalue (=1, since the matrix is column-normalized) Why does such a p exist? p exists if B is nxn, nonnegative, irreducible [Perron Frobenius theorem]

(Simplified) PageRank In short: imagine a particle randomly moving along the edges Compute its steady-state probability (ssp) Full version of algorithm: with occasional random jumps Why? To make the matrix irreducible

Full Algorithm With probability 1-c, fly-out to 1 2 3 a random node Then, we have p = c B p + (1-c)/n 1 => p = (1-c)/n [I - c B] -1 1 4 5

How to compute PageRank for huge matrix? 1 2 3 Use the power iteration method http://en.wikipedia.org/wiki/power_iteration 4 p = c B p + (1-c)/n 1 5 p B p = c + (1-c) 1/n Can initialize this vector to any non-zero vector, e.g., all 1 s

http://www.cs.duke.edu/csed/principles/pagerank/ 23

PageRank for graphs (generally) You can compute PageRank for any graphs Should be in your algorithm toolbox Better than simple centrality measure (e.g., degree) Fast to compute for large graphs (O(E)) But can be misled (Google Bomb) How? 24

Personalized PageRank Make one small variation of PageRank Intuition: not all pages are equal, some more relevant to a person s specific needs How? 25

Personalizing PageRank With probability 1-c, fly-out to a random node some preferred nodes Then, we have p = c B p + (1-c)/n 1 => p = (1-c)/n [I - c B] -1 1

Why learn Personalized PageRank? Can be used for recommendation, e.g., If I like this webpage, what would I also be interested? If I like this product, what other products I also like? (in a user-product bipartite graph) Also helps with visualizing large graphs Instead of visualizing every single nodes, visualize the most important ones Again, very flexible. Can be run on any graph. 27

Interactive Application

Building an interactive application Will show you an example application (Apolo) that uses a diffusion-based algorithm to perform recommendation on a large graph Personalized PageRank (= Random Walk with Restart) Belief Propagation (powerful inference algorithm, for fraud detection, image segmentation, error-correcting codes, etc.) Spreading activation or degree of interest in Human- Computer Interaction (HCI) Guilt-by-association techniques 29

Building an interactive application Why diffusion-based algorithms are widely used? Intuitive to interpret uses network effect, homophily, etc. Easy to implement Math is relatively simple Fast run time linear to #edges, or better Probabilistic meaning 30

Human-In-The-Loop Graph Mining Apolo: Machine Learning + Visualization CHI 2011 Apolo: Making Sense of Large Network Data by Combining Rich User Interaction and Machine Learning 31

Finding More Relevant Nodes HCI Paper Citation network Data Mining Paper 32

Finding More Relevant Nodes HCI Paper Citation network Data Mining Paper 32

Finding More Relevant Nodes HCI Paper Citation network Data Mining Paper Apolo uses guilt-by-association (Belief Propagation, similar to personalized PageRank) 32

Demo: Mapping the Sensemaking Literature Nodes: 80k papers from Google Scholar (node size: #citation) Edges: 150k citations 33

Key Ideas (Recap) Specify exemplars Find other relevant nodes (BP) 35

Apolo s Contributions 1 Human + Machine It was like having a partnership with the machine. Apolo User 2 Personalized Landscape 36

Apolo 2009 37

Apolo 2010 38

Apolo 2011 22,000 lines of code. Java 1.6. Swing. Uses SQLite3 to store graph on disk 39

User Study Used citation network Task: Find related papers for 2 sections in a survey paper on user interface Model-based generation of UI Rapid prototyping tools 40

Between subjects design Participants: grad student or research staff 41

41

41

Judges Scores 16 Apolo Scholar Score 8 0 Modelbased *Prototyping *Average Higher is better. Apolo wins. * Statistically significant, by two-tailed t test, p <0.05 42

Apolo: Recap A mixed-initiative approach for exploring and creating personalized landscape for large network data Apolo = ML + Visualization + Interaction 43

Feldspar Finding Information by Association. CHI 2008 Polo Chau, Brad Myers, Andrew Faulring YouTube: http://www.youtube.com/watch?v=q0tiv8f_o_e&feature=youtu.be&list=ulq0tiv8f_o_e Paper: http://www.cs.cmu.edu/~dchau/feldspar/feldspar-chi08.pdf 44

Feldspar What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 45

Feldspar A system that helps people find things on their computers when typical search or browsing tools don t work What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 45

Feldspar A system that helps people find things on their computers when typical search or browsing tools don t work An example scenario What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 45

Find the webpage mentioned in the email from the person I met at an event What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 46

Find the webpage mentioned in the email from the person I met at an event If I can t remember the specifics, such as any text in the webpage, email, etc. à Can t search What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 46

Find the webpage mentioned in the email from the person I met at an event If I can t remember the specifics, such as any text in the webpage, email, etc. à Can t search If I haven t bookmarked the webpage à Can t browse What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 46

Find the webpage mentioned in the email from the person I met at an event What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 47

Find the webpage mentioned in the email from the person I met at an event But I can describe the webpage with a chain of associations. What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 47

Find the webpage mentioned in the email from the person I met at an event But I can describe the webpage with a chain of associations. webpage email person event What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 47

Find the webpage mentioned in the email from the person I met at an event But I can describe the webpage with a chain of associations. webpage email person event The psychology literature has shown that people often remember things exactly like this. What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 47

Natural question: Can I find things by associations? What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 48

Natural question: Can I find things by associations? Can I find the webpage by specifying its associated information (email, person, and event)? What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 48

Natural question: Can I find things by associations? Can I find the webpage by specifying its associated information (email, person, and event)? We created Feldspar, which supports this associative retrieval of information. What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 48

Feldspar stands for. What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 49

F Finding E Elements by L Leveraging Feldspar stands for. D Diverse Sources of P Pertinent A Associative Recollection What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 49

Implementation: Overview Create a graph database to store the associations among items on the computer Develop an algorithm that processes the query and returns results What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 50

Creating an Association Database (a graph) Install Google Desktop and let it index all the items on the computer What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 51

Creating an Association Database (a graph) Focus on 7 types Install Google Desktop and let it index all the items on the computer What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 51

Creating an Association Database (a graph) Identify associations and build our database, which is a directed graph Focus on 7 types Install Google Desktop and let it index all the items on the computer What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 51

Creating an Association Database (a graph) Identify associations and build our database, which is a directed graph Focus on 7 types Install Google Desktop and let it index all the items on the computer What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 51

Creating an Association Database (a graph) Identify associations and build our database, which is a directed graph Focus on 7 types Install Google Desktop and let it index all the items on the computer What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 51

Practitioners guide to building (interactive) applications Think about scalability early e.g., picking a scalable algorithm early on When building interactive applications, use iterative design approach (as in Apolo) Why? It s hard to get it right the first time Create prototype, evaluate, modify prototype, evaluate,... Quick evaluation helps you identify important fixes early (can save you a lot of time) 52

Practitioners guide to building (interactive) applications How to do iterative design? What kinds of prototypes? Paper prototype, lo-fi prototype, high-fi prototype What kinds of evaluation? Important to involve REAL users as early as possible Recruit your friends to try your tools Lab study (controlled, as in Apolo) Longitudinal study (usage over months) Deploy it and see the world s reaction! To learn more: CS 6750 Human-Computer Interaction CS 6455 User Interface Design and Evaluation 53

If you want to know more about people http://amzn.com/0321767535 54

Polonium: Web-Scale Malware Detection SDM 2011 Polonium: Tera-Scale Graph Mining and Inference for Malware Detection

Typical Malware Detection Method Signature-based detection 1.Collect malware 2.Generate signatures 3.Distribute to users 4.Scan computers for matches What about zero-day malware? No samples à No signatures à No detection How to detect them early? 56

Reputation-Based Detection Computes reputation score for each application e.g., MSWord.exe Poor reputation = Malware 57

Polonium Text Patented I led initial design and development Serving 120 million users Answered trillions of queries 58

Polonium Text Propagation of leverage of network influence unearths malware Patented I led initial design and development Serving 120 million users Answered trillions of queries 58

Polonium works with 60 Terabyte Data 50 million machines anonymously reported their executable files 900 million unique files (Identified by their cryptographic hash values) Goal: label malware and good files 59

Why A Hard Problem? Existing Research Polonium Small dataset Huge dataset (60 terabytes) Detects specific malware (e.g., worm, trojans) Detects all types (needs a general method) Many false alarms (>10%) Strict (<1%) 60

Polonium: Problem Definition Given Undirected machine-file bipartite graph 37 billion edges, 1 billion nodes (machines, files) Some file labels from Symantec (good or bad) Find Labels for all unknown files 61

Where to Get Good and Bad Labels? Symantec has a ground truth database of known-good and known-bad files e.g., set known-good file s prior to 0.9 62

How to Gauge Machine Reputation? Computed using Symantec s proprietary formula; a value between 0 and 1 Derived from anonymous aspects of machine s usage and behavior 63

How to propagate known information to the unknown? 64

Key Idea: Guilt-by-Association GOOD files likely appear on GOOD machines BAD files likely appear on BAD machines Also known as Homophily Machine Good Bad Good 0.9 0.1 File Bad 0.1 0.9 65

How to propagate known information to the unknown? Adapts Belief Propagation (BP) A powerful inference algorithm Used in image processing, computer vision, error-correcting codes, etc. 66

Propagating Reputation Example Machine Good Bad File Good 0.9 0.1 Bad 0.1 0.9 Machines 0.6 0.45 0.35 A B C Files 0.9 0.5 0.5 0.1 1 2 3 4 67

Propagating Reputation Example Machine Good Bad File Good 0.9 0.1 Bad 0.1 0.9 Machines 0.6 0.45 0.35 A B C Files 0.9 0.5 0.5 0.1 1 2 3 4 67

Propagating Reputation Example Machine Good Bad File Good 0.9 0.1 Bad 0.1 0.9 Machines 0.6 0.45 0.35 A B C Files 0.92 0.58 0.38 0.5 0.06 0.1 1 2 3 4 67

Propagating Reputation Example Machine Good Bad File Good 0.9 0.1 Bad 0.1 0.9 Machines 0.6 0.45 0.35 A B C Files 0.92 0.58 0.38 0.5 0.06 0.1 1 2 3 4 67

Propagating Reputation Example Machine Good Bad File Good 0.9 0.1 Bad 0.1 0.9 Machines 0.87 0.6 0.45 0.81 0.35 0.1 A B C Files 0.92 0.58 0.38 0.5 0.06 0.1 1 2 3 4 67

Two Equations in Belief Propagation Details 68

Computing Node Belief (Reputation) Details 69

Computing Node Belief (Reputation) Details Belief Prior belief Neighbors opinions 69

Computing Node Belief (Reputation) A B C Details 0.5 1 2 3 4 Belief Prior belief Neighbors opinions 69

Creating Message for Neighbor Details 70

Creating Message for Neighbor Details Opinion for neighbor Edge potential Belief Good Bad Good 0.9 0.1 Bad 0.1 0.9 70

Creating Message for Neighbor A B C Details 1 2 3 4 Opinion for neighbor Edge potential Belief Good Bad Good 0.9 0.1 Bad 0.1 0.9 70

Evaluation Using millions of ground truth files,10-fold cross validation True Positive Rate % of bad correctly labeled Ideal 85% True Positive Rate 1% False Alarms False Positive Rate (False Alarms) % of good labeled as bad 71

Evaluation Using millions of ground truth files,10-fold cross validation True Positive Rate % of bad correctly labeled Ideal 85% True Positive Rate 1% False Alarms Boosted existing methods by 10 absolute % point False Positive Rate (False Alarms) % of good labeled as bad 71

Multi-Iteration Results 765 4 3 2 1 True Positive Rate % of bad correctly labeled False Positive Rate (False Alarm) % of good labeled as bad 72

Scalability Running Time Per Iteration Linux 16-core Opteron 256GB RAM 3 hours, 37 billion edges 73

Scalability How Did I Scale Up BP? Details 1.Early termination (after 6 iterations) à Faster 2.Keep edges on disk à Saves 200GB of RAM 3.Computes half of the messages à Twice as fast 74

Further Scale Up Belief Propagation Use Hadoop if graph doesn t fit in memory [ICDE 11] Speed scales up linearly with number of machines Scale-up Yahoo! M45 cluster 480 machines 1.5 PB storage 3.5TB machine Number of machines 75