Sampling Large Graphs for Anticipatory Analysis

Similar documents
Sampling Operations on Big Data

Anomaly Detection in Very Large Graphs Modeling and Computational Considerations

[CoolName++]: A Graph Processing Framework for Charm++

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search

Graph Exploitation Testbed

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

DataSToRM: Data Science and Technology Research Environment

Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman

General Instructions. Questions

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Mining Web Data. Lijun Zhang

Scalable Selective Traffic Congestion Notification

Analysis and Mapping of Sparse Matrix Computations

Kara Greenfield, William Campbell, Joel Acevedo-Aviles

Jaccard Coefficients as a Potential Graph Benchmark

Latent Space Model for Road Networks to Predict Time-Varying Traffic. Presented by: Rob Fitzgerald Spring 2017

Social Behavior Prediction Through Reality Mining

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader

Technical Brief: Domain Risk Score Proactively uncover threats using DNS and data science

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Infinite data. Filtering data streams

Data Sources for Cyber Security Research

Deep Character-Level Click-Through Rate Prediction for Sponsored Search

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Scott Philips, Edward Kao, Michael Yee and Christian Anderson. Graph Exploitation Symposium August 9 th 2011

With turing you can: Identify, locate and mitigate the effects of botnets or other malware abusing your infrastructure

LLMORE: Mapping and Optimization Framework

Tracking Groups in Mobile Network Traces

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

ECLT 5810 Clustering

Signal Processing on Databases

Big Data - Some Words BIG DATA 8/31/2017. Introduction

Edge Classification in Networks

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Linear Regression Optimization

Bipartite Edge Prediction via Transductive Learning over Product Graphs

Lec 8: Adaptive Information Retrieval 2

Cluster-based 3D Reconstruction of Aerial Video

Graph Data Management

Collaborative Filtering for Netflix

Predictive Indexing for Fast Search

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul

Data preprocessing Functional Programming and Intelligent Algorithms

Lecture 25: Review I

PARALLELIZATION OF THE NELDER-MEAD SIMPLEX ALGORITHM

BigDataBench: a Benchmark Suite for Big Data Application

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Part 1: Link Analysis & Page Rank

Microsoft Advance Threat Analytics (ATA) at LLNL NLIT Summit 2018

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

ECLT 5810 Clustering

Introduction to Data Mining

Robustness of Centrality Measures for Small-World Networks Containing Systematic Error

Covert and Anomalous Network Discovery and Detection (CANDiD)

Feature Selection for fmri Classification

BIG DATA TESTING: A UNIFIED VIEW

CENTRALITIES. Carlo PICCARDI. DEIB - Department of Electronics, Information and Bioengineering Politecnico di Milano, Italy

Demystifying movie ratings 224W Project Report. Amritha Raghunath Vignesh Ganapathi Subramanian

Collective Mind. Early Warnings of Systematic Failures of Equipment. Dr. Artur Dubrawski. Dr. Norman Sondheimer. Auton Lab Carnegie Mellon University

Massive Data Analysis

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

IEEE 802 network enhancements for the next decade Industry Connections Activity Initiation Document (ICAID) Version: 1.

Mission Impossible: Surviving Today s Flood of Critical Data

ActiveClean: Interactive Data Cleaning For Statistical Modeling. Safkat Islam Carolyn Zhang CS 590

CSCI5070 Advanced Topics in Social Computing

Link Prediction for Social Network

WHITE PAPER. Operationalizing Threat Intelligence Data: The Problems of Relevance and Scale

Nonparametric Importance Sampling for Big Data

An Automated System for Data Attribute Anomaly Detection

Similarity Ranking in Large- Scale Bipartite Graphs

Object and Action Detection from a Single Example

PageRank. CS16: Introduction to Data Structures & Algorithms Spring 2018

Case Study: Social Network Analysis. Part II

BubbleNet: A Cyber Security Dashboard for Visualizing Patterns

Package mmtsne. July 28, 2017

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs

Node Similarity. Ralucca Gera, Applied Mathematics Dept. Naval Postgraduate School Monterey, California

Finding a needle in Haystack: Facebook's photo storage

GraFBoost: Using accelerated flash storage for external graph analytics

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

Sentiment analysis under temporal shift

Weka ( )

TELCOM2125: Network Science and Analysis

Fraud Mobility: Exploitation Patterns and Insights

Improved Detection of Low-Profile Probes and Denial-of-Service Attacks*

Link Analysis in the Cloud

Section 7.12: Similarity. By: Ralucca Gera, NPS

MSA220 - Statistical Learning for Big Data

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

UNSUPERVISED LEARNING FOR ANOMALY INTRUSION DETECTION Presented by: Mohamed EL Fadly

Insight: that s for NSA Decision making: that s for Google, Facebook. so they find the best way to push out adds and products

Machine Learning Applications for Data Center Optimization

Multi-Centrality Graph Spectral Decompositions and Their Application to Cyber Intrusion Detection

Building an Effective Threat Intelligence Capability. Haider Pasha, CISSP, C EH Director, Security Strategy Emerging Markets Office of the CTO

The HPEC Challenge Benchmark Suite

UNCLASSIFIED. R-1 ITEM NOMENCLATURE PE D8Z: Data to Decisions Advanced Technology FY 2012 OCO

Transcription:

Sampling Large Graphs for Anticipatory Analysis Lauren Edwards*, Luke Johnson, Maja Milosavljevic, Vijay Gadepally, Benjamin A. Miller IEEE High Performance Extreme Computing Conference September 16, 2015 This work is sponsored by the Intelligence Advanced Research Projects Activity (IARPA) under Air Force Contract #FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government.

Outline Introduction Sampling Techniques Link Prediction Experimental Setup Results Summary

Common Big Data Challenge Operators Analysts Commanders Users Rapidly increasing - Data volume - Data velocity - Data variety - Data veracity (security) Data Gap Users 2000 2005 2010 2015 & Beyond Data <html> OSINT Weather HUMINT C2 Ground Maritime Air Space Cyber

What is a Graph? Graphs are mathematical representations of physical or logical relationships between sets of objects A graph is a pair of sets: A set of vertices/ nodes (entities) A set of edges/links (connections, transactions, and relationships)

Applications of Graph Analytics NSA-RD-2013-05 Brain scale... ISR Social 100 billion vertices, 100 trillion edges Cyber2.08 mna bytes 2 (molar bytes) adjacency Biomatrix 2.84 PB adjacency list 2.84 PB edge list Human connectome. Gerhard et al., Frontiers in Neuroinformatics 5(3), 2011 Graphs represent entities and relationships detected through multiple data Graphs represent relationships between individuals or documents Graphs represent Graphs represent 2 N = 6.022 10 mol communication connectivity between Paul Burkhardt, Chris Waring An NSA Big Graph experiment patterns of computers brain regions on a network ~1B 1T regions and ~1K 1M entities and connections ~10K 10M individuals and interactions ~1M 1B network events GOAL: Identify anomalous patterns GOAL: Identify hidden social networks GOAL: Detect attack or malicious software A 23 1 connections GOAL: Detect regions affected by a neurological condition

Big Data Challenge NSA-RD-2013-056001v1 Brain scale... 100 billion vertices, 100 trillion edges Memory required for storing Sparse Matrix (GB) 2,097,152 2.08 mna bytes 2 (molar bytes) adjacency matrix 2.84 PB adjacency list Petabyte 2.84 PB edge list NSA-RD-2013-056001v1 Web scale... 262,144 50 billion vertices, 1 trillion edges 271 EB adjacency matrix 29.5 TB adjacency list 29.1 TB edge list 32,768 NSA-RD-2013-056001v1 Social scale... Brain Human connectome. Gerhard et al., Frontiers in Neuroinformatics 5(3), 2011 2 NA = 6.022 1023 mol 1 Paul Burkhardt, Chris Waring An NSA Big Graph experiment 1 billion vertices, 100 billion edges 111 PB adjacency matrix 4,096 2.92 TB adjacency list Terabyte 512 Web graph from the SNAP database (http://snap.stanford.edu/data) 2.92 TB edge list Internet graph from the Opte Project (http://www.opte.org/maps) Paul Burkhardt, Chris Waring 64 An NSA Big Graph experiment Web Social Twitter graph from Gephi dataset (http://www.gephi.org) Cyber 8 Paul Burkhardt, Chris Waring An NSA Big Graph experiment P. Burkhardt and C. Waring, An NSA Big Graph experiment, National Security Agency, Tech. Rep. NSA-RD-2013-056002v1, 2013. ISR 1 1 16 256 4,096 65,536 1,048,576 NSA Big Graph Experiment Graph Vertices (millions) How can these massive graphs be leveraged to provide accurate anticipatory intelligence?

Outline Introduction Sampling Techniques Link Prediction Experimental Setup Results Summary

Methods for Sampling Graphs Edge sampling methods Consider each edges individually, keep it in the sample based on some criterion Vertex sampling methods Sample individual vertices or local substructures Random walk methods Walk from vertex to vertex by some criterion, keep edges along the path Snowball sampling methods Start with a set of seed vertices, grow the sample from those

Random Edge Sampling Each edge in the graph has equal probability of being included in the sample Methods Used Popularity-Based Sampling Probability of sampling a given edge is proportional to a decreasing function of the vertex degrees Wedge Sampling Choose a vertex at random, and randomly select two of its adjacent edges to be included in the sample

Random Edge Sampling Each edge in the graph has equal probability of being included in the sample Methods Used Popularity-Based Sampling Probability of sampling a given edge is proportional to a decreasing function of the vertex degrees Wedge Sampling Choose a vertex at random, and randomly select two of its adjacent edges to be included in the sample

Random Edge Sampling Each edge in the graph has equal probability of being included in the sample Methods Used Popularity-Based Sampling Probability of sampling a given edge is proportional to a decreasing function of the vertex degrees Wedge Sampling Choose a vertex at random, and randomly select two of its adjacent edges to be included in the sample

Random Edge Sampling Each edge in the graph has equal probability of being included in the sample Methods Used Popularity-Based Sampling Probability of sampling a given edge is proportional to a decreasing function of the vertex degrees Wedge Sampling Choose a vertex at random, and randomly select two of its adjacent edges to be included in the sample

Random Edge Sampling Each edge in the graph has equal probability of being included in the sample Methods Used Popularity-Based Sampling Probability of sampling a given edge is proportional to a decreasing function of the vertex degrees Wedge Sampling Choose a vertex at random, and randomly select two of its adjacent edges to be included in the sample

Random Edge Sampling Each edge in the graph has equal probability of being included in the sample Methods Used Popularity-Based Sampling Probability of sampling a given edge is proportional to a decreasing function of the vertex degrees Wedge Sampling Choose a vertex at random, and randomly select two of its adjacent edges to be included in the sample

Random Edge Sampling Each edge in the graph has equal probability of being included in the sample Methods Used Popularity-Based Sampling Probability of sampling a given edge is proportional to a decreasing function of the vertex degrees Wedge Sampling Choose a vertex at random, and randomly select two of its adjacent edges to be included in the sample

Random Edge Sampling Each edge in the graph has equal probability of being included in the sample Methods Used Popularity-Based Sampling Probability of sampling a given edge is proportional to a decreasing function of the vertex degrees Wedge Sampling Choose a vertex at random, and randomly select two of its adjacent edges to be included in the sample

Random Walk Sampling Move along edges from vertex to vertex, adding edges to the sample along the way At each step, with probability 0.15, jump to any vertex in the graph Random Area Sampling Choose a set of seed vertices, and add all edges adjacent to those vertices to the sample Methods Used, Cont d

Random Walk Sampling Move along edges from vertex to vertex, adding edges to the sample along the way At each step, with probability 0.15, jump to any vertex in the graph Random Area Sampling Choose a set of seed vertices, and add all edges adjacent to those vertices to the sample Methods Used, Cont d

Random Walk Sampling Move along edges from vertex to vertex, adding edges to the sample along the way At each step, with probability 0.15, jump to any vertex in the graph Random Area Sampling Choose a set of seed vertices, and add all edges adjacent to those vertices to the sample Methods Used, Cont d

Random Walk Sampling Move along edges from vertex to vertex, adding edges to the sample along the way At each step, with probability 0.15, jump to any vertex in the graph Random Area Sampling Choose a set of seed vertices, and add all edges adjacent to those vertices to the sample Methods Used, Cont d

Random Walk Sampling Move along edges from vertex to vertex, adding edges to the sample along the way At each step, with probability 0.15, jump to any vertex in the graph Random Area Sampling Choose a set of seed vertices, and add all edges adjacent to those vertices to the sample Methods Used, Cont d

Random Walk Sampling Move along edges from vertex to vertex, adding edges to the sample along the way At each step, with probability 0.15, jump to any vertex in the graph Random Area Sampling Choose a set of seed vertices, and add all edges adjacent to those vertices to the sample Methods Used, Cont d

Random Walk Sampling Move along edges from vertex to vertex, adding edges to the sample along the way At each step, with probability 0.15, jump to any vertex in the graph Random Area Sampling Choose a set of seed vertices, and add all edges adjacent to those vertices to the sample Methods Used, Cont d

Random Walk Sampling Move along edges from vertex to vertex, adding edges to the sample along the way At each step, with probability 0.15, jump to any vertex in the graph Random Area Sampling Choose a set of seed vertices, and add all edges adjacent to those vertices to the sample Methods Used, Cont d

Random Walk Sampling Move along edges from vertex to vertex, adding edges to the sample along the way At each step, with probability 0.15, jump to any vertex in the graph Random Area Sampling Choose a set of seed vertices, and add all edges adjacent to those vertices to the sample Methods Used, Cont d

Random Walk Sampling Move along edges from vertex to vertex, adding edges to the sample along the way At each step, with probability 0.15, jump to any vertex in the graph Random Area Sampling Choose a set of seed vertices, and add all edges adjacent to those vertices to the sample Methods Used, Cont d

Random Walk Sampling Move along edges from vertex to vertex, adding edges to the sample along the way At each step, with probability 0.15, jump to any vertex in the graph Random Area Sampling Choose a set of seed vertices, and add all edges adjacent to those vertices to the sample Methods Used, Cont d

Random Walk Sampling Move along edges from vertex to vertex, adding edges to the sample along the way At each step, with probability 0.15, jump to any vertex in the graph Random Area Sampling Choose a set of seed vertices, and add all edges adjacent to those vertices to the sample Methods Used, Cont d

Methods Used, Cont d Random Walk Sampling Move along edges from vertex to vertex, adding edges to the sample along the way At each step, with probability 0.15, jump to any vertex in the graph Random Area Sampling Choose a set of seed vertices, and add all edges adjacent to those vertices to the sample Selected methods comprise a broad array of network sampling techniques

Outline Introduction Sampling Techniques Link Prediction Experimental Setup Results Summary

Link Prediction Scenario: Two vertices currently do not share an edge Question: How likely are these vertices to form a connection in the near future?? Examples: How likely is a connection to be formed between two people in a social network? How likely are two political entities to engage in diplomatic or military conflict?? Link prediction: Anticipation of new connections that currently do not exist

Link Prediction Statistics Common neighbors Vertices with more neighbors in common are more likely to connect Jaccard s coefficient Normalize common neighbors by total number of neighbors Low-rank approximation Use a low-rank approximation for the square of the adjacency matrix 2 U Σ 2 U T Tensor decomposition Break data into vertex-vertextime factors, extrapolate along temporal axis vertex vertex time time

How Does Sampling Hurt? If edge sampling probability is p, probability of maintaining a common neighbor is p 2 This smears the distribution of common neighbors, inhibiting detection Sample 1/8 of Edges For eigenvector-based techniques, removing edges causes the distribution of eigenvalues to contract The outlier eigenvalues, which provide the most information, contract more quickly eigenvalue

Outline Introduction Sampling Techniques Link Prediction Experimental Setup Results Summary

Experiments aggregate input graph sampling sampling sampling sampling sampling prediction statistics prediction statistics prediction statistics prediction statistics prediction statistics evaluation Simulated graph Method Random edge Popularity-based Random walk Random area Wedge Amount Factor by which data size should be reduced Link Prediction Common neighbors Jaccard s coefficient Low-rank approximation Tensor decomposition Prediction performance Running time

Simulation Setup Graph at time t Normalized common neighbors Random changes Graph at time t+1 BTER Model: Simulate graphs evolving over several time steps First time step: Create a graph from a generative model Future time steps: Compute common neighbors Normalize using cosine similarity Randomly select based on scores Create new random edges from generative model Simulates chance encounters Randomly remove edges

Simulated Graphs Approximately 10,000 vertices, average degree of 17, 5% increase in edge count per time step Two degree distributions: one lighter tailed and one heavier tailed Variations designed to determine most robust sampling methods

Performance Metrics Area under the ROC curve (AUC) Measure area under the receiver operating characteristic (ROC) curve, which maps false alarm rate to detection rate Running time Time required (in Matlab implementation) to compute prediction statistics Probability of Detection ROC shaded area = 0.144 AUC = 0.856 Probability of False Alarm Memory Footprint Storage required for prediction statistics Fundamental metrics for anticipatory network analytics: Predictive performance Latency Computing resource use

Outline Introduction Sampling Techniques Link Prediction Experimental Setup Results Summary

Link Prediction in Baseline Simulation Common Neighbors Low-Rank Approximation Jaccard s Coefficient Tensor Decomposition Common neighbors and Jaccard s coefficient are virtually identical Random area sampling has the most significant initial performance drop in all cases Random walk sampling improves AUC at a higher sampling factor with tensor factorization

Time for Link Prediction Common Neighbors Jaccard s Coefficient Common neighbors and Jaccard: time is substantially reduced as sampling becomes more aggressive Best initial decrease is with random area sampling However, this comes with the biggest decrease in predictive performance

Time for Link Prediction Low-Rank Common Approximation Neighbors Tensor Jaccard s Decomposition Coefficient Low Rank: time increases with higher sampling rates Changes in the distribution of eigenvalues alters convergence rate Tensor: High sampling rates significantly reduces runtime Random area sampling produces high number of isolated vertices, reduces the dimensionality of the adjacency matrix

Outline Introduction Sampling Techniques Link Prediction Experimental Setup Results Summary

Summary Network analytics are of tremendous importance for anticipatory analytics Forecast future connections Networks of interest are often extremely large, and data sampling enables computation on a dataset of a reasonable size Study considered the effect of sampling on anticipatory network analytics Random edge sampling is typically the most stable Random area sampling loses performance the fastest Promising areas for future work: test on larger simulated and real graph data, investigate emergent community detection, aggregate samples for higher predictive capabilities, integration with high-performance graph analytics hardware

Backup