Edge Classification in Networks

Similar documents
Link Prediction in Graph Streams

gsketch: On Query Estimation in Graph Streams

Computer Vision. Exercise Session 10 Image Categorization

Mining Web Data. Lijun Zhang

Local Algorithms for Sparse Spanning Graphs

Link Sign Prediction and Ranking in Signed Directed Social Networks

Succinct Representation of Separable Graphs

Kernel-based online machine learning and support vector reduction

Unsupervised Learning

Network community detection with edge classifiers trained on LFR graphs

Big Data Analytics Influx of data pertaining to the 4Vs, i.e. Volume, Veracity, Velocity and Variety

I211: Information infrastructure II

Link Prediction for Social Network

Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges

Predictive Indexing for Fast Search

On Dimensionality Reduction of Massive Graphs for Indexing and Retrieval

Network Mathematics - Why is it a Small World? Oskar Sandberg

Nearest Neighbor Predictors

Multi-label classification using rule-based classifier systems

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION

CSE547: Machine Learning for Big Data Spring Problem Set 1. Please read the homework submission policies.

A Framework for Clustering Massive-Domain Data Streams

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

All lecture slides will be available at CSC2515_Winter15.html

Mining Web Data. Lijun Zhang

Chapter 4: Non-Parametric Techniques

Data Preprocessing. Supervised Learning

Association Pattern Mining. Lijun Zhang

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

Topic 1 Classification Alternatives

Nearest Neighbors Classifiers

Nearest neighbors classifiers

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

On Biased Reservoir Sampling in the Presence of Stream Evolution

A Improving Classification Quality in Uncertain Graphs

Research Question Presentation on the Edge Clique Covers of a Complete Multipartite Graph. Nechama Florans. Mentor: Dr. Boram Park

LECTURES 3 and 4: Flows and Matchings

Dynamic and Historical Shortest-Path Distance Queries on Large Evolving Networks by Pruned Landmark Labeling

Introduction to Artificial Intelligence

Based on Raymond J. Mooney s slides

On Clustering Graph Streams

over Multi Label Images

Nearest neighbor classification DSE 220

Kernel Methods & Support Vector Machines

Semi-supervised learning and active learning

Topic: Local Search: Max-Cut, Facility Location Date: 2/13/2007

Jeff Howbert Introduction to Machine Learning Winter

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Compressing Social Networks

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

Towards Graphical Models for Text Processing 1

The k-center problem Approximation Algorithms 2009 Petros Potikas

Histogram and watershed based segmentation of color images

Structural and Syntactic Pattern Recognition

Evaluating Classifiers

On Classification of High-Cardinality Data Streams

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

Supervised Learning: K-Nearest Neighbors and Decision Trees

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Using Multi-Task Learning For Large-Scale Document Classification. Azad Naik Bachelor of Technology Indian School of Mines, 2009

Collective Entity Resolution in Relational Data

Faster Cover Trees. Mike Izbicki and Christian R. Shelton UC Riverside. Izbicki and Shelton (UC Riverside) Faster Cover Trees July 7, / 21

K-Means Clustering 3/3/17

Chapter 4. Adaptive Self-tuning : A Neural Network approach. 4.1 Introduction

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Sentiment analysis under temporal shift

PARALLEL CLASSIFICATION ALGORITHMS

Graph Mining: Overview of different graph models

Sampling Large Graphs for Anticipatory Analysis

Variational Bayesian PCA versus k-nn on a Very Sparse Reddit Voting Dataset

Empirical risk minimization (ERM) A first model of learning. The excess risk. Getting a uniform guarantee

Constructing a G(N, p) Network

Fast Fuzzy Clustering of Infrared Images. 2. brfcm

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Note 8

Communication and Memory Efficient Parallel Decision Tree Construction

On Privacy-Preservation of Text and Sparse Binary Data with Sketches

Personalized Web Search

A Framework for Clustering Massive Text and Categorical Data Streams

SOCIAL MEDIA MINING. Data Mining Essentials

On Compressing Social Networks. Ravi Kumar. Yahoo! Research, Sunnyvale, CA. Jun 30, 2009 KDD 1

On Dense Pattern Mining in Graph Streams

Signed Network Modeling Based on Structural Balance Theory

An Adaptive Framework for Multistream Classification

Multi-label Classification. Jingzhou Liu Dec

ECS289: Scalable Machine Learning

Going nonparametric: Nearest neighbor methods for regression and classification

Mining di Dati Web. Lezione 3 - Clustering and Classification

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Leveraging Transitive Relations for Crowdsourced Joins*

Clustering & Classification (chapter 15)

Face Recognition using Eigenfaces SMAI Course Project

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Constructing a G(N, p) Network

GraphBLAS Mathematics - Provisional Release 1.0 -

More Efficient Classification of Web Content Using Graph Sampling

A PROPOSED HYBRID BOOK RECOMMENDER SYSTEM

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Going nonparametric: Nearest neighbor methods for regression and classification

IN the last decade, online social network analysis has

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Transcription:

Charu C. Aggarwal, Peixiang Zhao, and Gewen He Florida State University IBM T J Watson Research Center Edge Classification in Networks ICDE Conference, 2016

Introduction We consider in this paper the edge classification problem in networks, which is defined as follows. Given a graph-structured network G(N, A), where N is a set of vertices and A N N is a set of edges, in which a subset A l A of edges are properly labeled a priori. Determine for those edges in A u = A \ A l the edge labels which are unknown.

Applications The edge classification problem has numerous applications in a wide array of network-centric scenarios: A social network may contain relationships of various types such as friends, families, or even rivals. A subset of the edges may be labeled with the relevant relationships. The determination of unlabeled relationships can be useful for making targeted recommendations. An adversarial or terrorist network may contain unknown relationships between its participants although a subset of the relationships have been known already. In some cases, the edges may have numerical quantities associated with them corresponding to strengths of relationships, specification of ratings, likes, or dislikes.

Contributions We formulate the edge classification problem and the edge regression problem in graph-structured networks. We propose a structural similarity model for edge classification. In order to support edge classification in large-scale networks that are either disk-resident or stream-like, we devise probabilistic, min-hash based edge classification algorithms that are efficient and cost-effective without comprising the classification accuracy significantly; We carry out extensive experimental studies in different realworld networks, and the experimental results verify the efficiency and accuracy of our edge classification techniques.

Problem Definition We provide a formal model for edge classification in networks. We assume that we have an undirected network G =(N, A), where N denotes the set of vertices, A denotes the set of edges, and n = N, m = A. A subset of edges in A, denoted by A l, are associated with edge labels. The label of the edge (u, v) is denoted by l uv. The remaining subset of unlabeled edges is denoted by A u = A \ A l.

Formal Definitions Given an undirected network G =(N, A), and a set A l A of labeled edges, where each edge (u, v) A l has a label l uv, the edge classification problem is to determine the labels for the edges in A u = A \ A l. In some scenarios, it is also possible for the edges in A to be labeled with numeric quantities. As a result, edge classification turns out to be the edge regression problem in networks, defined as follows, Consider an undirected network G =(N, A), and a set A l A of edges, each of which is annotated by numerical labels, denoted by l uv R. The edge regression problem is to determine the numerical labels for the edges in A u = A \ A l.

Overview We will propose a structural similarity model for edge classification. This approach shares a number of similarities with the nearest neighbor classification, except that the structural similarity of one edge to another is much harder to define in the context of a network.

Overview of Steps Consider an edge (u, v) A u, which needs to be classified within the network G =(N, A). The top-k most similar vertices to u, denoted by S(u) = {u 1...u k }, are first determined based on a structural similarity measure discussed slightly later. The top-k most similar vertices to v, denoted by S(v) = {v 1...v k }, are determined based on the same structural similarity measure as in the case of vertex u; The set of edges in A l [S(u) S(v)] are selected and the dominant class label of these edges is assigned as the relevant edge label for the edge (u, v).

Similarity Computation How are the most similar vertices to a given vertex u N are determined in a network? Here the key step is to define a pair-wise similarity function for vertices. We consider Jaccard coefficient for vertex-wise similarity quantification.

Similarity Computation Let I (u) N be the set of vertices incident on the vertex u belonging to the edge label 1. Let I + (u) N be the set of vertices incident on u belonging to the edge label 1. The Jaccard coefficient of vertices u and v, J + (u, v), on the positive edges (bearing the edge label 1) is defined as follows: J + (u, v) = I+ (u) I + (v) I + (u) I + (v) (1)

Similarity Computation (Contd.) The Jaccard coefficient on the negative edges can be defined in an analogous way, as follows: J (u, v) = I (u) I (v) I (u) I (v) (2) As a result, the similarity between vertices u and v can be defined by a weighted average of the values of J + (u, v) and J (u, v). The most effective way to achieve this goal is to use the fraction of vertices belonging to the two classes.

Weighted Jaccard The fraction of vertices incident on u belonging to the edge label 1 is given by f u + = I + (u) I + (u) + I (u). Similarly, the value of f (u) isdefinedas(1 f + (u)). The average fraction across both vertices is defined by f + (u, v) = (f + (u) +f + (v))/2, and f (u, v) = (f (u) + f (v))/2. Therefore, the overall Jaccard similarity, J(u, v), of the vertices u and v is defined by the weighted average: J(u, v) =f + (u, v) J + (u, v)+f (u, v) J (u, v) (3)

Differential Weight An important variation of the edge classification process is that all edges are not given an equal weight in the process of determining the dominant edge label. Consider the labeled edge (u r,v q ), where u r S(u), and v q S(v). The weight of the label of the edge (u r,v q )canbeset to J(u, u r ) J(v, v q ) in order to ensure greater importance of edges, which are more similar to the target vertices u and v. It is also interesting to note that the above approach can easily be extended to the k-way case by defining a separate Jaccard coefficient on a class-wise basis, and then aggregating in a similar way, as discussed above.

Sparse Labeling In real-world cases, the edges are often labeled rather sparsely, which makes it extremely difficult to compute similarities with only the labeled edges. In such cases, a separate component of the similarity, J 0 (u, v), between vertices u and v is considered only with unlabeled edges, just like J + (u, v) andj (u, v) which are the Jaccard coefficients with positive and negative edges, respectively. J(u, v) =f + (u, v) J + (u, v)+f (u, v) J (u, v)+µj 0 (u, v) (4) Here, µ is a discount factor less than 1, which is used to prevent excessive impact of the unlabeled edges on the similarity computation.

Numerical Labels The case of numerical class variables in edge regression is similar to that of binary or categorical class variables in edge classification, except that the vertex similarities and the class averaging steps are computed differently in order to account for the numerical class variables. Let I + (u) be the set of all vertices incident on a particular vertex u, such that for each v I + (u), the edge (u, v) has a numerical label whose value is greater than the average of the labeled edges incident on u. The remaining vertices incident on u, whose labeled edge values are below the average, are put in another set, I (u).

The similarity values J + (u, v), J (u, v), and J(u, v) can be defined in an analogous way according to the previous binary edge classification case.

Classification with Numeric Labels In order to determine the numeric label of an edge (u, v), a similar procedure is employed as in the previous case. The first step is to determine the closest k vertices S(u) = {u 1...u k } to u, and the closest k vertices {v 1...v k } to v with the use of the aforementioned Jaccard similarity function. Then, the numeric labels of the edges in A l [S(u) S(v)] are averaged. As in the case of edge classification, different edges can be given different weights in the regression process.

Challenges The main problem with the exact approaches discussed in the previous section is that they work most efficiently when the entire graph is memory-resident. However, when the graph is very large and disk-resident, the computation of par-wise vertex similarity is likely to be computationally expensive because all the adjacent edges of different vertices need to be accessed, thus resulting in a lot of random accesses to hard disks. In many real-world dynamic cases, the graph where edge classification is performed is no long static but evolving in a fast speed.

Basic Approach Use synopsis structure to summarize the graph Synopsis structure can be efficiently updated in real time Synopsis structure can be leveraged to efficiently perform classification

Broad Approach In order to achieve this goal, we will consider a probabilistic, min-hash based approach, which can be applied effectively for both disk-resident graphs and graph streams. In this min-hash approach, the core idea is to associate a succinct data structure, termed min-hash index, with each vertex, which keeps track of the set of adjacent vertices. Specifically, the positive, negative, and unlabeled edges (and their incident adjacent vertices) of a given vertex are handled separately in different min-hash indexes.

Min-Hash Technique The use of the min-hash index is standard in these settings Details of the use of the min-hash index in paper Theoretical bounds provided in paper

Experimental Results Experimental studies demonstrate the effectiveness and efficiency of our proposed edge classification methods in realworld networks. We consider a variety of effectiveness and efficiency metrics on diverse networked data sets to show the broad applicability of our methods.

Data Sets Epinions (binary) Slashdot (binary) Wikipedia (binary) Youtube (multiple labels)

Metrics Accuracy: we randomly select 1, 500 edges from each network dataset and examine the average accuracy of classification for these edges w.r.t. thetruelabelsprovidedinthe network; Time: we gauge the average time consumed for edge classification for the 1, 500 edges selected randomly from within the networks; Space: as opposed to the exact methods that need explicitly materialize the whole networks for edge classification, the probabilistic methods build space-efficient min-hash indexes, which are supposed to be succinct in size and memoryresident. We record the total space cost (in megabytes) of min-hash structures in the probabilistic methods, which need not store the entire graphs for edge classification.

ExtUF ProbUF ExtUF ProbUF ExtWF ProbWF ExtWP ProbWP Accuracy 100 90 80 70 60 60 50 Edge Classification Methods ExtUF ProbUF ExtWF ProbWF 50 Accuracy (%) Accuracy (%) 100 90 80 70 100 90 80 70 60 Edge Classification Methods (a) Epinion (b) Slashdot 50 Edge Classification Methods ExtWF ProbWF ExtWP ProbWP ExtUF ProbUF 10 0 Accuracy (%) 50 40 30 20 Edge Classification Methods (c) Wikipedia (d) Youtube ExtWF ProbWF Accuracy (%) ExtWP ProbWP ExtWP ProbWP

ExtUF ProbUF ExtUF ProbUF ExtWF ProbWF ExtWP ProbWP Running Time 7 6 5 4 3 2 1 0 Edge Classification Methods 0 Average Runtime (sec.) 5 0.14 ExtUF ProbUF ExtWF ProbWF Average Runtime (sec.) 4 3 2 1 Edge Classification Methods (a) Epinion (b) Slashdot 0.12 0.1 0.08 0.06 0.04 0.02 0 Edge Classification Methods ExtWF ProbWF ExtWP ProbWP ExtUF ProbUF 0 Average Runtime (sec.) 10 8 6 4 2 Edge Classification Methods (c) Wikipedia (d) Youtube ExtWF ProbWF Average Runtime (sec.) ExtWP ProbWP ExtWP ProbWP

Effect of the number of nearest neighbors Accuracy (%) 100 95 90 85 80 75 70 ExtWF ExtUF ProbUF ExtWP ProbWP ProbWF Accuracy (%) 100 90 80 70 60 ExtWF ExtUF ProbUF ExtWP ProbWP ProbWF Accuracy (%) 65 100 95 90 85 80 75 70 65 60 40 60 80 100 The Number of Top Similar Neighbors (k) (a) Epinion ExtWF ProbUF ProbWP ExtUF ExtWP ProbWF 40 60 80 100 The Number of Top Similar Neighbors (k) (c) Wikipedia Accuracy (%) 50 50 45 40 35 30 25 20 15 10 40 60 80 100 The Number of Top Similar Neighbors (k) (b) Slashdot ExtWF ProbUF ProbWP ExtUF ExtWP ProbWF 40 60 80 100 The Number of Top Similar Neighbors (k) (d) Youtube

Effect of hash functions Accuracy (%) 100 90 80 70 60 50 40 30 20 30 40 50 Average Runtime (sec.) 1 0.8 0.6 0.4 0.2 20 30 40 50 Space Cost (MB) 20 80 70 60 50 40 30 20 Epinion Slashdot Wikipedia Youtube Networks (a) Accuracy 20 30 40 50 0 Epinion Slashdot Wikipedia Youtube Networks (b) Runtime 10 0 Epinion Slashdot Wikipedia Youtube Networks (c) Space

Effect of the number of unlabeled edges 100 95 90 ExtWP ProbWP 100 95 90 ExtWP ProbWP Accuracy (%) 85 80 75 70 Accuracy (%) 85 80 75 70 65 65 60 100 95 90 10% 30% 50% 70% The Percentage of Unlabeled Edges (t) (a) Epinion ExtWP ProbWP 60 50 40 10% 30% 50% 70% The Percentage of Unlabeled Edges (t) (b) Slashdot ExtWP ProbWP Accuracy (%) 85 80 75 70 65 Accuracy (%) 30 20 10 60 10% 30% 50% 70% The Percentage of Unlabeled Edges (t) (c) Wikipedia 0 10% 30% 50% 70% The Percentage of Unlabeled Edges (t) (d) Youtube

Conclusions New technique for edge classification in networks Uses probabilistic min-hash data structures for efficient classification