Graph Sampling Approach for Reducing. Computational Complexity of. Large-Scale Social Network

Similar documents
Reducing Computational Complexity of Network Analysis using Graph Compression Method for Brand Awareness Effort

FAST SUMMARIZATION OF LARGE-SCALE SOCIAL NETWORK USING GRAPH PRUNING BASED ON K-CORE PROPERTY

Solution of Maximum Clique Problem. by Using Branch and Bound Method

Conditional Volatility Estimation by. Conditional Quantile Autoregression

Solutions of Stochastic Coalitional Games

A New Energy-Aware Routing Protocol for. Improving Path Stability in Ad-hoc Networks

The Number of Fuzzy Subgroups of Cuboid Group

Pseudo-random Bit Generation Algorithm Based on Chebyshev Polynomial and Tinkerbell Map

A Comparative Study on Optimization Techniques for Solving Multi-objective Geometric Programming Problems

Mining and Analyzing Online Social Networks

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

Ennumeration of the Number of Spanning Trees in the Lantern Maximal Planar Graph

Algorithms and Applications in Social Networks. 2017/2018, Semester B Slava Novgorodov

An Empirical Analysis of Communities in Real-World Networks

Hyperbola for Curvilinear Interpolation

Stochastic Coalitional Games with Constant Matrix of Transition Probabilities

THE EFFECT OF SEGREGATION IN NON- REPEATED PRISONER'S DILEMMA

On the Parallel Implementation of Best Fit Decreasing Algorithm in Matlab

Graceful Labeling for Some Star Related Graphs

Some Algebraic (n, n)-secret Image Sharing Schemes

Heronian Mean Labeling of. Disconnected Graphs

A Computational Study on the Number of. Iterations to Solve the Transportation Problem

Constructing a G(N, p) Network

Constructing a G(N, p) Network

Introduction Types of Social Network Analysis Social Networks in the Online Age Data Mining for Social Network Analysis Applications Conclusion

What is the Optimal Bin Size of a Histogram: An Informal Description

Alessandro Del Ponte, Weijia Ran PAD 637 Week 3 Summary January 31, Wasserman and Faust, Chapter 3: Notation for Social Network Data

Dominator Coloring of Prism Graph

Association Rule with Frequent Pattern Growth. Algorithm for Frequent Item Sets Mining

Generating Topology on Graphs by. Operations on Graphs

Record Linkage using Probabilistic Methods and Data Mining Techniques

Heronian Mean Labeling of Graphs

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection

Effective Clustering Algorithms for VLSI Circuit. Partitioning Problems

AN APPROACH FOR LOAD BALANCING FOR SIMULATION IN HETEROGENEOUS DISTRIBUTED SYSTEMS USING SIMULATION DATA MINING

NSGA-II for Biological Graph Compression

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

A Monotonic Sequence and Subsequence Approach in Missing Data Statistical Analysis

The Generalized Stability Indicator of. Fragment of the Network. II Critical Performance Event

Automatic Group-Outlier Detection

Solving Traveling Salesman Problem Using Parallel Genetic. Algorithm and Simulated Annealing

The Cover Pebbling Number of the Join of Some Graphs

On the Test and Estimation of Fractional Parameter. in ARFIMA Model: Bootstrap Approach

A New Approach to Evaluate Operations on Multi Granular Nano Topology

Iteration Reduction K Means Clustering Algorithm

Small World Properties Generated by a New Algorithm Under Same Degree of All Nodes

A Note on Vertex Arboricity of Toroidal Graphs without 7-Cycles 1

Research Article A Two-Level Cache for Distributed Information Retrieval in Search Engines

Job Shop Scheduling Problem (JSSP) Genetic Algorithms Critical Block and DG distance Neighbourhood Search

6.207/14.15: Networks Lecture 5: Generalized Random Graphs and Small-World Model

Chapter 4: Implicit Error Detection

Clustering Part 4 DBSCAN

A New Secure Mutual Authentication Scheme with Smart Cards Using Bilinear Pairings

A New Approach to Meusnier s Theorem in Game Theory

Erdős-Rényi Model for network formation

Detecting and Analyzing Communities in Social Network Graphs for Targeted Marketing

Structural Analysis of Paper Citation and Co-Authorship Networks using Network Analysis Techniques

Domination Number of Jump Graph

TELCOM2125: Network Science and Analysis

Implementation on Real Time Public. Transportation Information Using GSM Query. Response System

RETRACTED ARTICLE. Web-Based Data Mining in System Design and Implementation. Open Access. Jianhu Gong 1* and Jianzhi Gong 2

Study of Data Mining Algorithm in Social Network Analysis

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks

An Optimization Algorithm of Selecting Initial Clustering Center in K means

Mining Distributed Frequent Itemset with Hadoop

Center-Based Sampling for Population-Based Algorithms

Intro to Random Graphs and Exponential Random Graph Models

Groups with Isomorphic Tables of Marks Orders: 32, 48, 72 and 80 1

Lecture Note: Computation problems in social. network analysis

Rainbow Vertex-Connection Number of 3-Connected Graph

Enumeration of Minimal Control Sets of Vertices in Oriented Graph

Social Network Analysis

Robust EC-PAKA Protocol for Wireless Mobile Networks

Overlapping Community Detection in Dynamic Networks

A SOCIAL NETWORK ANALYSIS APPROACH TO ANALYZE ROAD NETWORKS INTRODUCTION

Collaborative Rough Clustering

A Generating Function Approach to Analyze Random Graphs

Fast Approximate Minimum Spanning Tree Algorithm Based on K-Means

Image Reconstruction Using Rational Ball Interpolant and Genetic Algorithm

New Optimal Load Allocation for Scheduling Divisible Data Grid Applications

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Image Segmentation Based on. Modified Tsallis Entropy

MIDTERM EXAMINATION Networked Life (NETS 112) November 21, 2013 Prof. Michael Kearns

Communication Networks I December 4, 2001 Agenda Graph theory notation Trees Shortest path algorithms Distributed, asynchronous algorithms Page 1

Centrality Measures to Identify Traffic Congestion on Road Networks: A Case Study of Sri Lanka

Election Analysis and Prediction Using Big Data Analytics

Volume 2, Issue 11, November 2014 International Journal of Advance Research in Computer Science and Management Studies

An Investigation into the Free/Open Source Software Phenomenon using Data Mining, Social Network Theory, and Agent-Based

Taccumulation of the social network data has raised

The k-means Algorithm and Genetic Algorithm

A Partition Method for Graph Isomorphism

Proximal Manifold Learning via Descriptive Neighbourhood Selection

Properties of Biological Networks

Problems of Sensor Placement for Intelligent Environments of Robotic Testbeds

Distributed Detection in Sensor Networks: Connectivity Graph and Small World Networks

Complete Bipartite Graphs with No Rainbow Paths

Privacy-Preserving of Check-in Services in MSNS Based on a Bit Matrix

Enhancing Stratified Graph Sampling Algorithms based on Approximate Degree Distribution

Rough Connected Topologized. Approximation Spaces

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

Transcription:

Journal of Innovative Technology and Education, Vol. 3, 216, no. 1, 131-137 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/1.12988/jite.216.6828 Graph Sampling Approach for Reducing Computational Complexity of Large-Scale Social Network Andry Alamsyah 1,2, Yahya Peranginangin 1, Intan Muchtadi-Alamsyah 3, Budi Rahardjo 2 and Kuspriyanto 2 1 School of Economics and Business, Telkom University, Indonesia 2 School of Electrical Engineering, Bandung Institute of Technology, Indonesia 3 Faculty of Mathematics and Natural Science Bandung Institute of Technology Indonesia Copyright 215 Andry Alamsyah et al. This article is distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract The online social network services provide platform for human social interactions. Nowadays, many kinds of online interactions generate large-scale social network data. Network analysis helps to mine knowledge and pattern from relationship between actors inside the network. This approach is important to support predictions and decision-making process in many real-world applications. The social network analysis methodology, which borrow approach from graph theory provides several metrics that enabled us to measure specific properties of the networks. Some of metrics calculations were built with no scalability in minds, thus it is computationally expensive. In this paper, we propose graph sampling approach to reduce social network size, thus reducing computation operations. The performance comparison between natural graph sampling strategies using edge random sampling, node random sampling and random walks are presented on each selected graph property. We found that the performance of graph sampling strategies depends on graph properties measured. Keywords: Graph Sampling, Graph Theory, Large-Scale Social Network, Computational Complexity, Social Network Analysis

132 Andry Alamsyah et al. 1 Introduction Our daily online interactions have contributed to the rise of data production. This lead to massive data available, which in turn can offer opportunity for us to model the phenomenon and make predictions that support decision making process in many real world applications. The legacy methodology in social sciences are based on directly working with the population and few take the route to gain insight using massive data available in online. We use the Social Network Analysis (SNA) [1] [2] methodology to analyze the network data. SNA foundation are based on graph theory [3]. SNA provides many metrics that support network topology measurement but some of them for example Betweenness Centrality were built for small network. As the network becomes bigger, it increases the computational complexity of finding the shortest path between all pairs in network. Betweenness Centrality have computational complexity reaching O( N 3 ) [4], where N is the number of nodes in the network. In our previous attempt to reduce graph size and make representative sample, we propose Graph Summary [5] based on Minimum Description Length [6] principle. Our effort can reduce the graph size by 5%, but the time needed to construct graph summary is high, because of merging nodes operation complexity. For that reason, we try another alternative with more simple approach such as using graph sampling methodology. Graph sampling [7] is a technique to pick a subset of nodes and/or edges from original graph. Although being smaller in size, sampled graph may be similar to original graph in some way (see. Figure 1). In this paper, we are interested on how properties behave after graph sampling implementation. We generate different size of graph sample in order to see the properties evolution in different size of network. We investigate which properties are preserved in certain sampling technique and which properties are not. If a property in sampled graph is preserved, then we can estimate the property value in original network. We define original graph G(NG, EG) where NG is set of nodes and EG is set of edges. The graph sample S(NS, ES) is subset of G(NG, EG), where NS NG and ES EG. We denote N as the number of nodes and E as the number of edges. Figure 1. (a) left: Original network with 439 nodes and 88234 edges. (b). right: Sampling of original network using RW strategy with 1 nodes and 213 edges

Graph sampling approach for reducing computational complexity 133 2 Social Network Properties There are many social network metrics available to describe certain social network properties. We use the following properties [2] as comparison in each graph sample size. 1. Average Degree is the number of edges E compared to number of nodes N. We denote Average Degree = E / N 2. Density is the number of edges E in the network compared to maximum number of edges between nodes in the network. We denote Density = E /( N *( N -1)) 3. Modularity measures fraction of the edges that fall within the given groups minus the expected of such fraction is edges were distributed at random. The bigger M value means the boundary between groups in the network are more distinct. 4. Average Clustering Coefficient measures the total degree to which nodes are tend to cluster together in network compared to total number of nodes. 5. Diameter measures the longest of shortest path between any pair of nodes in network. 6. Average Path Length calculated by finding the shortest path between all pairs of nodes, adding them up and then dividing by the total number of pairs. 7. Connected Component shows how many network components in which any two nodes are connected to each other by paths. 3 Graph Sampling Strategies We propose three strategies of graph sampling by using Edge Random Sampling (ERS), Node Random Sampling (NRS) and Random Walks Sampling (RW) methods. We explain each of the strategy and algorithm construction as follows: 3.1 Edge Random Sampling The Edge Random Sampling (ERS) strategy is the simplest strategy of three. We pick random edges uniformly from the network. The selected edges are used to construct new smaller network. The ERS algorithm is as follows: 1: input network G(N G, E G) 2: choose set of random edges E S from G 3: construct network sample S(N S, E S), where E S are the chosen edges from E G and N S are selected nodes connecting the E S 3.2 Node Random Sampling The Node Random Sampling (NRS) strategy works as follows. We pick random

134 Andry Alamsyah et al. nodes uniformly from the network, then we construct new smaller network from the chosen nodes. The NRS algorithm is as follows: 1: input network G(N G, E G) 2: choose set of random nodes N C from G 3: create permutations list P = P(N G, 2) 4: create intersection list I = P & G, where I is the edge list from the connections between selected N G from G 5: construct network sample S(N S, E S) from list I 3.3 Random Walks Sampling The Random Walks (RW) strategy are based on random walks idea, where given a network and a starting node, then we select neighbor of this node at random and move to it. The random sequence selected, which consists of nodes and edges are kept as the result of RW strategy. We run each RW process for certain number of iterations which we choose carefully depends what we see as the sufficient sample of network size. The x most visited nodes from the iterations become our candidate for network sample. From this point, the process similar with NRS process. The difference RW process and NRS process is nodes are not chosen by uniformly random sampling, but by random walks mechanism. The RW algorithm is as follows: 1: input network G(N G, E G) 2: while r < number of run 3: while i < number of iterations: We randomly choose N G(i+1) where N G(i+1) is neighbor of N G(i) 4: list L r = (N G(i), N G(i+1),N G(i+2),,N G(number of iterations)) 5: list L = L 1 + L 2 +. + L number of run 6: list M = descending sorted list L by number times a node has been selected in RW strategy 7: list M x = x most visited node 8: create permutations list P = P(M x, 2) 9: create intersection list I = P & G, where I is the edge list from the connections between selected N G from G 1: construct network sample S(N S, E S) from list I 4 Experiments We use ego-facebook dataset from Stanford Network Analysis Project (SNAP) repositories for our experiment to construct graph sample, using three strategies in section 3. This dataset network consists of 439 nodes and 88234 edges. The dataset network properties are Density.11, Average Degree 43,691, Modularity.835, Average Clustering Coefficient.617, Diameter 8, Average Path Length 3.693 and Connected Component 1 The ERS algorithm experiment is run by pick a set of random edges which each round chosen by stepping up 1 edges from 1 edges to 7 edges. The random sampling nature of this algorithm produces fluctuate values. To overcome discrepancy between each result, on each round, we run the algorithm 1 times. The average network properties value from all run become our final result of the ERS strategy.

Graph sampling approach for reducing computational complexity 135 The NRS algorithm experiment is run by pick some randomly uniform nodes, start from 1 nodes, and then continue to 5, 1, 15, 2, 25, 3, 35 nodes. This set of nodes NC become candidate of our graph sample. To check whether there is a connection between a pair of selected nodes, we construct permutations mechanism P(NC, 2), where we list all possible permutations between pairs of nodes. On the last step, we check whether there is actual connection between a pair of nodes from the list of possible connection between nodes resulted from permutations mechanism P by using intersection list I between list of NG and list of NC. The list I converted to edge list and become network sample S(NS, ES). The RW algorithm experiment is run with the scenario as follows. At one random walks process, we set the number of iterations i as the numbers of how many nodes collected on each random walks process. We choose the value i = 1, as we consider this number is suitable enough to get representative number of nodes. We run 1 random walks process. On each random walks process, we pick x most visited nodes, throughout the experimentation we change variable x to 1, 5, 1, 15, 2, 25, and 3 nodes respectively. The sum of all 1 random walks process are collected, and we pick the top 1, 5, 1, 15, 2, 25 and 3 most visited nodes respectively. The network from the top most visited nodes becomes our graph sample S(NS, ES). 1 12 8 6 4 2 1 8 6 4 2 1 2 3 4 5 2 4 6 8 1 (a). Graph Sample Size (nodes vs edges) (b). Average Degree (size vs value),5 1,4,8,3,6,2,4,1,2 2 4 6 8 1 2 4 6 8 1 (c). Density (size vs value) (d). Modularity (size vs value)

136 Andry Alamsyah et al. 1 3,8,6,4,2 25 2 15 1 5 2 4 6 8 1 2 4 6 8 1 (e). Average Clustering Coefficient (size vs value) (f). Diameter (size vs value) 1 1 8 8 6 6 4 4 2 2 2 4 6 8 1 2 4 6 8 1 (g). Average Path Length (size vs value) (h). Connected Component (size vs value) Figure 2. Experiment result chart (a). Graph Sample Size, (b). Average Degree, (c). Density, (d). Modularity, (e). Average Clustering Coefficient, (f). Diameter, (g). Average Path Length, (h). Connected Component 5 Discussions and Conclusions Figure 2 show the performance chart of three sampling methods. In Graph Sample Size, RW create sample graph with smaller number of nodes but with much bigger number of edges compared to the other two, this signify that sample from RW are faster and more effective in constructing representative graph sample. A method performs better in a graph property if, 1). it converges to the value of the original graph property with predicted result, both linearly or exponentially and 2). the gap between predicted value and the original value is small. With this regards, RW methods are performs better in properties Average Clustering Coefficient, Diameter, Average Path Length, Connected Component. Both NRS and ERS are performs better in properties Average Degree, Modularity and Density. To reduce the computational complexity of Betweenness Centrality metrics as the sample of our original problems. We pick a property that related to our problem, in this case is finding shortest path between all pairs. The problem is related to Average Path Length and Diameter properties; thus we choose RW methods.

Graph sampling approach for reducing computational complexity 137 Our conclusions are 1). We can use graph sampling based on three strategies to reduce graph size while still retain some properties or retain some information that we use as estimator of original graph properties. 2). Different graph properties need different graph sampling methodology. 3). For the future works, we need to compare the computations complexity of each sampling methods in order to get the conclusive results regarding each sampling methods. References [1] J. Scott, Social Network Analysis: a Handbook, Sage Publications, 2. [2] M.E.J. Newman, Network: An Introduction, Oxford University Press, 21. http://dx.doi.org/1.193/acprof:oso/9781992665.1.1 [3] R. Diestel, Graph Theory: Electronic, Edition 25, Springer-Verlag Heidelberg, New York, 1997, 2, 25. [4] U. Brandes, A Faster Algorithm for Betweenness Centrality, Journal of Mathematical Sociology, 25 (2), no. 2, 163-177. http://dx.doi.org/1.18/2225x.21.999249 [5] A. Alamsyah, Y. Peranginangin, B. Rahardjo, I. Muchtadi-Alamsyah, Kuspriyanto Kuspriyanto, Reducing Computational Complexity of Network Analysis using Graph Compression Method for Brand Awareness Effort, The 3 rd International Conference on Computational Science and Technology, (215), 1-6. http://dx.doi.org/1.2991/iccst-15.215.26 [6] P.D. Grunwald, The Minimum Description Length Principle, The MIT Press, 27. [7] P. Hu, W.C. Lau, A Survey and Taxonomy of Graph Sampling, arxiv:138.5865[cs.si], 213. Received: November 15, 215; Published: August 29, 216